Boolean Circuits#
- Research Direction: Train a small network to compute boolean functions and reverse-engineer its implementation to understand how it learns and represents logical reasoning.a
- Experiment Setup: Use different input/output structures, architectures (MLP or transformer**), and task variations (single gate**, fixed formula, formula distribution, causal inference) to study the network’s learning process.
- Interpretability Analysis: Analyze the network’s learned representations by visualizing decision boundaries, studying feature directions for logical primitives, and examining how intermediate layers progressively solve the problem**.**
- Network Learning Analysis: Investigate if the network learns directions corresponding to individual variables, conjunctions, intermediate subexpressions, and how attention routes information in transformers.
- Causal Interventions: Ablate neurons/heads to analyze the impact on logical subcircuits and understand the network’s causal structure.
- Superposition Analysis: Explore if the network represents more logical features than dimensions when trained on multiple formulas simultaneously, considering the trade-off with formula complexity and co-occurrence statistics.
- Potential Research Directions: Exploring transformers on variable-length formulas, out-of-distribution generalization, and comparison to symbolic approaches.
- Experiment Focus: Investigating how attention mechanisms implement recursive evaluation in the context of formula evaluation.
- Visualization and Analysis: Suggestions for detailed experimental plans, analysis techniques (activation patching, probing), and visualization suite design.