Boolean Circuits#

Research Direction: Train a small network to compute boolean functions and reverse-engineer its implementation to understand how it learns and represents logical reasoning.#


Experiment Setup: Use different input/output structures, architectures (MLP or transformer), and task variations (single gate, fixed formula, formula distribution, causal inference) to study the network’s learning process.

Interpretability: Analyze the network’s learned representations by visualizing decision boundaries, studying feature directions for logical primitives, and examining how intermediate layers progressively solve the problem**.**

Network Learning : Investigate if the network learns directions corresponding to individual variables, conjunctions, intermediate subexpressions, and how attention routes information in transformers.

Causal Interventions: Ablate neurons/heads to analyze the impact on logical subcircuits and understand the network’s causal structure.

Superposition Analysis: Explore if the network represents more logical features than dimensions when trained on multiple formulas simultaneously, considering the trade-off with formula complexity and co-occurrence statistics.

Directions:

  • Exploring transformers on variable-length formulas, out-of-distribution generalization, and comparison to symbolic approaches.
  • Investigating how attention mechanisms implement recursive evaluation in the context of formula evaluation.
  • Visualization and Analysis: Suggestions for detailed experimental plans, analysis techniques (activation patching, probing), and visualization suite design.