In 1986, Hinton published “Learning Distributed Representations of Concepts”, with some of its results also appearing in his famous Nature paper, “Learning Representations by Back-Propagating Errors.” This work is among the earliest—if not the earliest—to showcase in-depth visualization and interpretation of neural network internals. This post offers a casual read for anyone curious about how early researchers began to make sense of what neural networks were learning, with a focus on the family-tree experiments and my own reproduction.
How Are Concepts Encoded in Neural Networks?
Back in 1986, there were two dominant but competing ideas about how concepts might be represented in artificial neural networks: localist representation and distributed representation.
- Localist representation assigns each concept to a single node in the network.
- Distributed representation, by contrast, encodes a concept through the joint activations of many nodes.
Hinton pointed out a limitation of distributed representations as they were commonly understood at the time: a concept tends to be encoded by a fixed activation pattern, regardless of context or role.
Take John, for example. At work, John might be represented as a researcher—his role involves reading papers, training models, and debugging code. But at home, John is a father of three, and his activities shift to storytelling, playing with the kids, and helping with homework. If a network uses only one fixed pattern to represent John, it can’t easily capture these role-dependent variations.
To address this limitation, Hinton proposed a new network architecture in which a concept and its role are encoded separately by distinct sets of neurons. As a proof of concept, he trained a five-layer neural network using gradient descent to solve a family tree prediction task. He then visualized and interpreted the network’s weights, demonstrating how concepts were encoded and uncovering what individual nodes had learned within the network.
Experiment setting
To motivate the idea of role-specific nodes, Hinton introduced the family tree prediction problem:
You’re given two isomorphic family trees—one in English, one in Italian. “Isomorphic” here means the trees share the exact same structure, but the individuals have different names in each language. For example, as shown in the figure below, Christopher is married to Penelope, and they have two children: Arthur and Victoria. Each tree contains 12 individuals and there are a total of 24 unique names across both trees.
From the family trees, we can extract 12 types of relationships: father, mother, husband, wife, son, daughter, uncle, aunt, brother, sister, nephew, and niece. Across both trees, there are 104 relationship instances—triplets like (Margaret, has_father, Christopher). There are some special cases where the third item in the triplets contains not one but two individuals. For instance, Colin has two uncles: Arthur and Charles. The relationship triplet is represented by (Colin, has_uncle, (Arthur, Charles)).
The goal of the task is to learn a representation of the family tree that enables the network to predict the third item given the first two—for example, given (Margaret, has_father, ?), the model should output Christopher.
Network Architecture
Hinton’s neural network architecture consists of five layers:
The input layer has 36 nodes: the first 24 nodes represent individual names, and the remaining 12 nodes represent relationship types.
The first hidden layer is split into two groups of 6 nodes each. One group is fully connected to the name inputs, and the other is fully connected to the relationship inputs—this separation allows the network to encode concepts and roles independently.
The second and third hidden layers contain 12 and 6 fully connected nodes, respectively.
The output layer has 24 nodes, each corresponding to a name, just like the input.
The network uses the sigmoid activation function as the non-linear layer. The network architecture is shown in the figure below.
Network Visualization and Interpretation
Some of the learned weights in the network turn out to be interpretable. To explore this, Hinton visualizes the weights of the first 6 nodes in the first hidden layer across different inputs.
In the figure below, each colored line corresponds to one hidden node. The size of each square indicates the strength of activation: white squares represent positive activation, while black squares indicate negative activation.
Due to space constraints, only the English names are displayed. The corresponding Italian names (from the isomorphic tree) are shown as the second line within each node’s visualization.
The visualization above reveals several interesting patterns in how the network learns.
Node 1 and Node 5 both appear to distinguish between English and Italian names—Node 1 activates for English names, while Node 5 activates for Italian ones. Although their functions seem redundant, this is a natural outcome when the network is free to learn any features that reduce training loss. Such redundancy is often encouraged by weight decay.
The model maps the original 24-dimensional one-hot encoded input into a compressed 6-dimensional representation. This dimensionality reduction forces the network to retain only the most essential information. Given the isomorphic structure of the two family trees, the simplest compression is to encode the tree origin in just a single bit.
Remarkably, without any explicit knowledge of the input semantics, the network discovers this isomorphic property on its own—effectively using just one or two nodes to indicate which tree a name belongs to.
In other nodes, we observe that English names and their Italian counterparts produce nearly identical activations, reinforcing the symmetry between the trees.
Node 2 appears to encode generational depth: members of the third generation, like Charlotte and Colin, exhibit strong negative activations.
Node 6 seems to capture family branch information. All individuals on the right half of the family tree trigger negative activations—an intriguing and consistent pattern.
As Hinton noted, gender information is not encoded by the name input nodes. This makes sense: the task doesn’t require gender to determine relationships. For example, knowing Colin’s gender isn’t necessary to identify his father.
Hinton also analyzed the relation input nodes. One standout is Node 5 (middle figure in the bottom row), which clearly encodes gender—male relations lead to negative activations, while female relations result in positive ones.
Node 3 strongly correlates with generational hierarchy: it activates positively for senior relations and negatively for younger ones.
Reproduction
I re-implemented the entire experiment using PyTorch, and the code is publicly available on GitHub.
Random Seeds
Due to the small size of the dataset, the model is highly sensitive to weight initialization. To account for this, I evaluated performance across 50 different random seeds by measuring test accuracy.
Evaluation Metric
I used MultilabelAccuracy as the main metric. Predictions with output values above 0.5 are considered positive. In the multilabel setting, accuracy is 1 only if all labels are predicted correctly (exact match).
Results
While staying as close to the original architecture as possible, I incorporated several modern training techniques. The best model I trained achieved an average accuracy of 0.67, with 10 out of 50 runs reaching perfect accuracy (4 out of 4 correct).
What Helped, What Didn’t
Below is a summary of the techniques I tested, categorized by their impact on accuracy.
Improvements:
- Adam / AdamW
- Batch Normalization
- Deeper architectures (e.g., 12+6 → 6+6 → 6 → 12 → 24 → 24): A structured expansion/reduction strategy—especially 2× reduction and expansion—proved crucial for generalization.
- Learning rate warm-up: Consistently improved test performance slightly.
Observations:
Adam and BatchNorm led to more stable training dynamics. Larger, deeper models helped—but only when designed with structure (e.g., bottleneck-style). AdamW maintained average accuracy but increased the chance of achieving perfect scores in some trials.
Degradations:
- SGD
- Layer Normalization
- ReLU activations
Neutral (No Clear Impact):
- Soft labels
- Multi-label loss functions
Visualizations
I visualized the first-layer weights for both name and relationship inputs. Below is a plot for the name nodes using random seed 0, where the model correctly predicted 3 out of 4 labels. More visualizations are available in the GitHub repo.
From the visualizations:
- Nodes 3 and 4 appear to specialize in distinguishing between English and Italian names.
- Nodes 5 and 6 likely encode generational information, with deeper generations triggering more negative activations.
These patterns are consistent with the original findings by Hinton, though mine are somewhat less pronounced.
Final Thoughts
Hinton’s paper, while groundbreaking, can be challenging to follow due to its age and style. I hope this blog offers a clearer understanding and encourages you to dive into the original work yourself—you might notice insights I missed.
Personally, I find great satisfaction in reproducing classic experiments. These small-scale, well-designed setups offer an ideal playground for experimenting with learning rates, optimizers, architectures, and more. They showcase how thoughtful design can lead to deep insights into learning systems. My implementation is not intended to be fully polished. You’re more than welcome to build on top of it to try methods that can improve the testing accuracy.
Visualizing first-layer weights is a technique still widely used in computer vision—for instance, to reveal edge detectors or color blobs in CNNs. Higher-layer weights, however, become less interpretable due to increased complexity and non-linearity. In those cases, techniques like positive example tracing, gradient-based analysis, or sparse autoencoders (in transformers) are more effective.
Perhaps the most surprising insight from this reproduction was realizing how many standard training techniques—momentum, soft labels, learning rate warm-up, stochastic gradient descent—were already explored as early as the 1980s. We often take these tools for granted today, but Hinton’s foresight made it possible to train and generalize deep models long before deep learning became mainstream.
Role-specific representations?
The family tree experiment was motivated by the idea of role-specific representations. Hinton argued that, to truly understand relationships, neural networks need to represent not just who an entity is, but also what role it plays in a given context. To achieve this, he introduced the notion of grouped neurons—separate subsets of neurons encoding identity and role independently.
This concept foreshadows ideas seen in his later work, such as capsule networks, which aim to disentangle object properties and their spatial relationships. In modern large language models (LLMs), a similar effect is achieved through the attention mechanism, which inherently associates entities with their contextual roles in a dynamic and flexible way.