In 1986, Hinton published “Learning Distributed Representations of Concepts”, with some of its results also appearing in his famous Nature paper, “Learning Representations by Back-Propagating Errors”. This paper is among the earliest—if not the earliest—efforts to showcase in-depth visualization and interpretation of neural network internals. This post offers a casual read for anyone curious about how early researchers began to make sense of what neural networks were learning, with a focus on the family-tree experiments and my own reproduction.

How Are Concepts Encoded in Neural Networks?

Back in 1986, there were two dominant but competing ideas about how concepts might be represented in an artificial neural network: localist representation and distributed representation.

  • Localist representation, where each concept is assigned to a single node in the network.
  • Distributed representation, where by contrast, a concept is encoded through the joint activations of many nodes.

Hinton pointed out a limitation of distributed representations as they were commonly understood at the time: a concept tends to be encoded by a fixed activation pattern, regardless of context or role.

Take John, for example. At work, John might be represented as a researcher—his role involves reading papers, training models, and debugging code. But at home, John is a father of three, and his activities shift to storytelling, playing with the kids, and helping with homework. If a network uses only one fixed pattern to represent John, it can’t easily capture these role-dependent variations.

To address this limitation, Hinton proposed a new network architecture in which a concept and its role are encoded separately by distinct sets of neurons. As a proof of concept, he trained a five-layer neural network using gradient descent to solve a family tree prediction task. He then visualized and interpreted the network’s weights, demonstrating how concepts were encoded and uncovering what individual nodes had learned within the network.

Family Tree Problem

To motivate the idea of role-specific nodes, Hinton introduced the family tree prediction problem:

You’re given two isomorphic family trees—one in English, one in Italian. “Isomorphic” here means the trees share the exact same structure, but the individuals have different names in each tree. For example, as shown in the figure below, Christopher is married to Penelope, and they have two children: Arthur and Victoria. Each tree contains 12 individuals and there are a total of 24 unique names across both trees.

hinton-family-tree-experiment

From the family trees, we are only interested in 12 types of relationships: father, mother, husband, wife, son, daughter, uncle, aunt, brother, sister, nephew, and niece. Grandparents are great but we won’t need them for now. Across both trees, there are 104 relationship instances—triplets like (Margaret, has_father, Christopher). There are some special cases where the third item in the triplets contains not one but two individuals. For instance, Colin has two uncles: Arthur and Charles. The relationship triplet is represented by (Colin, has_uncle, (Arthur, Charles)).

The goal of the task is to learn a representation of the family tree that enables the network to predict the third item given the first two—for example, given (Margaret, has_father, ?), the model should output Christopher.

Network Architecture

In the paper Hinton proposed a five layers MLP:

The input layer has 36 nodes: the first 24 nodes represent individual names, and the remaining 12 nodes represent relationship types.

The first hidden layer is split into two groups of 6 nodes each. One group is fully connected to the name inputs, and the other is fully connected to the relationship inputs—this separation allows the network to encode concepts and roles independently.

The second and third hidden layers contain 12 and 6 fully connected nodes, respectively.

The output layer has 24 nodes, each corresponding to a name, just like the input.

The network uses the sigmoid activation function as the non-linear mechanism. A visualization of the network architecture can be found below.

hinton-family-tree-network-architecture

Network Visualization and Interpretation

Some of the learned weights in the network turn out to be interpretable. To explore this, Hinton visualizes the weights of the first 6 nodes (connected to name nodes) in the first hidden layer across different inputs.

In the figure below, each colored line corresponds to one hidden node. The size of each square indicates the strength of activation. Color represents activation sign: white squares represent positive activation, while black squares indicate negative activation.

hinton-family-tree-weights-visualization

Due to space constraints, only the English names are displayed in the paper. The corresponding Italian names (from the isomorphic tree) are shown as the second line within each node’s visualization.

The visualization above reveals several interesting patterns in how the network learns.

Node 1 and Node 5 both appear to distinguish between English and Italian names—Node 1 activates for English names, while Node 5 activates for Italian ones. Although their functions seem redundant, this is a natural outcome when the network is free to learn any features that reduce training loss. Such redundancy is often encouraged by weight decay.

The model maps the original 24-dimensional one-hot encoded input into a compressed 6-dimensional representation. This dimensionality reduction forces the network to retain only the most essential information. Given the isomorphic structure of the two family trees, the simplest compression is to encode the tree origin in just a single bit.

Remarkably, without any explicit knowledge of the input semantics, the network discovers this isomorphic property on its own—effectively using just one or two nodes to indicate which tree a name belongs to.

In other nodes, we observe that English names and their Italian counterparts produce nearly identical activations, reinforcing the symmetry between the trees.

Node 2 appears to encode generational depth: members of the third generation, like Charlotte and Colin, exhibit strong negative activations.

Node 6 seems to capture family branch information. All individuals on the right half of the family tree trigger negative activations—an intriguing and consistent pattern.

As Hinton noted, gender information is not encoded by the name input nodes. This makes sense: the task doesn’t require gender to determine relationships. For example, knowing Colin’s gender isn’t necessary to identify his father.

Hinton also analyzed the relation input nodes. One standout is Node 5 (middle figure in the bottom row), which clearly encodes gender information—male relations lead to negative activations, while female relations result in positive ones.

Node 3 strongly correlates with generational hierarchy: it activates positively for senior relations and negatively for younger ones.

Reproduction

I reproduced the family tree experiment using PyTorch, and the code is publicly available on GitHub. Almost 40 years have passed since the paper was published and there’re many excellent off-the-shelf training components available to me now that simply do not exist back then. Therefore my goal is not 100% faithful reproduction, but to optimize for the network’s performance on the testing set. In addition to the MLP model used in the paper, I experimented with transformers and was able to get them perform almost as well as the MLPs. Code is here.

Reproducibility: The Ghost That Haunts My Mind

When you publish a result, you expect it to hold universally. How can one confidently call themselves a scientist if their findings can’t be verified by peers? And yet, reproducibility remains a daunting challenge—haunting, even. It exists on at least two levels: same-machine reproducibility and cross-machine reproducibility. The former can often be achieved through careful coding practices and controlled environments. The latter is far more elusive.

Researchers have long understood the critical role of weight initialization in training deep neural networks. Poor initialization can lead to bad local minima, undermining the entire learning process. Geoffrey Hinton, who shared a Nobel Prize (Turing Award) for his work on neural networks, introduced layer-by-layer pretraining in Boltzmann Machines as a solution—an approach that set the stage for effective gradient-based learning. Ironically, he later acknowledged that clever random initialization strategies can achieve similar results, rendering the complex pretraining steps unnecessary in many cases.

But therein lies the paradox: the randomness that enables deep learning also complicates reproducibility. Fixing random seeds is vital to mitigate this issue. With this and package version management, same-machine reproducibility becomes achievable—your results remain consistent across multiple runs on the same setup.

Cross-machine reproducibility, however, is a different beast. In my own work, I’ve run identical code in identical Python environments on different machines, only to find mismatched results. I train exclusively on CPUs without multi-threaded data loading. Yet discrepancies persist. The culprit? Different CPUs often rely on different math acceleration libraries, each with subtle differences in implementation. These low-level variations can be enough to break reproducibility.

As a quick example of discrenpency caused by different low-level library, I run the following code on my Macbook Air M4 2025 and Ubuntu 24.04.03 LTS on Intel i7-12700K CPU, both with the same python enviroment.

import torch
torch.manual_seed(42)
x = torch.randn(10000, 10000)
y = torch.mm(x, x.t())
print(y.sum().item())

I got 99229240.0 on my Mac and 99229296.0 on my Ubuntu machine, representing a 5.64e-7 relative different. Although neglectable in most cases, this is enough to cause the accuracy number differs on multiple machines.

Digging into the rabbit hole of reproducibility might seem a bit crazy. I remember seeing news stories as a kid—scientists taking their own lives because their findings couldn’t be replicated. Maybe that’s when the obsession with reproducibility took root in me. Whether this fixation is necessary or just a burden, I honestly don’t know.

MLP Model

Due to the small size of the dataset, the model is especially sensitive to weight initialization. To account for this, I evaluated the testing set performance across 50 different random seeds. Another design choice I made is to randomly shuffle the dataset for each random seed so that we’re not overfitting to a specific train/test split, which, although adds to the complexity of reproduction, makes our experiments more rigorous.

Remember that Colin has two uncles, so family tree prediction is a multi-labels classification problem. I used MultilabelAccuracy as the main metric. Predictions with output values above 0.5 are considered positive. Accuracy is 1 if and only if all labels are predicted correctly (exact match).

The MLP model I trained has 5 layers: 24+12 → 6+6 → 6 → 12 → 32 → 24. It achieves an average accuracy of 0.725, with 20 out of 50 runs reaching perfect accuracy (4 out of 4 correct).

Below is a summary of the techniques I tested, categorized by their impact on accuracy.

Improvements:

  • Adam / AdamW: Adam performs much better than SGD in quickly reducing training loss. Adam with weight decay helps with better generalization. In all my experiments, the training accuracy achieves 1, so the problem really lies in reducing overfitting.
  • Batch Normalization: a proven way to accelarate training.
  • Deeper architectures: a structured expansion/reduction strategy—especially 2× reduction and expansion—proved crucial for generalization.
  • Learning rate warm-up: Consistently improved test performance by a small margin.
  • Gradient clipping: like learning rate warmup, is another universal trick to slightly improve test accuracy.
  • Hard label: it is counter-intuitive but hard labels (1.1, -0.1 instead of 1, 0) actually helps with test accuracy. The MSE loss we use is satisfied after output reaching 0 or 1. By using labels with larger margin, the model learns to better separates positive and negative signals.
  • Tweaking learning rate and learning epochs: it’s dirty work but they do need good tuning. Parameter sweeping definitely will help, but won’t do it for now.
  • Bug fixing: bias=True before batch norm layer is a easy to make but hard to catch bug.

Degradations:

  • ReLU activation: Although ReLU has been widely used in training deep models, sigmoid still performs way better in this special case.
  • Custom weight initialization: I tried Xavier initialization and uniform initialization like (-0.3, 0.3), (-0.1, 0.1), (-0.5, 0.5). Basically anything other than the default initialization for the linear layer didn’t work.

Neutral (No Clear Impact):

  • Soft labels
  • Multi-label loss functions

Transformer architecture

If we formulate the family tree prediction problem as a sequential prediction task, we can use decoder transformers (the GPT style) to tackle it. We can feed text inputs into an LLM and likely get very high accuracy. For this experiment, we’ll train a small transformer network from scratch to better align with the MLP setting.

I borrowed a lot of code from nano-GPT. An initial version gets 0.35 average accuracy. Not bad for the first-try, but I have higher expectations from transformers. Most training tricks for MLP also apply to transformer: learning rate warmup, gradient clipping, AdamW, and learning rate/epochs tweaking, etc. When I try to tweak the network architecture, specially trying out new non-linearity, normalization, and other hyperparameters like expansion factor, all trials fail really badly. The transformers are not popular without a reason. These network components are really put together in an “optimal” way. Changing anything inside would break the whole system. I then restrict myself to only change the following hyperparameters:

    n_layer: int
    n_head: int
    n_embd: int
    dropout: float

Parameter sweeping now comes in handy. I tried both wandb’s bayesian search and a grid-search script that I vibe coded. Below is the hyperparamters that work best:

    n_layer: int = 2
    n_head: int = 14
    n_embd: int = 112
    dropout: float = 0.2

I gained the following insights from parameter sweeping, which is copied and translated from my training log:

From the analysis, n_head is an important factor - as long as it’s high, the results won’t be poor. For n_layer, just two layers are sufficient; more layers are fine but will cause overfitting. n_embd_factor also needs to be high. Even with only two layers, this network already has 10 linear layers. Within each transformer layer, there is one attention layer and one MLP layer. The attention layer contains two linear layers, and the MLP layer also contains two linear layers. Additionally, the input has one embedding layer and the output has one linear layer, totaling 10 layers. It was also observed that dropout needs to be sufficiently high to counteract overfitting. I found that having such a high n_head indeed has a suppressive effect on overfitting. I observed early on that multi-head attention is actually an ensemble mechanism, where each head is trained independently. For overfitting, random initialization is crucial - initializing close to the global optimum results in less overfitting, otherwise overfitting becomes severe. Ensemble is an important method that can mitigate overfitting. Although it appears that parameters have increased on first look, overfitting actually decreases. Therefore, while “more layers lead to more overfitting” holds true, “more parameters lead to more overfitting” doesn’t necessarily hold. So transformer networks need not only grow taller but also wider. Growing wider helps solve overfitting, while growing taller helps solve underfitting. By simultaneously growing both taller and wider, the network becomes strong and robust.

It also seems that the transformers are very sensative to changes. A small change in any parameter would lead to worse generalization.

Encoder Transformer

If we stick to the classification scenario, we could actually use an encoder tranformer (the BERT style). The only differences are getting rid of the attention mask, prepending an <CLS> token to the input and comparing its output to the target.

I only managed to get around 0.4 accuracy using the encoder model. The biggest difference I suppose is that decoder model calculates loss on both second input and the output, when the encoder model only does it on the output. The training signal provided by the second input, although noisy, can actually make the model learn better.

Visualizations

I visualized the first-layer weights of the MLP model for both name and relationship inputs. Below is a plot for the name nodes using random seed 0, where the model correctly predicted 3 out of 4 labels. More visualizations are available in the GitHub repo.

my-reproduction-weight-visualization-random-seed-0

From the visualizations:

  • Nodes 3 and 4 appear to specialize in distinguishing between English and Italian names.
  • Nodes 5 and 6 likely encode generational information, with deeper generations triggering more negative activations.

These patterns are consistent with the original findings by Hinton, though mine are somewhat less pronounced.

Final Thoughts

Hinton’s paper, while groundbreaking, can be challenging to follow due to its age and style. I hope this blog offers a clearer understanding. I encourage you to dive into the original work yourself—you might notice insights I missed.

Personally, I find great satisfaction in reproducing classic experiments. These small-scale, well-designed setups offer an ideal playground for experimenting with learning rates, optimizers, architectures, and more. They showcase how thoughtful experiment design can lead to deep insights into learning systems. My implementation is not intended to be fully polished. You’re more than welcome to build on top of it to try methods that can improve the testing accuracy.

Visualizing first-layer weights is a technique still widely used in computer vision—for instance, to reveal edge detectors or color blobs in CNNs. Higher-layer weights, however, become less interpretable due to increased complexity and non-linearity. In those cases, techniques like maximum-activating input images, gradient-modified input images (together with some prior to make images look natural), or sparse autoencoders (in transformers) are more effective.

Perhaps the most surprising insight from this reproduction was realizing how many standard training techniques—momentum, soft labels, learning rate warm-up, stochastic gradient descent—were already proposed and explored as early as the 1980s. We often take these tools for granted today, but Hinton’s focus on these training techniques paved the road for the wide adoption of artificial neural networks long before deep learning became mainstream.

Role-specific representations?

The family tree experiment was motivated by the idea of role-specific representations. Hinton argued that, to truly understand relationships, neural networks need to represent not just who an entity is, but also what role it plays in a given context. To achieve this, he introduced the notion of grouped neurons—separate subsets of neurons encoding identity and role independently.

This concept foreshadows ideas seen in his later work, such as capsule networks, which aim to disentangle object properties and their spatial relationships. In my opinion, modern large language models (LLMs) achieved a similar effect through the attention mechanism, which inherently associates entities with their contextual roles in a dynamic and flexible way.

Inductive Bias

Different models are designed to implement different inductive bias and thus suit different tasks. Hinton in a later paper (Learning Distributed Representations of Concepts Using Linear Relational Embedding) admits that the MLP models struggle to find an optimal solution and instead turn to a method called Linear Relational Embedding. The key idea is to represent concepts as vectors,binary relations as matrices, and the operation of applying a relation to a concept as a matrix-vector multiplication that produces an approximation to the related concept. A representation for concepts and relations is learned by maximizing an appropriate discriminative goodness function using gradient ascent.

Is It Worth the Efforts?

A question may rise as to why spending so much efforts on a toy experiment that nobody cares about today. When I grew up I was taught this Da Vinci’s egg drawing story:

when young Leonardo was apprenticed to Andrea del Verrocchio, his master made him draw hundreds of eggs from different angles to teach him about form, light, and shadow. Verrocchio told him that no two eggs are identical and each viewing angle reveals something new.

This story might be apocryphal, but I believe it captures something important in artistic education as well as model training: it’s through repeated simple experiments that one aquires the intuition/knowledge of some fundamental truth.