Llama-3-8B MLP Neurons

Thông Nguyễn

Abstract

We release a dataset of text snippets that strongly activate MLP neurons in the Llama-3-8B model. Using this dataset, we demonstrate examples of meaningful features that can be discovered. We open-source our work, enabling others to easily create similar datasets for other transformer models and uncover important features. We also create a simple web interface to facilitate the exploration of MLP neurons.

Introduction

Transformer networks (Vaswani et al. 2017) have a remarkable ability to capture complex patterns and structures in their training data. Understanding how these neural networks work is not only an inspiring research problem, but also a practical necessity given their widespread deployment to millions of people.

Transformer models consist of two major components: attention layers and MLP layers. While significant progress has been made in understanding attention layers, such as the work on Transformer circuits by (Elhage et al. 2021), the understanding of MLP layers remains limited.

flowchart LR
  S(( )) --> A
  subgraph Llama3's MLP
  A(( )) --> Up --> B(("✕"));
  A --> Gate --> G(("𝑓")) --> B;
  B --> Down
  end
  Down --> C(( ))

Figure 1: Diagram of the Llama-3 MLP layer, which includes three linear projections: Up, Gate, and Down, with the SiLU activation function (𝑓).

Interestingly, MLP layers are one of the few places in transformer networks where privileged bases can be found (Elhage et al. 2021). These vector bases are favored by the model due to their pointwise non-linear computation. They are referred to as neurons, while the outputs of the activation function (see Figure 1) are referred to as neuron activations. Neural networks tend to use neurons to represent important features (Karpathy, Johnson, and Fei-Fei 2015; Geva et al. 2020, 2022), making them a good starting point for understanding transformers.

In this work, we release a dataset of text snippets that strongly activate MLP neurons in the Llama-3-8B model. We chose the Llama-3-8B model¹ for its strong evaluation performance and real-world usefulness.

We show examples of meaningful features discoverable with the dataset, and expect that many more can be found. We also anticipate that automated systems using LLMs could greatly help uncover features from the dataset, as shown in (Bills et al. 2023; Bricken et al. 2023). By open-sourcing our work,² we enable others to easily create similar datasets for other transformer models. To facilitate exploration of Llama-3 features, we create a simple web interface available at https://neuralblog.github.io/llama3-neurons/neuron_viewer.html.

Dataset

Overview. The dataset includes a total of more than 14 million text snippets from the FineWeb-Edu dataset (Guilherme et al. 2024). There are 32 snippets for each of the 458,752 MLP neurons in the Llama-3-8B model. Each snippet is 64 tokens long and strongly activates the corresponding neuron at the token in the middle of the snippet. See examples at here and here. We go into detail how we construct the dataset in the next section.

Open access. The data can be freely accessed on the Hugging Face llama3-8b-mlp-neurons dataset. Please note that we do not claim any copyright over the text snippets.

Example features. We found interesting features at all layers of the network. At lower layers, we discovered neurons triggered by a single word or subword, such as here and here. In higher layers, we observed neurons activated by more abstract concepts. For example, a neuron in layer 15 is triggered by the idea of “something being removed or relieved to alleviate a negative situation.” Another neuron in layer 24 activates when discussing highly intelligent, smart, and successful individuals. Interestingly, we found a neuron that is highly active when encountering a broken word.

Click to explore other neurons by changing the layer and neuron index.

Method

We used the open-weight Llama-3-8B base model from Meta, which is a powerful model that can be run on a single GPU. Since Meta did not release their training and evaluation dataset, we instead used the recently released FineWeb-Edu dataset (Guilherme et al. 2024), specifically the sample-10BT subset, which provides approximately 10 billion tokens.

To collect neuron activations, we randomly sampled segments of 128 tokens from the dataset and fed them into the network. We then recorded the MLP neuron activations for the last 64 tokens of each input segment. To reduce memory and disk usage during data collection, we only kept the top 32 snippets whose 96th token triggered the highest activation for each neuron. Additionally, we only stored the position of each sequence instead of the entire sequence itself, which also helped to significantly reduce memory usage.

Note that the Llama-3-8B model has 32 layers, and each layer contains 14,336 MLP neurons, resulting in a total of 458,752 neurons in the entire network.

In total, we fed 4 million examples into the model and collected the top 32 examples for each MLP neuron across the network. The entire process took approximately 12 hours to complete on an A100 GPU with 48GB of RAM.

In the end, for each neuron, we obtained 32 snippets, each 64 tokens long (32 prefix tokens, the token with the high activation, and 31 suffix tokens). The result is a dataset of more than 14 million text snippets that strongly activate MLP neurons in the middle of each snippet.

Conclusion

We hope the release of a text snippet dataset that highly triggers MLP neurons in the Llama-3-8B model will facilitate research into understanding real-world large language models. We also expect that by using this dataset in conjunction with LLM assistance, we can greatly extend our understanding of Llama-3 neurons. We encourage everyone to visit our neuron viewer page to explore the neurons themselves, which can greatly improve the intuition into how these models work internally.

References

Bills, Steven, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. 2023. “Language Models Can Explain Neurons in Language Models.” https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html.

Bricken, Trenton, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, et al. 2023. “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.” Transformer Circuits Thread.

Elhage, Nelson, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, et al. 2021. “A Mathematical Framework for Transformer Circuits.” Transformer Circuits Thread.

Gao, Leo, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. 2024. “Scaling and Evaluating Sparse Autoencoders.” https://arxiv.org/abs/2406.04093.

Geva, Mor, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. 2022. “Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space.” https://arxiv.org/abs/2203.14680.

Geva, Mor, Roei Schuster, Jonathan Berant, and Omer Levy. 2020. “Transformer Feed-Forward Layers Are Key-Value Memories.” https://arxiv.org/abs/2012.14913.

Guilherme, Penedo, Kydlíček Hynek, Allal Loubna Ben, Lozhkov Anton, Raffel Colin, Werra Leandro, and Wolf Thomas. 2024. “🍷 FineWeb: Decanting the Web for the Finest Text Data at Scale.” https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1.

Karpathy, Andrej, Justin Johnson, and Li Fei-Fei. 2015. “Visualizing and Understanding Recurrent Networks.” https://arxiv.org/abs/1506.02078.

Templeton, Adly, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, et al. 2024. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Transformer Circuits Thread. https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” https://arxiv.org/abs/1706.03762.

Footnotes

Citation

BibTeX citation:

@misc{nguyen2024neurons,
  author = {Nguy\hat{e}n, Th\hat{o}ng},
  title = {Llama-3-8B {MLP} {Neurons}},
  date = {2024-06-09},
  url = {https://neuralblog.github.io/llama3-neurons},
  langid = {en},
  abstract = {We release a dataset of text snippets that strongly
    activate MLP neurons in the Llama-3-8B model. Using this dataset, we
    demonstrate examples of meaningful features that can be discovered.
    We open-source our work, enabling others to easily create similar
    datasets for other transformer models and uncover important
    features. We also create a simple web interface to facilitate the
    exploration of MLP neurons.}
}

For attribution, please cite this work as:

Nguyễn, Thông. 2024. “Llama-3-8B MLP Neurons.” https://neuralblog.github.io/llama3-neurons.