Engineering Monosemanticity in Toy Models

11/16/2022
by   Adam S. Jermyn, et al.
0

In some neural networks, individual neurons correspond to natural “features” in the input. Such monosemantic neurons are of great help in interpretability studies, as they can be cleanly understood. In this work we report preliminary attempts to engineer monosemanticity in toy models. We find that models can be made more monosemantic without increasing the loss by just changing which local minimum the training process finds. More monosemantic loss minima have moderate negative biases, and we are able to use this fact to engineer highly monosemantic models. We are able to mechanistically interpret these models, including the residual polysemantic neurons, and uncover a simple yet surprising algorithm. Finally, we find that providing models with more neurons per layer makes the models more monosemantic, albeit at increased computational cost. These findings point to a number of new questions and avenues for engineering monosemanticity, which we intend to study these in future work.

READ FULL TEXT
research
08/30/2021

Neuron-level Interpretation of Deep NLP Models: A Survey

The proliferation of deep neural networks in various domains has seen an...
research
09/20/2021

Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks

Small neural networks with a constrained number of trainable parameters,...
research
11/09/2018

Gradient Descent Finds Global Minima of Deep Neural Networks

Gradient descent finds a global minimum in training deep neural networks...
research
12/20/2019

Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks

The optimization of multilayer neural networks typically leads to a solu...
research
07/05/2019

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

The permutation symmetry of neurons in each layer of a deep neural netwo...
research
05/31/2022

Feature Learning in L_2-regularized DNNs: Attraction/Repulsion and Sparsity

We study the loss surface of DNNs with L_2 regularization. We show that ...
research
10/20/2017

Point Neurons with Conductance-Based Synapses in the Neural Engineering Framework

The mathematical model underlying the Neural Engineering Framework (NEF)...

Please sign up or login with your details

Forgot password? Click here to reset