Multiple widely used neural networks are composed of two parts: the first part projects data points into another space, and the other part of the model does further regression/classification upon this space. By transforming raw data features to another potentially more tractable space, deep learning models have recently shown potential in many areas, ranging from dialogue systems(vinyals2015neural; lopez2017alexa; chen2017survey), medical image analysis (kononenko2001machine; ker2017deep; erickson2017machine; litjens2017survey; razzak2018deep; bakator2018deep) to robotics (peters2003reinforcement; kober2013reinforcement; pierson2017deep; sunderhauf2018limits).
One major challenge in deep learning is to model better intra- and inter-sample structures for complex data features. Recent works often model the intra-sample structure by considering the order and adjacency of the input features, for instance, positional encoding for texts/speech in Transformers (vaswani2017attention)
and kernel width for images in convolution neural networks(lecun2015deep). Regarding the inter-sample
structure, literature often assumes that a dataset can be represented in a continuous space and an interpolation of two embeddings might be meaningful(bowman2016generating; chen2016infogan), while the data might be naturally discrete (NIPS2017_7a98af17). Moreover, the mainstream relies on a non-fully transparent optimization function to reorganize the space. Therefore, in this work, we would like to explore how to explicitly, dynamically rearrange the space (inter-sample structure) by leveraging the intra-sample structures.
Inspired by Atomic Physics, where atom is the smallest unit of matter and meanwhile discretely distributed, we propose that we can model a data point as an atom. As illustrated in the left of Figure 2, an atom in a Bohr model (bohr1913constitution), an often adopted concept in Physics (halliday2013fundamentals) and Chemistry (brown2009chemistry), contains a dense nucleus, which is composed of the positively charged protons and uncharged neutrons, surrounded by orbiting negatively charged electrons with a nucleus radius. Further, multiple atoms can have interatomic forces, composed of attractive and repulsive forces, that make the atoms distant away (non-zero). Such interatomic forces are also the reason for the atoms to form molecules, crystals and metals in our observable life.
In this paper, we propose Atom Modeling, a science- and theoretically-based method that explicitly model the intra-sample relation via atomic structure and the inter-sample relation via interactomic forces. Specifically, we consider a data point as an atom and let a model automatically learn the mapping of each component in a data point to an electron, a proton, or a neutron. We then estimate interatomic forces with the learned subatomic particles, nucleus radius and atomic spacing. Finally, the model is optimized to minimize the potential energy induced by the interatomic forces and maintain the balance of total charges and the number of electrons, protons, and neutrons. This method is not only found effective, but also easy to implement in tens of lines for any model architecture.
We validate the effects of Atom Modeling on synthetic data and real data in the domains of text and image classification as well as on convolutional neural networks and transformers. The empirical results show that Atom Modeling can consistently improve the performance accross data amounts, domains, and output complexity. The analyses demonstrate that Atom Modeling can capture intra-sample structures with interpretable meanings of subatomic particles in a data point, while forms an inter-sample structure that increases the model expressivity.
Our contributions are:
We propose to look into the problem of discrete representation in deep learning models.
We propose Atom Modeling, a simple method of Atomic Physics for machine learning where the distances among data points (inter-sample structure) naturally depend on their intra-sample structure.
We empirically demonstrate that Atom Modeling can improve the performance across different setups and provides an interpretable atomic structure of a data point.
We are motivated by a property of the hidden layer of a neural network, and the property of the naturally existing atoms.
2.1 Discrete Representation
A neural network can be represented as a composition of functions , where each function is one of its layers. When a neural network is seen as two groups of functions , the first half encodes the input into a hidden space and the second half transforms the latent into the output space.
We consider that in a situation when the model capacity of the second half is fixed. The projected latent space by the first half functions are hence important to the final output of the model. For easier mathematical description, we denote a quantization of the simplicity of the encoded hidden space as . The whole model capacity is therefore bounded by:
Our intuition is that if the distance and the shape of two classes are hard to separate, the simplicity is small. Then the second half functions with a model capacity equal to a linear function cannot separate them. An example is shown in Figure 1. If the points’ distances are larger, they can be split by the same linear function. That is, the space simplicity can be promoted by a more discrete latent space, so one of the bounds of the model capacity can be improved.
However, if only naively increasing the distances among all the points, the space will be unboundly enlarged and the relative positions might not be changed much. A method that focuses more on separating points that are nearby but with different properties (e.g., the lower left side in the two plots of Figure 1) is desired.
2.2 Atomic Physics
Atom is an unit in the nature that composes molecules, crystals, metals in the nature and helps determine their properties (halliday2013fundamentals; brown2009chemistry). An atom consists three types of subatomic particles: electrons, protons, and neutrons. Among them, an electron has a negative charge, a proton has a positive charge, and a neutron has no charge. In terms of their masses, scientists have empirically measured that an electron has approximately 1/1836 mass of a proton and a proton is slightly lighter than a neutron (mohr2008codata).
The atomic structure of a Bohr model (bohr1913constitution) is shown in Figure 2, where the protons and neutrons form a nucleus that occupies a small volume of the atom; the electrons orbit around the nucleus with a nucleus radius. Such atoms have two primary types of forces: intraatomic and interatomic forces. The intraatomic forces include the strong nucleus force that forms protons and neutrons into a nucleus, while the interatomic forces bind two atoms and avoid them having zero distance.
The major motivating property of atoms are that the distance between two atoms is related to their atomic structures, such as the number of the protons and electrons. As shown in Figure 2(c), the intuition is that only atoms that are already close to each other and have similar structures, e.g., they are all positively charged, will be distanced away.
This property can be desirable for us to discretize data representation in neural networks. In this work, we (1) assume that there is intraatomic force in the mapping from a data point to the atomic structure and (2) borrow the idea of the interatomic forces to explicitly model the inter-sample structure.
3 Atom Modeling
Our proposed method include two parts: the intra-sample structure modeling and the inter-sample structure regulation.
As illustrated in Figure 2, we regard each component in a data point as a subatomic particle. The mapping from each component to a subatomic particle forms the intra-sample structures, and these intra-sample structures are jointly learned with the dynamic distance function among data points, i.e., the inter-sample structures. Atom modeling has a property to increase the distance among data points with similar intra-sample structures while not affecting data points with different intra-sample structures. We provide this proof in the Appendix.
3.1 Intra-Sample Structure
A simple atomic structure (bohr1913constitution) includes three subatomic particles, protons, electrons, and neutrons. This structure has several properties, such as charges, masses, nucleus radius and the distances between two particles. We will show that these properties can be simply determined by a learnable charge associated with each component in a data point.
We assign a charge, , to each component of a data point. We set the range of to be since in nature, the charge of the subatomic particles, electron, neutron, and proton, are respectively , , and . To obtain in a neural network, we first assume a model encodes each component of a data point into an -dimensional embedding in one of its hidden layer (the -th layer in section 2.1). Then without additional parameters, we can transform arbitrary one dimension in , which is denoted as , into the charge by , where
indicates the Sigmoid function.
The empirical results in Atomic Physics show that electrons have limited masses while protons and neutrons masses are in a similar level (mohr2008codata). Following these results, we approximate the mass of each component in a data point as shown in Figure 2(a), i.e., the mass is about one when the charge is larger than or equal to 0; the mass linearly reduced to zero when the charge is negative. Our analogy of mass is mathematically defined as .
An atomic nucleus is composed of protons and neutrons, which occupy nearly all the mass of an atom. Therefore, we approximate the position of the atomic nucleus by weighted average of each subatomic particle with its mass as the weight. The position of the atomic nucleus is then formulated as , where is the -th component of a data point (atom) , and is the number of components. is the position of each component in the embedding space. Without additional parameters, we take the remaining dimensions of (without ) as the position in the embedding space of each component in a data point, so .
Since the outermost particles of an atom are electrons, we heuristically approximate the nucleus radius as the average distance from electrons to the nucleus. The nucleus radiusis hence formulated as , where denotes -norm.
Considering the geometry of atomic structure, as illustrated in Figure 2(b), we approximate the distances between a pair of particles in different atoms (atoms , ) by if they are homoelectricity (, where and ). If they are heteroelectricity (), the distance is approximated by . The nucleus position and nucleus radius with subscripts 1 and 2 corresponds to atoms and respectively.
Balance of Charges and Number of Particles.
Moreover, we follow the findings in Atomic Physics that an atom tends to be electrically neutral for stability. Meanwhile, the number of neutrons is usually about the same as the number of protons. Therefore, we propose two losses, and , to regularize the charges and number of neutrons.
The idea is that is the mean square error between the total charge and zero, and is an approximation of the total number of charged particles (protons and electrons; ). The charged particles should occupy about 2/3 of the total number of particles in an atom (). That is, the remaining 1/3 is the number of neutrons.
3.2 Inter-Sample Structure
After assigning a charge for each component in a data point, we can take advantage of the derived characteristics, i.e., mass, nucleus, radius, and distance, to compute the interatomic forces for building the inter-sample structure.
As illustrated in Figure 2(c), a widely accepted understanding is that the interatomic force is composed of two parts: repulsive force, which pushes away two homoelectric particles, and cohesive force, which pulls closer two heteroeletric particles. The combination of the two forces will compose a curve of their potential energy with respect to the atom spacing. Such curve has a Balance Point which results in the lowest potential energy and often locates at a short distance between two atoms but must be larger than zero. When distance is closer to zero, the potential energy increases dramatically, thus naturally preventing two atoms become “continuous” (with zero distance).
In Atom Modeling, we compute the potential energy caused by Columb Forces (halliday2013fundamentals) for randomly sampled particle pairs drawn from two data points (, ) and take the summation. That is, we aim to minimize the loss defined as:
During implementation, for computational efficiency, we randomly pair components from data points in the same batch to compute . Therefore, the order of complexity does not increase.
During training, we take a weighted sum of the three losses, , and
, and add them to the original loss function, e.g., a cross-entropy loss. The model will then be trained by minimizing the following loss function:
We empirically observe that the coefficients , , and do not require careful tuning and can often be the same number.
3.4 Theoretical Results
Data points with similar intra-sample structures can have longer balance distance; ones with more opposite intra-sample structures can have shorter balance distance. Specifically, the balance distance is proportional to , where indicates the difference between the intra-sample structures of two data points and satisfies .
We provide this proof in the Appendix.
In order to validate the effects of Atom Modeling, we conduct extensive experiments across multiple datasets on synthetic, text (warstadt-etal-2019-neural; sheng2020investigating; grasser2018aspect) and image domains (parkhi2012cats; nilsback2008automated; deng2009imagenet) with two model architectures, transformer (vaswani2017attention; devlin2019bert) and convolution neural network (he2016deep). We compare our method with two commonly-seen distance functions, p-norm with and , as our designed baselines, since to the best of our knowledge, we have not seen prior works discuss this type of method. We also answer the question: “what are the learned intra- and inter-structures?” by qualitative analyses in the next section.
4.1 Synthetic Experiment
To gain more insights of Atom Modeling, we design the following synthetic experiment. In this way, we can visualize the inter-sample structure in the latent space without further projection, which can change the structure.
Setup - Data Generation.
Inspired by the complexity of real data, e.g., texts and images, we propose a synthetic dataset where each data point has multiple components. We design each data point to contain five 2-dimensional vectors, where each vector (input feature) is sampled from a Gaussian Mixture Model composed of two Gaussian distributions,and . If most of the input features in a data point is sampled from , we label the data point as 0; otherwise, we label it as 1. The generated dataset is plotted in Figure 4.
Setup - Model Architecture.
For fair comparison and easy visualization, all four methods are trained on the same neural network with two linear layers. The first layer (2x3) is fed with the data points of size (5x2), and its output will be of size (5x3) in a hidden space. We then take the first dimension of this output as the weight (5x1) and the rest as the embeddings of the five input features (5x2). Hence, we can compute the weighted sum of the five embeddings and obtain a two-dimensional vector. We take this vector as the embedding of a data point and feed it into the second layer (2x2) to classify the data points as 0 or 1.
As shown in Figure 5
, across 10 random runs, the same model trained with only cross-entropy loss achieves an average 87% accuracy and has a relatively low variance. With the addition of increasing 1-norm or 2-norm distance among data points in the latent space, the highest accuracy can achieve is about 90% accuracy. However, since the variances are simultaneously increased, the average accuracy is only enhanced a bit to 87%-88%. If we apply Atom Modeling, the average accuracy is significantly improved to 92% and with a median 96%. Overall, we observe that distancing away data point in the latent space can enhance the chance to improve the model performance. However, 1-norm and 2-norm distances treat every data points the same, hence the learned data point positions in the latent space can be relatively the same and only the scale is enlarged. In the other hand, Atom Modeling does not treat every data points equally but focuses on increasing the data points that should not be too closed, e.g., points in the neighborhood but in different classes.
4.2 Text Classification
We validate the effect of Atom Modeling on text classification, which is an often seen real application where the data points are composed of multiple, ordered tokens. The number of tokens in data points are various.
Results of text classification. We report the mean and the standard deviation (the value after
) of each evaluation metric for three random runs.
We conduct experiments on three datasets with different scale and number of classes, CoLA (wang2018glue; warstadt-etal-2019-neural), Poem (sheng2020investigating), and Drugs (grasser2018aspect)
. CoLA is a set of about 10k sentences annotated with 2 classes about whether it is a grammatical English sentence based on Linguistic theory. Poem is a corpus about 1k examples for sentiment analysis of classic poems with verses annotated with four classes: negative, no impact, positive, or mixed (both negative and positive). Drugs is a set of about 100k pieces of reviews and each is diagnosed to be one of 50 conditions, e.g., “depression”,“pain”,“acne”.
We fine-tune a pretrained language model, BERT-base-uncased (devlin2019bert), with cross-entropy loss. To implement 1-norm/2-norm distance regulation on BERT, we take the mean of the n-th layer hidden states of all tokens as the embedding of the data point. To atomize the n-th hidden layer of the BERT model ( here), we utilize the first dimension in the latent space of each token as its charge . Therefore 1-norm/2-norm/Atom Modeling are implemented upon BERT without extra parameters.
In Table 1, similar to the results of synthetic experiment, we observe that in most cases Atom Modeling achieves the best results. The method 2-norm distance has a more on-par performance with Atom Modeling on the Drugs dataset. Oppositely, the 1-norm distance method performs similarly or slightly better on CoLA and Poem, but worse than only using cross-entropy on Drugs. One difference from the synthetic experiment is that the utilized transformer architecture does not directly operates on the data points embeddings, but on the components embeddings. In this way, the benefits of Atom Modeling might not be fully leveraged. However, we can still observe the improvements via the jointly learned components embeddings across different number of classes and dataset scales.
4.3 Image Classification
We evaluate Atom Modeling on image classification, which is an often seen real application where the data points are composed of pixels with specific arrangement.
We conduct experiments on three datasets with different scale and number of classes, Oxford-Pets (parkhi2012cats), Oxford-Flowers102 (nilsback2008automated)
, and ImageNet(deng2009imagenet)
. Oxford-IIIT Pets consists of 37 cats and dogs species with roughly 200 images for each class. Oxford Flowers-102 consists of 102 flower categories. Each class consists of between 40 and 258 images. ImageNet consists of about 1M images in 1000 classes. We resize all the images to 32x32 and apply the same data preprocessing method for these three datasets. Note that the resizing will make the tasks more challenging since some details in a image might disappear.
Based on previous implementation of ResNet18 (he2016deep), we first concatenate the output channels of the first convolution layer and take each corresponding position of a pixel as its embedding . To implement 1-norm/2-norm distance regulation, we take the mean of the pixel embeddings as the embedding of the data point. To atomize this layer in ResNet, we utilize the first dimension of each embedding as its charge . Therefore 1-norm/2-norm/Atom Modeling are implemented upon ResNet without extra parameters.
As listed in Table 2, similar to the results of both the synthetic experiment and text classification, we observe that Atom Modeling achieves the highest accuracy in all cases. In most cases, the 2-norm distance method slightly improves the results trained with only cross-entropy loss. However, the 1-norm distance method performs poorly in image classification. We also found that the performance improvements are larger on Oxford-IIIT Pets and Oxford Flowers-102. We conjecture that this is because these two datasets are more fine-grained classification tasks, since some categories of pets and flowers are not easy to distinguish.
|Oxford-IIIT Pet||Oxford Flowers-102||ImageNet|
5 Quantitative and Qualitative Analyses
As our motivations about the discrete representation and the property of atomic physics, we are curious about (1) the impact of the atomized -th layer, (2) the learned latent space/inter-sample structure, and (3) the learned mapping from data point components to subatomic particles/intra-sample structures.
5.1 Atomized the -th Hidden Layer
The -th layer (in Section 2.1) split a model into the first part that projects data point into the latent space, and the second part that transforms the latent space to the output. Due to the goal of Atom Modeling to reorganize the latent space for the second part of the model can do better transformation, which hidden layer is reorganized might affect the final results. In Figure 6, which is to apply Atom Modeling in different layer of BERT model on CoLA, we observe that discretizing the first layer generally has the best performance. Recall that we were using a complex model architecture, which does not directly operate on the data point embedding, but the components embeddings. We conjecture that this complexity of the model architecture requires a larger capacity of the second part of the model.
5.2 Visualization of Learned Inter-Sample Structure
After training with different methods, the projection of the same data points to the same latent space will have different relative positions. As plotted in Figure 7, we visualized four random runs of the learned inter-sample structures of models trained with only cross-entropy loss, 1-norm/2-norm distance, and Atom Modeling on the synthetic experiment. We found that training with only the cross-entropy loss results in the embeddings entangled in the middle and has a long and thin shape. When applying additional 1-norm distance, the shape is similar to cross-entropy but the scale of both axes is enlarged about five times. When using 2-norm distance, the shape is more similar to the original dataset. However, there are still multiple overlaps and the scale is enlarged up to ten times. In the end, we found that Atom Modeling not only spreads out the distribution to be like the original dataset, but also distance away middle ones as our motivation. Meanwhile, the scale is remained about the same as the scale learned by cross-entropy.
5.3 Visualization of Learned Intra-Sample Structures
In Atom Modeling, the inter-sample structure depends on the learned intra-sample structure. We look into the learned intra-sample structure in texts and images to have an understanding of how the charges distributed in real world among words and pixels. For texts, we first extract the charges of a token with different context and then take an average. We plot the charges of randomly sampled 15 words of CoLA in Figure 9. We found that in both CoLA and STSB datasets, most often seen words such as “more” and “also” are learned to be electrons, while protons are often Nouns with specific meanings, such as “water” and “violin”. This phenomenon validates our expectation that the key contexts are mapped to protons and assistant texts that often co-appear with the key contexts are mapped to electrons. For images, the charges of every pixels are plotted in Figure 9. We observe that the charge distribution is often like a relief sculpture, where electrons are mapped to shadows on object edges and protons are the crucial parts for a model to classify the image. Moreover, we found that the charge distribution sometimes recover some missing details in the processed images. For example, in the right most of Figure 9, the charge distribution draws the other petals that originally hidden in the dark.
6 Related Work and Discussion
Recent works have tried to connect Physics with machine learning and data science(raissi2017physics; karpatne2017theory; bar2019learning; raissi2019physics; sitzmann2020implicit; NEURIPS2021_df438e52; NEURIPS2021_0a3b5a7a; NEURIPS2021_7ca57a9f)
. For computer vision,hasani2021liquid and NEURIPS2021_07845cd9 model visual reasoning by considering Physical equations such as velocity. tang2021image
proposes to split images into amplitude and phase as waves to improve the MLP architecture. For natural language processing, there is a surge of researches on the utility of Quantum Physics in texts(wu2021natural). For example, zhang2020quantum applies Quantum Physics to do sentiment analysis for emotions in dialogues. Atom Modeling also explores the potential impact from Nature science to machine learning. However, the goal of Atom Modeling is to approach the question: “how to promote the discrete nature of data points in a deep learning model?”
Some will find similarity between Atom Modeling and regularization (tibshirani1996regression; nie2010efficient; gulrajani2017improved) and kernel methods (muller2001introduction; keerthi2003asymptotic; hofmann2008kernel). The main differences from them are that first Atom Modeling constrains the underlying data representation rather than the model parameters. Second, Atom Modeling aims to enhance the expressivity of the inter-sample structure in a dataset and specifically model the discrete nature in an unchanged dimensional space, instead of mapping from lower to higher dimensional space.
In this paper, we propose Atom Modeling, a science- and theory-based approach to promote a discrete representation in a deep learning model. Moreover, by drawing an analogy between data and subatomic components, Atom Modeling learns interpretable intra-sample structure for each data point. An important property is that a model trained with Atom Modeling learns an inter-sample structure (distances among data points) depending on the intra-sample structure. Leveraging this property can potentially improve model capacity. Moving forward, we see that Atom Modeling provides a possible way to improve the current neural network and hope it provides a base for future works to consider more about the discrete nature.
Appendix A Proof of Theorem 3.1
Set and .
Set and .
If , the right hand side divided by zero is undefined.
If and recall that , then . This results in , contradict to the premise .
This equation can be satisfied only when . A balance point exists in .
Suppose that with , then,
which is monotonically decreasing given .