Symmetries frequently occur in machine learning. They express that the loss function of a model is invariant under a certain group of transformations. For example, the loss function of matrix factorization or word embedding models remains unchanged if we simultaneously rotate all embedding vectors with the same rotation matrix. This is an example of a continuous symmetry, since the rotations are parameterized by a continuum of real-valued angles.
Sometimes, the symmetry of a loss function is broken, e.g., due to the presence of an additional term that violates the symmetry. For example, this may be a weak regularizer. In this paper, we show that such symmetry breaking may induce slow convergence problems in gradient descent, in particular when the symmetry breaking is weak. We solve this problem with a new optimization algorithm.
Weak continuous symmetry breaking leads to an ill-conditioned optimization problem. When a loss function is invariant under a continuous symmetry, it has a manifold of degenerate (equivalent) minima. This is usually not a problem because any such minimum is a valid solution of the optimization problem. However, adding a small symmetry breaking term to the loss function lifts the degeneracy and forces the model to prefer one minimum over all others. As we show, this leads to an ill-conditioned Hessian of the loss, with a small curvature along symmetry directions and a large curvature perpendicular to them. The ill-conditioned Hessian results in slow convergence of gradient descent.
We propose an optimization algorithm that speeds up convergence by separating the optimization in the symmetry directions from the optimization in the remaining directions. At regular intervals, the algorithm efficiently minimizes the small symmetry breaking term in such a way that the minimization does not degrade the symmetry invariant term.
Symmetries can be broken explicitly, e.g., due to an additional term such as an regularizer in a word embedding model (Sun et al., 2016). However, perhaps more interestingly, symmetries can also be broken by couplings between model parameters. This is known as spontaneous symmetry breaking in the physics community.
One of our main findings is that spontaneous symmetry breaking occurs in certain time series models, such as dynamic matrix factorizations and dynamic word embedding models (Lu et al., 2009; Koren, 2010; Charlin et al., 2015; Bamler & Mandt, 2017; Rudolph & Blei, 2018). In these models, it turns out that model parameters may be smoothly twisted along the time axis, and that these twists contribute only little to the loss, thus leading to a small gradient. These inexpensive smooth twists are known in the physics community as Goldstone modes (Altland & Simons, 2010).
Our contributions are as follows:
We identify a broad class of models that suffer from slow convergence of gradient descent due to Goldstone modes. We explain both mathematically and pictorially how Goldstone modes lead to slow convergence.
Using ideas from gauge theories in physics, we propose Goldstone Gradient Descent (Goldstone-GD), an optimization algorithm that speeds up convergence by separating the optimization along symmetry directions from the remaining coordinate directions.
We evaluate the Goldstone-GD algorithm experimentally with dynamic matrix factorizations and Dynamic Word Embeddings. We find that Goldstone-GD converges orders of magnitude faster and finds more interpretable embedding vectors than standard gradient descent (GD) or GD with diagonal preconditioning.
For Dynamic Word Embeddings (Bamler & Mandt, 2017), Goldstone-GD allows us to find historic synonyms of modern English words, such as “wagon” for “car”. Without our advanced optimization algorithm, we were not able to perform this task.
Our paper is structured as follows. Section 2 describes related work. In Section 3, we specify the model class under consideration, introduce concrete example models, and discuss the slow convergence problem. In Section 4, we propose the Goldstone-GD algorithm that solves the slow convergence problem. We report experimental results in Section 5 and provide concluding remarks in Section 6.
2 Related Work
Our paper discusses continuous symmetries in machine learning and proposes a new optimization algorithm. In this section, we summarize related work on both aspects.
Most work on symmetries in machine learning focuses on discrete symmetries. Convolutional neural networks(LeCun et al., 1998) exploit the discrete translational symmetry of images. This idea was generalized to arbitrary discrete symmetries (Gens & Domingos, 2014), to the permutation symmetry of sets (Zaheer et al., 2017), and to discrete symmetries in graphical models (Bui et al., 2013; Noessner et al., 2013). Discrete symmetries do not cause an ill-conditioned optimization problem because they lead to isolated degenerate minima rather than a manifold of degenerate minima.
In this work, we consider models with continuous symmetries. Continuous rotational symmetries have been identified in deep neural networks (Badrinarayanan et al., 2015), matrix factorization (Mnih & Salakhutdinov, 2008; Gopalan et al., 2015), linear factor models (Murphy, 2012), and word embeddings (Mikolov et al., 2013a, b; Pennington et al., 2014; Barkan, 2017). Dynamic matrix factorizations (Lu et al., 2009; Koren, 2010; Sun et al., 2012; Charlin et al., 2015) and dynamic word embeddings (Bamler & Mandt, 2017; Rudolph & Blei, 2018) generalize these models to sequential data. These are the models whose optimization we address in this paper. A specialized optimization algorithm for a loss function with a continuous symmetry was presented in (Choi et al., 1999). Our discussion is more general since we only require invariance under a collective rotation of all feature vectors, and not under independent symmetry transformations of each individual feature.
The slow convergence in these models is caused by shallow directions of the loss function. Popular methods to escape a shallow valley of a loss function (Duchi et al., 2011; Zeiler, 2012; Kingma & Ba, 2014) use diagonal preconditioning. As confirmed by our experiments, diagonal preconditioning does not speed up convergence when the shallow directions correspond to collective rotations of many model parameters, which are not aligned with the coordinate axes.
are a more sophisticated form of preconditioning, which has been applied to deep learning(Pascanu & Bengio, 2013) and to variational inference (Hoffman et al., 2013). Our proposed algorithm uses natural gradients in a subspace where they are cheap to obtain. Different to the Krylov subspace method (Vinyals & Povey, 2012), we construct the subspace such that it always contains the shallow directions of the loss.
3 Problem Setting
In this section, we formalize the notion of continuous symmetry breaking (Section 3.1), and we specify the type of models that we investigate in this paper (Section 3.2). We then show that the introduced models exhibit a specific type of symmetry breaking that is generically weak, which leads to slow convergence of gradient descent (Section 3.3).
3.1 Symmetry Breaking in Representation Learning
We formalize the notion of weak continuous symmetry breaking and show that it leads to an ill-conditioned optimization problem. The red surface in Figure 1a illustrates a rotationally symmetric loss . It has a ring of degenerate (i.e., equivalent) minima. The purple sphere depicts one arbitrarily chosen minimum. Tangential to the ring of degenerate minima, i.e., along the blue arrows, is flat.
For a machine learning example, consider factorizing a large matrix into the product of two smaller matrices and by minimizing the loss . Rotating all columns of and by the same orthogonal111We call a square matrix ‘orthogonal’ if is the identity. This is sometimes also called an ‘orthonormal’ matrix. rotation matrix such that and does not change since .
The continuous rotational symmetry of leads to a manifold of degenerate minima: if minimizes , then so does for any rotation matrix . On the manifold of degenerate minima, the gradient assumes the constant value of zero. A constant gradient (first derivative) means that the curvature (second derivative) is zero. More precisely, the Hessian of
has a zero eigenvalue for all eigenvectors that are tangential to the manifold of degenerate minima. Usually, a zero eigenvalue of the Hessian indicates a maximally ill-conditioned optimization problem, but this is not an issue here. The zero eigenvalue only means that convergence within the manifold of degenerate minima is infinitely slow. This is of no concern since any minimum is a valid solution of the optimization problem.
A problem arises when the continuous symmetry is weakly broken, e.g., by adding a small regularizer to the loss. A sufficiently small regularizer changes the eigenvalues of the Hessian only slightly, leaving it still ill-conditioned. However, even a small regularizer lifts the degeneracy and turns the manifold of exactly degenerate minima into a shallow valley with one preferred minimum. Convergence along this shallow valley is slow because of the ill-conditioned Hessian. This is the slow convergence problem that we address in this paper. We present a more natural setup that exhibits this problem in Sections 3.2-3.3. Our solution, presented in Section 4, is to separate the optimization in the symmetry directions from the optimization in the remaining directions.
3.2 Representation Learning for Time Series
We define a broad class of representation learning models for sequential data, and introduce three example models that are investigated experimentally in this paper. As we show in Section 3.3, the models presented here suffer from slow convergence due to a specific kind of symmetry breaking.
General Model Class.
We consider data that are associated with additional metadata , such as a time stamp. For each , the task is to learn a low dimensional representation by minimizing what we call a ‘local loss function’ . We add a quadratic regularizer that couples the representations along the -dimension. In a Bayesian setup, comes from the log-prior of the model. The total loss function is thus
For each task , the representation is a matrix whose columns are embedding vectors of some low dimension . We assume that is invariant under a collective rotation of all columns of : let be an arbitrary orthogonal rotation matrix of the same dimension as the embedding space, then
Finally, we consider a specific form of the regularizer which is quadratic in , and which is defined in terms of a sparse symmetric coupling matrix :
Here, the matrix-vector multiplications are carried out in -space, and the trace runs over the remaining dimensions. Note that, different to Section 3.1, we do not require to have a small coefficient. We only require the coupling matrix to be sparse. In the examples below, is tridiagonal and results from a Gaussian Markovian time series prior. In a more general setup, is the Laplacian matrix of a sparse weighted graph (Poignard et al., 2018). Here, is the adjacency matrix, whose entries are the coupling strengths, and the degree matrix is diagonal and defined such that the entries of each row of sum up to zero.
Equations 1, 2, and 3 specify the problem class of interest in this paper. The following paragraphs introduce the specific example models used in our experiments. In Section 3.3, we show that the sparse coupling in these models leads to weak continuous symmetry breaking and therefore to slow convergence of gradient descent (GD).
Model 1: Dense Dynamic Matrix Factorization.
Consider the task of factorizing a large matrix into a product of two smaller matrices. The latent representation is the concatenation of the two embedding matrices,
In a Gaussian matrix factorization, the local loss function is
In dynamic matrix factorization models, the data are observed sequentially at discrete time steps , and the representations capture the temporal evolution of latent embedding vectors. We use a Markovian Gaussian time series prior with a coupling strength , resulting in the regularizer
Here, the vector is the th column of the matrix , i.e., the th embedding vector, and is the number of columns. The regularizer allows the model to share statistical strength across time. By multiplying out the square, we find that has the form of Eq. 3 with a tridiagonal coupling matrix,
Model 2: Sparse Dynamic Matrix Factorization.
In a sparse matrix factorization, the local loss involves only few components of the matrix . The latent representation is again . We consider a model for movie ratings where each user rates only few movies. When user rates movie in time step , we model the log-likelihood to obtain the binary rating
with a logistic regression,
with the sigmoid function. Eq. 8 is the log-likelihood of the rating of a single movie by a single user. We obtain the full log-likelihood for time step by summing over the log-likelihoods of all ratings observed at time step . The local loss is
Here, is the Frobenius norm, and we add a quadratic regularizer with strength since data for some users or movies may be scarce. We distinguish this local regularizer from the time series regularizer , given again in Eq. 6, as the local regularizer does not break the rotational symmetry.
Model 3: Dynamic Word Embeddings.
Word embeddings map words from a large vocabulary to a low dimensional representation space such that neighboring words are semantically similar, and differences between word embedding vectors capture syntactic and semantic relations. We consider the Dynamic Word Embeddings model (Bamler & Mandt, 2017), which uses a probabilistic interpretation of the Skip-Gram model with negative sampling, also known as word2vec (Mikolov et al., 2013b; Barkan, 2017), and combines it with a time series prior. The model is trained on text sources with time stamps , and it assigns two time dependent embedding vectors and to each word from a fixed vocabulary. The embedding vectors are obtained by simultaneously factorizing two matrices, which contain so-called positive and negative counts of word-context pairs. Therefore, the representation for each time step is invariant under orthogonal transformations. The regularizer comes from the time series prior, which is a discretized Ornstein-Uhlenbeck process, i.e., it combines a random diffusion process with a local quadratic regularizer.
3.3 Symmetry Breaking in Representation Learning for Time Series
We show that the time series models introduced in Section 3.2 exhibit a variant of the symmetry breaking discussed in Section 3.1. This variant of symmetry breaking is generically weak, thus causing slow convergence. We discuss the convergence first geometrically and then more formally.
Geometric Picture: Goldstone Modes.
We find that minimizing the total loss in Eq. 1 with GD converges in two phases, illustrated in Figures 1b-d. Each purple sphere depicts an embedding vector for a time step of the model. The red surface is the rotationally symmetric local loss function . For a simpler visualization, we assume here that is the same function for every time step , and that each contains only a single embedding vector.
In the first phase, GD starts from a random initialization (Figure 1b) and quickly finds approximate minima of the local loss functions for each time step. At this stage, we observe in experiments that the parameters of the time series model twist smoothly around the rotation center of (Figure 1c). Such a configuration is called a Goldstone mode in the physics literature (Altland & Simons, 2010).
The coupling in Eqs. 1 and 6 penalizes Goldstone modes as it tries to pull neighboring spheres together. In the second phase of convergence, GD eliminates Goldstone modes and arrives eventually at the minimum of the total loss (Figure 1d; in this toy example, the chain contracts to a single point). Convergence in this phase is slow because, as we show below, the gradient of is suppressed in a Goldstone mode (Goldstone theorem) (Altland & Simons, 2010).
Figure 2 identifies the same two phases of convergence in an experiment. The plots show snapshots of the embedding space in a Gaussian dynamic matrix factorization with a representation space and time steps (details in Section 5.1). Points of equal color show the trajectory of one embedding vector, i.e., for and fixed .
We see that GD (top row of plots) arrives at a Goldstone mode, i.e., smooth twists around the origin, after about training iterations. In this toy experiment, the local loss is again identical for all . Thus, in the optimum, each trajectory contracts to a single point, but this contraction takes many more training iterations. By contrast, our algorithm evades Goldstone modes and converges much faster.
Formal Picture: Eigenvalues of the Hessian.
Goldstone modes decay slowly in GD because the gradient of the total loss is suppressed in a Goldstone mode. This can be seen by analyzing the eigenvalues of the Hessian of at its minimum. For a configuration that is close to the true minimum , the gradient is approximately .
The Hessian of in Eq. 1 is the sum of the Hessians of the local loss functions plus the Hessian of the regularizer . As discussed in Section 3.1, the Hessians of all have exact zero eigenvalues along the symmetry directions. Within this nullspace, only the Hessian of the regularizer remains. From Eq. 3, we find where
is the tensor product, andand are identity matrices in the input and embedding space, respectively. Thus, has the same eigenvalues as the Laplacian matrix of the coupling graph, each with multiplicity .
Since the rows of sum up to zero, has a zero eigenvalue for the eigenvector . This is because the total loss is exactly invariant under global rotations of all embeddings by the same rotation matrix. As discussed in Section 3.1, zero eigenvalues due to an exact continuous symmetry do not induce slow convergence of GD.
The speed of convergence is governed by the lowest nonzero eigenvalue of the Hessian, and therefore of . In a Markovian time series model, in Eq. 7 couples neighbors along a chain of length . Its lowest nonzero eigenvalue is (de Abreu, 2007), which vanishes as for large . This leads to the ill-conditioned Hessian and to the small gradient in a Goldstone mode. A more general model may couple tasks along a sparse graph other than a chain. The second lowest eigenvalue of the Laplacian matrix of a graph is called ‘algebraic connectivity’ (de Abreu, 2007), and it is small in sparse graphs.
4 Goldstone Gradient Descent
We now propose our solution to the slow convergence problem of Section 3.3. Algorithm LABEL:alg:gauge-gd summarizes our Goldstone Gradient Descent (Goldstone-GD) algorithm. We discuss details in Section 4.1
, and hyperparameters in Section4.2.
The algorithm minimizes a loss function of the form of Eqs. 1-3. We alternate between standard GD (lines LABEL:ln:begin-fullspace-LABEL:ln:standard-gradstep in Algorithm LABEL:alg:gauge-gd), and a specialized minimization in the subspace of symmetry transformations (lines LABEL:ln:begin-symspace-LABEL:ln:update-gamma). The latter efficiently minimizes the symmetry breaking regularizer without degrading the symmetry invariant local loss functions . We always perform several updates of each type in a row (hyperparameters and ) because switching between them incurs some overhead (lines LABEL:ln:prepare-gauge and LABEL:ln:apply-gauge). Algorithm LABEL:alg:gauge-gd presents Goldstone-GD in its simplest form. It is straight-forward to combine it with adaptive learning rates and minibatch sampling, see experiments in Section 5.
4.1 Optimization in the Symmetry Subspace
We explain lines LABEL:ln:prepare-gauge-LABEL:ln:apply-gauge of Algorithm LABEL:alg:gauge-gd. These steps minimize the total loss function while restricting updates of the model parameters to symmetry transformations. Let denote orthogonal matrices. The task is to minimize the following auxiliary loss function over ,
with the nonlinear constraint . If minimizes , then updating decreases the loss by eliminating all Goldstone modes. The second term on the right-hand side of Eq. 10 does not influence the minimization as it is independent of . Subtracting this term makes independent of the local loss functions : by using Eqs. 1-2, we can write in terms of only the regularizer ,
Artificial Gauge Fields.
We turn the constrained minimization of over into an unconstrained minimization using a result from the theory of Lie groups (Hall, 2015)
. Every special orthogonal matrix
is the matrix exponential of a skew symmetricmatrix . Here, skew symmetry means that , and the matrix exponential function is defined by its series expansion,
which is not to be confused with the componentwise exponential of (the term in Eq. 12 is the matrix product of with itself, not the componentwise square). Eq. 12 follows from the Lie group–Lie algebra correspondence for the Lie group (Hall, 2015). Note that is close to the identity if the entries of are small. To enforce skew symmetry of , we parameterize it via the skew symmetric part of an unconstrained matrix , i.e.,
We call the components of the gauge fields, invoking an analogy to gauge theory in physics.
Taylor Expansion in the Gauge Fields.
Eqs. 12-13 parameterize a valid rotation matrix in terms of an arbitrary matrix . This turns the constrained minimization of into an unconstrained one. However, the matrix-exponential function in Eq. 12 is numerically expensive, and its derivative is complicated because the group is non-abelian. We simplify the problem by introducing an approximateion. As the model parameters approach the minimum of , the optimal rotations that minimize converge to the identity, and thus the gauge fields converge to zero. In this limit, the approximation becomes exact.
We approximate the auxiliary loss function by a second order Taylor expansion . In detail, we truncate Eq. 12 after the term quadratic in and insert the truncated series into Eq. 11. We multiply out the quadratic form in the prior , Eq. 3, and neglect again all terms of higher than quadratic order in . Using the skew symmetry of and the symmetry of the Laplacian matrix , we find
where the trace runs over the embedding space, and for each , we define the matrix ,
We evaluate the matrices on line LABEL:ln:prepare-gauge in Algorithm LABEL:alg:gauge-gd. Note that the adjacency matrix is sparse, and that we only need to obtain those matrices for which .
We describe the numerical minimization of below. Once we obtain gauge fields that minimize , the optimal update step for the model parameters would be . For efficiency, we truncate the matrix exponential function after the linear term, resulting in line LABEL:ln:apply-gauge of Algorithm LABEL:alg:gauge-gd. We do not reset the gauge fields to zero after updating , so that the next minimization of starts with preinitialized . This turned out to speed up convergence in our experiments, possibly because acts like a momentum in the symmetry subspace.
Lines LABEL:ln:begin-symspace-LABEL:ln:update-gamma in Algorithm LABEL:alg:gauge-gd minimize over the gauge fields using GD. We speed up convergence using the fact that depends only on the prior and not on . Since we know the Hessian of , we can use natural gradients (Amari, 1998), resulting in the update step
where is a constant learning rate and is the pseudoinverse of the Laplacian matrix . We obtain by taking the eigendecomposition of and inverting the eigenvalues, except for the single zero eigenvalue corresponding to (irrelevant) global rotations, which we leave at zero. has to be obtained only once before entering the training loop.
We find that we can automatically set in Eq. 16 to a value that leads to fast convergence,
We arrive at this choice of learning rate by estimating the Hessian of. The preconditioning with in Eq. 16 takes into account the structure of the Hessian in -space, which enters in Eq. 14 via the adjacency matrix . The remaining factor , defined in Eq. 15, is quadratic in the components of and linear in . This suggests a learning rate . We find empirically for large models that the -dependency of leads to a small mismatch between and the Hessian of . The more conservative choice of learning rate in Eq. 17 leads to fast convergence of the gauge fields in all our experiments.
|LABEL:ln:standard-gradstep||gradient step in full param. space||model dependent|
|LABEL:ln:prepare-gauge||transformation to symmetry space|
|LABEL:ln:update-gamma||nat. grad. step in symmetry space|
|LABEL:ln:apply-gauge||transformation to full param. space|
Goldstone-GD has two integer hyperparameters, and , which control the frequency of execution of each operation. Table 1 lists the computational complexity of each operation, assuming that the sparse adjacency matrix has nonzero entries, as is the case in Markovian time series models (Eq. 7). Note that the embedding dimension is typically orders of magnitude smaller than the input dimension . Therefore, update steps in the symmetry subspace (line LABEL:ln:update-gamma) are cheap. In our experiments, we always set and such that the overhead from lines LABEL:ln:prepare-gauge-LABEL:ln:apply-gauge is less than .
We evaluate the proposed Goldstone-GD optimization algorithm on the three example models introduced in Section 3.2. We compare Goldstone-GD to standard GD, to AdaGrad (Duchi et al., 2011), and to Adam (Kingma & Ba, 2014). Goldstone-GD converges orders of magnitude faster and fits more interpretable word embeddings.
5.1 Visualizing Goldstone Modes With Artificial Data
Model and Data Preparation.
We fit the dynamic Gaussian matrix factorization model defined in Eqs. 4-6 in Section 3.2 to small scale artificial data. In order to visualize Goldstone modes in the embedding space we choose an embedding dimension of and, for this experiment only, we fit the model to time independent-data. This allows us to monitor convergence since we know that the matrices and that minimize the loss are also time-independent. We generate artificial data for the matrix by drawing the components of two matrices
from a standard normal distribution, forming
, and adding uncorrelated Gaussian noise with variance.
We use time steps and a coupling strength of . We train the model with standard GD (baseline) and with Goldstone-GD with and . We find fastest convergence for the baseline method if we clip the gradients to an interval and use a decreasing learning rate despite the noise-free gradient. Here, is the training iteration. We optimize the hyperparameters for fastest convergence in the baseline and find , , and .
Figure 2 compares convergence in the two algorithms. We discussed the figure at the end of Section 3.3. In summary, Goldstone-GD converges an order of magnitude faster even in this small scale setup that allows only for three different kinds of Goldstone modes (the skew symmetric gauge fields have only independent parameters). Once the minimization finds minima of the local losses , differences in the total loss between the two algorithms are small since Goldstone modes contribute only little to (this is why they decay slowly in GD).
5.2 MovieLens Recommendations
Model and Data Set.
We fit the sparse dynamic Bernoulli factorization model defined in Eqs. 6-9 in Section 3.2 to the Movielens 20M data set222https://grouplens.org/datasets/movielens/20m/ (Harper & Konstan, 2016). We use embedding dimension , coupling strength , and regularizer . The data set consists of million reviews of movies by
users with time stamps from 1995 to 2015. We binarize the ratings by splitting at the median, discarding ratings at the median, and we slice the remainingmillion data points into time bins of equal duration. We split randomly across all bins into training, validation, and test set.
Baseline and Hyperparameters.
We compare the proposed Goldstone-GD algorithm to GD with AdaGrad (Duchi et al., 2011) with a learning rate prefactor of obtained from cross-validation. Similar to Goldstone-GD, AdaGrad is designed to escape shallow valleys of the loss, but it uses only diagonal preconditioning. We compare to Goldstone-GD with and , using the same AdaGrad optimizer for update steps in the full parameter space.
|car||boat, saddle, canoe, wagon, box||shell, roof, ceiling, choir, central|
|computer||perspective, telescope, needle, mathematical, camera||organism, disturbing, sexual, rendering, bad|
|DNA||potassium, chemical, sodium, molecules, displacement||operates, differs, sharing, takes, keeps|
|electricity||vapor, virus, friction, fluid, molecular||exercising, inherent, seeks, takes, protect|
|tuberculosis||chronic, paralysis, irritation, disease, vomiting||trained, uniformly, extinguished, emerged, widely|
The additional operations in Goldstone-GD lead to a overhead in runtime. The upper panel in Figure 3 shows training curves for the loss using the baseline (purple) and Goldstone-GD (green). The loss drops faster in Goldstone-GD, but differences in terms of the full loss are small because the local loss functions are much larger than the regularizer in this experiment. The lower panel of Figure 3 shows only . Both algorithms converge to the same value of , but Goldstone-GD converges at least an order of magnitude faster. The difference in value is small because Goldstone modes contribute little to . They can, however, have a large influence on the parameter values, as we show next in experiments with Dynamic Word Embeddings.
5.3 Dynamic Word Embeddings
Model and Data Set.
We perform variational inference (Ranganath et al., 2014) in Dynamic Word Embeddings (DWE), see Section 3.2. We fit the model to digitized books from the years to in the Google Books corpus333http://storage.googleapis.com/books/ngrams/books/datasetsv2.html (Michel et al., 2011) (approximately words). We follow (Bamler & Mandt, 2017) for data preparation, resulting in a vocabulary size of , a training set of time step, and a test set of time steps. The DWE paper proposes two inference algorithms: filtering and smoothing. We use the smoothing algorithm, which has better predictive performance than filtering but suffers from Goldstone modes. We set the embedding dimension to due to hardware constraints and train for iterations using an Adam optimizer (Kingma & Ba, 2014) with a decaying prefactor of the adaptive learning rate, , where is the training iteration, , and . We find that this leads to better convergence than a constant prefactor. All other hyperparameters are the same as in (Bamler & Mandt, 2017). We compare the baseline to Goldstone-GD using the same learning rate schedule and , which leads to an runtime overhead.
By eliminating Goldstone modes, Goldstone-GD makes word embeddings comparable across the time dimension of the model. We demonstrate this in Table 2, which shows the result of ‘aging’ modern words, i.e., translating them from modern English to the English language of the year . For each query word , we report the five words whose embedding vectors at the first time step (year ) have largest overlap with the embedding vector of the query word at the last time step (year ). Overlap is measured in cosine distance (normalized scalar product), between the means of and under the variational distribution.
Goldstone-GD finds words that are plausible for the year while still being related to the query (e.g., means of transportation in a query for ‘car’). By contrast, the baseline method fails to find plausible results. Figure 4 provides more insight into the failure of the baseline method. It shows histograms of the cosine distance between word embeddings and for the same word from the first to the last time step. In Goldstone-GD (green), most embeddings have a large overlap because the meaning of most words does not change drastically over time. By contrast, in the baseline (purple), no embeddings overlap by more than between and , and some embeddings even change their orientation (negative overlap). We explain this counterintuitive result with the presence of Goldstone modes, i.e., the entire embedding spaces are rotated against each other.
For a quantitative comparison, we evaluate the predictive log-likelihood of the test set under the posterior mean, and find slightly better predictive performance with Goldstone-GD ( vs. per test point). The improvement is small because the training set is so large that the influence of the symmetry breaking regularizer is dwarfed in all but the symmetry directions by the log-likelihood of the data. The main advantage of Goldstone-GD are the more interpretable embeddings, as demonstrated in Table 2.
We identified a slow convergence problem in representation learning models with a continuous symmetry and a Markovian time series prior, and we solved the problem with a new optimization algorithm, Goldstone-GD. The algorithm separates the minimization in the symmetry subspace from the remaining coordinate directions. Our experiments showed that Goldstone-GD converges orders of magnitude faster and fits more interpretable embedding vectors, which can be compared across the time dimension of a model. Since continuous symmetries are common in representation learning, we believe that gauge theories and, more broadly, the theory of Lie groups are more widely useful in machine learning.
We thank Ari Pakman for valuable and detailed feedback that greatly improved the manuscript.
- Altland & Simons (2010) Altland, A. and Simons, B. D. Condensed matter field theory. Cambridge University Press, 2010.
- Amari (1998) Amari, S.-I. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
- Badrinarayanan et al. (2015) Badrinarayanan, V., Mishra, B., and Cipolla, R. Understanding symmetries in deep networks. arXiv preprint arXiv:1511.01029, 2015.
- Bamler & Mandt (2017) Bamler, R. and Mandt, S. Dynamic word embeddings. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 380–389, 2017.
Bayesian Neural Word Embedding.
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- Bui et al. (2013) Bui, H. H., Huynh, T. N., and Riedel, S. Automorphism groups of graphical models and lifted variational inference. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pp. 132–141, 2013.
- Charlin et al. (2015) Charlin, L., Ranganath, R., McInerney, J., and Blei, D. M. Dynamic Poisson factorization. In Proceedings of the 9th ACM Conference on Recommender Systems, pp. 155–162, 2015.
Choi et al. (1999)
Choi, S., Amari, S., Cichocki, A., and Liu, R.-w.
Natural gradient learning with a nonholonomic constraint for blind
deconvolution of multiple channels.
First International Workshop on Independent Component Analysis and Signal Separation, pp. 371–376, 1999.
- de Abreu (2007) de Abreu, N. M. M. Old and new results on algebraic connectivity of graphs. Linear Algebra and its Applications, 423(1):53–73, 2007.
- Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
- Gens & Domingos (2014) Gens, R. and Domingos, P. M. Deep symmetry networks. In Advances in Neural Information Processing Systems 27, pp. 2537–2545. 2014.
- Gopalan et al. (2015) Gopalan, P., Hofman, J. M., and Blei, D. M. Scalable recommendation with hierarchical Poisson factorization. In UAI, pp. 326–335, 2015.
- Hall (2015) Hall, B. Lie groups, Lie algebras, and representations: an elementary introduction, volume 222. Springer, 2015.
- Harper & Konstan (2016) Harper, F. M. and Konstan, J. A. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):19, 2016.
- Hoffman et al. (2013) Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. W. Stochastic Variational Inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
- Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference for Learning Representations (ICLR), 2014.
- Koren (2010) Koren, Y. Collaborative filtering with temporal dynamics. Communications of the ACM, 53(4):89–97, 2010.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Lu et al. (2009) Lu, Z., Agarwal, D., and Dhillon, I. S. A spatio–temporal approach to collaborative filtering. In ACM Conference on Recommender Systems (RecSys), 2009.
- Martens (2014) Martens, J. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
- Michel et al. (2011) Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331(6014):176–182, 2011.
- Mikolov et al. (2013a) Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013a.
- Mikolov et al. (2013b) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, pp. 3111–3119. 2013b.
- Mnih & Salakhutdinov (2008) Mnih, A. and Salakhutdinov, R. R. Probabilistic matrix factorization. In Advances in neural information processing systems, pp. 1257–1264, 2008.
- Murphy (2012) Murphy, K. P. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.
- Noessner et al. (2013) Noessner, J., Niepert, M., and Stuckenschmidt, H. Rockit: Exploiting parallelism and symmetry for map inference in statistical relational models. In AAAI Workshop: Statistical Relational Artificial Intelligence, 2013.
- Pascanu & Bengio (2013) Pascanu, R. and Bengio, Y. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
Pennington et al. (2014)
Pennington, J., Socher, R., and Manning, C.
Glove: Global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
- Poignard et al. (2018) Poignard, C., Pereira, T., and Pade, J. P. Spectra of laplacian matrices of weighted graphs: structural genericity properties. SIAM Journal on Applied Mathematics, 78(1):372–394, 2018.
- Ranganath et al. (2014) Ranganath, R., Gerrish, S., and Blei, D. Black box variational inference. In Artificial Intelligence and Statistics, pp. 814–822, 2014.
- Rudolph & Blei (2018) Rudolph, M. and Blei, D. Dynamic embeddings for language evolution. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pp. 1003–1011, 2018.
- Sun et al. (2016) Sun, F., Guo, J., Lan, Y., Xu, J., and Cheng, X. Sparse word embeddings using regularized online learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2915–2921, 2016.
- Sun et al. (2012) Sun, J. Z., Varshney, K. R., and Subbian, K. Dynamic matrix factorization: A state space approach. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1897–1900, 2012.
- Vinyals & Povey (2012) Vinyals, O. and Povey, D. Krylov subspace descent for deep learning. In Artificial Intelligence and Statistics, pp. 1261–1268, 2012.
- Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, pp. 3394–3404, 2017.
- Zeiler (2012) Zeiler, M. D. ADADELTA: an Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701, 2012.