ammi
None
view repo
We propose learning discrete structured representations from unlabeled data by maximizing the mutual information between a structured latent variable and a target variable. Calculating mutual information is intractable in this setting. Our key technical contribution is an adversarial objective that can be used to tractably estimate mutual information assuming only the feasibility of cross entropy calculation. We develop a concrete realization of this general formulation with Markov distributions over binary encodings. We report critical and unexpected findings on practical aspects of the objective such as the choice of variational priors. We apply our model on document hashing and show that it outperforms current best baselines based on discrete and vector quantized variational autoencoders. It also yields highly compressed interpretable representations.
READ FULL TEXT VIEW PDFNone
Unsupervised learning of discrete representations is appealing because they correspond to natural symbolic representations in many domains (e.g., phonemes in speech signals, topics in text, and objects in images). However, working with discrete variables comes with technical challenges such as non-differentiability and nontrivial combinatorial optimization. Standard methods approach the problem within the framework of variational autoencoding
(Kingma & Welling, 2014; Rezende et al., 2014) and bypass these challenges by adopting some form of gradient approximation and possibly strong independence assumptions (Bengio et al., 2013; van den Oord et al., 2017).In this paper we are instead interested in a promising alternative framework based on maximal mutual information (MMI). Unlike autoencoding, MMI estimates a distribution over latent variables without modeling raw signals by maximizing the mutual information between the latent and a target variable (Brown et al., 1992; Bell & Sejnowski, 1995; Tishby et al., 1999). It is well motivated as a principled approach to learning representations that retain only predictive information and drop noise. Its neural extensions have recently been quite successful in learning useful continuous representations across domains (Oord et al., 2018; Belghazi et al., 2018; Hjelm et al., 2019; Bachman et al., 2019).
We depart from these existing works on MMI in two important ways. First, we learn discrete structured representations. There are previous works on learning discrete representations with neural MMI (McAllester, 2018; Stratos, 2019)
, but they only consider unstructured representations which can transmit at most the log of the number of labels bits of information. Breaking the log bottleneck in the discrete regime requires making encodings structured, but it also makes exact computation intractable. Thus the feasibility of optimizing mutual information effectively in this setting remains unclear. We develop a tractable formulation that only requires tractable cross entropy by a combination of mild structural assumptions and an appropriate loss function (see below).
Second, we consider a new mutual information estimator based on the difference of entropies for learning representations. This is a crucial departure from existing works that optimize variational lower bounds (Poole et al., 2019). Estimators of a lower bound on mutual information have been shown to suffer fundamental limitations (McAllester & Stratos, 2020), suggesting a need to investigate alternative estimators. Our estimator is neither a lower bound nor an upper bound, yet it can be optimized adversarially as in generative adversarial networks (GANs) (Goodfellow et al., 2014). We show for the first time that such adversarial optimization of mutual information is a viable option for learning meaningful representations.
Our proposed discrete structured MMI is novel and largely uncharted in the literature. An important contribution of this paper is charting practical considerations for this alien approach by developing a concrete realization based on a structured model over binary encodings. More specifically, the model encodes an observation into a zero-one vector of length , resulting in possible encodings. We show how mutual information can be estimated efficiently by adversarial dynamic programming with controllable Markov assumptions. One critical and unexpected finding is that the expressiveness of a variational prior needs to be strictly greater than that of the model (i.e., it has to be higher-order Markovian).
To demonstrate the utility of our model in a real-world problem, we apply it to unsupervised document hashing (Dong et al., 2019; Shen et al., 2018; Chaidaroon et al., 2018). The task is to compress an article into a drastically smaller discrete encoding that preserves semantics. Our model outperforms current state-of-the-art baselines based on discrete VAEs (Kingma & Welling, 2014; Maddison et al., 2017; Jang et al., 2017) and VQ-VAEs (van den Oord et al., 2017) with Bernoulli priors. We additionally design a predictive version of document hashing in which the model is tasked with encoding a future article with the knowledge of a past article. We find that our model achieves favorable performance with highly compressed and interpretable representations.
Many successful approaches to unsupervised representation learning are based on density estimation. For instance, it is now very common in natural language processing to make use of continuous representations that are learned in the process of modeling a conditional distribution
, such as the conditional distribution of a word given a context window (Mikolov et al., 2013; Peters et al., 2018; Devlin et al., 2019). In this case the input pair is easily sampled from unlabeled data, by masking an observed word. There is also much work that identifies representations, often continuous, with the latent variables in an unconditional density model of (Kingma & Welling, 2014; Rezende et al., 2014; Higgins et al., 2017).Learning representations through density estimation, however, suffers from certain limitations. First, it may be unnecessary to fully model the density of noisy, raw data when we are only interested in learning representations. Second, many standard approaches to learning discrete-valued latent representations in the context of density estimation require the use of either biased gradient estimators (Bengio et al., 2013; van den Oord et al., 2017)
or high variance ones
(Mnih & Gregor, 2014; Mnih & Rezende, 2016).Maximal mutual information (MMI) is a refreshingly different approach to unsupervised representation learning in which we estimate a conditional distribution over latent representations by maximizing mutual information under these distributions. In contrast with density estimation, there is no issue of modeling noise since the model never estimates a distribution over raw signals (i.e., there is no decoder). The mutual information objective has been shown to produce state-of-the-art representations of images, speech, and text (Bachman et al., 2019; Oord et al., 2018).
The focus with MMI so far has been largely limited to learning continuous representations. Existing works on learning discrete representations with MMI only consider unstructured one-of--labels representations (McAllester, 2018; Stratos, 2019) due to computational reasons. Our main contribution is a tractable formulation for discrete structured MMI. This involves an adversarial objective reminiscent of GANs (Goodfellow et al., 2014) and radically different from existing MMI objectives based on variational lower bounds of mutual information (Poole et al., 2019). Other than tractability reasons, the choice of the objective can be theoretically motivated as avoiding statistical limitations of estimating lower bounds on mutual information (McAllester & Stratos, 2020).
Let
denote an unknown but samplable joint distribution over raw signals
. We assume discrete for simplicity and relevance to our experimental setting (document hashing), but the formulation can be easily adapted to the continuous case. We introduce an encoder that defines a conditional distribution over a discrete latent variable representing the encoding of and aim to maximize : the mutual information between and under and . By the data processing inequality, the objective is a lower bound on and can be viewed as distilling the predictive information of about into .This formulation alone is meaningless since it admits the trivial solution . In order to achieve compression, there are various options. In the information bottleneck method (Tishby et al., 1999), we additionally regularize the information rate of by simultaneously minimizing . Here we advocate a more direct approach by giving an explicit budget on the entropy of .
Equivalently, we can simply maximize with a finite encoding space such that . Even with small the objective is intractable because it involves marginalization over . Using that , we introduce a variational model and optimize
(1) |
where denotes the cross entropy between and . By the usual property of cross entropy, is an upper bound on , hence the objective (1) is a lower bound on .
Unfortunately, the applicability of this variational lower bound is critically limited to settings in which the entropy of is tractable (McAllester, 2018; Stratos, 2019) or constant with respect to learnable parameters (Chen et al., 2016; Alemi et al., 2017). When is small, can be easily estimated from samples as
(2) |
But this explicit calculation is clearly infeasible for large . Furthermore, the log bottleneck on the budget then implies that it is infeasible to achieve a large information rate using this naive formulation (e.g., even if we specify to be a trillion we have ).
To allow for large , we propose to make structured. A simplest example of structured is a binary vector of length which yields so that can be as large as . More generally, can be any structure whose size is exponential in some controllable integer .
The first key ingredient in deriving a tractable formulation is the tractability of the cross entropy between and .
The cross entropy between and estimated from iid samples of
can be computed in time polynomial in where .
There is a class of structured probabilistic models with standard conditional independence assumptions such that Assumption 3.1 holds. For instance, in the case of , we may impose Markov assumptions and define (with the convention for )
where and are the Markov orders of and . It can be easily verified that the estimate of based on a single sample is
where
is the marginal probability of the length-
sequence ending at position under the conditional distribution . These marginals can be computed by applying a variant of the forward algorithm (Rabiner, 1989) (see the supplementary material). We give a general algorithm that computes the cross entropy between any distributions over with Markov orders in time in Algorithm 1.While we focus on the choice for concreteness, we emphasize that similar structural assumptions can be made to consider others. For instance, the conditional entropy of tree-structured can be computed using a variant of the inside algorithm (Hwa, 2000). We leave exploring other types of structure as future work.
Assumption 3.1 allows us to estimate the second term (cross entropy) of the objective (1) , but it is still insufficient for estimating the objective since the first term (entropy) remains intractable. Assumption 3.1 only imposes conditional independence: conditioning on the input , the probability of the -th value of is independent of under . This independence breaks for the unconditional distribution . Consequently the entropy term does not decompose. Note that while the conditional entropy remains tractable, we cannot use it as a meaningful approximation of since the error is which is zero iff is independent of (i.e., vacuous encoding).
Thus we propose to introduce an additional variational model to estimate the intractable distribution . We would like to make the resulting variational approximation a lower bound on so that the objective remains maximization over all models. Unfortunately, when entropy is large (which is our setting) meaningful lower bounds are impossible (McAllester & Stratos, 2020). This motivates us to again consider the cross-entropy upper bound with the following assumption.
The cross entropy between and estimated from iid samples of
can be computed in time polynomial in where .
In the binary vector setting , we can define
to be a Markov model of order
Then Algorithm 1 can be used to estimate in time.
This gives our final objective
(3) |
which is tractable by Assumption 3.1 and 3.2. Note that for any choice of , exact optimization over and recovers the original objective . It can be interpreted as a simultaneously collaborative and adversarial game. The second term (cross entropy minimization) encourages and to agree on the encoding of . The first term (entropy maximization) encourages to diversify its prediction of but also use information from to thwart the opponent who does not have access to .
A notable aspect of the objective is that it is neither an upper bound nor a lower bound on ; we cannot guarantee that estimated from samples is larger or smaller than . While we lack guarantees, theoretical and empirical evidence that this bypasses limitations of lower bounds on mutual information has been shown in McAllester & Stratos (2020).
At test time, given input we calculate
and use as a discrete structured representation of . In the current setting in which is an order- Markov distribution over , we can calculate in time using a variant of the Viterbi algorithm (Viterbi, 1967).
We give details of the proposed adversarial MMI training procedure in Algorithm 2. As input we assume model definitions , , and that satisfy Assumption 3.1 and 3.2, samplable , gradient update function (we use Adam for all our experiments (Kingma & Ba, 2014)), and a validation task . The validation task evaluates the quality of the representation predicted by and is particularly needed since the running estimate of the adversarial objective (3) may not reflect actual progress. We delineate certain practical issues that are important in making Algorithm 2 effective.
We find that it is critical to make the variational prior strictly more expressive than the posterior . That is, there exist distributions over that can be modeled by but not by (conditioning on any ). In the context of Markov models over binary vectors , this means the Markov order of is strictly greater than the Markov order of (Algorithm 1 allows for any ). Recall that are conditionally independent under but not independent under (Section 3.1.2). Thus we must model using a distribution that is strictly more powerful than . The benefit of this explicit joint entropy maximization is suggested in experiments later in which we show that our approach is more effective at learning representations than discrete VAEs or VQ-VAEs (which do not explicitly maximize entropy) as becomes larger.
We also find it helpful to overparameterize . We use a feedforward network with
parameters and a tunable number of ReLU layers to define the distribution.
The adversarial objective (3) reduces to the non-adversarial objective (1) if the inner minimization over is solved exactly. We find it important to mimic this by taking multiple () gradient steps for with large learning rate before taking a gradient step for . Aggressive inner-loop optimization has been shown helpful in other contexts such as VAEs (He et al., 2019). Note that is still carried across batches and not learned from scratch at every batch.
Finally, we find it useful to introduce a tunable weight for the entropy term, akin to the weight for KL divergence in -VAEs (Higgins et al., 2017). Note that optimizing this weighted objective is equivalent to optimizing . The weight can be used to determine a task-specific trade-off between predictiveness and diversity in .
We now study empirical aspects of our proposed adversarial MMI approach (henceforth AMMI) with extensive experiments. We consider unsupervised document hashing (Chaidaroon et al., 2018) as a main testbed for evaluating the quality of learned representations. The task is to compress an article into a drastically smaller discrete encoding that preserves semantics and formulated as an autoencoding problem. To study methods in a predictive setting, we also develop a variant of this task in which the representation of an article is learned to be predictive of the encoding of a related article.
Let
be a random variable corresponding to a document. The goal is to learn a document encoder
that defines a conditional distribution over binary hashes . The quality of document encodings is evaluated by the average top-100 precision. Specifically, given a document at test time, we retrieve 100 nearest neighbors from the training set under the encoding measured by the Hamming distance and check how many of the neighbors have overlapping topic labels (thus we assume annotation only for evaluation).In the literature this is typically approached as an autoencoding problem in which is estimated by maximizing the evidence lower bound (ELBO) on the marginalized log likelihood of training documents
(4) |
where is a fixed prior suitable for the task. For example, the current state-of-the-art model (BMSH) defines
as a mixture of Bernoulli distributions
(Dong et al., 2019). Here is treated as a variational model that estimates the intractable posterior under the model .In contrast, we propose to learn a document encoder by the following adversarial formulation of the mutual information between and :
(5) |
where is a variational model that estimates the intractable prior under the model . This can be seen as a single-variable variant of the more general objective (3) in which and is tied with . Note the absence of a decoder (Section 2).
We follow the standard setting in BMSH for all our models (Dong et al., 2019). The raw document representation is a high-dimensional TFIDF vector computed from preprocessed corpora (TMC, NG20, and Reuters) provided by Chaidaroon et al. (2018). We aim to learn an -dimensional binary vector representation where we vary the number of bits . All VAE-based models compute a continuous embedding by feeding through a feedforward layer (FF) and apply some discretization operation on to obtain . BMSH computes and samples from which is reconstructed.^{1}^{1}1We refer to Dong et al. (2019) for details of the decoder and the Bernoulli mixture prior since they are not needed for AMMI. The ELBO objective (4) is optimized by straight-through estimation (Bengio et al., 2013).^{2}^{2}2That is, the decoder receives
to backpropagate gradients directly to
.We are interested in comparing AMMI to mainstream discrete representation learning techniques. Thus we additionally consider vector quantized VAEs (VQ-VAEs) which substitute sampling with a nearest neighbor lookup and are shown to be useful in many tasks (van den Oord et al., 2017; Razavi et al., 2019). In particular, we adopt the decomposed vector quantization (DVQ) proposed in Kaiser et al. (2018). The model learns codebooks where the row index in corresponds to . The encoder computes and quantizes the -th segments in against (implicitly yielding ) from which the decoder reconstructs . DVQ is trained by minimizing the reconstruction loss and the vector quantization loss. It can be seen as optimizing the ELBO objective (4) with a uniform prior over and a point-mass posterior .
Our model consists of an encoder and a variational prior , which are respectively parameterized by order- and order- Markov models over . The encoder computes which gives the model’s probability of conditioning on each value of for every . Similarly, the prior computes where is a learnable embedding dictionary with dimension . We can then use Algorithm 1 to calculate
where in practice we use a batch of samples to estimate these quantities. The algorithm can be batched efficiently.
All hyperparameters for AMMI are shown in Algorithm 2. We perform random grid search on the validation portion of each dataset. We find that an effective range of values is: initialization , batch size , adversarial step , adversarial learning rate , learning rate , and entropy weight . We similarly perform random grid search on all hyperparameters of BMSH and DVQ. We use an NVIDIA Quadro RTX 6000 with 24GB memory.
The Markov orders and of the encoder and the prior are also controllable hyperparameters in AMMI. We find that setting is sufficient for this task (i.e., bits are independent conditioning on the document). But the choice of is crucial, as we show below.
Data | TMC | NG20 | Reuters | Avg | |||||||||
16b | 32b | 64b | 128b | 16b | 32b | 64b | 128b | 16b | 32b | 64b | 128b | ||
BOW | 50.86 | 9.22 | 57.62 | 39.23 | |||||||||
LSH | 43.93 | 45.14 | 45.53 | 47.73 | 5.97 | 6.66 | 7.70 | 9.49 | 32.15 | 38.62 | 46.67 | 51.94 | 31.79 |
S-RBM | 51.08 | 51.66 | 51.90 | 51.37 | 6.04 | 5.33 | 6.23 | 6.42 | 57.40 | 61.54 | 61.77 | 64.52 | 39.61 |
SpH | 60.55 | 62.81 | 61.43 | 58.91 | 32.00 | 37.09 | 31.96 | 27.16 | 63.40 | 65.13 | 62.90 | 60.45 | 51.98 |
STH | 39.47 | 41.05 | 41.81 | 41.23 | 52.37 | 58.60 | 58.06 | 54.33 | 73.51 | 75.54 | 73.50 | 69.86 | 56.61 |
VDSH | 68.53 | 71.08 | 44.10 | 58.47 | 39.04 | 43.27 | 17.31 | 5.22 | 71.65 | 77.53 | 74.56 | 73.18 | 53.66 |
NASH | 65.73 | 69.21 | 65.48 | 59.98 | 51.08 | 56.71 | 50.71 | 46.64 | 76.24 | 79.93 | 78.12 | 75.59 | 64.62 |
GMSH | 67.36 | 70.24 | 70.86 | 72.37 | 48.55 | 53.81 | 58.69 | 55.83 | 76.72 | 81.83 | 82.12 | 78.46 | 68.07 |
DVQ | 71.47 | 73.27 | 75.17 | 76.24 | 47.23 | 54.45 | 58.77 | 62.10 | 79.57 | 83.43 | 83.73 | 86.27 | 70.98 |
BMSH | 70.62 | 74.81 | 75.19 | 74.50 | 58.12 | 61.00 | 60.08 | 58.02 | 79.54 | 82.86 | 82.26 | 79.41 | 71.37 |
AMMI | 71.17 | 73.67 | 75.05 | 76.24 | 55.49 | 59.58 | 63.80 | 65.74 | 82.62 | 83.39 | 85.18 | 86.16 | 73.17 |
BMMI | 70.52 | 49.74 | 79.97 |
We first examine the feasibility of variational approximation of the prior. To this end, we consider the small-bit setting in which we can explicitly enumerate values of to estimate the entropy using equation (2). In this case the objective becomes non-adversarial. We refer to this model as BMMI (brute-force MMI) which only consists of trained by .
Figure 1 shows two experiments that illustrate the importance of the Markov order of the variational prior . First, we fix a BMMI with order partially trained on Reuters and a random batch of samples to calculate the empirical entropy by brute-force. Then for each choice of , we minimize the empirical cross entropy between and over with full hyperparameter tuning. Figure 1(a) shows that we need to achieve a realistic estimate of the empirical entropy. The necessary value of clearly depends on : yields an accurate estimate for the partially trained BMMI used in this experiment.
Next, we examine the best achievable validation precision on Reuters across different values of . We perform full hyperparameter tuning for BMMI () and for AMMI with . We see that the performance is poor for , but it quickly becomes competitive as and even surpasses the performance of BMMI. We hypothesize that the adversarial formulation has beneficial regularization effects in addition to making the objective tractable for large : we leave a deeper investigation of this phenomenon as future work. We use for all our experiments.
Table 1 shows top-100 precisions on the test portion of each dataset TMC, NG20, and Reuters using bits. Baselines include locality sensitive hashing (LSH) (Datar et al., 2004)
, stack restricted Boltzmann machines (S-RBMs)
(Hinton, 2012), spectral hashing (SpH) (Weiss et al., 2009), self-taught hashing (STH) (Zhang et al., 2010), variational deep semantic hashing (VDSH) (Chaidaroon et al., 2018), and neural architecture for semantic hashing (NASH) (Shen et al., 2018), as well as BMSH (Dong et al., 2019) and DVQ described in Section 4.1.1. The naive baseline BOW refers to the bag-of-words representation that indicates presence of words in document .^{3}^{3}3The vocabulary size is , , and for TMC, NG20, and Reuters.We see that AMMI performs favorably to current state-of-the-art methods, yielding the best average precision across datasets and settings. In particular, the precision of AMMI is significantly higher than the best previous result given by BMSH when is large. With 128 bits, AMMI achieves 76.24 vs 74.50, 65.74 vs 58.02, and 86.16 vs 79.41 on TMC, NG20, and Reuters. We hypothesize that this is partly due to the explicit entropy maximization in AMMI that considers all bits jointly via dynamic programming. While this is implicitly enforced in the mixture prior in BMSH, direct entropy maximization seems to be more effective.
In the case of 16 bits, we also report precisions of BMMI that estimates entropy exactly by brute-force (it is computationally intractable to train BMMI with larger than bits). We see that AMMI is again able to achieve better results potentially due to regularization effects. Finally, we observe that the newly proposed DVQ baseline is quite competitive with BMSH and also achieves higher precision when is large; we suspect that the decomposed encoding allows the model to make better use of multiple codebooks as reported in Kaiser et al. (2018).
Distance | Document |
---|---|
0 | O.J. Simpson lashed out at the family of the late Ronald Goldman, a day after they won the rights to Simpson’s canceled "If I Did It" book about the slayings of Goldman |
1 | News Corp. on Monday announced that it will cancel the release of a new book by former American football star O.J. Simpson and a related exclusive television interview |
5 | Phil Spector’s lawyers have asked the judge to tell jurors they must find the record producer either guilty or not guilty of murder with no option to find lesser offenses |
10 | Sen. Ted Stevens’ defense lawyer bore in on the prosecution’s chief witness on Tuesday, portraying him to a jury as someone who betrayed a longtime friend to protect his fortune. |
20 | Words that cannot be said on American television are not often uttered at the U.S. Supreme Court, at least not by high-priced lawyers and the justices themselves. |
50 | Cols 1-6: Sending a strong message that the faltering economy will be his top focus, President-elect Barack Obama on Friday urged Congress to pass an economic stimulus package |
90 | President Hu Jintao’s upcoming visits to Latin America and Greece would boost bilateral relations and deepen cooperation |
0 | Ukrainian President Leonid Kuchma had a meeting on Monday evening with Polish President Alexander Kwasniewski and Lithuanian President Valdas Adamkus |
1 | Radical Ukrainian opposition figure Yulia Timoshenko Wednesday ventured into the hostile eastern mining bastion of Prime Minister Viktor Yanukovich |
5 | Ukrainian President Viktor Yushchenko was forced into an emergency landing Thursday and seized the aircraft of his bitter political foe, Prime Minister Yulia Tymoshenko |
10 | On a clearing in this disputed city, where enemy homes were bulldozed after the conflict in August, Mayor Yuri M. Luzhkov promised this month to build a new neighborhood |
20 | Barack Obama is the "American Gorbachev" who will ultimately destroy the United States, militant Russian nationalist Vladimir Zhirinovsky said Tuesday. |
50 | Ministers from Pacific Rim nations warned Thursday that imposing trade barriers in reaction to the global economic downturn would only deepen the crisis. |
90 | We shall move the following graphics: US IRAQ QAEDA Graphic with portraits of Osama bin Laden and Colin Powell, examining US claims that the latest bin Laden tape reinforces |
0 | NASCAR has a new rivalry: Carl Edwards vs. Kyle Busch. Edwards called the latest installment payback and Busch promised that retribution will come down the road. |
1 | Penske Racing teammates Ryan Briscoe and Helio Castroneves filled the front-row Friday for Edmonton Indy, repeating their 1-2 finish in last week’s IndyCar race |
5 | Brazilian race-car driver Helio Castroneves upset Spice Girl Melanie Brown to capture the fifth "Dancing With the Stars" mirrorball trophy in the television dance competition. |
10 | Audi overcame the challenge of two Peugeot cars and wet racing conditions Sunday to win the 24 Hours of Le Mans for the fourth straight year. |
20 | The president of cycling’s governing body Monday insists the doping problems in his sport do not threaten its place in the Olympics. |
50 | Robert Pattinson, who stars as the vampire heartthrob Edward Cullen in the forthcoming movie "Twilight," stepped onto a riser at the King of Prussia Mall |
90 | Australian Prime Minister John Howard reshuffled his cabinet Tuesday, appointing Education Minister Brendan Nelson to the defence portfolio. |
0 | How did that Van Halen song go? “I found the simple life ain’t so simple … ” The iconic Los Angeles metal band is set to be inducted Monday night into |
8 | The boys from Van Halen, most notably mercurial guitarist Eddie Van Halen, showed up as promised at a news conference Monday to announce their fall tour with original singer |
20 | The PG-13-rated thriller gave 20th Century Fox its first No. 1 launch in seven months. The opening-night crowd was heavily male and young, matching the video-game |
50 | Due to Wednesday night’s victory, a mathematician and avid Vasco soccer fan calculated on Thursday that the team’s chances of being dropped into the second division fell by |
90 | A top Iranian minister who admitted to faking his university degree will face a motion of no confidence on Tuesday on charges that he tried to bribe members of Parliament |
Dim | # Distinct Codes | Precision | |
---|---|---|---|
BOW | 20000 | 208808 | 26.66 |
BMSH | 128 | 208004 | 75.77 |
DVQ | 128 | 208655 | 76.80 |
AMMI | 128 | 153123 | 79.14 |
Unsupervised document hashing only considers a single variable and does not test the conditional formulation . Hence we introduce a new task, predictive document hashing, in which represent distinct articles that discuss the same event. It is clear that is large: the large uncertainty of a random article is dramatically reduced given a related article.
We construct such article pairs from the Who-Did-What dataset (Onishi et al., 2016). We remove all overlapping articles so that each article appears only once in the entire training/validation/test data containing 104404/8928/7326 document pairs. We give more details of the dataset in the supplementary material. Similarly as before, we use -dimensional TFIDF representations as raw input and consider the task of compressing them into binary bits. The quality of the binary encodings is measured by top-100 matching precision: given a test article , we check if the correct corresponding article is included in 100 nearest neighbors of under the encoding based on the Hamming distance.
AMMI now consists of a pair of encoders and as well as a variational prior , which are respectively parameterized by order-, order-, and order- Markov models over . Based on our findings in the previous section we use and . We train the model with Algorithm 2.
To compare with VAEs, we consider a conditional variant which optimizes the ELBO objective under the joint distribution with an approximation of the true posterior which can be used as a document encoder. In this setting, training for BMSH remains unchanged except that the the model predicts instead of for reconstruction and uses conditional prior in the KL regularization term. The DVQ model likewise simply predicts instead of but loses its ELBO interpretation. We tune hyperparameters of all models similarly as before.
Table 1 shows top-100 precisions on the test portion. We see that AMMI again achieves the best performance in comparison to BMSH and DVQ. We also report the number of distinct values of induced on the 208808 training articles (union of and ). We see that AMMI learns the most compact clustering which nonetheless generalizes best.
We conduct qualitative analysis of the document encodings by examining articles with increasing Hamming distance (i.e., semantic drift). Table 2 shows illustrative examples. The article about the O.J. Simpson trial drifts to the Phil Spector trial, the Ted Stevens trial, and eventually other unrelated subjects in politics and economy. The article about NASCAR drifts to other racing reports, cycling, movie stars and politics.
We have presented AMMI, an approach to learning discrete structured representations by adversarially maximizing mutual information. It obviates the intractability of entropy estimation by making mild structural assumptions that apply to a wide class of models and optimizing the difference of cross-entropy upper bounds. We have derived a concrete instance of the approach based on Markov models and identified important practical issues such as the expressiveness of the variational prior. We have demonstrated its utility on unsupervised document hashing by outperforming current best results. We have also proposed the predictive document hashing task and showed that AMMI yields high-quality semantic representations. Future work includes extending AMMI to other structured models and extending it to cases in which even cross entropy calculation is intractable.
Proceedings of the 35th International Conference on Machine Learning
, volume 80 of Proceedings of Machine Learning Research, pp. 531–540, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.A tutorial on hidden markov models and selected applications in speech recognition.
Proceedings of the IEEE, 77(2):257–286, 1989.Proceedings of the 35th International Conference on Machine Learning
, volume 80 of Proceedings of Machine Learning Research, pp. 531–540, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.A tutorial on hidden markov models and selected applications in speech recognition.
Proceedings of the IEEE, 77(2):257–286, 1989.The forward algorithm is shown in Algorithm 3. We write to denote the set of integers and if is true and otherwise.
We take pairs from the Who-Did-What dataset (Onishi et al., 2016). The pairs in this dataset were constructed by drawing articles from the LDC Gigaword newswire corpus. A first article is drawn at random and then a list of candidate second articles is drawn using the first sentence of the first article as an information retrieval query. A second article is selected from the candidates using criteria described in Onishi et al. (2016), the most significant of which is that the second article must have occurred within a two week time interval of the first. We filtered article pairs so that each article is distinct in all data. The resulting dataset has 104404, 8928, and 7326 article pairs for training, validation, and evaluation.
To follow the standard setting in unsupervised document hashing, we represent each article as a TFIDF vector using a vocabulary of size 20000. We use SentencePiece (Kudo & Richardson, 2018) to learn an optimal text tokenization for the target vocabulary size.