Learning Discrete Structured Representations by Adversarially Maximizing Mutual Information

04/08/2020
by   Karl Stratos, et al.
14

We propose learning discrete structured representations from unlabeled data by maximizing the mutual information between a structured latent variable and a target variable. Calculating mutual information is intractable in this setting. Our key technical contribution is an adversarial objective that can be used to tractably estimate mutual information assuming only the feasibility of cross entropy calculation. We develop a concrete realization of this general formulation with Markov distributions over binary encodings. We report critical and unexpected findings on practical aspects of the objective such as the choice of variational priors. We apply our model on document hashing and show that it outperforms current best baselines based on discrete and vector quantized variational autoencoders. It also yields highly compressed interpretable representations.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/02/2020

Variational Mutual Information Maximization Framework for VAE Latent Codes with Continuous and Discrete Priors

Learning interpretable and disentangled representations of data is a key...
05/28/2020

VMI-VAE: Variational Mutual Information Maximization Framework for VAE With Discrete and Continuous Priors

Variational Autoencoder is a scalable method for learning latent variabl...
10/04/2019

High Mutual Information in Representation Learning with Symmetric Variational Inference

We introduce the Mutual Information Machine (MIM), a novel formulation o...
10/08/2019

MIM: Mutual Information Machine

We introduce the Mutual Information Machine (MIM), an autoencoder model ...
08/23/2019

Pareto-optimal data compression for binary classification tasks

The goal of lossy data compression is to reduce the storage cost of a da...
06/17/2019

Hierarchical Soft Actor-Critic: Adversarial Exploration via Mutual Information Optimization

We describe a novel extension of soft actor-critics for hierarchical Dee...
08/31/2021

APS: Active Pretraining with Successor Features

We introduce a new unsupervised pretraining objective for reinforcement ...

Code Repositories

1 Introduction

Unsupervised learning of discrete representations is appealing because they correspond to natural symbolic representations in many domains (e.g., phonemes in speech signals, topics in text, and objects in images). However, working with discrete variables comes with technical challenges such as non-differentiability and nontrivial combinatorial optimization. Standard methods approach the problem within the framework of variational autoencoding 

(Kingma & Welling, 2014; Rezende et al., 2014) and bypass these challenges by adopting some form of gradient approximation and possibly strong independence assumptions (Bengio et al., 2013; van den Oord et al., 2017).

In this paper we are instead interested in a promising alternative framework based on maximal mutual information (MMI). Unlike autoencoding, MMI estimates a distribution over latent variables without modeling raw signals by maximizing the mutual information between the latent and a target variable (Brown et al., 1992; Bell & Sejnowski, 1995; Tishby et al., 1999). It is well motivated as a principled approach to learning representations that retain only predictive information and drop noise. Its neural extensions have recently been quite successful in learning useful continuous representations across domains (Oord et al., 2018; Belghazi et al., 2018; Hjelm et al., 2019; Bachman et al., 2019).

We depart from these existing works on MMI in two important ways. First, we learn discrete structured representations. There are previous works on learning discrete representations with neural MMI (McAllester, 2018; Stratos, 2019)

, but they only consider unstructured representations which can transmit at most the log of the number of labels bits of information. Breaking the log bottleneck in the discrete regime requires making encodings structured, but it also makes exact computation intractable. Thus the feasibility of optimizing mutual information effectively in this setting remains unclear. We develop a tractable formulation that only requires tractable cross entropy by a combination of mild structural assumptions and an appropriate loss function (see below).

Second, we consider a new mutual information estimator based on the difference of entropies for learning representations. This is a crucial departure from existing works that optimize variational lower bounds (Poole et al., 2019). Estimators of a lower bound on mutual information have been shown to suffer fundamental limitations (McAllester & Stratos, 2020), suggesting a need to investigate alternative estimators. Our estimator is neither a lower bound nor an upper bound, yet it can be optimized adversarially as in generative adversarial networks (GANs) (Goodfellow et al., 2014). We show for the first time that such adversarial optimization of mutual information is a viable option for learning meaningful representations.

Our proposed discrete structured MMI is novel and largely uncharted in the literature. An important contribution of this paper is charting practical considerations for this alien approach by developing a concrete realization based on a structured model over binary encodings. More specifically, the model encodes an observation into a zero-one vector of length , resulting in possible encodings. We show how mutual information can be estimated efficiently by adversarial dynamic programming with controllable Markov assumptions. One critical and unexpected finding is that the expressiveness of a variational prior needs to be strictly greater than that of the model (i.e., it has to be higher-order Markovian).

To demonstrate the utility of our model in a real-world problem, we apply it to unsupervised document hashing (Dong et al., 2019; Shen et al., 2018; Chaidaroon et al., 2018). The task is to compress an article into a drastically smaller discrete encoding that preserves semantics. Our model outperforms current state-of-the-art baselines based on discrete VAEs (Kingma & Welling, 2014; Maddison et al., 2017; Jang et al., 2017) and VQ-VAEs (van den Oord et al., 2017) with Bernoulli priors. We additionally design a predictive version of document hashing in which the model is tasked with encoding a future article with the knowledge of a past article. We find that our model achieves favorable performance with highly compressed and interpretable representations.

2 Related Work

Many successful approaches to unsupervised representation learning are based on density estimation. For instance, it is now very common in natural language processing to make use of continuous representations that are learned in the process of modeling a conditional distribution

, such as the conditional distribution of a word given a context window  (Mikolov et al., 2013; Peters et al., 2018; Devlin et al., 2019). In this case the input pair is easily sampled from unlabeled data, by masking an observed word. There is also much work that identifies representations, often continuous, with the latent variables in an unconditional density model of  (Kingma & Welling, 2014; Rezende et al., 2014; Higgins et al., 2017).

Learning representations through density estimation, however, suffers from certain limitations. First, it may be unnecessary to fully model the density of noisy, raw data when we are only interested in learning representations. Second, many standard approaches to learning discrete-valued latent representations in the context of density estimation require the use of either biased gradient estimators (Bengio et al., 2013; van den Oord et al., 2017)

or high variance ones 

(Mnih & Gregor, 2014; Mnih & Rezende, 2016).

Maximal mutual information (MMI) is a refreshingly different approach to unsupervised representation learning in which we estimate a conditional distribution over latent representations by maximizing mutual information under these distributions. In contrast with density estimation, there is no issue of modeling noise since the model never estimates a distribution over raw signals (i.e., there is no decoder). The mutual information objective has been shown to produce state-of-the-art representations of images, speech, and text (Bachman et al., 2019; Oord et al., 2018).

The focus with MMI so far has been largely limited to learning continuous representations. Existing works on learning discrete representations with MMI only consider unstructured one-of--labels representations (McAllester, 2018; Stratos, 2019) due to computational reasons. Our main contribution is a tractable formulation for discrete structured MMI. This involves an adversarial objective reminiscent of GANs (Goodfellow et al., 2014) and radically different from existing MMI objectives based on variational lower bounds of mutual information (Poole et al., 2019). Other than tractability reasons, the choice of the objective can be theoretically motivated as avoiding statistical limitations of estimating lower bounds on mutual information (McAllester & Stratos, 2020).

3 Discrete Structured MMI

Let

denote an unknown but samplable joint distribution over raw signals

. We assume discrete for simplicity and relevance to our experimental setting (document hashing), but the formulation can be easily adapted to the continuous case. We introduce an encoder that defines a conditional distribution over a discrete latent variable representing the encoding of and aim to maximize : the mutual information between and under and . By the data processing inequality, the objective is a lower bound on and can be viewed as distilling the predictive information of about into .

This formulation alone is meaningless since it admits the trivial solution . In order to achieve compression, there are various options. In the information bottleneck method (Tishby et al., 1999), we additionally regularize the information rate of by simultaneously minimizing . Here we advocate a more direct approach by giving an explicit budget on the entropy of .

Equivalently, we can simply maximize with a finite encoding space such that . Even with small the objective is intractable because it involves marginalization over . Using that , we introduce a variational model and optimize

(1)

where denotes the cross entropy between and . By the usual property of cross entropy, is an upper bound on , hence the objective (1) is a lower bound on .

Unfortunately, the applicability of this variational lower bound is critically limited to settings in which the entropy of is tractable (McAllester, 2018; Stratos, 2019) or constant with respect to learnable parameters (Chen et al., 2016; Alemi et al., 2017). When is small, can be easily estimated from samples as

(2)

But this explicit calculation is clearly infeasible for large . Furthermore, the log bottleneck on the budget then implies that it is infeasible to achieve a large information rate using this naive formulation (e.g., even if we specify to be a trillion we have ).

3.1 Tractable Formulation

To allow for large , we propose to make structured. A simplest example of structured is a binary vector of length which yields so that can be as large as . More generally, can be any structure whose size is exponential in some controllable integer .

3.1.1 Tractable Cross Entropy

The first key ingredient in deriving a tractable formulation is the tractability of the cross entropy between and .

Assumption 3.1.

The cross entropy between and estimated from iid samples of

can be computed in time polynomial in where .

There is a class of structured probabilistic models with standard conditional independence assumptions such that Assumption 3.1 holds. For instance, in the case of , we may impose Markov assumptions and define (with the convention for )

where and are the Markov orders of and . It can be easily verified that the estimate of based on a single sample is

where

is the marginal probability of the length-

sequence ending at position under the conditional distribution . These marginals can be computed by applying a variant of the forward algorithm (Rabiner, 1989) (see the supplementary material). We give a general algorithm that computes the cross entropy between any distributions over with Markov orders in time in Algorithm 1.

While we focus on the choice for concreteness, we emphasize that similar structural assumptions can be made to consider others. For instance, the conditional entropy of tree-structured can be computed using a variant of the inside algorithm (Hwa, 2000). We leave exploring other types of structure as future work.

3.1.2 Adversarial Optimization

Assumption 3.1 allows us to estimate the second term (cross entropy) of the objective (1) , but it is still insufficient for estimating the objective since the first term (entropy) remains intractable. Assumption 3.1 only imposes conditional independence: conditioning on the input , the probability of the -th value of is independent of under . This independence breaks for the unconditional distribution . Consequently the entropy term does not decompose. Note that while the conditional entropy remains tractable, we cannot use it as a meaningful approximation of since the error is which is zero iff is independent of (i.e., vacuous encoding).

Thus we propose to introduce an additional variational model to estimate the intractable distribution . We would like to make the resulting variational approximation a lower bound on so that the objective remains maximization over all models. Unfortunately, when entropy is large (which is our setting) meaningful lower bounds are impossible (McAllester & Stratos, 2020). This motivates us to again consider the cross-entropy upper bound with the following assumption.

Assumption 3.2.

The cross entropy between and estimated from iid samples of

can be computed in time polynomial in where .

In the binary vector setting , we can define

to be a Markov model of order

Then Algorithm 1 can be used to estimate in time.

  Input: for and ; for and where
  Subroutine: that computes in time such that is the marginalized probability of ending in under (given in the supplementary material)
  Output: Cross entropy between and
  Runtime:
  Forward computation:
  Marginals: For , for ,
where we overwrite if
  Cross entropy: Set as the following scalar
Algorithm 1 CrossEntropy

This gives our final objective

(3)

which is tractable by Assumption 3.1 and 3.2. Note that for any choice of , exact optimization over and recovers the original objective . It can be interpreted as a simultaneously collaborative and adversarial game. The second term (cross entropy minimization) encourages and to agree on the encoding of . The first term (entropy maximization) encourages to diversify its prediction of but also use information from to thwart the opponent who does not have access to .

A notable aspect of the objective is that it is neither an upper bound nor a lower bound on ; we cannot guarantee that estimated from samples is larger or smaller than . While we lack guarantees, theoretical and empirical evidence that this bypasses limitations of lower bounds on mutual information has been shown in McAllester & Stratos (2020).

Inference

At test time, given input we calculate

and use as a discrete structured representation of . In the current setting in which is an order- Markov distribution over , we can calculate in time using a variant of the Viterbi algorithm (Viterbi, 1967).

3.2 Practical Issues

We give details of the proposed adversarial MMI training procedure in Algorithm 2. As input we assume model definitions , , and that satisfy Assumption 3.1 and 3.2, samplable , gradient update function (we use Adam for all our experiments (Kingma & Ba, 2014)), and a validation task . The validation task evaluates the quality of the representation predicted by and is particularly needed since the running estimate of the adversarial objective (3) may not reflect actual progress. We delineate certain practical issues that are important in making Algorithm 2 effective.

  Input: Models , , satisfying Assumption 3.1 and 3.2; samplable ; gradient update function ; validation task
  Hyperparameters: Initialization range , batch size , number of adversarial gradient steps , adversarial learning rate , learning rate , entropy weight
  
  repeat
     for  do
         For times:
        
     end for
  until  stops improving
Algorithm 2 AdversarialMMI
Expressive variational prior

We find that it is critical to make the variational prior strictly more expressive than the posterior . That is, there exist distributions over that can be modeled by but not by (conditioning on any ). In the context of Markov models over binary vectors , this means the Markov order of is strictly greater than the Markov order of (Algorithm 1 allows for any ). Recall that are conditionally independent under but not independent under (Section 3.1.2). Thus we must model using a distribution that is strictly more powerful than . The benefit of this explicit joint entropy maximization is suggested in experiments later in which we show that our approach is more effective at learning representations than discrete VAEs or VQ-VAEs (which do not explicitly maximize entropy) as becomes larger.

We also find it helpful to overparameterize . We use a feedforward network with

parameters and a tunable number of ReLU layers to define the distribution.

Aggressive inner-loop optimization

The adversarial objective (3) reduces to the non-adversarial objective (1) if the inner minimization over is solved exactly. We find it important to mimic this by taking multiple () gradient steps for with large learning rate before taking a gradient step for . Aggressive inner-loop optimization has been shown helpful in other contexts such as VAEs (He et al., 2019). Note that is still carried across batches and not learned from scratch at every batch.

Entropy weight

Finally, we find it useful to introduce a tunable weight for the entropy term, akin to the weight for KL divergence in -VAEs (Higgins et al., 2017). Note that optimizing this weighted objective is equivalent to optimizing . The weight can be used to determine a task-specific trade-off between predictiveness and diversity in .

4 Experiments

We now study empirical aspects of our proposed adversarial MMI approach (henceforth AMMI) with extensive experiments. We consider unsupervised document hashing (Chaidaroon et al., 2018) as a main testbed for evaluating the quality of learned representations. The task is to compress an article into a drastically smaller discrete encoding that preserves semantics and formulated as an autoencoding problem. To study methods in a predictive setting, we also develop a variant of this task in which the representation of an article is learned to be predictive of the encoding of a related article.

4.1 Unsupervised Document Hashing

Let

be a random variable corresponding to a document. The goal is to learn a document encoder

that defines a conditional distribution over binary hashes . The quality of document encodings is evaluated by the average top-100 precision. Specifically, given a document at test time, we retrieve 100 nearest neighbors from the training set under the encoding measured by the Hamming distance and check how many of the neighbors have overlapping topic labels (thus we assume annotation only for evaluation).

In the literature this is typically approached as an autoencoding problem in which is estimated by maximizing the evidence lower bound (ELBO) on the marginalized log likelihood of training documents

(4)

where is a fixed prior suitable for the task. For example, the current state-of-the-art model (BMSH) defines

as a mixture of Bernoulli distributions

(Dong et al., 2019). Here is treated as a variational model that estimates the intractable posterior under the model .

In contrast, we propose to learn a document encoder by the following adversarial formulation of the mutual information between and :

(5)

where is a variational model that estimates the intractable prior under the model . This can be seen as a single-variable variant of the more general objective (3) in which and is tied with . Note the absence of a decoder (Section 2).

4.1.1 Models

Bmsh

We follow the standard setting in BMSH for all our models (Dong et al., 2019). The raw document representation is a high-dimensional TFIDF vector computed from preprocessed corpora (TMC, NG20, and Reuters) provided by Chaidaroon et al. (2018). We aim to learn an -dimensional binary vector representation where we vary the number of bits . All VAE-based models compute a continuous embedding by feeding through a feedforward layer (FF) and apply some discretization operation on to obtain . BMSH computes and samples from which is reconstructed.111We refer to Dong et al. (2019) for details of the decoder and the Bernoulli mixture prior since they are not needed for AMMI. The ELBO objective (4) is optimized by straight-through estimation (Bengio et al., 2013).222That is, the decoder receives

to backpropagate gradients directly to

.

Dvq

We are interested in comparing AMMI to mainstream discrete representation learning techniques. Thus we additionally consider vector quantized VAEs (VQ-VAEs) which substitute sampling with a nearest neighbor lookup and are shown to be useful in many tasks (van den Oord et al., 2017; Razavi et al., 2019). In particular, we adopt the decomposed vector quantization (DVQ) proposed in Kaiser et al. (2018). The model learns codebooks where the row index in corresponds to . The encoder computes and quantizes the -th segments in against (implicitly yielding ) from which the decoder reconstructs . DVQ is trained by minimizing the reconstruction loss and the vector quantization loss. It can be seen as optimizing the ELBO objective (4) with a uniform prior over and a point-mass posterior .

Ammi

Our model consists of an encoder and a variational prior , which are respectively parameterized by order- and order- Markov models over . The encoder computes which gives the model’s probability of conditioning on each value of for every . Similarly, the prior computes where is a learnable embedding dictionary with dimension . We can then use Algorithm 1 to calculate

where in practice we use a batch of samples to estimate these quantities. The algorithm can be batched efficiently.

4.1.2 Hyperparameter Tuning

All hyperparameters for AMMI are shown in Algorithm 2. We perform random grid search on the validation portion of each dataset. We find that an effective range of values is: initialization , batch size , adversarial step , adversarial learning rate , learning rate , and entropy weight . We similarly perform random grid search on all hyperparameters of BMSH and DVQ. We use an NVIDIA Quadro RTX 6000 with 24GB memory.

The Markov orders and of the encoder and the prior are also controllable hyperparameters in AMMI. We find that setting is sufficient for this task (i.e., bits are independent conditioning on the document). But the choice of is crucial, as we show below.


(a)                                (b)

Figure 1: Behavior of the model using different Markov orders for the variational prior ; we use bits and as the Markov order of . For each choice of all hyperparameters of the variational model are fully optimized. (a) shows the cross-entropy upper bound estimated on a fixed batch of samples. (b) shows the validation precision on Reuters. The gray line corresponds to the setting in which we calculate the entropy by brute-force.

Data TMC NG20 Reuters Avg
16b 32b 64b 128b 16b 32b 64b 128b 16b 32b 64b 128b
BOW 50.86 9.22 57.62 39.23
LSH 43.93 45.14 45.53 47.73 5.97 6.66 7.70 9.49 32.15 38.62 46.67 51.94 31.79
S-RBM 51.08 51.66 51.90 51.37 6.04 5.33 6.23 6.42 57.40 61.54 61.77 64.52 39.61
SpH 60.55 62.81 61.43 58.91 32.00 37.09 31.96 27.16 63.40 65.13 62.90 60.45 51.98
STH 39.47 41.05 41.81 41.23 52.37 58.60 58.06 54.33 73.51 75.54 73.50 69.86 56.61
VDSH 68.53 71.08 44.10 58.47 39.04 43.27 17.31 5.22 71.65 77.53 74.56 73.18 53.66
NASH 65.73 69.21 65.48 59.98 51.08 56.71 50.71 46.64 76.24 79.93 78.12 75.59 64.62
GMSH 67.36 70.24 70.86 72.37 48.55 53.81 58.69 55.83 76.72 81.83 82.12 78.46 68.07
DVQ 71.47 73.27 75.17 76.24 47.23 54.45 58.77 62.10 79.57 83.43 83.73 86.27 70.98
BMSH 70.62 74.81 75.19 74.50 58.12 61.00 60.08 58.02 79.54 82.86 82.26 79.41 71.37
AMMI 71.17 73.67 75.05 76.24 55.49 59.58 63.80 65.74 82.62 83.39 85.18 86.16 73.17
BMMI 70.52 49.74 79.97
Table 1: Results on unsupervised document hashing. For each dataset we show test precisions of the top 100 retrieved documents using -bit binary vector encoding. See the main text for task and model descriptions.

4.1.3 Importance of the Markov Order of the Variational Prior

We first examine the feasibility of variational approximation of the prior. To this end, we consider the small-bit setting in which we can explicitly enumerate values of to estimate the entropy using equation (2). In this case the objective becomes non-adversarial. We refer to this model as BMMI (brute-force MMI) which only consists of trained by .

Figure 1 shows two experiments that illustrate the importance of the Markov order of the variational prior . First, we fix a BMMI with order partially trained on Reuters and a random batch of samples to calculate the empirical entropy by brute-force. Then for each choice of , we minimize the empirical cross entropy between and over with full hyperparameter tuning. Figure 1(a) shows that we need to achieve a realistic estimate of the empirical entropy. The necessary value of clearly depends on : yields an accurate estimate for the partially trained BMMI used in this experiment.

Next, we examine the best achievable validation precision on Reuters across different values of . We perform full hyperparameter tuning for BMMI () and for AMMI with . We see that the performance is poor for , but it quickly becomes competitive as and even surpasses the performance of BMMI. We hypothesize that the adversarial formulation has beneficial regularization effects in addition to making the objective tractable for large : we leave a deeper investigation of this phenomenon as future work. We use for all our experiments.

4.1.4 Results

Table 1 shows top-100 precisions on the test portion of each dataset TMC, NG20, and Reuters using bits. Baselines include locality sensitive hashing (LSH) (Datar et al., 2004)

, stack restricted Boltzmann machines (S-RBMs)

(Hinton, 2012), spectral hashing (SpH) (Weiss et al., 2009), self-taught hashing (STH) (Zhang et al., 2010), variational deep semantic hashing (VDSH) (Chaidaroon et al., 2018), and neural architecture for semantic hashing (NASH) (Shen et al., 2018), as well as BMSH (Dong et al., 2019) and DVQ described in Section 4.1.1. The naive baseline BOW refers to the bag-of-words representation that indicates presence of words in document .333The vocabulary size is , , and for TMC, NG20, and Reuters.

We see that AMMI performs favorably to current state-of-the-art methods, yielding the best average precision across datasets and settings. In particular, the precision of AMMI is significantly higher than the best previous result given by BMSH when is large. With 128 bits, AMMI achieves 76.24 vs 74.50, 65.74 vs 58.02, and 86.16 vs 79.41 on TMC, NG20, and Reuters. We hypothesize that this is partly due to the explicit entropy maximization in AMMI that considers all bits jointly via dynamic programming. While this is implicitly enforced in the mixture prior in BMSH, direct entropy maximization seems to be more effective.

In the case of 16 bits, we also report precisions of BMMI that estimates entropy exactly by brute-force (it is computationally intractable to train BMMI with larger than bits). We see that AMMI is again able to achieve better results potentially due to regularization effects. Finally, we observe that the newly proposed DVQ baseline is quite competitive with BMSH and also achieves higher precision when is large; we suspect that the decomposed encoding allows the model to make better use of multiple codebooks as reported in Kaiser et al. (2018).


Distance Document
0 O.J. Simpson lashed out at the family of the late Ronald Goldman, a day after they won the rights to Simpson’s canceled "If I Did It" book about the slayings of Goldman
1 News Corp. on Monday announced that it will cancel the release of a new book by former American football star O.J. Simpson and a related exclusive television interview
5 Phil Spector’s lawyers have asked the judge to tell jurors they must find the record producer either guilty or not guilty of murder with no option to find lesser offenses
10 Sen. Ted Stevens’ defense lawyer bore in on the prosecution’s chief witness on Tuesday, portraying him to a jury as someone who betrayed a longtime friend to protect his fortune.
20 Words that cannot be said on American television are not often uttered at the U.S. Supreme Court, at least not by high-priced lawyers and the justices themselves.
50 Cols 1-6: Sending a strong message that the faltering economy will be his top focus, President-elect Barack Obama on Friday urged Congress to pass an economic stimulus package
90 President Hu Jintao’s upcoming visits to Latin America and Greece would boost bilateral relations and deepen cooperation
0 Ukrainian President Leonid Kuchma had a meeting on Monday evening with Polish President Alexander Kwasniewski and Lithuanian President Valdas Adamkus
1 Radical Ukrainian opposition figure Yulia Timoshenko Wednesday ventured into the hostile eastern mining bastion of Prime Minister Viktor Yanukovich
5 Ukrainian President Viktor Yushchenko was forced into an emergency landing Thursday and seized the aircraft of his bitter political foe, Prime Minister Yulia Tymoshenko
10 On a clearing in this disputed city, where enemy homes were bulldozed after the conflict in August, Mayor Yuri M. Luzhkov promised this month to build a new neighborhood
20 Barack Obama is the "American Gorbachev" who will ultimately destroy the United States, militant Russian nationalist Vladimir Zhirinovsky said Tuesday.
50 Ministers from Pacific Rim nations warned Thursday that imposing trade barriers in reaction to the global economic downturn would only deepen the crisis.
90 We shall move the following graphics: US IRAQ QAEDA Graphic with portraits of Osama bin Laden and Colin Powell, examining US claims that the latest bin Laden tape reinforces
0 NASCAR has a new rivalry: Carl Edwards vs. Kyle Busch. Edwards called the latest installment payback and Busch promised that retribution will come down the road.
1 Penske Racing teammates Ryan Briscoe and Helio Castroneves filled the front-row Friday for Edmonton Indy, repeating their 1-2 finish in last week’s IndyCar race
5 Brazilian race-car driver Helio Castroneves upset Spice Girl Melanie Brown to capture the fifth "Dancing With the Stars" mirrorball trophy in the television dance competition.
10 Audi overcame the challenge of two Peugeot cars and wet racing conditions Sunday to win the 24 Hours of Le Mans for the fourth straight year.
20 The president of cycling’s governing body Monday insists the doping problems in his sport do not threaten its place in the Olympics.
50 Robert Pattinson, who stars as the vampire heartthrob Edward Cullen in the forthcoming movie "Twilight," stepped onto a riser at the King of Prussia Mall
90 Australian Prime Minister John Howard reshuffled his cabinet Tuesday, appointing Education Minister Brendan Nelson to the defence portfolio.
0 How did that Van Halen song go? “I found the simple life ain’t so simple … ” The iconic Los Angeles metal band is set to be inducted Monday night into
8 The boys from Van Halen, most notably mercurial guitarist Eddie Van Halen, showed up as promised at a news conference Monday to announce their fall tour with original singer
20 The PG-13-rated thriller gave 20th Century Fox its first No. 1 launch in seven months. The opening-night crowd was heavily male and young, matching the video-game
50 Due to Wednesday night’s victory, a mathematician and avid Vasco soccer fan calculated on Thursday that the team’s chances of being dropped into the second division fell by
90 A top Iranian minister who admitted to faking his university degree will face a motion of no confidence on Tuesday on charges that he tried to bribe members of Parliament
Table 2: Qualitative analysis of AMMI document representations learned by predictive document hashing. For each considered document (top row), we show documents with Hamming distance at least 1, 5, 10, 20, 50, and 90 under the representations to examine their semantic differences.

Dim # Distinct Codes Precision
BOW 20000 208808 26.66
BMSH 128 208004 75.77
DVQ 128 208655 76.80
AMMI 128 153123 79.14
Table 3: Results on predictive document hashing. For each model we show the representation dimension, number of distinct codes (i.e., clusters) induced on 208808 training documents, and top-100 precision on the test set.

4.2 Predictive Document Hashing

Unsupervised document hashing only considers a single variable and does not test the conditional formulation . Hence we introduce a new task, predictive document hashing, in which represent distinct articles that discuss the same event. It is clear that is large: the large uncertainty of a random article is dramatically reduced given a related article.

We construct such article pairs from the Who-Did-What dataset (Onishi et al., 2016). We remove all overlapping articles so that each article appears only once in the entire training/validation/test data containing 104404/8928/7326 document pairs. We give more details of the dataset in the supplementary material. Similarly as before, we use -dimensional TFIDF representations as raw input and consider the task of compressing them into binary bits. The quality of the binary encodings is measured by top-100 matching precision: given a test article , we check if the correct corresponding article is included in 100 nearest neighbors of under the encoding based on the Hamming distance.

4.2.1 Models

AMMI now consists of a pair of encoders and as well as a variational prior , which are respectively parameterized by order-, order-, and order- Markov models over . Based on our findings in the previous section we use and . We train the model with Algorithm 2.

To compare with VAEs, we consider a conditional variant which optimizes the ELBO objective under the joint distribution with an approximation of the true posterior which can be used as a document encoder. In this setting, training for BMSH remains unchanged except that the the model predicts instead of for reconstruction and uses conditional prior in the KL regularization term. The DVQ model likewise simply predicts instead of but loses its ELBO interpretation. We tune hyperparameters of all models similarly as before.

4.2.2 Results and Qualitative Analysis

Table 1 shows top-100 precisions on the test portion. We see that AMMI again achieves the best performance in comparison to BMSH and DVQ. We also report the number of distinct values of induced on the 208808 training articles (union of and ). We see that AMMI learns the most compact clustering which nonetheless generalizes best.

We conduct qualitative analysis of the document encodings by examining articles with increasing Hamming distance (i.e., semantic drift). Table 2 shows illustrative examples. The article about the O.J. Simpson trial drifts to the Phil Spector trial, the Ted Stevens trial, and eventually other unrelated subjects in politics and economy. The article about NASCAR drifts to other racing reports, cycling, movie stars and politics.

5 Conclusions

We have presented AMMI, an approach to learning discrete structured representations by adversarially maximizing mutual information. It obviates the intractability of entropy estimation by making mild structural assumptions that apply to a wide class of models and optimizing the difference of cross-entropy upper bounds. We have derived a concrete instance of the approach based on Markov models and identified important practical issues such as the expressiveness of the variational prior. We have demonstrated its utility on unsupervised document hashing by outperforming current best results. We have also proposed the predictive document hashing task and showed that AMMI yields high-quality semantic representations. Future work includes extending AMMI to other structured models and extending it to cases in which even cross entropy calculation is intractable.

References

  • Alemi et al. (2017) Alemi, A., Fischer, I., Dillon, J., and Murphy, K. Deep variational information bottleneck. In ICLR, 2017. URL https://arxiv.org/abs/1612.00410.
  • Bachman et al. (2019) Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp. 15509–15519, 2019.
  • Belghazi et al. (2018) Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In Dy, J. and Krause, A. (eds.),

    Proceedings of the 35th International Conference on Machine Learning

    , volume 80 of Proceedings of Machine Learning Research, pp. 531–540, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • Bell & Sejnowski (1995) Bell, A. J. and Sejnowski, T. J. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
  • Bengio et al. (2013) Bengio, Y., Léonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • Brown et al. (1992) Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. Class-based -gram models of natural language. Computational Linguistics, 18(4):467–479, 1992.
  • Chaidaroon et al. (2018) Chaidaroon, S., Ebesu, T., and Fang, Y. Deep semantic text hashing with weak supervision. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1109–1112, 2018.
  • Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.
  • Datar et al. (2004) Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262, 2004.
  • Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019.
  • Dong et al. (2019) Dong, W., Su, Q., Shen, D., and Chen, C. Document hashing with mixture-prior generative models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5229–5238, 2019.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
  • He et al. (2019) He, J., Spokoyny, D., Neubig, G., and Berg-Kirkpatrick, T. Lagging inference networks and posterior collapse in variational autoencoders. In International Conference on Learning Representations (ICLR), New Orleans, LA, USA, May 2019. URL https://openreview.net/pdf?id=rylDfnCqF7.
  • Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. Iclr, 2(5):6, 2017.
  • Hinton (2012) Hinton, G. E. A practical guide to training restricted boltzmann machines. In Neural networks: Tricks of the trade, pp. 599–619. Springer, 2012.
  • Hjelm et al. (2019) Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bklr3j0cKX.
  • Hwa (2000) Hwa, R. Sample selection for statistical grammar induction. In Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 13, pp. 45–52. Association for Computational Linguistics, 2000.
  • Jang et al. (2017) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In Proceedings of International Conference on Learning Representations (ICLR), 2017. URL https://arxiv.org/abs/1611.01144.
  • Kaiser et al. (2018) Kaiser, L., Bengio, S., Roy, A., Vaswani, A., Parmar, N., Uszkoreit, J., and Shazeer, N. Fast decoding in sequence models using discrete latent variables. In International Conference on Machine Learning, pp. 2390–2399, 2018.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In Proceedings of International Conference on Learning Representations (ICLR), 2014. URL https://arxiv.org/abs/1312.6114.
  • Kudo & Richardson (2018) Kudo, T. and Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71, 2018.
  • Maddison et al. (2017) Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. In Proceedings of International Conference on Learning Representations (ICLR), 2017. URL https://arxiv.org/abs/1611.00712.
  • McAllester (2018) McAllester, D. Information theoretic co-training. arXiv preprint arXiv:1802.07572, 2018.
  • McAllester & Stratos (2020) McAllester, D. and Stratos, K. Formal limitations on the measurement of mutual information. In Artificial Intelligence and Statistics, 2020.
  • Mikolov et al. (2013) Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • Mnih & Gregor (2014) Mnih, A. and Gregor, K. Neural Variational Inference and Learning in Belief Networks. In Proceedings of ICML, 2014.
  • Mnih & Rezende (2016) Mnih, A. and Rezende, D. J. Variational Inference for Monte Carlo Objectives. In Proceedings of ICML, 2016.
  • Onishi et al. (2016) Onishi, T., Wang, H., Bansal, M., Gimpel, K., and McAllester, D. Who did what: A large-scale person-centered cloze dataset. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2230–2235, 2016.
  • Oord et al. (2018) Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Peters et al. (2018) Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237, 2018.
  • Poole et al. (2019) Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., and Tucker, G. On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171–5180, 2019.
  • Rabiner (1989) Rabiner, L. R.

    A tutorial on hidden markov models and selected applications in speech recognition.

    Proceedings of the IEEE, 77(2):257–286, 1989.
  • Razavi et al. (2019) Razavi, A., van den Oord, A., and Vinyals, O. Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems, pp. 14837–14847, 2019.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pp. 1278–1286. PMLR, 2014.
  • Shen et al. (2018) Shen, D., Su, Q., Chapfuwa, P., Wang, W., Wang, G., Henao, R., and Carin, L. Nash: Toward end-to-end neural architecture for generative semantic hashing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2041–2050, 2018.
  • Stratos (2019) Stratos, K. Mutual information maximization for simple and accurate part-of-speech induction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2019.
  • Tishby et al. (1999) Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pp. 368–377, 1999.
  • van den Oord et al. (2017) van den Oord, A., Vinyals, O., et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315, 2017.
  • Viterbi (1967) Viterbi, A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory, 13(2):260–269, 1967.
  • Weiss et al. (2009) Weiss, Y., Torralba, A., and Fergus, R. Spectral hashing. In Advances in neural information processing systems, pp. 1753–1760, 2009.
  • Zhang et al. (2010) Zhang, D., Wang, J., Cai, D., and Lu, J. Self-taught hashing for fast similarity search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 18–25, 2010.

References

  • Alemi et al. (2017) Alemi, A., Fischer, I., Dillon, J., and Murphy, K. Deep variational information bottleneck. In ICLR, 2017. URL https://arxiv.org/abs/1612.00410.
  • Bachman et al. (2019) Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp. 15509–15519, 2019.
  • Belghazi et al. (2018) Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In Dy, J. and Krause, A. (eds.),

    Proceedings of the 35th International Conference on Machine Learning

    , volume 80 of Proceedings of Machine Learning Research, pp. 531–540, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • Bell & Sejnowski (1995) Bell, A. J. and Sejnowski, T. J. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
  • Bengio et al. (2013) Bengio, Y., Léonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • Brown et al. (1992) Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. Class-based -gram models of natural language. Computational Linguistics, 18(4):467–479, 1992.
  • Chaidaroon et al. (2018) Chaidaroon, S., Ebesu, T., and Fang, Y. Deep semantic text hashing with weak supervision. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1109–1112, 2018.
  • Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.
  • Datar et al. (2004) Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262, 2004.
  • Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019.
  • Dong et al. (2019) Dong, W., Su, Q., Shen, D., and Chen, C. Document hashing with mixture-prior generative models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5229–5238, 2019.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
  • He et al. (2019) He, J., Spokoyny, D., Neubig, G., and Berg-Kirkpatrick, T. Lagging inference networks and posterior collapse in variational autoencoders. In International Conference on Learning Representations (ICLR), New Orleans, LA, USA, May 2019. URL https://openreview.net/pdf?id=rylDfnCqF7.
  • Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. Iclr, 2(5):6, 2017.
  • Hinton (2012) Hinton, G. E. A practical guide to training restricted boltzmann machines. In Neural networks: Tricks of the trade, pp. 599–619. Springer, 2012.
  • Hjelm et al. (2019) Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bklr3j0cKX.
  • Hwa (2000) Hwa, R. Sample selection for statistical grammar induction. In Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 13, pp. 45–52. Association for Computational Linguistics, 2000.
  • Jang et al. (2017) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In Proceedings of International Conference on Learning Representations (ICLR), 2017. URL https://arxiv.org/abs/1611.01144.
  • Kaiser et al. (2018) Kaiser, L., Bengio, S., Roy, A., Vaswani, A., Parmar, N., Uszkoreit, J., and Shazeer, N. Fast decoding in sequence models using discrete latent variables. In International Conference on Machine Learning, pp. 2390–2399, 2018.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In Proceedings of International Conference on Learning Representations (ICLR), 2014. URL https://arxiv.org/abs/1312.6114.
  • Kudo & Richardson (2018) Kudo, T. and Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71, 2018.
  • Maddison et al. (2017) Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. In Proceedings of International Conference on Learning Representations (ICLR), 2017. URL https://arxiv.org/abs/1611.00712.
  • McAllester (2018) McAllester, D. Information theoretic co-training. arXiv preprint arXiv:1802.07572, 2018.
  • McAllester & Stratos (2020) McAllester, D. and Stratos, K. Formal limitations on the measurement of mutual information. In Artificial Intelligence and Statistics, 2020.
  • Mikolov et al. (2013) Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • Mnih & Gregor (2014) Mnih, A. and Gregor, K. Neural Variational Inference and Learning in Belief Networks. In Proceedings of ICML, 2014.
  • Mnih & Rezende (2016) Mnih, A. and Rezende, D. J. Variational Inference for Monte Carlo Objectives. In Proceedings of ICML, 2016.
  • Onishi et al. (2016) Onishi, T., Wang, H., Bansal, M., Gimpel, K., and McAllester, D. Who did what: A large-scale person-centered cloze dataset. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2230–2235, 2016.
  • Oord et al. (2018) Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Peters et al. (2018) Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237, 2018.
  • Poole et al. (2019) Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., and Tucker, G. On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171–5180, 2019.
  • Rabiner (1989) Rabiner, L. R.

    A tutorial on hidden markov models and selected applications in speech recognition.

    Proceedings of the IEEE, 77(2):257–286, 1989.
  • Razavi et al. (2019) Razavi, A., van den Oord, A., and Vinyals, O. Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems, pp. 14837–14847, 2019.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pp. 1278–1286. PMLR, 2014.
  • Shen et al. (2018) Shen, D., Su, Q., Chapfuwa, P., Wang, W., Wang, G., Henao, R., and Carin, L. Nash: Toward end-to-end neural architecture for generative semantic hashing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2041–2050, 2018.
  • Stratos (2019) Stratos, K. Mutual information maximization for simple and accurate part-of-speech induction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2019.
  • Tishby et al. (1999) Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pp. 368–377, 1999.
  • van den Oord et al. (2017) van den Oord, A., Vinyals, O., et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315, 2017.
  • Viterbi (1967) Viterbi, A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory, 13(2):260–269, 1967.
  • Weiss et al. (2009) Weiss, Y., Torralba, A., and Fergus, R. Spectral hashing. In Advances in neural information processing systems, pp. 1753–1760, 2009.
  • Zhang et al. (2010) Zhang, D., Wang, J., Cai, D., and Lu, J. Self-taught hashing for fast similarity search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 18–25, 2010.

Appendix A Forward Algorithm

The forward algorithm is shown in Algorithm 3. We write to denote the set of integers and if is true and otherwise.

  Input: for and
  Output: For and
  Runtime:
  Base: for
  Main: For , for ,
where
Algorithm 3 Forward

Appendix B Dataset Construction for Predictive Document Hashing

We take pairs from the Who-Did-What dataset (Onishi et al., 2016). The pairs in this dataset were constructed by drawing articles from the LDC Gigaword newswire corpus. A first article is drawn at random and then a list of candidate second articles is drawn using the first sentence of the first article as an information retrieval query. A second article is selected from the candidates using criteria described in Onishi et al. (2016), the most significant of which is that the second article must have occurred within a two week time interval of the first. We filtered article pairs so that each article is distinct in all data. The resulting dataset has 104404, 8928, and 7326 article pairs for training, validation, and evaluation.

To follow the standard setting in unsupervised document hashing, we represent each article as a TFIDF vector using a vocabulary of size 20000. We use SentencePiece (Kudo & Richardson, 2018) to learn an optimal text tokenization for the target vocabulary size.