One of the most common and important tasks of machine learning is building generative models, that can capture and learn a wide variety of data distributions. Recent developments in generative modeling concentrate around two major areas of research: variational autoencoders (VAE)
that aim at capturing latent representations of data, while simultaneously keeping it restricted to known distribution (e.g., normal distribution) and generative adversarial networks (GANs)[3, 8]
with grounds in game theory, having strong emphasis on creating realistic samples from underlying distributions.
These kind of models are not only known from generating data from the distribution represented by data examples but are also used to train informative and discriminative feature embeddings. It can be obtained either only with unsupervised data by using good discriminative properties of the GAN’s discriminator achieved during adversarial training [8, 12] or using some subset of labeled data and incorporating semi-supervised mechanisms during training the generative model [9, 13].
In this work we concentrate on obtaining better feature representation for image data using semi-supervised learning with a model based on Bidirectional Generative Adversarial Networks (BiGANs)  / Adversarially Learned Inference (ALI) . In order to incorporate semi-supervised data into training procedure, we propose to enrich the primary training objective with an additional triplet loss term  that operates on the labeled examples.
Our approach is inspired by the work  where the triplet loss was used to increase the quality of features representation in the discriminator. Contrary to this approach, we make use of an additional model in BiGAN architecture - encoder and aim at increasing the quality of feature representation in the coding space, that is further used by a generator to create artificial samples. Practically, it means that the feature representation can be used not only for classification and retrieval purposes but also for generating artificial images similar to existing.
The contribution of the paper is twofold. We introduce a new GAN training procedure for learning latent representations that extends the models presented in [1, 2] and inspired by  for semi-supervised learning. We show that Triplet BiGAN will result in superior scores in classification and image retrieval tasks.
This work is organized as follows. In Sec. 2 we present the basic concepts related to GAN models and triplet learning. In Sec. 3 we describe our approach - Triplet BiGAN model. In Sec. 4 we provide the results obtained by Triplet BiGAN on two challenging tasks: image classification and image retrieval. This work is summarized in Sec. 5.
2 Related works
2.1 Generative Adversarial Networks
Since their inception, Generative Adversarial Networks (GANs) 
have become one of the most popular models in a field of generative computer vision. Their main advantages come from their straightforward architecture and ability to produce state-of-the-art results. Studies performed in recent years propose many performance, stability and usage improvements to the original version, with Deep Convolutional GAN (DCGAN) and Improved GAN  being used most often as architectural baselines in pure image generation learning tasks.
The main idea of GAN is based on game theory and assumes training of two competing networks: generator and discriminator . The goal of GANs is to train generator to sample from the data distribution
by transforming the vector of noiseto the data space. The discriminator is trained to distinguish the samples generated by from the samples from . The training problem formulation is as follows:
where: - true data distribution, - prior to the data space.
The model is usually trained with the gradient-based approaches by taking minibatch of fake images generated by transforming random vectors sampled from via the generator and minibatch of data samples from . They are used to maximize with respect to parameters of by assuming a constant , and then minimizing with respect to parameters of by assuming a constant .
2.2 Bidirectional Generative Adversarial Networks
BiGAN model, presented in [1, 2] extends the original GAN model by an additional encoder module , that maps the examples from data space to the latent space . By incorporating the encoder into the GAN architecture, we can code examples in the same space that is used as a seed for generating artificial samples.
The objective function that is used train the BiGAN model can be defined in the following manner:
The adversarial paradigm applied to train the model BiGAN is analogous as for GAN model. The goal of training is to solve the min-max problem stated below:
Practically, the model is trained in alternating procedure, where the parameters of discriminator are updated by optimizing the following loss function:
and the parameters of generator and encoder are jointly trained by optimizing the following loss:
Since (as authors showed in ) the encoder, in order to be an optimal one, learns to invert the examples from true data distribution, the same loss can be applied to the encoder and the generator parameters.
Experiments show that the encoder despite learning in a pure unsupervised way was able to embed meaningful features, which later show during reconstruction. The inclusion of additional module raises a question about the quality of this feature representation for classification and image retrieval tasks. The approach of combining objectives seems promising, as the encoder module is explicitly trained for feature embedding, as opposed to the discriminator, which main task is to categorize samples into real and fake.
2.3 Triplet Networks
. Triplet networks consist of three instances of the same neural network, that share parameters among themselves. During training, the triplet modelreceives three examples from the training data: the reference sample , the positive sample (the sample that is in some way similar to the reference sample, f.e. it belongs to the same class) and the negative sample (that is dissimilar to the reference sample) . The goal is to train the triplet network in such a way, that the distance between encoded query example to the encoded negative example is greater than the distance form the encoded query example to the encoded positive example . In general case, this distances are computed as a L2-norm between feature vectors, i.e. : and .
During the training the triplet model makes use of the probabilitythat the distance of the query example to the negative example is greater than its distance to the positive one which can be defined in the following way:
We formulate the objective function for a single triplet in a following manner :
The parameters of the model are updated according to the gradient-based approach that is used to optimize the objective function by utilizing the minibatches of triplets selected from data. Usually, the procedure of triplet selection is performed randomly (assuming that , are closer than , ) but there are some other approaches that speed-up the training process. The most popular is to construct the triplets for training taking under consideration the hardest negative samples , which are the closest to the currently selected reference sample .
3 Triplet BiGANs
In this work we introduce Triplet BiGAN model that combines the benefits of using BiGAN in terms of learning interesting representations in latent space and the superior power of the triplet model that trains well using supervised data. The core idea of our approach is to incorporate the encoder model of BiGAN to act as triplet network on the labeled part of training data (see fig. 1).
In terms of training the Triplet BiGAN we simply modify the (see eq. (5)) criterion by incorporating an additional triplet term:
where is triplet loss defined by eq. (7),
is a hyperparameter that represents the impact of the triplet loss on the global criterion andis the distribution that generates triplets, where , are from the same class and , are from different classes.
Triplet BiGAN model is dedicated to solving semi-supervised problems, where only some portion of labeled data is available. Practically, we do not have access to , therefore we are sampling the triplets from some portion of an available labeled dataset, .
The training procedure for the model is described in alg. 1. We assume that , , are neural networks, that are described by the parameters , , respectively. For the training procedure we assume, that we have access to unsupervised data and some portion of supervised data . For each training iteration, we randomly sample noise vector from the normal distribution, pass it thru generator to obtain the fake sample . We select from unlabeled data and triplet from labeled data . Using encoder we receive the coding vector corresponding to the sample . Next, we update the parameters of discriminator by optimizing the criterion . During the same iteration we update the parameters of generator and encoder by optimizing . The procedure is repeated until convergence.
In practical implementation, we make use stochastic gradient optimization techniques and perform gradient updates using ADAM method. We also initialize parameters of the Triplet BiGAN by training simple BiGAN without triplet term for given number of epochs without triplet term ().
The motivation behind this approach is to increase the discriminative capabilities of the codes obtained from latent space for BiGAN model using some portion of labeled examples involved in triplet training. As a result, we obtain the encoding model , that is not only capable of coding the data examples for further reconstruction but also can be used as good quality feature embedding for the tasks like image classification or retrieval.
The goal of the experiments is to evaluate the discriminative properties of the encoder in two challenging tasks: image retrieval and classification. We compare the results with the two reference approaches: triplet network trained only with supervised data and simple BiGAN model, where the latent representation of encoder is used for evaluation.
The model was trained on two datasets: Street View House Numbers (SVHN) and CIFAR10. In each dataset, 50 last examples of each class were used as a validation set and were not used for training the models. During training only selected portion of training set have assigned labels. The next subsection presents results obtained when using only 100, 200, 300, 400 or 500 labeled examples per class. For testing purposes, we trained classifier only on the images from the training split, that were given a label for triplet training.
Retrieval evaluation was done with accuracy and mean average precision (mAP). For classification, 9-nearest neighbors classifier was used with weighted by the distance-based importance of neighbors. Mean average precision was calculated at length of encoded data. Cluster visualization was performed by applying t-SNE with Euclidean metric, perplexity 30 and Barnes-Hut approximation for 1000 iterations.
The architectures of discriminator, encoder, and generator were as presented in .
The encoder network
is a 7-layer convolutional neural network, that learns the mapping from image spaceto feature space
. After each convolutional layer (excluding the last), a batch normalization is performed, and the output is passed through a leaky relu activation function. After penultimate convolutional block (meaning convolutional layer with normalization and activation function) a reparametrization trick is performed.
The generator network is a neural network with seven convolution transposition layers. After each layer (except the last) a batch normalization is performed, and the output is passed through a leaky relu activation function. After the last convolution-transposition layer, we squash the features to
range with the sigmoid function.
The discriminator part consists of three neural networks – discriminates in the image space, discriminates in the encoding space. Both of them map their inputs into a discriminative latent space and each of them returns a same-size vector. Third network takes concatenation of said vectors as an input and returns a decision, whether an input tuple (image, encoding) comes from encoding or generative part of the Triplet BiGAN network. Image discriminator is made of five convolution layers with Leaky Relu nonlinearity after each of them. Encoding discriminator is represented as two convolution layers with Leaky Relu nonlinearity after each of them and Joint discriminator
is another three convolutional layers with Leaky Relu between them and the sigmoid nonlinearity at the end.
For assessing classification accuracy quality, the experiments were done to test the influence of feature vector size and images per class taken for semi-supervised learning. For each of the model, the experiments were done, when the feature vector consisted of either 16, 32, 64 or 128 (256 for SVHN) variables and 500 labeled images per class were taken. On the other hand, using feature vector size of 64, the experiments measured the impact of a number of labeled examples available during training, with possible values being 100, 200, 400 and 500 (only for Cifar10). The experiments were conducted on Cifar10 and SVHN datasets.
For assessing image retrieval quality, the experiments were made to test an influence of feature vector size and images per class taken for semi-supervised learning. For each sample in the testing data, an algorithm sorts the images from the training dataset from closest to most further. Distances are calculated basing on Euclidean distances between images’ feature vectors to check if samples that are close to each other in data space (images belong to the same class) are close to each other in feature space (their representation vectors are similar). In ideal type situation (mAP = 1), all of the relevant training images would be put first and only then training images that belong to the same class. With 10 classes in each dataset mAP = 0.1 may be considered random ordering, as it roughly means that on average, only every tenth image was of the same class as the test image.
Results presented in tables below indicate the increased average precision of image retrieval when using Triplet BiGAN method as opposed to using only labeled examples by 0.05 - 0.15 in all, but one experiment.
Figure 2 presents a visualization of the closest images from the restricted training set (used only 500 examples per class from the original training set, the same examples that were used in semi-supervised learning). The closeness of the image was decided between each image from the randomly chosen sample of 5 images from the test set and each image from the restricted training set. The distance between images was calculated by encoding each image to feature vector form and calculating Euclidean distance between selected test and training images.
As seen in the visualization, BiGAN model, despite the fact of being trained in a pure unsupervised way, can still embed similar images to similar vectors. However, the closest pictures tend to contain occasional errors, which is not the case with retrieval using triplet models, that tend to contain errors sparingly.
The notable example, showing better results of Triplet BiGAN in comparison to regular Triplet model is the 4th image from the selected test pictures (grey frog). Using 32 and 64 size features vectors Triplet BiGAN was able to retrieve other frog and toad images correctly. The same image caused problems for the original Triplet model, not to mention BiGAN. This shows that additional unsupervised learning of underlying data architecture is indeed beneficial to finding subtle differences in images and can improve the quality of feature embedding.
b) Clusterization results for models with feature vector size and trained (Triplet and Triplet BiGAN) using 500 labeled examples per class.
Figures below show visualizations of embedding quality of tested models. Each sub-figure present embedding mapped to 2 dimensions using the t-SNE algorithm with one of the three models: Triplet, BiGAN and Triplet BiGAN on the training set and the test set. For Triplet and Triplet BiGAN models, 500 labeled samples per class were used. Two experiments were performed: one with the feature vector of size 32, and one with the feature vector of size 64, as mentioned in figure captions. T-SNE algorithm ran for 1000 epochs using perplexity of 30 and Euclidean metric for distance calculation. In the visualization each class was marked with own color, that was preserved through all sub-figures.
In classification and retrieval experiments Triplet BiGAN achieved worse results (tables 1, 3, 5, 7) than a Triplet GAN presented in . However, we believe than our proposed model has still several advantages in comparison to the reference method. Since in Triplet BiGAN, we perform metric learning on the Encoder (unlike in  where metric learning is done on the Discriminator features)
As visualizations suggest, both Triplet and Triplet BiGAN models did not have any problems with learning clusterization on training sets. The output from the t-SNE clearly shows separate group for each class of the samples for this models. This is not the case in BiGAN model. However, while BiGAN was trained without distance-based objective, one can still spot concentration of particular colors. This aligns with observations  that the encoder learns to embed meaningful features into the feature vector, including those, that are somewhat characteristic for specific classes.
Regarding test sets, Triplet and Triplet BiGAN did not generalize to create perfect separations of classes. The models learn to rather bind particular classes into small, homogeneous groups, which are not clearly visible on visualizations but are enough to perform classification using the nearest neighbor algorithm. In the case of BiGAN model the embedding features from the training set do not translate well to the test set, creating a somewhat chaotic collection of points, that is able to generate image retrieval results that are close to random.
This work presents the Triplet BiGAN model that uses joint optimizing criteria: to learn to generate and encode images and to be able to recognize the similarity of given objects. Experiments show that features extracted by an encoder, despite learning only on true data (in opposition to features learned by the discriminator, that learns on real and generated data), may be used as a basis of image classifier, retrieval, grouping or autoencoder model.
Also included in this work are descriptions of the models that were essential milestones in the field of generative models and distance learning models and an inspiration for creating the presented framework.
-  Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. arXiv preprint arXiv:1605.09782 (2016)
-  Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., Courville, A.: Adversarially learned inference. arXiv preprint arXiv:1606.00704 (2016)
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. pp. 2672–2680 (2014)
Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: International Workshop on Similarity-Based Pattern Recognition. pp. 84–92. Springer (2015)
-  Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
-  Kumar, B., Carneiro, G., Reid, I., et al.: Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5385–5394 (2016)
-  Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research 9(Nov), 2579–2605 (2008)
-  Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
-  Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems. pp. 2234–2242 (2016)
-  Yao, T., Long, F., Mei, T., Rui, Y.: Deep semantic-preserving and ranking-based hashing for image retrieval. In: IJCAI. pp. 3931–3937 (2016)
-  Zhuang, B., Lin, G., Shen, C., Reid, I.: Fast training of triplet-based deep binary embedding networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5955–5964 (2016)
-  Zieba, M., Semberecki, P., El-Gaaly, T., Trzcinski, T.: Bingan: Learning compact binary descriptors with a regularized gan. arXiv preprint arXiv:1806.06778 (2018)
-  Zieba, M., Wang, L.: Training triplet networks with gan. arXiv preprint arXiv:1704.02227 (2017)