1 Introduction
Matching pairs of data is a crucial part of many machine learning tasks, including recommendation
sarwar2001item ; DBLP:journals/corr/abs-1804-09979 , person re-identification zheng2015scalable , image search DBLP:journals/corr/WangSLRWPCW14 parkhi2015deep , etc., as typical industrial applications. Over the past decade, a deep learning framework for matching up data, e.g., images, has served as the core of such systems.Asides from these tasks, an extension of multiple instance matching, namely set-to-set matching, has recently been raised as an important element of various applications required by emerging web technologies or services. A representative example in e-commerce is the recommendation of a group of fashion items deemed to match the collection of fashion items already owned by the user. Regarding the group as a set, we can see this problem as one of set-to-set matching. Another example is group re-identification in surveillance systems DBLP:journals/corr/LisantiMBF17 ; Xiao:2018:GRL:3240508.3240539 ; DBLP:journals/corr/abs-1905-07108 , which has recently started implementing a function to track a known group of suspicious persons or criminals. This task can also be simplified as a set-to-set matching problem.
The difficulty of set-to-set matching, in comparison with ordinary data matching, lies in the two types of exchangeability required: one is exchangeability between the pair of sets, and the other is invariance across different permutations of the items in each set. A function that calculates a matching score should provide an invariant response, regardless of the order of the two sets or the permutations of the items.
The main focus of this paper is an architecture that preserves the aforementioned exchangeability properties, and at the same time, realizes a high performance in the set-to-set matching tasks. We consider the architecture of two modules: 1) the feature extractor, and 2) the matching layer. A straightforward method of ensuring exchangeability is to apply a same set feature extractor, such as Set Transformer lee2019set , separately to each of the sets, and compute the symmetric matching score with the extracted features. In this study, however, we argue that allowing the feature extractor to include interactions between the two sets will improve the feature representations for the task of set-to-set matching. We propose a deep learning architecture for 1) feature extraction, named cross-set transformation, which iteratively provides the interactions between the pair of sets to each other in the intermediate layers of the feature extractor. The proposed architecture also includes 2) a matching layer, named cross-similarity function, that calculates the matching score between the features of the set members across the two sets. Our model guarantees both types of exchangeability in the modules of 1) and 2).
We discuss the set-to-set matching problem in a supervised setting, where examples of correct pairs of sets are provided as training data. The objective of the supervised learning is to train the feature extractor and matching layer, so that an appropriate set of features to be matched can be extracted. The model is then used to find a correct pair of sets from a group of candidates. We propose an efficient training framework for the proposed set-to-set matching architecture.
The effectiveness of our approach is demonstrated in two real-world applications. First, we consider fashion set recommendations, where examples of the outfits are used as the training data, to show correct combinations of items (clothes). Using a large number of examples of the outfits in the form of images, we aim to match the correct pair of defined sets through subset and superset matching. In the subset matching problem, we randomly split an outfit in half beforehand, to form two subsets, and then use them as a correct pair. Superset matching is a multiple-outfit version of subset matching. Next, we conduct experiments on a simple type of group re-identification DBLP:journals/corr/LisantiMBF17 ; Xiao:2018:GRL:3240508.3240539 ; DBLP:journals/corr/abs-1905-07108 . The objective is to match the two groups composed of the same persons captured in individual images provided by the Market-1501 dataset zheng2015scalable , under “noisy” and “non-noisy” conditions. The proposed methods are compared with a baseline model, which in this case is a straightforward extension of the Set Transformer lee2019set , in the set-to-set matching problem.
The main contributions of this paper are as follows:
-
A novel deep learning architecture is proposed to provide the two types of exchangeability required for set-to-set matching.
-
The proposed feature extractor using the interactions between two sets is shown to extract better features for set-to-set matching.
-
The proposed models demonstrate better performance than the baseline for the fashion set recommendation and group re-identification tasks, supporting the claim that the interactions improve both the accuracy and robustness of the set-matching procedure.

2 Preliminaries: Set-to-Set Matching
2.1 Problem Formulation
To describe the task of matching two sets, we introduce the necessary notation. Let be a finite set of all items. Sets and as data are subsets of , where and ; hence and . Let
be feature vectors representing the features of
and , respectively. Let and be sets of these feature vectors, where .The function calculates a matching score between the two sets and . Guaranteeing the exchangeability of the set-to-set matching requires that the matching function is symmetric and invariant under any permutation of items within each set (see Section 2.2).
We consider tasks where the matching function is used to select a correct matching. Given candidate pairs of sets , where and , we choose as a correct one so that achieves the maximum score from amongst the candidates. In this study, a supervised learning setting is considered, where the function
is trained to classify the correct pair and unmatched pairs. The details of the training method are deferred to Section
3.4.2.2 Mappings of Exchangeability
We present a brief review of the foundational elements of our models and set-input functions, to demonstrate the exchangeability they realize.
2.2.1 Permutation Invariance
A set-input function is said to be permutation invariant if
(1) |
for permutations on and on .
2.2.2 Permutation Equivariance
A map is said to be permutation equivariant if
(2) |
for permutations and , where and are on and , respectively. Note that is permutation invariant for permutations within .
2.2.3 Symmetric Function
A map is said to be symmetric if
(3) |
2.2.4 Two-Set-Permutation Equivariance
Given and , a map is said to be a two-set-permutation equivariant if
(4) |
for any permutation operator exchanging the two sets.
3 Matching and Learning for Sets
In this section, the set-to-set matching problem is described. Based on the problem scenario described in Section 2.1, we state the design motivations, including a concrete example of a real-life application in Section 3.1. We then describe in detail the architecture of 2) the matching layer, cross-similarity function in Section 3.2, and 1) the feature extractor, cross-set transformation in Section 3.3. Finally, we discuss training procedures for our model in Section 3.4. Figure 1 shows the proposed architecture.
3.1 Motivation
The set-to-set matching problem motivates us to construct a specific architecture, one which includes the interactions between the set members of different sets. As an example we take the case of set-to-set fashion item recommendations, here and represent the outfits or subsets/supersets of the outfits, and set members and
are fashion items coordinating the outfits. Using a conventional deep convolutional neural network, i.e., Inception-v3
szegedy2016rethinking , which will be described later, we can extract image features and from garment images that contain individual fashion items representing and , respectively. Assuming that the image features represent not only low-level features such as colors and edges but also high-level ones, such as style features DBLP:journals/corr/abs-1807-03133 affecting the impression of the outfit, taking into account combinations of items is required to fully consider fashion compatibility han2017learning ; he2016learning. Hence, we consider that 1) using the interactions between set members of different sets is crucial in feature extraction, and also that, after the feature extraction, 2) the resulting features of the two sets must be measured by estimating a score of the possibility of being the correct pair. Finally, the supervised framework is required to learn the appropriate features to match.
3.2 Calculating Matching Score for Sets
We introduce a matching layer to calculate the matching score between two given sets, mapping . It is designed to calculate the inner product for every combination of set members across both sets, so we call this cross-similarity (CS), which is defined as follows:
(5) |
where and are feature vectors in and , respectively, is a linear function allowing conversions into a lower-dimensional space using weights , e.g., ,
is a non-negative mapping, i.e., ReLU
glorot2011deep , and is the number of dimensions of the lower-dimensional space. CS can be seen as a calculation of the average similarity in the linear subspaces created by the dimensionality reduction . Note that CS corresponds to the (normalized) inner product in the linear subspace if both sets contain only one set member.Instead of calculating CS singly, we utilize multiple CSs (mCS) to combine the cross-similarities calculated with different linear mappings, the procedure runs as follows:
(6) | |||||
(7) | |||||
(8) |
Where indicates the concatenation, and the linear function with weights maps . Because CS is permutation invariant (definition in Eq. (1)), mCS is also permutation invariant:
Property 1.
Both CS and mCS are permutation invariant.
Additionally, because CS is symmetric (definition in Eq. (3)), mCS is symmetric as well:
Property 2.
Both CS and mCS are symmetric.
This symmetric property entails that CS and mCS satisfy the exchangeability criterion for the pair of sets, i.e., and .
Next, to allow for comparison against the scores for other matching candidates, the output of mCS or CS is fed into the loss function, e.g., a softmax function with cross-entropy loss. That is, the task of maximizing the matching score is translated into a minimization of the loss function, using the given label.
3.3 Cross-Set Feature Transformation
We construct the architecture of the feature extractor, which transforms sets of features using the interactions between the pair of sets, and extracts the desired features to be matched in the post-processing stages.
Here, consider the transformation of a pair of set-feature vectors into new feature representations on . Let be the iteration (layer) number of the cross-set transformation layers. Our feature extraction then can be described as a map of , where , , , , , and . For example, denotes the feature vector extracted by the -th layer representing , and is defined similarly. Note that the initial feature vectors with are found with a typical feature extractor, i.e., a deep convolutional neural network for an individual image. Then, we construct a parallel architecture, cross-set transformation, with an asymmetric transformation , as follows:
(9) |
where or is a permutation equivariant function for the first argument (defined in Eq. (2)), that transforms the set features in the first argument into new feature representations, regardless of the order of the set features in the second argument. Furthermore, residual paths he2016deep may be used in Eq. (9) if required.
We propose two possible feature extractors for : an attention-based function, and an affinity-based function. Both are constructed to assign the “matched” feature vectors to the “reference” feature vector, taking account of interactions between the two sets. For simplicity, we provide an explanation via the case of extracting the features for as follows (we can easily exchange and for ).
The attention-based function of maps as follows:
(10) |
where , , , , and
denotes a linear transformation, e.g.,
. Here, if a multihead structure is not utilized, as described later. It is clear that the attention-based function transforms the respective feature vectors, giving the interactions between the set members of and .Note that our attention-based function has a strong relation to dot-product attention DBLP:journals/corr/VaswaniSPUJGKP17 ; lee2019set , which has in the past been introduced to calculate the weighted average on using as the coefficients. However, the operation would be inconsistent with our matching objective, as through normalization it increases the coefficients even in unmatched cases of and . To preserve non-linearity, we use instead the non-negative weighted sum and then average it using Eq. (10).
The affinity-based function of maps as follows:
(11) |
where , , and . Using the two linear transformations and , the affinity-based function combines the resembling features so that the resulting feature vectors for have similar representations to the (linearly transformed) vectors in .
Other simple permutation equivariant functions of , i.e., , may be utilized. However, we consider it a function incapable of extracting rich enough features without any interactions between the two sets to yield accurate matching for two sets.
Instead of performing singly, we introduce a multihead structure DBLP:journals/corr/VaswaniSPUJGKP17 to our feature extractor , which is a permutation equivariant function mapping . Denoting the output of as , the multihead version of is defined as , where indicates a concatenation for each corresponding set member in , , and . Note that the multihead structure is related to recent models such as MobileNet DBLP:journals/corr/HowardZCKWWAA17 , which isolates and places the convolutional operations in parallel to reduce the calculation costs whilst preserving the accuracy of the recognition. We assume that the multihead structure provides various interactions between the set members, reducing the calculation costs as well.
3.3.1 Analysis
We briefly show that our architecture is permutation invariant and symmetric using trivial solutions as follows. Note that the independent operations for each set member are not related to the above properties, so we focus on the matching layer of CS or mCS and the feature extractor of the cross-set transformation.
Proposition 1.
A composition of the function or with the cross-set transformation , i.e., or , is symmetric and permutation invariant.
Proof.
For permutation invariance, because a composite function of a permutation equivariant function and permutation invariant function or (described in Property 1) is permutation invariant, both and are permutation invariant.
To argue that our architecture is symmetric, we use the fact that the symmetric function composed with the two-set-permutation equivariant function defined in Eq. (4) is symmetric, i.e., . Assuming that the cross-set transformation exhibits the weight-sharing structure, is a two-set-permutation equivariant function, i.e., , where and for any permutation operator exchanging the responses of and . Thus, we have the following:
Property 3.
The cross-set transformation is equivariant under two-set-permutation.
Note that we can stack cross-set transformations in a way that preserves the symmetric architecture within multiple layers, by combining with other networks that operate upon the sets or items independently. We discuss the overall architecture in Section 5.2.
3.4 Training for Pairs of Sets
We attempt to train our model efficiently using multiple correct pairs taken together. As described in the problem formulation, candidates are provided to find the correct one-to-one matching. Here, candidates per reference set of are fed into the matching process in each training iteration. However, if we prepare different candidates for each reference set of , the calculations for data processing would be inefficient.
To train our model efficiently, we create matching candidates from the correct pairs. Let be a correct pair of sets, where . From those -pair, by extracting all , we can create the set of as . That is, is composed of sets exhibiting correct relations to the respective , and can be used as a set of candidates for each in the training stage. Note that we can construct these candidates by assuming that one correct pair exists for the respective sets, as described in the problem formulation. An example is shown in Figure 2.
Compared with a typical mini-batch training, suppose that the set size is on average, the -pair training method utilizes data (images) per training iteration; this can be regarded as the size of the mini-batch. We consider the above training method as a set version of -pair training loss NIPS2016_6200 , so we call this -pair-set loss herein.
If the quantity of set data is not large, we can use other training frameworks, i.e., a triplet loss with the softplus function DBLP:journals/corr/HermansBL17 ; we can use the triplet loss for the relations among the reference set of , the positive candidate , and the negative candidate , where . We demonstrate that the above two training frameworks can be adopted for training our models in the experiment.

4 Related Works
4.1 Set-Input Methods
Deep learning architecture for set data is developing and has been well studied DBLP:journals/corr/LiCZL16 ; vinyals2015order ; lee2019set ; DBLP:journals/corr/ZaheerKRPSS17 . In the work of Lee et al. lee2019set , a set-feature representation was introduced by applying a self-attention based Transformer DBLP:journals/corr/VaswaniSPUJGKP17 to set data. An encoder–decoder model called Set Transformer transforms the set data into a vector/matrix representation in the feature space. Zaheer et al. DBLP:journals/corr/ZaheerKRPSS17 derived a condition for the property of permutation invariance/equivariance in functions, and introduced an operator referred to as deep sets. These models can manage set data that serve multiple objectives, such as set classification, calculation from images, text retrieval, etc. However, constructing a deep learning model that can manage multiple sets has not been well studied.
4.2 Methods for Matching Two Sets
The matching of multiple data is related to a setting in measuring two distributions gretton2005measuring ; muandet2017kernel ; poczos2012support ; muandet2012learning ; DBLP:journals/corr/ZhangXHL17 ; DBLP:journals/corr/LiCCYP17 . However, to the best of our knowledge, deep learning for matching two sets of data has not been well studied.
4.3 Applications
Many fashion item recommendation studies have investigated natural combinations of fashion items, the so-called visual fashion compatibility, to recommend fashion items or outfits han2017learning ; he2016learning ; hsiao2018creating ; DBLP:journals/corr/abs-1804-09979 ; DBLP:journals/corr/abs-1807-03133 ; DBLP:journals/corr/abs-1803-09196 ; DBLP:journals/corr/abs-1902-08009 ; DBLP:journals/corr/abs-1810-02443 . In this study, the main difficulties of the subset/superset matching procedures lie in satisfying the fashion compatibility requirements of the matched sets.
In the applications of group re-identification DBLP:journals/corr/LisantiMBF17 ; Xiao:2018:GRL:3240508.3240539 ; DBLP:journals/corr/abs-1905-07108 or multi-shot person re-identification wang2014person , problems of multiple instance matching arise. One group re-identification scenario has been proposed that requires the detection of known groups from videos DBLP:journals/corr/abs-1905-07108 . Because our experiments focus on set-to-set matching using given cropped images, we regard it is a simple type of group re-identification.
5 Experiments
5.1 Baseline for Comparison
We validate our architecture through comparison with other set-matching models. However, to the best of our knowledge, studies using deep neural networks for matching two sets are non-existent. Instead, we use an extension from the state-of-the-art set-input method to a set-to-set matching procedure, and consider this acceptable for the comparison.
We straightforwardly extend the Set Transformer lee2019set towards the two sets matching method. Here, the Set Transformer transforms a set of feature vectors into a vector on . Denoting the Set Transformer model , we perform the extension by calculating the matching score between the two sets and via the inner product , sharing the weights between the two . Note that the extension of the Set Transformer satisfies the exchangeability criteria for the set-to-set matching, however, no interactions between the two sets are provided. The other processes and scenarios are the same as in our methodology.
Comparing the experimental results of our models with the results of the extended Set Transformer-based model serves as a performance comparison and also an evaluation for our models, providing insight into whether our architecture is valid or not.
5.2 Overall Architecture
In applications of our model, we use an encoder–decoder structure, inspired by the Transformer models DBLP:journals/corr/VaswaniSPUJGKP17 ; lee2019set . As an example, Vaswani et al. regarded the Transformer as an encoder–decoder model for text translation DBLP:journals/corr/VaswaniSPUJGKP17
; the encoder transforms a set of features within the input domain, and the decoder transforms the resultant set of features onto the output domain. Because the translation is unidirectional, the one-encoder–decoder structure is included in the Transformer. Meanwhile, an iterative model of the encoder–decoder, i.e., the Stacked Hourglass model, has been proposed and demonstrates a high accuracy in the task of human pose estimation
DBLP:journals/corr/NewellYD16 . Borrowing from the above architectures, we construct our overall architecture by combining the encoder lee2019set , which is a permutation equivariant function called a self-attention block, with the decoder, which is a function of our cross-set transformation. We then repeat the encoder–decoder structure times in succession (described in Figure 1).For the experiments, we construct our models as follows. We set both the number of cross-set transformation layers and the number of encoder layers to 2. That is, we iteratively perform the one-layered encoder and the one-layered cross-set transformation two times in succession. Here, the resultant set of feature vectors extracted from each encoder model is fed into the next cross-set transformation and also the respective residual paths. Alongside this, we apply a feed-forward network, which comprises two-layered linear transformations with a leaky ReLU maas2013rectifier to the first argument of each function . Note that we use the two types of function composing the cross-set transformation: the attention-based function in Eq. (10) and the affinity-based function in Eq. (11).
To extract the individual feature vector from one of the images within the set, we use convolutional neural networks (CNNs) as the feature extractor. We use different CNNs for each task. For the task of the fashion set recommendation, we use the Inception-v3 szegedy2016rethinking
model, which is pre-trained using the ILSVRC-2012 ImageNet dataset
russakovsky2015imagenet . Using this model, we extract the feature vector on from the second-last layer placed between the global average pooling layer and the fully connected layer. We linearly transform the feature vector into to serve as one of the set members, then the two sets of collected feature vectors are fed into the set-input functions. On the other hand, we utilize a simple four-layered CNN without any pre-training for the group re-identification task. This CNN transforms an RGB image into the feature vector mapping channels using kernels, and we then apply global average pooling so that the resultant feature vector is on as well.5.3 Training Details
In the training stage, we set the number of match candidates to 16, 4, and 16 for subset matching, superset matching, and group re-identification, respectively.
We use a stochastic gradient descent method with a learning rate of 0.005, a momentum of 0.5, and a weight decay of 0.00004. The learning rate is set to degrade every 16 epochs by multiplying by 0.7. For fashion set recommendations, we set the maximum number of epochs to 32, which requires a week for training on Amazon SageMaker ml.p3.8xlarge. For group re-identification, we set the maximum number of epochs to 128, which takes a few hours. We train both the CNN and the set matching model simultaneously.
In the selection of the loss function, to reduce the learning time, we use the -pair-set loss for the tasks of superset matching and group re-identification. We also use the triplet loss with softplus function for the subset matching.
5.4 Fashion Set Recommendation
5.4.1 Dataset
We examine the set-to-set matching for the fashion set recommendation using the IQON dataset DBLP:journals/corr/abs-1807-03133 . IQON (www.iqon.jp) is a user-participating fashion web service sharing outfits for women. The IQON dataset consists of recently created, high-quality outfits, including 199,792 items grouped into 88,674 outfits. We split these outfits into groups, using 70,997 for training, 8,842 for validation, and 8,835 for testing.
For training with the IQON dataset, we set the maximum and minimum numbers of items for each outfit as 8 and 4, respectively; if the outfit contains more than 8 items, then we randomly select 8 items from it. The outfits contain roughly 5.5 items on average.
5.4.2 Preparing Set Pairs
To construct the correct pair of sets to be matched, we randomly halve the given outfit into two non-empty proper subsets and as follows: , where . We use different splitting methods for subset matching and superset matching.
We expect our model to reconstruct the original outfit by combining two subsets, and , provided such an inverse mapping exists. In the reconstruction, we assume that the desired features either remain within both subsets or are extracted during matching. For example, we regard the desired features as the discriminative features, which serve to recognize the fashion compatibility han2017learning ; he2016learning or infer the visual style of the outfit. That is, in the matching of the two subsets, such desired features to be matched must be obtained.
We perform our experiments using these subsets, to try to find the correct pairs. Here, we consider matching two subsets and ; we call this problem subset matching. In the subset matching, subsets are provided as a set of matching candidates, whilst maintaining the category restrictions for each fashion item. That is, these candidates only contain the same-category fashion items and are fed into the training or testing stages. Note that without any category restrictions, the models tend to be trained to select the candidate that contains non-overlapped fashion category items, e.g., shoes, with the reference subset . To avoid this situation, we introduce category restrictions to the candidates in each training/testing iteration.
Also, we extend the problem of subset matching to superset matching which presents more complex problem situations. We choose outfits randomly and split the respective outfits randomly in half , where . Then we create two supersets and . These two supersets serve as a correct pair for the superset matching problem. We consider the superset as a multimodal/mixture set, which consists of multiple fashion styles, such that the matching problem is one of finding similar supersets in terms of these mixed fashion styles. Because each superset has a category overlap of fashion items themselves, providing category restrictions to the candidates is not necessarily required in the superset matching, so we do not give the restrictions. Note that we used in the training stage and selected in the test stage.
5.4.3 Subset/Superset Matching
We discuss the experimental results of the matching subsets/supersets. Figure 3(a) and 3(b) show the differences between the learning curves for our models and the baseline, the Set Transformer-based model. Our models seemed to converge well in both the subset and superset matching tasks, while the learning curve of the baseline model showed higher losses compared with our models, indicating the difficulties of learning with the baseline model in subset/superset matching. Moreover, Table 1 shows significantly different results between our models and the baseline. Here, Cross Attention and Cross Affinity denote our models with the attention-based and affinity-based functions, respectively. In both the subset and superset matching, our models demonstrated better results compared with the baseline. For example, in the simplest scenario for superset matching with Cand: 4 and Mix: 2, the accuracies of the baseline and our model with affinity-based function were 73.5% and 90.6%, respectively.
![]() |
![]() |
Also, comparing the results of the attention-based function and affinity-based function, the learning curves in Figure 3 show almost the same losses, however, Table 1 shows that the affinity-based function performed better in both the subset and superset matching.
Subset Matching | Superset Matching | |||||
Cand:4 | Cand:8 | Cand:4 | Cand:8 | |||
Method | Mix:2 | Mix:4 | Mix:2 | Mix:4 | ||
Set Trans. (baseline) | 39.2 | 22.7 | 73.5 | 65.3 | 57.5 | 49.6 |
Cross Attention | 58.1 | 41.9 | 88.8 | 74.3 | 80.6 | 58.9 |
Cross Affinity | 60.2 | 43.3 | 90.6 | 75.9 | 82.8 | 61.9 |
5.5 Group Re-Identification
We present the results of group re-identification, where the task is to identify the pair of sets which consist of individual images of the same persons. We report our results using an extension of a well-known person re-identification dataset, Market-1501 zheng2015scalable , which provides cropped images containing individual persons alongside their respective labels. We evaluated the accuracy using the training/validation data, including the query/reference splits, which were divided by the author. Here we regard sets of reference and query data as and , respectively.
For the experiment, using Market-1501, we construct image sets composed of multiple persons. Each set consists of 3–8 persons and contains three different images of each. We validate the matching accuracy of the models and their robustness for “noise” as well. Here, the “noise” means an unexpected person mixed up in the group re-identification; for real-world applications, we consider that both the reference set and query set may contain “non-target” persons owing to errors, such as human error or mis-classification by the bounding box detector. These “noisy” situations may not be avoided completely; therefore, investigating the robustness of group re-identification is crucial. Figure 4 shows an example, which includes a “non-target” person. Here, we trained the models under the same “noisy” or “non-noisy” settings for each test, to investigate the robustness of the models for the set-to-set matching under noisy situations. In the test stage, we set the number of candidates to 5.
Table 2 compares the results of our models with the baseline. In the non-noisy case, all models showed almost perfect accuracies; we consider “averaging feature vectors in sets” achieves the high matching accuracy in this simple case, so all the models showed accurate results. By increasing the ratio of noise, the accuracy was degraded across all models. However, our models were more accurate than the baseline, even in the noisy case. For example, when the “noise” was five persons in the set , and the remaining three persons were the correct target persons in both the sets and , the accuracies of the baseline and proposed model with affinity-based functions were 80.4% and 92.4%, respectively. The differences in the accuracies were preserved even in more noisy settings. Those results support the claim that providing the interactions between two sets improves both the accuracy and the robustness of the matching procedure.

(Ratio of “noise” persons in , Ratio of “noise” persons in ) | |||||||
Method | |||||||
Set Trans. (baseline) | 99.5 | 95.1 | 89.9 | 85.7 | 80.4 | 65.7 | 48.1 |
Cross Attention | 99.6 | 96.9 | 94.8 | 91.9 | 90.7 | 72.9 | 56.1 |
Cross Affinity | 99.7 | 96.5 | 92.5 | 94.4 | 92.4 | 72.0 | 61.7 |
6 Conclusion
In this paper, we investigated a set-to-set matching problem. We proposed a novel architecture including 1) the cross-set transformation and 2) the cross-similarity function, along with the training framework for a large amount of set-data.
We showed that our architecture preserves the two types of exchangeability for a pair of sets and the items within them (Proposition 1), that should be satisfied for the set-to-set matching procedure.
We demonstrated that our models performed well compared with the baseline, which was taken as a Set Transformer-based model, in experiments involving fashion set recommendation and group re-identification. These results support the claim that the feature representation extracted with interactions between the set members of the two sets improves accuracy and robustness for the two set matching procedure.
References
- (1) B. M. Sarwar, G. Karypis, J. A. Konstan, J. Riedl, et al., Item-based collaborative filtering recommendation algorithms., Www 1 (2001) 285–295.
-
(2)
P. Tangseng, K. Yamaguchi, T. Okatani,
Recommending outfits from personal
closet, CoRR abs/1804.09979.
arXiv:1804.09979.
URL http://arxiv.org/abs/1804.09979 -
(3)
L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, Scalable person re-identification: A benchmark, in: Proceedings of the IEEE International Conference on Computer Vision, 2015.
-
(4)
J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, Y. Wu,
Learning fine-grained image similarity
with deep ranking, CoRR abs/1404.4661.
arXiv:1404.4661.
URL http://arxiv.org/abs/1404.4661 - (5) O. M. Parkhi, A. Vedaldi, A. Zisserman, et al., Deep face recognition., in: bmvc, Vol. 1, 2015, p. 6.
-
(6)
G. Lisanti, N. Martinel, A. D. Bimbo, G. L. Foresti,
Group re-identification via
unsupervised transfer of sparse features encoding, CoRR abs/1707.09173.
arXiv:1707.09173.
URL http://arxiv.org/abs/1707.09173 -
(7)
H. Xiao, W. Lin, B. Sheng, K. Lu, J. Yan, J. Wang, E. Ding, Y. Zhang, H. Xiong,
Group re-identification:
Leveraging and integrating multi-grain information, in: Proceedings of the
26th ACM International Conference on Multimedia, MM ’18, ACM, New York, NY,
USA, 2018, pp. 192–200.
doi:10.1145/3240508.3240539.
URL http://doi.acm.org/10.1145/3240508.3240539 -
(8)
W. Lin, Y. Li, H. Xiao, J. See, J. Zou, H. Xiong, J. Wang, T. Mei,
Group re-identification with
multi-grained matching and integration, CoRR abs/1905.07108.
arXiv:1905.07108.
URL http://arxiv.org/abs/1905.07108 - (9) J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, Y. W. Teh, Set transformer: A framework for attention-based permutation-invariant neural networks, in: International Conference on Machine Learning, 2019, pp. 3744–3753.
-
(10)
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
-
(11)
T. Nakamura, R. Goto, Outfit generation
and style extraction via bidirectional LSTM and autoencoder, CoRR
abs/1807.03133.
arXiv:1807.03133.
URL http://arxiv.org/abs/1807.03133 - (12) X. Han, Z. Wu, Y.-G. Jiang, L. S. Davis, Learning fashion compatibility with bidirectional lstms, in: Proceedings of the 25th ACM international conference on Multimedia, ACM, 2017, pp. 1078–1086.
- (13) R. He, C. Packer, J. McAuley, Learning compatibility across categories for heterogeneous item recommendation, in: 2016 IEEE 16th International Conference on Data Mining (ICDM), IEEE, 2016, pp. 937–942.
-
(14)
X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 315–323.
- (15) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-
(16)
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, I. Polosukhin, Attention is
all you need, CoRR abs/1706.03762.
arXiv:1706.03762.
URL http://arxiv.org/abs/1706.03762 -
(17)
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, H. Adam, Mobilenets:
Efficient convolutional neural networks for mobile vision applications, CoRR
abs/1704.04861.
arXiv:1704.04861.
URL http://arxiv.org/abs/1704.04861 -
(18)
K. Sohn,
Improved
deep metric learning with multi-class n-pair loss objective, in: D. D. Lee,
M. Sugiyama, U. V. Luxburg, I. Guyon, R. Garnett (Eds.), Advances in Neural
Information Processing Systems 29, Curran Associates, Inc., 2016, pp.
1857–1865.
URL http://papers.nips.cc/paper/6200-improved-deep-metric-learning-with-multi-class-n-pair-loss-objective.pdf -
(19)
A. Hermans, L. Beyer, B. Leibe, In
defense of the triplet loss for person re-identification, CoRR
abs/1703.07737.
arXiv:1703.07737.
URL http://arxiv.org/abs/1703.07737 -
(20)
Y. Li, L. Cao, J. Zhu, J. Luo, Mining
fashion outfit composition using an end-to-end deep learning approach on set
data, CoRR abs/1608.03016.
arXiv:1608.03016.
URL http://arxiv.org/abs/1608.03016 - (21) O. Vinyals, S. Bengio, M. Kudlur, Order matters: Sequence to sequence for sets, arXiv preprint arXiv:1511.06391.
-
(22)
M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Póczos, R. Salakhutdinov, A. J.
Smola, Deep sets, CoRR
abs/1703.06114.
arXiv:1703.06114.
URL http://arxiv.org/abs/1703.06114 - (23) A. Gretton, O. Bousquet, A. Smola, B. Schölkopf, Measuring statistical dependence with hilbert-schmidt norms, in: International conference on algorithmic learning theory, Springer, 2005, pp. 63–77.
- (24) K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, et al., Kernel mean embedding of distributions: A review and beyond, Foundations and Trends® in Machine Learning 10 (1-2) (2017) 1–141.
- (25) B. Póczos, L. Xiong, D. Sutherland, J. Schneider, Support distribution machines, arXiv preprint arXiv 1202.
- (26) K. Muandet, K. Fukumizu, F. Dinuzzo, B. Schölkopf, Learning from distributions via support measure machines, in: Advances in neural information processing systems, 2012, pp. 10–18.
-
(27)
Y. Zhang, T. Xiang, T. M. Hospedales, H. Lu,
Deep mutual learning, CoRR
abs/1706.00384.
arXiv:1706.00384.
URL http://arxiv.org/abs/1706.00384 -
(28)
C. Li, W. Chang, Y. Cheng, Y. Yang, B. Póczos,
MMD GAN: towards deeper
understanding of moment matching network, CoRR abs/1705.08584.
arXiv:1705.08584.
URL http://arxiv.org/abs/1705.08584 - (29) W.-L. Hsiao, K. Grauman, Creating capsule wardrobes from fashion images, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7161–7170.
-
(30)
M. I. Vasileva, B. A. Plummer, K. Dusad, S. Rajpal, R. Kumar, D. A. Forsyth,
Learning type-aware embeddings for
fashion compatibility, CoRR abs/1803.09196.
arXiv:1803.09196.
URL http://arxiv.org/abs/1803.09196 -
(31)
Z. Cui, Z. Li, S. Wu, X. Zhang, L. Wang,
Dressing as a whole: Outfit
compatibility learning based on node-wise graph neural networks, CoRR
abs/1902.08009.
arXiv:1902.08009.
URL http://arxiv.org/abs/1902.08009 -
(32)
T. He, Y. Hu, Fashionnet: Personalized
outfit recommendation with deep neural network, CoRR abs/1810.02443.
arXiv:1810.02443.
URL http://arxiv.org/abs/1810.02443 - (33) X. Wang, R. Zhao, Person re-identification: System design and evaluation overview, in: Person Re-Identification, Springer, 2014, pp. 351–370.
-
(34)
A. Newell, K. Yang, J. Deng, Stacked
hourglass networks for human pose estimation, CoRR abs/1603.06937.
arXiv:1603.06937.
URL http://arxiv.org/abs/1603.06937 - (35) A. L. Maas, A. Y. Hannun, A. Y. Ng, Rectifier nonlinearities improve neural network acoustic models, in: Proc. icml, Vol. 30, 2013, p. 3.
- (36) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International journal of computer vision 115 (3) (2015) 211–252.