Deep Set-to-Set Matching and Learning

by   Yuki Saito, et al.

Matching two sets of items, called set-to-set matching problem, is being recently raised. The difficulties of set-to-set matching over ordinary data matching lie in the exchangeability in 1) set-feature extraction and 2) set-matching score; the pair of sets and the items in each set should be exchangeable. In this paper, we propose a deep learning architecture for the set-to-set matching that overcomes the above difficulties, including two novel modules: 1) a cross-set transformation and 2) cross-similarity function. The former provides the exchangeable set-feature through interactions between two sets in intermediate layers, and the latter provides the exchangeable set matching through calculating the cross-feature similarity of items between two sets. We evaluate the methods through experiments with two industrial applications, fashion set recommendation, and group re-identification. Through these experiments, we show that the proposed methods perform better than a baseline given by an extension of the Set Transformer, the state-of-the-art set-input function.



page 9


A Robust Point Sets Matching Method

Point sets matching method is very important in computer vision, feature...

Hierarchical Graph Matching Networks for Deep Graph Similarity Learning

While the celebrated graph neural networks yield effective representatio...

Path-following based Point Matching using Similarity Transformation

To address the problem of 3D point matching where the poses of two point...

Improving Outfit Recommendation with Co-supervision of Fashion Generation

The task of fashion recommendation includes two main challenges: visual ...

FashionNet: Personalized Outfit Recommendation with Deep Neural Network

With the rapid growth of fashion-focused social networks and online shop...

A Next Basket Recommendation Reality Check

The goal of a next basket recommendation (NBR) system is to recommend it...

Group-matching algorithms for subjects and items

We consider the problem of constructing matched groups such that the res...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Matching pairs of data is a crucial part of many machine learning tasks, including recommendation 

sarwar2001item ; DBLP:journals/corr/abs-1804-09979 , person re-identification zheng2015scalable , image search DBLP:journals/corr/WangSLRWPCW14

, face recognition 

parkhi2015deep , etc., as typical industrial applications. Over the past decade, a deep learning framework for matching up data, e.g., images, has served as the core of such systems.

Asides from these tasks, an extension of multiple instance matching, namely set-to-set matching, has recently been raised as an important element of various applications required by emerging web technologies or services. A representative example in e-commerce is the recommendation of a group of fashion items deemed to match the collection of fashion items already owned by the user. Regarding the group as a set, we can see this problem as one of set-to-set matching. Another example is group re-identification in surveillance systems DBLP:journals/corr/LisantiMBF17 ; Xiao:2018:GRL:3240508.3240539 ; DBLP:journals/corr/abs-1905-07108 , which has recently started implementing a function to track a known group of suspicious persons or criminals. This task can also be simplified as a set-to-set matching problem.

The difficulty of set-to-set matching, in comparison with ordinary data matching, lies in the two types of exchangeability required: one is exchangeability between the pair of sets, and the other is invariance across different permutations of the items in each set. A function that calculates a matching score should provide an invariant response, regardless of the order of the two sets or the permutations of the items.

The main focus of this paper is an architecture that preserves the aforementioned exchangeability properties, and at the same time, realizes a high performance in the set-to-set matching tasks. We consider the architecture of two modules: 1) the feature extractor, and 2) the matching layer. A straightforward method of ensuring exchangeability is to apply a same set feature extractor, such as Set Transformer lee2019set , separately to each of the sets, and compute the symmetric matching score with the extracted features. In this study, however, we argue that allowing the feature extractor to include interactions between the two sets will improve the feature representations for the task of set-to-set matching. We propose a deep learning architecture for 1) feature extraction, named cross-set transformation, which iteratively provides the interactions between the pair of sets to each other in the intermediate layers of the feature extractor. The proposed architecture also includes 2) a matching layer, named cross-similarity function, that calculates the matching score between the features of the set members across the two sets. Our model guarantees both types of exchangeability in the modules of 1) and 2).

We discuss the set-to-set matching problem in a supervised setting, where examples of correct pairs of sets are provided as training data. The objective of the supervised learning is to train the feature extractor and matching layer, so that an appropriate set of features to be matched can be extracted. The model is then used to find a correct pair of sets from a group of candidates. We propose an efficient training framework for the proposed set-to-set matching architecture.

The effectiveness of our approach is demonstrated in two real-world applications. First, we consider fashion set recommendations, where examples of the outfits are used as the training data, to show correct combinations of items (clothes). Using a large number of examples of the outfits in the form of images, we aim to match the correct pair of defined sets through subset and superset matching. In the subset matching problem, we randomly split an outfit in half beforehand, to form two subsets, and then use them as a correct pair. Superset matching is a multiple-outfit version of subset matching. Next, we conduct experiments on a simple type of group re-identification DBLP:journals/corr/LisantiMBF17 ; Xiao:2018:GRL:3240508.3240539 ; DBLP:journals/corr/abs-1905-07108 . The objective is to match the two groups composed of the same persons captured in individual images provided by the Market-1501 dataset zheng2015scalable , under “noisy” and “non-noisy” conditions. The proposed methods are compared with a baseline model, which in this case is a straightforward extension of the Set Transformer lee2019set , in the set-to-set matching problem.

The main contributions of this paper are as follows:

  • A novel deep learning architecture is proposed to provide the two types of exchangeability required for set-to-set matching.

  • The proposed feature extractor using the interactions between two sets is shown to extract better features for set-to-set matching.

  • The proposed models demonstrate better performance than the baseline for the fashion set recommendation and group re-identification tasks, supporting the claim that the interactions improve both the accuracy and robustness of the set-matching procedure.

Figure 1: Example of our architecture. and indicate an -th (one-layered) encoder, sharing weights within the same layer, and a fully connected layer, respectively. Here, we omitted the multihead structure in the function .

2 Preliminaries: Set-to-Set Matching

2.1 Problem Formulation

To describe the task of matching two sets, we introduce the necessary notation. Let be a finite set of all items. Sets and as data are subsets of , where and ; hence and . Let

be feature vectors representing the features of

and , respectively. Let and be sets of these feature vectors, where .

The function calculates a matching score between the two sets and . Guaranteeing the exchangeability of the set-to-set matching requires that the matching function is symmetric and invariant under any permutation of items within each set (see Section 2.2).

We consider tasks where the matching function is used to select a correct matching. Given candidate pairs of sets , where and , we choose as a correct one so that achieves the maximum score from amongst the candidates. In this study, a supervised learning setting is considered, where the function

is trained to classify the correct pair and unmatched pairs. The details of the training method are deferred to Section


2.2 Mappings of Exchangeability

We present a brief review of the foundational elements of our models and set-input functions, to demonstrate the exchangeability they realize.

2.2.1 Permutation Invariance

A set-input function is said to be permutation invariant if


for permutations on and on .

2.2.2 Permutation Equivariance

A map is said to be permutation equivariant if


for permutations and , where and are on and , respectively. Note that is permutation invariant for permutations within .

2.2.3 Symmetric Function

A map is said to be symmetric if


2.2.4 Two-Set-Permutation Equivariance

Given and , a map is said to be a two-set-permutation equivariant if


for any permutation operator exchanging the two sets.

3 Matching and Learning for Sets

In this section, the set-to-set matching problem is described. Based on the problem scenario described in Section 2.1, we state the design motivations, including a concrete example of a real-life application in Section 3.1. We then describe in detail the architecture of 2) the matching layer, cross-similarity function in Section 3.2, and 1) the feature extractor, cross-set transformation in Section 3.3. Finally, we discuss training procedures for our model in Section 3.4. Figure 1 shows the proposed architecture.

3.1 Motivation

The set-to-set matching problem motivates us to construct a specific architecture, one which includes the interactions between the set members of different sets. As an example we take the case of set-to-set fashion item recommendations, here and represent the outfits or subsets/supersets of the outfits, and set members and

are fashion items coordinating the outfits. Using a conventional deep convolutional neural network, i.e., Inception-v3 

szegedy2016rethinking , which will be described later, we can extract image features and from garment images that contain individual fashion items representing and , respectively. Assuming that the image features represent not only low-level features such as colors and edges but also high-level ones, such as style features DBLP:journals/corr/abs-1807-03133 affecting the impression of the outfit, taking into account combinations of items is required to fully consider fashion compatibility han2017learning ; he2016learning

. Hence, we consider that 1) using the interactions between set members of different sets is crucial in feature extraction, and also that, after the feature extraction, 2) the resulting features of the two sets must be measured by estimating a score of the possibility of being the correct pair. Finally, the supervised framework is required to learn the appropriate features to match.

3.2 Calculating Matching Score for Sets

We introduce a matching layer to calculate the matching score between two given sets, mapping . It is designed to calculate the inner product for every combination of set members across both sets, so we call this cross-similarity (CS), which is defined as follows:


where and are feature vectors in and , respectively, is a linear function allowing conversions into a lower-dimensional space using weights , e.g., ,

is a non-negative mapping, i.e., ReLU 

glorot2011deep , and is the number of dimensions of the lower-dimensional space. CS can be seen as a calculation of the average similarity in the linear subspaces created by the dimensionality reduction . Note that CS corresponds to the (normalized) inner product in the linear subspace if both sets contain only one set member.

Instead of calculating CS singly, we utilize multiple CSs (mCS) to combine the cross-similarities calculated with different linear mappings, the procedure runs as follows:


Where indicates the concatenation, and the linear function with weights maps . Because CS is permutation invariant (definition in Eq. (1)), mCS is also permutation invariant:

Property 1.

Both CS and mCS are permutation invariant.

Additionally, because CS is symmetric (definition in Eq. (3)), mCS is symmetric as well:

Property 2.

Both CS and mCS are symmetric.

This symmetric property entails that CS and mCS satisfy the exchangeability criterion for the pair of sets, i.e., and .

Next, to allow for comparison against the scores for other matching candidates, the output of mCS or CS is fed into the loss function, e.g., a softmax function with cross-entropy loss. That is, the task of maximizing the matching score is translated into a minimization of the loss function, using the given label.

3.3 Cross-Set Feature Transformation

We construct the architecture of the feature extractor, which transforms sets of features using the interactions between the pair of sets, and extracts the desired features to be matched in the post-processing stages.

Here, consider the transformation of a pair of set-feature vectors into new feature representations on . Let be the iteration (layer) number of the cross-set transformation layers. Our feature extraction then can be described as a map of , where , , , , , and . For example, denotes the feature vector extracted by the -th layer representing , and is defined similarly. Note that the initial feature vectors with are found with a typical feature extractor, i.e., a deep convolutional neural network for an individual image. Then, we construct a parallel architecture, cross-set transformation, with an asymmetric transformation , as follows:


where or is a permutation equivariant function for the first argument (defined in Eq. (2)), that transforms the set features in the first argument into new feature representations, regardless of the order of the set features in the second argument. Furthermore, residual paths he2016deep may be used in Eq. (9) if required.

We propose two possible feature extractors for : an attention-based function, and an affinity-based function. Both are constructed to assign the “matched” feature vectors to the “reference” feature vector, taking account of interactions between the two sets. For simplicity, we provide an explanation via the case of extracting the features for as follows (we can easily exchange and for ).

The attention-based function of maps as follows:


where , , , , and

denotes a linear transformation, e.g.,

. Here, if a multihead structure is not utilized, as described later. It is clear that the attention-based function transforms the respective feature vectors, giving the interactions between the set members of and .

Note that our attention-based function has a strong relation to dot-product attention DBLP:journals/corr/VaswaniSPUJGKP17 ; lee2019set , which has in the past been introduced to calculate the weighted average on using as the coefficients. However, the operation would be inconsistent with our matching objective, as through normalization it increases the coefficients even in unmatched cases of and . To preserve non-linearity, we use instead the non-negative weighted sum and then average it using Eq. (10).

The affinity-based function of maps as follows:


where , , and . Using the two linear transformations and , the affinity-based function combines the resembling features so that the resulting feature vectors for have similar representations to the (linearly transformed) vectors in .

Other simple permutation equivariant functions of , i.e., , may be utilized. However, we consider it a function incapable of extracting rich enough features without any interactions between the two sets to yield accurate matching for two sets.

Instead of performing singly, we introduce a multihead structure DBLP:journals/corr/VaswaniSPUJGKP17 to our feature extractor , which is a permutation equivariant function mapping . Denoting the output of as , the multihead version of is defined as , where indicates a concatenation for each corresponding set member in , , and . Note that the multihead structure is related to recent models such as MobileNet DBLP:journals/corr/HowardZCKWWAA17 , which isolates and places the convolutional operations in parallel to reduce the calculation costs whilst preserving the accuracy of the recognition. We assume that the multihead structure provides various interactions between the set members, reducing the calculation costs as well.

3.3.1 Analysis

We briefly show that our architecture is permutation invariant and symmetric using trivial solutions as follows. Note that the independent operations for each set member are not related to the above properties, so we focus on the matching layer of CS or mCS and the feature extractor of the cross-set transformation.

Proposition 1.

A composition of the function or with the cross-set transformation , i.e., or , is symmetric and permutation invariant.


For permutation invariance, because a composite function of a permutation equivariant function and permutation invariant function or (described in Property 1) is permutation invariant, both and are permutation invariant.

To argue that our architecture is symmetric, we use the fact that the symmetric function composed with the two-set-permutation equivariant function defined in Eq. (4) is symmetric, i.e., . Assuming that the cross-set transformation exhibits the weight-sharing structure, is a two-set-permutation equivariant function, i.e., , where and for any permutation operator exchanging the responses of and . Thus, we have the following:

Property 3.

The cross-set transformation is equivariant under two-set-permutation.

Using Property 2 and Property 3, we prove Proposition 1, which guarantees the exchangeability of the set-to-set matching procedure. ∎

Note that we can stack cross-set transformations in a way that preserves the symmetric architecture within multiple layers, by combining with other networks that operate upon the sets or items independently. We discuss the overall architecture in Section 5.2.

3.4 Training for Pairs of Sets

We attempt to train our model efficiently using multiple correct pairs taken together. As described in the problem formulation, candidates are provided to find the correct one-to-one matching. Here, candidates per reference set of are fed into the matching process in each training iteration. However, if we prepare different candidates for each reference set of , the calculations for data processing would be inefficient.

To train our model efficiently, we create matching candidates from the correct pairs. Let be a correct pair of sets, where . From those -pair, by extracting all , we can create the set of as . That is, is composed of sets exhibiting correct relations to the respective , and can be used as a set of candidates for each in the training stage. Note that we can construct these candidates by assuming that one correct pair exists for the respective sets, as described in the problem formulation. An example is shown in Figure  2.

Compared with a typical mini-batch training, suppose that the set size is on average, the -pair training method utilizes data (images) per training iteration; this can be regarded as the size of the mini-batch. We consider the above training method as a set version of -pair training loss NIPS2016_6200 , so we call this -pair-set loss herein.

If the quantity of set data is not large, we can use other training frameworks, i.e., a triplet loss with the softplus function DBLP:journals/corr/HermansBL17 ; we can use the triplet loss for the relations among the reference set of , the positive candidate , and the negative candidate , where . We demonstrate that the above two training frameworks can be adopted for training our models in the experiment.

Figure 2: Example of -pair-set.

4 Related Works

4.1 Set-Input Methods

Deep learning architecture for set data is developing and has been well studied DBLP:journals/corr/LiCZL16 ; vinyals2015order ; lee2019set ; DBLP:journals/corr/ZaheerKRPSS17 . In the work of Lee et al. lee2019set , a set-feature representation was introduced by applying a self-attention based Transformer DBLP:journals/corr/VaswaniSPUJGKP17 to set data. An encoder–decoder model called Set Transformer transforms the set data into a vector/matrix representation in the feature space. Zaheer et al. DBLP:journals/corr/ZaheerKRPSS17 derived a condition for the property of permutation invariance/equivariance in functions, and introduced an operator referred to as deep sets. These models can manage set data that serve multiple objectives, such as set classification, calculation from images, text retrieval, etc. However, constructing a deep learning model that can manage multiple sets has not been well studied.

4.2 Methods for Matching Two Sets

The matching of multiple data is related to a setting in measuring two distributions gretton2005measuring ; muandet2017kernel ; poczos2012support ; muandet2012learning ; DBLP:journals/corr/ZhangXHL17 ; DBLP:journals/corr/LiCCYP17 . However, to the best of our knowledge, deep learning for matching two sets of data has not been well studied.

4.3 Applications

Many fashion item recommendation studies have investigated natural combinations of fashion items, the so-called visual fashion compatibility, to recommend fashion items or outfits han2017learning ; he2016learning ; hsiao2018creating ; DBLP:journals/corr/abs-1804-09979 ; DBLP:journals/corr/abs-1807-03133 ; DBLP:journals/corr/abs-1803-09196 ; DBLP:journals/corr/abs-1902-08009 ; DBLP:journals/corr/abs-1810-02443 . In this study, the main difficulties of the subset/superset matching procedures lie in satisfying the fashion compatibility requirements of the matched sets.

In the applications of group re-identification DBLP:journals/corr/LisantiMBF17 ; Xiao:2018:GRL:3240508.3240539 ; DBLP:journals/corr/abs-1905-07108 or multi-shot person re-identification wang2014person , problems of multiple instance matching arise. One group re-identification scenario has been proposed that requires the detection of known groups from videos DBLP:journals/corr/abs-1905-07108 . Because our experiments focus on set-to-set matching using given cropped images, we regard it is a simple type of group re-identification.

5 Experiments

5.1 Baseline for Comparison

We validate our architecture through comparison with other set-matching models. However, to the best of our knowledge, studies using deep neural networks for matching two sets are non-existent. Instead, we use an extension from the state-of-the-art set-input method to a set-to-set matching procedure, and consider this acceptable for the comparison.

We straightforwardly extend the Set Transformer lee2019set towards the two sets matching method. Here, the Set Transformer transforms a set of feature vectors into a vector on . Denoting the Set Transformer model , we perform the extension by calculating the matching score between the two sets and via the inner product , sharing the weights between the two . Note that the extension of the Set Transformer satisfies the exchangeability criteria for the set-to-set matching, however, no interactions between the two sets are provided. The other processes and scenarios are the same as in our methodology.

Comparing the experimental results of our models with the results of the extended Set Transformer-based model serves as a performance comparison and also an evaluation for our models, providing insight into whether our architecture is valid or not.

5.2 Overall Architecture

In applications of our model, we use an encoder–decoder structure, inspired by the Transformer models DBLP:journals/corr/VaswaniSPUJGKP17 ; lee2019set . As an example, Vaswani et al. regarded the Transformer as an encoder–decoder model for text translation DBLP:journals/corr/VaswaniSPUJGKP17

; the encoder transforms a set of features within the input domain, and the decoder transforms the resultant set of features onto the output domain. Because the translation is unidirectional, the one-encoder–decoder structure is included in the Transformer. Meanwhile, an iterative model of the encoder–decoder, i.e., the Stacked Hourglass model, has been proposed and demonstrates a high accuracy in the task of human pose estimation 

DBLP:journals/corr/NewellYD16 . Borrowing from the above architectures, we construct our overall architecture by combining the encoder lee2019set , which is a permutation equivariant function called a self-attention block, with the decoder, which is a function of our cross-set transformation. We then repeat the encoder–decoder structure times in succession (described in Figure 1).

For the experiments, we construct our models as follows. We set both the number of cross-set transformation layers and the number of encoder layers to 2. That is, we iteratively perform the one-layered encoder and the one-layered cross-set transformation two times in succession. Here, the resultant set of feature vectors extracted from each encoder model is fed into the next cross-set transformation and also the respective residual paths. Alongside this, we apply a feed-forward network, which comprises two-layered linear transformations with a leaky ReLU maas2013rectifier to the first argument of each function . Note that we use the two types of function composing the cross-set transformation: the attention-based function in Eq. (10) and the affinity-based function in Eq. (11).

To extract the individual feature vector from one of the images within the set, we use convolutional neural networks (CNNs) as the feature extractor. We use different CNNs for each task. For the task of the fashion set recommendation, we use the Inception-v3 szegedy2016rethinking

model, which is pre-trained using the ILSVRC-2012 ImageNet dataset 

russakovsky2015imagenet . Using this model, we extract the feature vector on from the second-last layer placed between the global average pooling layer and the fully connected layer. We linearly transform the feature vector into to serve as one of the set members, then the two sets of collected feature vectors are fed into the set-input functions. On the other hand, we utilize a simple four-layered CNN without any pre-training for the group re-identification task. This CNN transforms an RGB image into the feature vector mapping channels using kernels, and we then apply global average pooling so that the resultant feature vector is on as well.

5.3 Training Details

In the training stage, we set the number of match candidates to 16, 4, and 16 for subset matching, superset matching, and group re-identification, respectively.

We use a stochastic gradient descent method with a learning rate of 0.005, a momentum of 0.5, and a weight decay of 0.00004. The learning rate is set to degrade every 16 epochs by multiplying by 0.7. For fashion set recommendations, we set the maximum number of epochs to 32, which requires a week for training on Amazon SageMaker ml.p3.8xlarge. For group re-identification, we set the maximum number of epochs to 128, which takes a few hours. We train both the CNN and the set matching model simultaneously.

In the selection of the loss function, to reduce the learning time, we use the -pair-set loss for the tasks of superset matching and group re-identification. We also use the triplet loss with softplus function for the subset matching.

5.4 Fashion Set Recommendation

5.4.1 Dataset

We examine the set-to-set matching for the fashion set recommendation using the IQON dataset DBLP:journals/corr/abs-1807-03133 . IQON ( is a user-participating fashion web service sharing outfits for women. The IQON dataset consists of recently created, high-quality outfits, including 199,792 items grouped into 88,674 outfits. We split these outfits into groups, using 70,997 for training, 8,842 for validation, and 8,835 for testing.

For training with the IQON dataset, we set the maximum and minimum numbers of items for each outfit as 8 and 4, respectively; if the outfit contains more than 8 items, then we randomly select 8 items from it. The outfits contain roughly 5.5 items on average.

5.4.2 Preparing Set Pairs

To construct the correct pair of sets to be matched, we randomly halve the given outfit into two non-empty proper subsets and as follows: , where . We use different splitting methods for subset matching and superset matching.

We expect our model to reconstruct the original outfit by combining two subsets, and , provided such an inverse mapping exists. In the reconstruction, we assume that the desired features either remain within both subsets or are extracted during matching. For example, we regard the desired features as the discriminative features, which serve to recognize the fashion compatibility han2017learning ; he2016learning or infer the visual style of the outfit. That is, in the matching of the two subsets, such desired features to be matched must be obtained.

We perform our experiments using these subsets, to try to find the correct pairs. Here, we consider matching two subsets and ; we call this problem subset matching. In the subset matching, subsets are provided as a set of matching candidates, whilst maintaining the category restrictions for each fashion item. That is, these candidates only contain the same-category fashion items and are fed into the training or testing stages. Note that without any category restrictions, the models tend to be trained to select the candidate that contains non-overlapped fashion category items, e.g., shoes, with the reference subset . To avoid this situation, we introduce category restrictions to the candidates in each training/testing iteration.

Also, we extend the problem of subset matching to superset matching which presents more complex problem situations. We choose outfits randomly and split the respective outfits randomly in half , where . Then we create two supersets and . These two supersets serve as a correct pair for the superset matching problem. We consider the superset as a multimodal/mixture set, which consists of multiple fashion styles, such that the matching problem is one of finding similar supersets in terms of these mixed fashion styles. Because each superset has a category overlap of fashion items themselves, providing category restrictions to the candidates is not necessarily required in the superset matching, so we do not give the restrictions. Note that we used in the training stage and selected in the test stage.

5.4.3 Subset/Superset Matching

We discuss the experimental results of the matching subsets/supersets. Figure 3(a) and 3(b) show the differences between the learning curves for our models and the baseline, the Set Transformer-based model. Our models seemed to converge well in both the subset and superset matching tasks, while the learning curve of the baseline model showed higher losses compared with our models, indicating the difficulties of learning with the baseline model in subset/superset matching. Moreover, Table 1 shows significantly different results between our models and the baseline. Here, Cross Attention and Cross Affinity denote our models with the attention-based and affinity-based functions, respectively. In both the subset and superset matching, our models demonstrated better results compared with the baseline. For example, in the simplest scenario for superset matching with Cand: 4 and Mix: 2, the accuracies of the baseline and our model with affinity-based function were 73.5% and 90.6%, respectively.

(a) Learning curves for subset matching.
(b) Learning curves for superset matching.
Figure 3: Learning curves. Here, the training losses are smoothed by a moving average with a window size of 30 for visualization. Cross_AT, Cross_AF, and Set_Trans denote our model with attention-based function, our model with affinity-based function, and the baseline model, respectively.

Also, comparing the results of the attention-based function and affinity-based function, the learning curves in Figure 3 show almost the same losses, however, Table 1 shows that the affinity-based function performed better in both the subset and superset matching.

Subset Matching Superset Matching
Cand:4 Cand:8 Cand:4 Cand:8
Method Mix:2 Mix:4 Mix:2 Mix:4
Set Trans. (baseline) 39.2 22.7 73.5 65.3 57.5 49.6
Cross Attention 58.1 41.9 88.8 74.3 80.6 58.9
Cross Affinity 60.2 43.3 90.6 75.9 82.8 61.9
Table 1: Accuracy of subset/superset matching (%). Cand and Mix indicate the number of candidates for matching and the number of outfits mixed in the supersets, respectively.

5.5 Group Re-Identification

We present the results of group re-identification, where the task is to identify the pair of sets which consist of individual images of the same persons. We report our results using an extension of a well-known person re-identification dataset, Market-1501 zheng2015scalable , which provides cropped images containing individual persons alongside their respective labels. We evaluated the accuracy using the training/validation data, including the query/reference splits, which were divided by the author. Here we regard sets of reference and query data as and , respectively.

For the experiment, using Market-1501, we construct image sets composed of multiple persons. Each set consists of 3–8 persons and contains three different images of each. We validate the matching accuracy of the models and their robustness for “noise” as well. Here, the “noise” means an unexpected person mixed up in the group re-identification; for real-world applications, we consider that both the reference set and query set may contain “non-target” persons owing to errors, such as human error or mis-classification by the bounding box detector. These “noisy” situations may not be avoided completely; therefore, investigating the robustness of group re-identification is crucial. Figure 4 shows an example, which includes a “non-target” person. Here, we trained the models under the same “noisy” or “non-noisy” settings for each test, to investigate the robustness of the models for the set-to-set matching under noisy situations. In the test stage, we set the number of candidates to 5.

Table 2 compares the results of our models with the baseline. In the non-noisy case, all models showed almost perfect accuracies; we consider “averaging feature vectors in sets” achieves the high matching accuracy in this simple case, so all the models showed accurate results. By increasing the ratio of noise, the accuracy was degraded across all models. However, our models were more accurate than the baseline, even in the noisy case. For example, when the “noise” was five persons in the set , and the remaining three persons were the correct target persons in both the sets and , the accuracies of the baseline and proposed model with affinity-based functions were 80.4% and 92.4%, respectively. The differences in the accuracies were preserved even in more noisy settings. Those results support the claim that providing the interactions between two sets improves both the accuracy and the robustness of the matching procedure.

Figure 4: An example of a correct pair for group re-identification in the noisy setting. contains four persons, including a “non-target” person who is not included in .
(Ratio of “noise” persons in , Ratio of “noise” persons in )
Set Trans. (baseline) 99.5 95.1 89.9 85.7 80.4 65.7 48.1
Cross Attention 99.6 96.9 94.8 91.9 90.7 72.9 56.1
Cross Affinity 99.7 96.5 92.5 94.4 92.4 72.0 61.7
Table 2: Accuracy of group re-id. (%).

6 Conclusion

In this paper, we investigated a set-to-set matching problem. We proposed a novel architecture including 1) the cross-set transformation and 2) the cross-similarity function, along with the training framework for a large amount of set-data.

We showed that our architecture preserves the two types of exchangeability for a pair of sets and the items within them (Proposition 1), that should be satisfied for the set-to-set matching procedure.

We demonstrated that our models performed well compared with the baseline, which was taken as a Set Transformer-based model, in experiments involving fashion set recommendation and group re-identification. These results support the claim that the feature representation extracted with interactions between the set members of the two sets improves accuracy and robustness for the two set matching procedure.