DeepAI

# MessyTable: Instance Association in Multiple Camera Views

We present an interesting and challenging dataset that features a large number of scenes with messy tables captured from multiple camera views. Each scene in this dataset is highly complex, containing multiple object instances that could be identical, stacked and occluded by other instances. The key challenge is to associate all instances given the RGB image of all views. The seemingly simple task surprisingly fails many popular methods or heuristics that we assume good performance in object association. The dataset challenges existing methods in mining subtle appearance differences, reasoning based on contexts, and fusing appearance with geometric cues for establishing an association. We report interesting findings with some popular baselines, and discuss how this dataset could help inspire new problems and catalyse more robust formulations to tackle real-world instance association problems. Project page: $\href{https://caizhongang.github.io/projects/MessyTable/}{\text{MessyTable}}$

• 16 publications
• 12 publications
• 4 publications
• 11 publications
• 17 publications
• 37 publications
• 8 publications
• 169 publications
07/27/2020

### Associative3D: Volumetric Reconstruction from Sparse Views

This paper studies the problem of 3D volumetric reconstruction from two ...
02/02/2021

### Occluded Video Instance Segmentation

Can our video understanding systems perceive objects when a heavy occlus...
11/15/2014

### GASP : Geometric Association with Surface Patches

A fundamental challenge to sensory processing tasks in perception and ro...
08/07/2020

### Leveraging Localization for Multi-camera Association

We present McAssoc, a deep learning approach to the as-sociation of dete...
11/16/2019

Instance shadow detection is a brand new problem, aiming to find shadow ...
11/22/2022

### ONeRF: Unsupervised 3D Object Segmentation from Multiple Views

We present ONeRF, a method that automatically segments and reconstructs ...
08/28/2022

### Incremental Semantic Localization using Hierarchical Clustering of Object Association Sets

We present a novel approach for relocalization or place recognition, a f...

## Code Repositories

### MessyTable

MessyTable: Instance Association in Multiple Camera Views

## 1 Introduction

We introduce a new and interesting dataset, MessyTable. It contains over 5,000 scenes, each of which captured by nine cameras in one of the 600 configurations of camera poses. Each scene is shot with a random cluttered background and different lighting conditions with about 30 general objects on average. The objects are chosen arbitrarily from 120 classes of possible instances. Figure 1 depicts some scene examples. The goal is to associate the different objects in a scene, i.e., finding the right match of the same instance across views.

The seemingly easy task on this dataset is surprisingly challenging. The relative pose between two cameras can be large, and therefore, an object may appear to be very different when viewed from different angles. Objects are heavily occluded while some of them can be elevated by other objects in a cluttered scene. Hence, appearance viewed from different cameras is always partial. The problem is further complicated with similar-looking or even identical instances.

Solving the aforementioned problems is non-trivial. The geometric constraint is hard to use right away. Multi-view epipolar geometry is ambiguous when a pixel can correspond to all points on the epipolar line in the other view. Homographic projection assumes a reference plane, which is not always available. To associate an object across camera views, a method needs to distinguish subtle differences between similar-looking objects. Fine-grained recognition is still a challenging problem in computer vision. When a scene contains identical-looking objects, the method is required to search for neighbouring cues in the vicinity of the objects to differentiate them. The neighbouring configuration, however, can be occluded and highly non-rigid with changing relative position due to different camera views.

While the method developed from MessyTable can be applied to some immediate applications such as automatic check-out [38] in supermarkets, e.g., leveraging multiple views to prevent counting error on the merchandise, the instance association problem found in this dataset is reminiscent of many real-world problems such as person re-identification or object tracking across views. Both examples of real-world problems require some sort of association between objects, either through appearances, group configurations, or temporal cues. Solving these real-world problems requires one to train a model using domain-specific data. Nonetheless, they still share similar challenges and concerns as to the setting in MessyTable.

MessyTable is not expected to replace the functionalities of domain-specific datasets. It aims to be a general dataset offering fundamental challenges to existing vision algorithms, with the hope of inspiring new problems and encouraging novel solutions. In this paper, apart from describing the details of MessyTable, we also present the results of applying some baselines and heuristics to address the instance association problem. We also show how a deep learning-based method developed from MessyTable can be migrated to other real-world multi-camera domains and achieve good results.

## 2 Related Work

Related Problems. Various computer vision tasks, such as re-identification and tracking, can be viewed as some forms of instance association. Despite differences in the problem settings, they share common challenges as featured in MessyTable, including subtle appearance difference, heavy occlusion and viewpoint variation. We take inspirations from methods in these fields for developing a potential solution for multi-camera instance association.

Re-identification (ReID) aims to associate the query instance (e.g., a person) and the instances in the gallery [11]. Association performance suffers from drastic appearance differences caused by viewpoint variation, and heavy occlusion in crowded scenes. Therefore, the appearance feature alone can be insufficient for satisfactory results. In this regard, instead of distinguishing individual persons in isolation (e.g.,[34, 47, 49]), an alternative solution proposed by [48] exploits contextual cues: as people often walk in groups in crowded scenes, it associates the same group of people over space and time.

Multi-object Tracking (MOT) is the task to associate instances across sequential frames, leveraging the availability of both appearance features and temporal cues[23, 31, 28]. It suffers from ID switches and fragmentation primarily caused by occlusion [21, 18]. MOT in a multi-camera setting is formally referred to as Multi-Target Multi-Camera Tracking (MTMCT) [25], which also suffers from viewpoint variation in cross-camera tracklet association [46, 9, 16]. In addition, MTMCT with overlapping field of view[41, 5] is similar to MessyTable’s multi-camera setting. Thus, studies conducted on MessyTable might be inspiring for a better cross-camera association performance in MTMCT.

Related Datasets. Many datasets for ReID and MOT offer prominent challenges [12] that are common in real life. For instance, CUHK03[17], MSMT17[37], and MOT16[22] feature occlusion and viewpoint variation, and many other datasets [43, 40] also feature illumination variations.

There are several multi-camera datasets. Though originally designed for different purposes, they can be used for evaluating instance association. MPII Multi-Kinect (MPII MK) [35] is designed for object instance detection and collected on a flat kitchen countertop with nine classes of kitchenwares captured in four fixed views. The dataset features some level of occlusion, but the scenes are relatively simple for evaluating general object association. EPFL Multi-View Multi-Class (EPFL MVMC) [26] contains only people, cars, and buses and is built from video sequences of six static cameras taken at the university campus (with the road, bus stop, and parking slots). WILDTRACK [5], captured with seven static cameras in a public open area, is the latest challenging dataset for multi-view people detection.

Compared to existing datasets, MessyTable aims to offer fundamental challenges that are not limited to specific classes. MessyTable also contains a large number of camera setup configurations for a larger variety of camera poses and an abundance of identical instances that are absent in other datasets.

## 3 MessyTable Dataset

MessyTable is a large-scale multi-camera general object dataset designed for instance association tasks. It comprises 120 classes of common on-table objects (Figure 2), encompassing a wide variety of sizes, colors, textures, and materials. Nine cameras are arranged in 567 different setup configurations, giving rise to 20,412 pairs of relative camera poses (Section 3.1). A total of 5,579 scenes, each containing 6 to 73 randomly selected instances, are divided into three levels of difficulties. Harder scenes have more occlusions, more similar- or identical-looking instances, and proportionally fewer instances in the overlapping areas (Section 3.2). The total 50,211 images in MessyTable are densely annotated with 1,219,240 bounding boxes. Each bounding box has an instance ID for instance association across cameras (Section 3.3). To make MessyTable more representative, varying light conditions and backgrounds are added. Details of the data collection can be found in the Supplementary Materials.

### 3.1 Variations in View Angles

For a camera pair with a large angle difference, the same instance may appear to be very different (e.g., instance ID 5 in Figure 1) in the two views. Existing multi-camera datasets typically have their cameras installed on static structures[35, 26], even at similar heights[5]

. This significantly limits the variation as the data essentially collapses to a very limited set of modes. In contrast, MessyTable has not only a high number of cameras but also a large variation in cameras’ poses. The camera stands are arbitrarily adjusted between scenes, resulting in an extremely diverse distribution of camera poses and large variance of angle difference between cameras, as shown in Figure

3.

### 3.2 Variations in Scenes

Partial and Full Occlusions. As shown in Figure 4(a) and (b), partial occlusion results in loss of appearance features [1], making matching more difficult across cameras; full occlusion completely removes the object from one’s view despite its existence in the scene. To effectively benchmark algorithms, in addition to dense clutter, artificial obstacles (such as cardboard boxes) are inserted into the scene.

Similar- or Identical-looking Instances. It is common to have similar and identical objects placed in the vicinity as illustrated in Figure 4(c) and (d). Similar or identical instances are challenging for association. MessyTable has multiple duplicates of the same appearance included in each scene, a unique feature that is not present in other datasets such as [35].

Variations in Elevation. Many works simplify the matching problem by assuming all objects are in contact with the same plane [8, 6, 1, 19]. However, this assumption often does not hold as the scene gets more complicated. To mimic the most general and realistic scenarios in real life, object instances in MessyTable are allowed to be stacked or placed on elevated surfaces as shown in Figure 4(e) and (f).

Data Splits. The dataset is collected with three different complexity levels: Easy, Medium, and Hard, with each subset accounting for 30%, 50%, and 20% of the total scenes. For each complexity level, we randomly partition data equally (1:1:1) into the training, validation, and test sets.

The camera angle differences are similar among the three levels of complexity. But the number of instances, the fraction of overlapped instances and the fraction of identical instances are significantly different as shown in Figure 5. Furthermore, as shown in the example scenes, the larger number of instances in a harder scene significantly increases the chance of heavy occlusion. We empirically show that these challenges undermine the association performance of various methods.

### 3.3 Data Annotation

We use OpenCV [2] for calibrating intrinsic and extrinsic parameters. As for the instance association annotation, we gather a team of 40 professional annotators and design a three-stage annotation scheme to obtain reliable annotations in a timely manner. The annotators first annotate bounding boxes to enclose all foreground objects (localization stage), followed by assigning class labels to the bounding boxes (classification stage). In the last stage, we develop an interactive tool for the annotators to group bounding boxes of the same object in all nine views and assign them the same instance ID (association stage). Bounding boxes in different views with the same instance ID are associated (corresponding to the same object instance in the scene) and the ID is unique in each view. For each stage, the annotators are split into two groups in the ratio of 4:1 for annotation and quality inspection.

It is worth mentioning that the interactive tool has two desirable features to boost efficiency and minimize errors: first, the class labels are used to filter out irrelevant bounding boxes during the association stage; second, the association results correct errors in the classification stage as the disagreement of classification labels from different views triggers reannotation. The details of the data annotation can be found in the Supplementary Materials. In short, MessyTable provides the following annotations:

• intrinsic and extrinsic parameters for the cameras;

• regular bounding boxes (with class labels) for all 120 foreground objects;

• instance ID for each bounding box

## 4 Baselines

In this section, we describe a few popular methods and heuristics that leverage appearance and geometric cues, and a new baseline that additionally exploits contextual information. We adopt a multiple instance association framework, in which pairwise distances between two sets of instances are computed first. After that, the association is formulated as a maximum bipartite matching problem and solved by the Kuhn-Munkres (Hungarian) algorithm.

### 4.1 Appearance Information

The appearance feature of the instance itself is the most fundamental information for instance association. As the instances are defined by bounding boxes, which are essentially image patches, we find instance association largely resembles the patch matching (local feature matching) problem.

Local feature matching is one of the key steps for low-level multiple camera tasks[45, 3, 39]. Various hand-crafted feature descriptors, e.g., SIFT [20]

, have been widely used in this task. We implement a classical matching pipeline including SIFT keypoint description of the instances and K-means clustering for the formation of a visual bag of words (VBoW). The distance between two VBoW representations is computed via chi-square (

).

The application of deep learning has led to significant progress in patch matching[13, 44, 7]. The recent works highlight the use of CNN-based discriminative feature extractors such as DeepDesc[33]

to directly output feature vectors, and the distance between two vectors can be computed using L2 distance; MatchNet

[13] uses metric networks instead of L2 distance for better performance; DeepCompare[44] proposes to use both multi-resolution patches and metric networks. We use these three works as baselines.

We also implement a standard triplet network architecture with a feature extractor supervised by the triplet loss during training, which has been proven effective to capture subtle appearance difference in face recognition

[30]. It is referred to as TripletNet in the experiments and L2 distance is used to measure the feature dissimilarity.

### 4.2 Surrounding Information

Inspired by Zheng et al. [48] in ReID, which addresses occlusion and view variation by associating a group of people instead of the individuals, we propose to look outside of the tight bounding box and involve the neighboring information, hoping to tackle not only occlusion and viewpoint variation, but also the existence of similar-looking instances.

The most intuitive idea is to expand the receptive field by a linear scaling ratio, i.e., cropping an area larger than the actual bounding box. This modification on the network input is referred to in the experiments as zoom-out, and the ratio as zoom-out ratio.

We take further inspiration from the human behavior: one may look at the surroundings for more visual cues only if the appearance of the instances themselves are not informative. Hence, we design a simple network (named Appearance-Surrounding Network, Figure 6

), which has two branches for appearance and surrounding feature extraction, fused as follows:

 dab=(1−λ)×Dl2(vaapp,vbapp)+λ×Dl2(vasur,vbsur) (1)
 λ=Sc(vaapp,vbapp) (2)

where and are superscripts for two patches, and are appearance and surrounding feature vectors, respectively. is L2 distance , and

is the weighting factor to fuse the appearance and surrounding branches. The fusion is designed such that, if the appearance features are similar, the network will place more weight on the surrounding than the appearance. Note that

is jointly optimized in an end-to-end network; it is not a hyperparameter to be set manually.

### 4.3 Geometric Methods

Homographic projection-based methods are very popular and used extensively in past works on Multi-Target Multi-Camera Tracking (MTMCT)[41, 42] and Multi-View People Detection[8, 6, 1, 19]. The mid-points of the bottom edges of the bounding boxes [19] are typically projected to a common coordinate system. The instances can thus be associated based on L2 distance between two sets of projected points. It assumes that all instances are placed on one reference 2D plane (e.g., the ground) and this simplification allows for an unambiguous pixel-to-pixel projection across cameras.

We also make use of epipolar geometry, which does not assume a reference plane. For a pair of bounding boxes in two views, the bounding box center in the first view is used to compute an epipolar line in the second view using the calibrated camera parameters. The distance between the bounding box center in second view and the epipolar line is added to the overall distance between the two bounding boxes. It is a soft constraint, since it does not accept or reject the matches, but penalizes unlikely matches by a large distance.

## 5 Experiments

Unless specified otherwise, we choose ResNet-18 as a light-weight backbone for all models, zoom-out ratio of 2 for models with zoom-out, a mixture of Easy, Medium, and Hard sets are used for training and evaluation.

### 5.1 Evaluation Metrics

AP: Class-agnostic Average Precision is used to evaluate the algorithm’s ability to differentiate positive and negative matches, independent of the choice of the threshold value. All distances are scaled into a range of 0 and 1, and the confidence score is obtained by 1 - x, where x is the scaled distance.

FPR-95: False Positive Rate at 95% recall [13] is commonly used in patch-based matching tasks and is adopted as a supplement to AP. However, it is worth noting that in the patch matching problem, the positive and negative examples are balanced in the evaluation, which is not the case in our task where the negative examples largely outnumber the positive ones.

IPAA: We introduce a new metric, Image Pair Association Accuracy (IPAA), that evaluates the image-pair level association results instead of the instance-pair level confidence scores. IPAA is computed as the fraction of image pairs with no less than X% of the objects associated correctly (written as IPAA-X). In our experiments, we observed that IPAA is more stringent than AP, making it ideal for showing differences between models with reasonably high AP values. Details can be found in the Supplementary Materials.

### 5.2 Benchmarking Baselines on MessyTable

In this section, we analyze and provide explanations for the performances of baselines on MessyTable, collated in Table 1.

Homographic projection performs poorly on MessyTable. The result is not surprising as objects in MessyTable can be placed on different elevated surfaces, violating the 2D reference plane assumption that is critical to accurate projection.

The SIFT-based classical method gives a poor performance as the hand-crafted key points tend to cluster around edges and texture-rich areas, leading to an unbalanced distribution. Hence, texture-less instances have very scarce key points, resulting in ineffective feature representation.

Deep learning-based patch matching SOTAs such as MatchNet [13], DeepCompare [44], and DeepDesc [33] give sub-optimal results as they struggle in distinguishing identical objects, which are abundant in MessyTable. Interestingly, our experiments show that a deeper backbone does not improve performance for MatchNet and DeepCompare, as their performances may be bottlenecked by their simple metric network designs. TripletNet with a triplet architecture outperforms these three models with a Siamese architecture by a clear margin (around a 0.25 increment in AP).

We compare TripletNet and ASNet on surrounding information extraction. Naive inclusion of surrounding information (TripletNet+ZO) worsens the association results, as a larger receptive field may introduce noises. In contrast, ASNet trains a specialized branch for the surrounding information to extract meaningful features. Figure 7 visualizes the feature map activations, showing that ASNet effectively learns to use surrounding information whereas TripletNet+ZO tends to focus on the instance itself. However, we highlight that despite a considerable improvement, the ASNet only achieves a moderate AP of 0.524. This leaves a great potential for improvements.

We also show that adding soft geometric constraints to ASNet gives further improvement (around 0.05 improvement in AP), indicating that the geometric information is complementary to appearance and surrounding information. However, the performance, especially in terms of the stringent metric IPAA-100, is still unsatisfactory.

### 5.3 Effects of View Variation and Scene Complexity

We ablate the challenges featured in MessyTable and their effects on instance association.

We compare the performances of several relatively strong baseline methods at various angle differences in Figure 8. It is observed that the performance by all three metrics deteriorate rapidly with an increase in the angle differences. As shown in Figure 1, large angle difference leads to differences in not only the appearance of an instance itself, but also its relative position within its context.

In addition, we test the same trained model on Easy, Medium, and Hard test sets. The three test sets have the same distribution of angle differences, but different scene complexity in terms of the number of instances, percentage of identical objects, and the extent of overlapping (Figure 5). The performance drops significantly in harder scenes, as shown in Table 2. We offer explanations: first, with more instances on the table, harder scenes contain more occlusion, as shown in Figure 5(a). Second, it is more common to have identical objects closely placed or stacked together, leading to similar surrounding features and geometric distances (Figure 9), making such instances indistinguishable. Third, harder scenes have a smaller fraction of instances in the overlapping area, this may lead to more false positive matches between non-overlapped similar or identical objects, which contributes to higher FPR-95 values.

The above challenges demand a powerful feature extractor that is invariant to viewpoint changes, robust under occlusion, and able to learn the surrounding feature effectively, yet, all baselines have limited capabilities.

### 5.4 MessyTable as a Benchmark and a Training Source

We further validate the usefulness of MessyTable by conducting experiments on three public multi-camera datasets (Table 3), which gives the following insights:

First, methods that saturate MPII MK and EPFL MVMC are far from saturating MessyTable (Table 1). Note that both datasets have a limited number of classes and instances. Hence, this result highlights the need for MessyTable, a more realistic and challenging dataset for research of instance association.

Second, it is observed that algorithms show consistent trends on MessyTable and other datasets, that is, an algorithm that performs better on MessyTable also performs better on all other datasets. This shows MessyTable can serve as a highly indicative benchmark for multi-camera instance association.

Third, models pretrained on MessyTable consistently perform better than those pretrained on ImageNet, showing MessyTable is a better training source for instance association tasks. Note that EPFL MVMC has three classes (people, cars, and buses) and WILDTRACK is a challenging people dataset. It shows that a model trained on the general objects in MessyTable, learns feature extraction that is readily transferable across domains without feature engineering.

## 6 Discussion

In this work, we have presented MessyTable, a large-scale multi-camera general object dataset for instance association. MessyTable features the prominent challenges for instance association such as appearance inconsistency due to view angle differences, partial and full occlusion, similar and identical-looking objects, difference in elevation, and limited usefulness of geometric constraints. We show in the experiments that it is useful in two more ways. First, MessyTable is a highly indicative benchmark for instance association algorithms. Second, it can be used as a training source for domain-specific instance association tasks.

By benchmarking baselines on MessyTable, we obtain important insights for instance association: appearance feature is insufficient especially in the presence of identical objects; our proposed simple baseline, ASNet, incorporates the surrounding information into association and effectively improves the association performance. In addition, we show that epipolar geometry as a soft constraint is complementary to ASNet.

Although the combined use of appearance features, context information, and geometric cues achieves reasonably good performance, ASNet is still inadequate to tackle all challenges. Therefore, we ask three important questions: (1) how to extract stronger appearance, neighbouring and geometric cues? (2) is there a smarter way to fuse these cues? (3) is there more information that we can leverage to tackle instance association?

The experiment results on MessyTable set many directions worth exploring. First, increasing view angle difference leads to a sharp deterioration of instance association performance of all baselines, highlighting the need for research on methods that capture view-invariant features and non-rigid contextual information. Second, methods give poorer performances as the scenes get more complicated; failure cases show that identical instances placed close to each other are extremely difficult to address despite that the strongest baseline already leverages appearance, surrounding and geometric cues. Hence, more in-depth object relationship reasoning may be helpful to distinguish such instances.

Acknowledgements. This research was supported by SenseTime-NTU Collaboration Project, Singapore MOE AcRF Tier 1 (2018-T1-002-056), NTU SUG, and NTU NAP.

## A Content Summary

In the supplementary materials, we provide additonal details on:

• data collection procedure;

• data annotation procedure;

• full list of the 120 classes of objects;

• example scenes of three difficulty levels: Easy, Medium, and Hard;

• statistics of MessyTable and the three datasets evaluated in Section 5.4;

• framework;

• proposed metric IPAA;

• baselines

## B Additional Details on Data Collection

We gather a team of 10 people for data collection, we refer to them as data collectors. We define the term “setup” and “scene” as follows: a setup is an arrangement of nine cameras. The camera poses are randomly set for a setup and are reset for subsequent setups. A scene is an arrangement of all objects on the table: a random set of objects are being placed on the table. These objects are then cleared from the table and replaced with a new random set of objects for subsequent scenes. With each setup, each camera captures one photo for each scene; a total of 10 scenes are collected for each setup.

### b.1 Setup

Camera Poses and Extrinsic Calibration For each setup, cameras poses, except camera #1 that provides a bird’s eye view of the scene, are varied. Certain camera poses are deliberately arranged to be very near the table surface, to collect images of an incomplete scene. A calibration board, with six large ArUco[27, 10] markers are then placed on the table, at a position that is visible to all cameras. The detected marker corners are used to compute the transformation matrix from the board frame to the camera frame by solving the the perspective-n-points problem [2].

Lighting Conditions Variations in lighting often severely affect the performances of visual algorithms. Data augmentation [32] and artificially generated shadows [38] can be unrealistic. Hence, we combine fixed light sources with mobile studio lighting kits to add lighting variations to the dataset such as different light directions and intensity, shadows, and reflective materials. The lighting is adjusted for every setup.

### b.2 Scene

For object placements, we only provide vague instructions to the data collectors about the approximate numbers of objects to be used for Easy, Medium, and Hard scenes respectively; the data collectors make their own decisions at choosing a set of objects and the pattern to place the objects on the table. Hence, we ensure that the object placements resemble the in-the-wild arrangements as much as possible.

For backgrounds, we include baskets and cardboard boxes during data capturing. They serve various purposes, including as occlusion, as platforms for other objects, etc. We also have coasters, placemats, and tablecloths underneath each scene which come in different sizes, patterns, colors, and textures, and are commonly found in natural scenes.

## C Additional Details on Data Annotation

The interactive tool we design for the association stage is shown in Figure 10. By selecting bounding boxes, these bounding boxes are assigned the same instance ID. The tool is designed with the following features to increase efficiency and to minimize errors:

Irrelevant Bounding Box Filtering Once a bounding box is selected (by clicking on it) in any view, only the bounding boxes of the same class or similar classes will remain displayed in other views. It is worth noting that we choose to keep similar classes, in addition to the same class, because the labels from the classification stage can be erroneous (a object is wrongly annotated with a similar class to the true class). Classes are considered to be similar based on their categories (the grouping is listed in Table 4).

Classification Annotation Verification The tool checks if the bounding boxes with the same instance ID have the same class labels. It notifies annotators if any disagreement is detected, and performs automatic correction based on majority voting of the class label amongst nine views, each annotated independently in the classification stage.

## F Additional Statistics of MessyTable and Other Datasets

Table 5 shows the additional statistics of MessTable and the three datasets that were evaluated in Section 5.4.

## G Additional Details on the Framework

As shown in Figure 13, all baselines discussed in the main paper are essentially different ways to compute the pair-wise distances. Homographic projection uses the pixel distance between two sets of projected points; SIFT uses the chi-square distance between two visual bag of words representations; MatchNet and DeepCompare use metric networks to compute the similarity between extracted feature vectors; DeepDesc, TripletNet, and ASNet use L2 distance; Epipolar soft constraint uses pixel distance between a bounding box center point and an epipolar line.

## H Additional Details on the Proposed Metric: Image-pair Association Accuracy (IPAA)

The motivation for IPAA is to gauge performance at the image-pair level whereas AP and FPR-95 gauge performance at the instance-pair level: AP and FPR-95 evaluate the matching score (confidence score) of each instance pair against its ground truths (0 or 1), but do not directly provide insights of the matching quality of an image pair, which contain many instance pairs. In contrast, IPAA is computed as the fraction of image pairs with no less than X% of the objects associated correctly (written as IPAA-X). The computation of the percentage of correctly associated objects for each image pair is shown in Figure 14.

## I Additional Details on Baselines

This section provides more details on baselines. These details are excluded in the main paper due to space constraint, but they offer important insights on the instance association problem.

### i.1 Additional Results on Zoom-out Ratio

By including surrounding information, the key hyperparameter for our baseline ASNet is the zoom-out ratio. We also conduct experiments on different zoom-out ratios. It shows that including surrounding information significantly improves the association performance (compared to that when zoom-out ratio = 1). We simply choose the zoom-out ratio to be 2 as the performance is not sensitive to the value of zoom-out ratio in the range [1.2, 2.4]. However, as the zoom-out ratio increases beyond 2.4, the performance starts to decline. We argue that even though a larger zoom-out ratio could include more surrounding area, the model is unable to extract an effective embedding for the surrounding features. This can be a direction for future research.

### i.2 More Details on Using Bounding Boxes from Detectors

We also evaluate our trained ASNet model on the test set where the bounding boxes are generated by detectors, instead of the ground truth bounding boxes. These detected bounding boxes suffer from false positive (false detection), false negative (missed detection), and imperfect localization and dimension.

It is worth noting that the detected bounding boxes undergo post-processing to obtain instance IDs from the ground truth. For a given image, bipartite matching is performed between the detected bounding boxes and the ground truth bounding boxes based on pair-wise IoUs. The matched detected bounding boxes are assigned the instance IDs of the ground truth bounding boxes, whereas the unmatched detected bounding boxes are assigned unique instance IDs.

The results are collated in Table 6. Instance association itself is challenging, let alone combining it with a detection stage. The weaker the detection model used as the upstream, the worse the association performance gets. We point out that joint optimization of the detection and the association stage can be a direction for future research.

### i.3 Additional Visualization of Scenes Where Geometric Cues Are Necessary

Figure 16 visualizes the scenes where both the appearance features and the surrounding features are similar for different object instances. In this scenario, geometric cues are particularly helpful as they give penalty to the geometrically infeasible pair (i.e., false pair), hence making the overall distance of the false pair larger than that of the true pair.

### i.4 Additional Results from Structure from Motion Baseline

Structure from Motion(SfM) can be used to generate 3D structure from multiple views [14, 39]. The 3D structure can be trivially used for instance association from multiple views as pixel correspondences are known. However, an inherent limitation of SfM is that only the intersection of cameras’ views can be reconstructed whereas instance association from multiple views should cover the union instead. Besides, SfM is sensitive to repetitive patterns, reflective, and textureless surfaces[15]. We apply three state-of-the-art SfM engines, ColMap[29], OpenMVG[24], and Theia[36], on the scenes of MessyTable. The first two are unable to reach convergence whereas Theia gives incomplete reconstruction results, shown in Figure 17.

### i.5 Visualization of SIFT Keypoints

We visualize the keypoints detected by SIFT, as shown in Figure 18. It is clear that SIFT keypoints cluster at feature-rich regions such as edges and patterns. Texture-less instances, however, have very few keypoints. This imbalanced distribution of keypoints is likely the reason for the poor performance.

## References

• [1] P. Baqué, F. Fleuret, and P. Fua (2017) Deep occlusion reasoning for multi-camera multi-target detection. In ICCV, Cited by: §3.2, §3.2, §4.3.
• [2] G. Bradski (2000) The OpenCV library. Dr. Dobb’s Journal of Software Tools. Cited by: §B.1, §3.3.
• [3] A. Caliskan, A. Mustafa, E. Imre, and A. Hilton (2019) Learning dense wide baseline stereo matching for people. In ICCVW, Cited by: Figure 3, §4.1.
• [4] D. Cernea (2015) OpenMVS: open multiple view stereovision. Cited by: Figure 17.
• [5] T. Chavdarova, P. Baqué, S. Bouquet, A. Maksai, C. Jose, T. Bagautdinov, L. Lettry, P. Fua, L. Van Gool, and F. Fleuret (2018) WILDTRACK: a multi-camera HD dataset for dense unscripted pedestrian detection. In CVPR, Cited by: §2, §2, §3.1.
• [6] T. Chavdarova et al. (2017) Deep multi-camera people detection. In ICMLA, Cited by: §3.2, §4.3.
• [7] G. Csurka and M. Humenberger (2018) From handcrafted to deep local features for computer vision applications. CoRR abs/1807.10254. Cited by: §4.1.
• [8] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua (2007) Multicamera people tracking with a probabilistic occupancy map. PAMI. Cited by: §3.2, §4.3.
• [9] J. Gao and R. Nevatia (2018) Revisiting temporal modeling for video-based person ReID. CoRR abs/1805.02104. Cited by: §2.
• [10] S. Garrido-Jurado, R. Muñoz-Salinas, F. J. Madrid-Cuevas, and R. Medina-Carnicer (2016)

Generation of fiducial marker dictionaries using mixed integer linear programming

.
Cited by: §B.1.
• [11] S. Gong, M. Cristani, S. Yan, and C. C. Loy (2014) Person re-identification. Springer. Cited by: §2.
• [12] M. Gou, Z. Wu, A. Rates-Borras, O. Camps, R. J. Radke, et al. (2018) A systematic evaluation and benchmark for person re-identification: features, metrics, and datasets. PAMI. Cited by: §2.
• [13] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg (2015) Matchnet: unifying feature and metric learning for patch-based matching. In CVPR, Cited by: §4.1, §5.1, §5.2.
• [14] R. Hartley and A. Zisserman (2003) Multiple view geometry in computer vision. Cambridge University Press. Cited by: §I.4.
• [15] H. Hirschmuller (2007) Stereo processing by semiglobal matching and mutual information. PAMI. Cited by: §I.4.
• [16] H. Hsu, T. Huang, G. Wang, J. Cai, Z. Lei, and J. Hwang (2019)

Multi-camera tracking of vehicles based on deep features Re-ID and trajectory-based camera link models

.
In CVPRW, Cited by: §2.
• [17] W. Li, R. Zhao, T. Xiao, and X. Wang (2014)

DeepReID: deep filter pairing neural network for person re-identification

.
In CVPR, Cited by: §2.
• [18] W. Li, J. Mu, and G. Liu (2019) Multiple object tracking with motion and appearance cues. In ICCVW, Cited by: §2.
• [19] A. López-Cifuentes, M. Escudero-Viñolo, J. Bescós, and P. Carballeira (2018) Semantic driven multi-camera pedestrian detection. CoRR abs/1812.10779. Cited by: §3.2, §4.3.
• [20] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. IJCV. Cited by: §4.1.
• [21] W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, X. Zhao, and T. Kim (2014) Multiple object tracking: a literature review. CoRR abs/1409.7618. Cited by: §2.
• [22] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler (2016) MOT16: a benchmark for multi-object tracking. CoRR abs/1603.00831. Cited by: §2.
• [23] A. Milan, S. H. Rezatofighi, A. Dick, I. Reid, and K. Schindler (2017)

Online multi-target tracking using recurrent neural networks

.
In AAAI, Cited by: §2.
• [24] P. Moulon, P. Monasse, R. Perrot, and R. Marlet (2016) OpenMVG: open multiple view geometry. In International Workshop on Reproducible Research in Pattern Recognition, Cited by: §I.4.
• [25] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, Cited by: §2.
• [26] G. Roig, X. Boix, H. B. Shitrit, and P. Fua (2011) Conditional random fields for multi-camera object detection. In ICCV, Cited by: §2, §3.1.
• [27] F. J. Romero-Ramirez, R. Muñoz-Salinas, and R. Medina-Carnicer (2018) Speeded up detection of squared fiducial markers. Image and Vision Computing. Cited by: §B.1.
• [28] A. Sadeghian, A. Alahi, and S. Savarese (2017) Tracking the untrackable: learning to track multiple cues with long-term dependencies. In ICCV, Cited by: §2.
• [29] J. L. Schönberger and J. Frahm (2016) Structure-from-Motion Revisited. In CVPR, Cited by: §I.4.
• [30] F. Schroff, D. Kalenichenko, and J. Philbin (2015) FaceNet: a unified embedding for face recognition and clustering. In CVPR, Cited by: §4.1.
• [31] S. Schulter, P. Vernaza, W. Choi, and M. Chandraker (2017) Deep network flow for multi-object tracking. In CVPR, Cited by: §2.
• [32] C. Shorten and T. M. Khoshgoftaar (2019) A survey on image data augmentation for deep learning. Journal of Big Data. Cited by: §B.1.
• [33] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer (2015) Discriminative learning of deep convolutional feature point descriptors. In ICCV, Cited by: §4.1, §5.2.
• [34] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, Cited by: §2.
• [35] W. Susanto, M. Rohrbach, and B. Schiele (2012) 3D object detection with multiple kinects. In ECCV, Cited by: §2, §3.1, §3.2.
• [36] C. Sweeney Theia multiview geometry library: tutorial & reference. Note: http://theia-sfm.org Cited by: Figure 17, §I.4.
• [37] L. Wei, S. Zhang, W. Gao, and Q. Tian (2018) Person transfer GAN to bridge domain gap for person re-identification. In CVPR, Cited by: §2.
• [38] X. Wei, Q. Cui, L. Yang, P. Wang, and L. Liu (2019) RPC: a large-scale retail product checkout dataset. CoRR abs/1901.07249. Cited by: §1, §B.1.
• [39] S. Winder, G. Hua, and M. Brown (2009) Picking the best DAISY. In CVPR, Cited by: §4.1, §I.4.
• [40] Y. Xu, X. Zhou, S. Chen, and F. Li (2019) Deep learning for multiple object tracking: a survey. IET Computer Vision. Cited by: §2.
• [41] Y. Xu, X. Liu, Y. Liu, and S. Zhu (2016) Multi-view people tracking via hierarchical trajectory composition. In CVPR, Cited by: §2, §4.3.
• [42] Y. Xu, X. Liu, L. Qin, and S. Zhu (2017) Cross-view people tracking by scene-centered spatio-temporal parsing. In AAAI, Cited by: §4.3.
• [43] M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. Hoi (2020) Deep learning for person re-identification: a survey and outlook. CoRR abs/2001.04193. Cited by: §2.
• [44] S. Zagoruyko and N. Komodakis (2015)

Learning to compare image patches via convolutional neural networks

.
In CVPR, Cited by: §4.1, §5.2.
• [45] J. Zbontar and Y. LeCun (2015) Computing the stereo matching cost with a convolutional neural network. In CVPR, Cited by: §4.1.
• [46] Z. Zhang, J. Wu, X. Zhang, and C. Zhang (2017)

Multi-target, multi-camera tracking by hierarchical clustering: recent progress on DukeMTMC project

.
CoRR abs/1712.09531. Cited by: §2.
• [47] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang (2017) Spindle net: person re-identification with human body region guided feature decomposition and fusion. In CVPR, Cited by: §2.
• [48] W. Zheng, S. Gong, and T. Xiang (2009) Associating groups of people. In BMVC, Cited by: §2, §4.2.
• [49] Y. Zhou and L. Shao (2018) Aware attentive multi-view inference for vehicle re-identification. In CVPR, Cited by: §2.