Log In Sign Up

Background Mixup Data Augmentation for Hand and Object-in-Contact Detection

by   Koya Tango, et al.

Detecting the positions of human hands and objects-in-contact (hand-object detection) in each video frame is vital for understanding human activities from videos. For training an object detector, a method called Mixup, which overlays two training images to mitigate data bias, has been empirically shown to be effective for data augmentation. However, in hand-object detection, mixing two hand-manipulation images produces unintended biases, e.g., the concentration of hands and objects in a specific region degrades the ability of the hand-object detector to identify object boundaries. We propose a data-augmentation method called Background Mixup that leverages data-mixing regularization while reducing the unintended effects in hand-object detection. Instead of mixing two images where a hand and an object in contact appear, we mix a target training image with background images without hands and objects-in-contact extracted from external image sources, and use the mixed images for training the detector. Our experiments demonstrated that the proposed method can effectively reduce false positives and improve the performance of hand-object detection in both supervised and semi-supervised learning settings.


page 1

page 2

page 4


Survey: Image Mixing and Deleting for Data Augmentation

Data augmentation has been widely used to improve deep nerual networks p...

RoIMix: Proposal-Fusion among Multiple Images for Underwater Object Detection

Generic object detection algorithms have proven their excellent performa...

Fast Hand Detection in Collaborative Learning Environments

Long-term object detection requires the integration of frame-based resul...

Improving Crowded Object Detection via Copy-Paste

Crowdedness caused by overlapping among similar objects is a ubiquitous ...

Background Adaptive Faster R-CNN for Semi-Supervised Convolutional Object Detection of Threats in X-Ray Images

Recently, progress has been made in the supervised training of Convoluti...

IDA: Improved Data Augmentation Applied to Salient Object Detection

In this paper, we present an Improved Data Augmentation (IDA) technique ...

Stingray Detection of Aerial Images Using Augmented Training Images Generated by A Conditional Generative Model

In this paper, we present an object detection method that tackles the st...

1 Introduction

Detecting the positions of a person’s hands and an object-in-contact (hand-object detection) from an image provides an important clue for understanding how the person interacts with the physical world. This hand-object detection is applicable to recognizing a person’s primitive actions, such as “taking” or “pushing”, and logging the person’s activity of interacting with the environment [yagi2021go]. Shan et al. [Shan20] built a hand-object detector for localizing hands and interacting objects on a large-scale dataset collected in naturalistic house-holding situations, such as in kitchen [damen2018scaling, li2018eye, sigurdsson2018actor], DIY [Shan20], and craft work [Shan20, sigurdsson2018actor].

However, a hand-object detector trained on such house-holding images may not be well generalized to other hand-manipulation images. For instance, the images in biological laboratories or factories have significantly different data distribution from the daily scenes used in training. To build an accurate hand-object detector for such unique application domains, a large amount of data and labels must be collected from scratch. However, data collection and annotation can be difficult due to various reasons, such as cost or privacy issues. In particular, expert knowledge is required to annotate the data in such specific application domains. Under these limitations, a hand-object detector may overfit the training data and lack the generalization ability due to the small amount of training data.

To improve the generalization ability of the detector trained on a small dataset, data augmentation is a key component in training. Recently, Mixup [zhang2017mixup], a method that overlays two different images, has been used as an empirically strong augmentation for object detection [zhou2021instant]. Nevertheless, naively applying Mixup induces unintended biases in hand-object detection. As shown in Figure 1, (a) contact states become ambiguous when hand-object pairs from different images overlap, and (b) the concentration of hands and objects in a specific local region makes identifying object boundaries difficult. These unintended mixtures will degrade the performance of a hand-object detector.

(a) Ambiguous contact states
(b) Ambiguous object boundaries
Figure 1: Problems with Mixup. Naively mixing two images causes ambiguity in (a) contact states and (b) object boundaries.
Figure 2: Overview of Background Mixup. We aim to improve diversity in training data while preserving foreground’s semantics.

To handle this, we propose a novel data-augmentation method, called Background Mixup, that utilizes data-mixing regularization while reducing the unintended effects in hand-object detection. As shown in Figure 2, we aim to augment a training image by mixing it with the background of external image sources that does not contain the foreground (i.e., hands and objects-in-contact) and using the mixed images for training a hand-object detector. The contributions of this paper are summarized as follows.

  • We propose a novel data-augmentation method, Background Mixup, that mixes a training image and a background image to improve the generalization ability of a hand-object detector in a small dataset.

  • Compared with Mixup, our experiments showed that Background Mixup improves the performance of a hand-object detector in supervised and semi-supervised learning settings.

  • Our method has also shown to be effective in reducing the number of false-positive predictions although Mixup has the disadvantage in this metric.

2 Related Work

2.1 Hand and Object-in-Contact

Jointly analyzing hands and objects-in-contact serves to understand human behavior [li2015delving, bambach2015lending]. While these studies have been conducted in a limited scale of data, Shan et al. [Shan20] proposed a large-scale dataset for training a hand-object detector localizing hands and interacting objects, which is collected in daily situations such as EPIC-KITCHENS 2018 [damen2018scaling], EGTEA [li2018eye], CharadesEgo [sigurdsson2018actor]. However, directly fine-tuning the hand-object detector on a small and specific dataset can lead to limited performance as discussed in Section 1.

To overcome this, we developed Background Mixup to improve the generalization ability of a hand-object detector on a small dataset in specific domains such as biomedical experiments and factory work.

2.2 Mixture-Based Data Augmentation

Mixture-based data augmentation mixes input data with other inputs to increase the diversity of data on a small dataset and improve the generalization performance of the model. Several mixture-based methods, such as Mixup [zhang2017mixup], CutMix [yun2019cutmix], Mosaic [ge2021yolox], and Cutout [devries2017cutout], have been used in many downstream tasks.

These mixture-based methods are used for semi-supervised learning of object detection [liu2021unbiased, zhou2021instant]. Unbiased-Teacher [liu2021unbiased] uses Cutout while Instant-Teaching [zhou2021instant] uses Mixup and Mosaic showing that Mixup particularly contributes to improving the performance of object detection. However, applying Mixup leads to unintended biases in the hand-object detection, as discussed in Section 1. These unintended mixtures will degrade the performance of a hand-object detector.

3 Proposed Method

In this section, we introduce our proposed training of a hand-object detector with Background Mixup data augmentation. Let and be sets of training and testing images, respectively. When the size of is small, a hand-object detector trained on may not generalize well to due to over-fitting to the training data. To solve this problem, we propose Background Mixup that uses a background image without foreground entities, i.e., hands and objects-in-contact, for increasing the diversity of the training data.

We use a trained hand-object detector [Shan20] to extract the background images from an external image source (e.g., kitchens), which are different from our target data of and . We extract the background images in which neither object-in-contact nor hand was detected by the hand-object detector, and construct a set of the background images .

Figure 3 shows a comparison between Mixup and Background Mixup. With Mixup, the foreground and background are combined, causing unintended effects that make the contact state ambiguous or make it difficult to identify the boundaries of objects, as shown in Figure 1. In contrast, Background Mixup reduces such unintended effects by mixing the training image with the background image, which can retain the foreground of the training image.

Figure 3: Comparison of Mixup [zhang2017mixup] and Background Mixup.

We denote the training image as and the background image as . We define Background Mixup as:


where and are randomly sampled and is a parameter controlling the degree of the mixture. Following Mixup, the parameter

is drawn from beta distribution

where and

indicate hyperparameters to determine the distribution shape.

We use for training the hand-object detector instead of . This method can be implemented with a small computational cost at the training stage. Thus no additional computational cost is required in inference.

4 Experiments

4.1 Experimental Setup

Datasets. We validated our method on various hand manipulation datasets including biomedical experiments, mock factories, and kitchens. We used a first-person video dataset of biomedical experiments and a mock factory environment dataset [ragusa2021meccano] as specific application domains where the data size and variety are limited. We also used a kitchen environment dataset [damen2018scaling, Shan20] including diverse cooking scenes.

For the dataset of biomedical experiments, we recorded 12 videos that contained basic actions such as preparing reagents in a biomedical lab. The bounding boxes of hands and objects-in-contact were annotated by an expert in the field. The total duration is 27 minutes, and the number of annotated frames is 3,093. We split the 12 videos into 6:3:3 for train:val:test.

For the set of background images in the experiments on biomedical and factory datasets, we used EPIC-KITCHENS-100 [Damen2021RESCALING]

of cooking scenes in kitchens as an external image source. For the experiments on the cooking dataset, we used Something-Something V2 

[goyal2017something] of daily scenes as an external image source to augment the background appearance.

Training details. We measure the performance of our method in supervised and semi-supervised learning settings. For supervised learning, we fine-tune the pre-trained hand-object detector proposed by Shan et al. [Shan20]. For semi-supervised learning, we trained the hand-object detector in the training pipeline of Unbiased-Teacher (UB-Teacher) [liu2021unbiased]. We evaluated the performance by the average precision (AP) of hands and objects-in-contact. Note that we do not provide hand AP in an experiment with mock factory environment dataset [ragusa2021meccano] because the hand bounding boxes are not annotated.

Baselines. We denote our proposed Background Mixup with EPIC-KITCHENS-100 [Damen2021RESCALING] and Something-Something V2 [goyal2017something] as BG-Mix and BG-Mix, respectively. We prepared two variants of Mixup as comparison methods. Mixup is the original Mixup that combines two different images within a dataset, and Mixup is Mixup that mixes a training image with a randomly selected image from EPIC-KITCHENS 2018 [Shan20, damen2018scaling].

4.2 Quantitative Evaluation

4.2.1 Supervised Learning

Biomedical Factory Cooking
Model hand AP obj AP mAP hand AP obj AP hand AP obj AP mAP

90.9 70.6 80.7 - 45.0 90.6 66.4 78.5
  + Mixup 90.9 70.4 80.6 - 44.6 90.6 65.6 78.1
  + Mixup 90.9 69.8 80.4 - 44.6 - - -
  + BG-Mix 90.9 72.2 81.0 - 45.2 - - -
  + BG-Mix - - - - - 90.6 65.7 78.2
Table 1: Quantitative comparisons on supervised learning.

Table 1 lists the results on the supervised learning settings. In the biomedical and mock factory environment datasets, BG-Mix exhibited the highest performance in mAP and object AP while Mixup and Mixup decreased in these metrics. This is because our method avoids the unintended effects shown in Figure 1, which degrades the performance of detecting an object-in-contact. In the cooking dataset, however, Mixup, Mixup and BG-Mix all obtained lower obj AP and hand AP than the Supervised baseline because the dataset already has diverse foreground and background appearances even without data augmentation. The hand APs of Background Mixup did not increase because the hand APs were already saturated.

4.2.2 Semi-Supervised Learning

Biomedical Factory Cooking
Model hand AP obj AP mAP hand AP obj AP hand AP obj AP mAP

90.6 64.0 77.3 - 27.0 90.5 41.4 65.9
  + Mixup 90.9 62.4 76.6 - 30.4 90.4 46.5 68.5
  + Mixup 90.8 65.4 78.1 - 31.0 - - -
  + BG-Mix 90.8 66.4 78.6 - 32.6 - - -
  + BG-Mix - - - - - 90.5 47.2 68.9
Table 2: Quantitative comparisons on semi-supervised learning at 1% labels.

Table 2 shows the results for semi-supervised learning with 1% labeled data. BG-Mix and BG-Mix exhibited the highest performance on all datasets, except for the hand AP on the biomedical data. In the cooking dataset having a variety of objects and backgrounds, although the performance of supervised learning decreased with both BG-Mix and Mixup as shown in Section 4.2.1, BG-Mix improved the performance of semi-supervised learning. This indicates that Background Mixup is effective under limited labels where the fully-supervised model suffers from generalizing to unknown test data.

4.2.3 Analysis of False-positive Predictions

Although mAP is a standard evaluation criterion in object detection, there is a technique that can improve the mAP score by allowing many false positives with low confidence  [kaggle2021map]. However, detection results that contain many false positives are problematic in real scenarios. Therefore, we experimented with precision to measure the percentage of false positives in detection results of a hand-object detector. Precision indicates the percentage of true positives among the predictions detected as positive. In other words, the lower precision, the higher the percentage of false positives.

Table 3 shows a comparison of precision when the percentage of labeled data is 1% in semi-supervised learning, and the confidence threshold is 0.1. While Mixup improves mAP, the precision is decreased. This indicates training the detector on mixed images with many overlapping bounding boxes in a specific area, as illustrated in Fig. 1(b), induces the bias of increasing the number of false positives. BG-Mix can improve the mAP without increasing the number of false positives because it keeps the information of hand-object contact and avoids the concentration of target hands and objects-in-contact in a local region.

Model mAP Precision Precision
UB-Teacher 77.3 87.1 49.7
  +Mixup 76.6 76.8 42.0
  +Mixup 78.1 75.6 38.7
  +BG-Mix 78.6 89.1 48.8
Table 3: Comparisons of false positive predictions on biomedical experiments dataset.

4.3 Qualitative Evaluation

(a) UB-Teacher
(b) +Mixup
(c) +Mixup
(d) +BG-Mix
Figure 4: Qualitative results in detecting object-in-contact.

Figure 4 shows the inference results for object-in-contact with a confidence threshold of 0.1. We observed that Figure 4 (b) Mixup and (c) Mixup increased the number of false positives (e.g., red bounding boxes far from the ground truth bounding boxes). In contrast, the predictions of Figure 4 (d) Background Mixup is less noisy and accurately represent the location of the object-in-contact compared to ground truth. Our method of increasing the diversity of the background without changing the foreground semantics could improve the performance of a hand-object detector without increasing the number of false positives.

5 Conclusion

We proposed Background Mixup, which mixes training images with background images that do not contain the hands and the objects-in-contact, whereas Mixup mixes both the foreground (i.e., the hand and the object-in-contact) and the background. Background Mixup can improve the performance of a hand-object detector in small datasets, such as biomedical experiments and mock factory environments, by increasing the diversity of the background appearances while inhibiting the unintended effects caused by Mixup. We have also shown that Background Mixup was effective in reducing the number of false positives.