Deep learning has enabled achieving outstanding results on a wide range of applications in computer vision and image processing[alom2019state, shrestha2019review]. However, the diversity of datasets and neural network architectures necessitates a careful selection of model architecture and training data that match best to the target application. Often times, for a same task, many models are available. These models might be trained on different datasets, or might come in different capacities, architectures, or even bit precisions.
Motivation: A natural question that arises in this case, is whether we can combine the neural networks so that one combined network can perform the same task as several input networks. Fig. 1 shows an example, where two input object detection models to detect ‘person’ and ‘vehicle’ are combined in one model. The benefits of combining models include: a) possible latency improvements due to running one inference as opposed to many, b) in case input models cover partially overlapping or non-overlapping classes/categories, one can build a stronger model with the union of the classes/categories through model composition (i.e. merging models’ skills as in Fig. 1), and c) for applications involving model deployment, e.g. for cloud services providers, it can reduce the deployment frequency/load.
Challenges: Creating a combined model from several input models is a challenging task. First, depending on the target task, the output model may need to have a specific architecture, and not necessarily one that is dictated by the input models. The input models themselves also might have different architectures. Second, in case input models are provided by users of a cloud system, or by different model creators/clients, the individual model owners would likely prefer not to share their training data, labels, not even weights or code. A privacy preserving model composition approach should rely on only a minimum amount of information from the model creators. Third, input models may have only partially overlapping or disjoint class categories. This imposes a major challenge when combining the individual models.
Existing methods: The existing solutions are mostly based on techniques such as knowledge distillation [hinton2015distilling, incrementaldistillation, banitalebidehkordi2021revisiting] or ensembling [zhou2002ensembling], which may be useful when classes/categories are identical and labeled data are available, but not for the case of arbitrary classes/categories with only unlabeled data. More details regarding the existing approaches are provided in section 2. In summary, to the best of our knowledge, the existing methods do not fully address the three challenges mentioned above.
Our contributions: In this paper, we propose a simple yet effective method to address the model composition of neural networks. Our method supports combination of an arbitrary number of networks with arbitrary architectures. To train a combined model, we leverage the abundance of unlabeled data and having labels or original training data of the input models is not a requirement. However, if any labeled data are available, the algorithm uses them to further boost the performance of the output model. Furthermore, we put no restrictions on the type and number of object categories of the input models. We demonstrate the effectiveness of our method through an extensive set of experiments for the task of object detection.
2 Related works
Related to our work are the following approaches:
Network Ensembling: Ensembling is a common way of aggregating the predictions of more than one models. Ensembling strategies are well explored in the literature [zhou2002ensembling, casadoensemble, solovyev2019weighted]. Simplest ways could be naive averaging of predictions.
Architectural Combination: These methods create new architectures from the input models. Adaptive Feeding (AF) [zhou2017adaptive]
proposes to use simultaneously two small and large networks that are trained to perform a same task. A linear classifier decides which examples to go to the small or large model. Their goal was to improve the inference speed. In another work,[chou2018unifying] Unifying&Merging (U&M) proposes to design a new architecture based on existing input architectures, to support learning multiple tasks.
MultiTask Networks: MultiTask networks learn multiple tasks in one model [ruder2017overview, VandenhendeGGB20, JhaKBN20]. Tasks train simultaneously, not by combining already trained individual networks.
Incremental Learning (IL): Gradually adding new categories while trying to limit the catastrophic forgetting [peng2020faster].
Dataset Merging: [rame2018omnia] Dataset merging is closest work to our study. It proposes to combine datasets by filling the missing annotations of non-overlapping categories.
While existing works have some correlation to the problem we study, none of them directly addresses this problem. Specifically, our proposed method combines neural networks of same tasks (e.g. classification, detection, etc.) using unlabeled data. If any labels are available, it will use them to further boost the performance. On the other hand, most existing methods mentioned above require labels to be available. In addition, our method supports having overlapping or disjoint target label categories while existing methods such as multi-task networks, U&M, AF, ensembling, or Dataset Merging only support homogeneous categories. Moreover, our method is architecture agnostic. In contrast, AF, MultiTask, IL, and Dataset Merging, don’t support arbitrary input model architectures.
It is also worth noting that most existing methods require access to input model weights or code, to construct a combined model. Methods such as multi-tasking, IL, Dataset Merging or U&M need full access to input models, in order to design a new combined architecture. Our method only requires an inference API, and thus treats the input models as black boxes, which in turn leads to a better privacy protection for the clients.
3 Model composition strategy
This section provides details regarding the model composition method we use in this paper. Fig. 2 shows the inputs and outputs of this process. In addition, Fig. 3 shows a flow-diagram of different steps within this method. As observed from Fig. 3, a number of models are provided as inputs. We then collect the predictions of these models over an unlabeled set of images. These predictions are filtered and aggregated to form a set of generated pseudo-labels, which is later used to train the output model . If any labeled data are available, will be fine-tuned on them. Algorithm 1 shows a break-down of these different steps.
An embodiment of how our method would be implemented for usage in a cloud services provider platform is demonstrated in Fig. 4. As shown in this figure, in the context of a cloud services provider, model composition can be leveraged for: less frequent model/data deployment/transfer, building stronger models, faster overall inference, empowering the model markets, and encouraging users to share their models in an incentive sharing strategy.
3.1 Filtering pseudo-labels
Since the input model predictions are not always perfectly accurate, the generated pseudo-labels will be noisy, and therefore less reliable. These training examples could have an adverse impact on the training of the output model. We filter out such kind of examples, by employing an entropy-based thresholding mechanism.
For a given input and a network function such that , the entropy is given by . An unreliable pseudo-label may be discarded if , for some threshold . Although entropy thresholding does not guarantee a perfect filtering of bad pseudo-labels, but in practice it works well and has been used as a confidence indicator for similar purposes [teerapittayanon2016branchynet, saporta2020esl, rottmann2018deep]. Note that for some tasks such as object detection, models output a confidence score that can also be used for filtering bad pseudo-labels.
3.2 Aggregation of pseudo-labels
Next, in the pseudo-label aggregation phase, we employ a consensus based strategy, where the majority of the input models need to agree on a pseudo-label in order for it to qualify as a candidate and pass to the next step. Pseudo-label aggregation can be done in various ways such as unanimous (all models agree), affirmative (union of all predictions), consensus (majority voting), etc. [casadoensemble]. Our experiments showed all these methods can be used with minor performance variations. We chose the consensus approach for the experiments since for combining a higher number of models, intuitively it makes more sense (See section 4 for a 10-model example). Note that for some tasks such as image classification, the aggregation will be a simple majority voting mechanism. For some other tasks such as object detection, it becomes more complicated due to the nature of the task. Here, we review our method of pseudo-label aggregation for object detection, which can also be extended to other similar tasks such as instance segmentation, tracking, etc.
Details of the pseudo-label aggregation strategy: Let denote the unlabeled dataset used. The input to the pseudo-label aggregation procedure is a list , where each itself is a list of detections from an input model over all unlabeled training images in . We then create a new list so that each contains predictions of all models on one single image, and length of is equal to number of all images in .
Next, for each element , we unite the predictions by their category names and the overlapping of their bounding boxes. If the overlapped area of any two elements in is higher than a certain threshold, and meanwhile if these two elements are of the same category, then they are treated as detections of the same object, which are further grouped together into a sub-list: . Subsequently, we decide whether to keep each element depending on the number of unique models with predictions included in , denoted by . In the most strict case, is kept in the list only when , where is the maximum number of models that may predict the object category corresponding to ; If we want a majority voting, then is kept when ; If a simple stacking strategy is used, then is kept regardless of . At this point, each could still have several candidate detections for the same region. Processing all of them through the detection network is not only cumbersome but could also decrease the overall performance. Therefore, we applied soft non-maximum suppression (Soft-NMS) [bodla2017soft] to each to filter the predictions a second time. Algorithm 2 formally captures these steps and Figure 5 demonstrates an example.
Remark 1: for image , represents a list of bounding boxes predicted on a particular object , i.e. detections of a same object by different models. is the number of unique models in . is the number of models that have the category of in their label set (i.e. number of models that actually have the capability of detecting that object category). As such, in general . In an ideal case where all eligible models can detect an object , we will have . If all input models have the same category label set, . In the case input models have different but overlapping categories (i.e. there is at least one category that is not supported by all models), for at least some and , . If all models have strictly different categories (no overlap), . And finally if some particular categories only belong to one model, for those categories .
Remark 2: The experiments in Section 4 contain various practical scenarios in which different aspects of our method are evaluated. Moreover, Figure 9 in supplementary materials [supplementary] shows a scenario where 10 input models with a diverse count and type of object categories are combined. For example in this figure, ‘teddy bear’ is only in , ‘bicycle’ is in , ‘potted plant’ is in , etc. In addition, we also explore in Section 4
the task of combining a face detection model with a mask detection one.
3.3 Training pipeline
Once pseudo-labels are filtered and combined, they will be used to train the output model architecture. Any available labeled data will be used in a final fine-tuning stage to improve the performance. Note that pseudo-labels are generated from unlabeled data. This is because input model owners may only share an interface to their model API for inference, not necessarily the weights, code, architecture, training data, or labels, to protect their privacy. We treat the input models as black boxes. In other words, we only pass a set of arbitrary unlabeled images through them, and collect their predictions to use as pseudo-labels. This further allows us to choose an arbitrary architecture and size for the output model that combines the class categories of the input models. Consequently, our model composition method is agnostic to the training hyper-parameters of the input models such as various optimizers, learning rate schedules, batch sizes, etc.
It is also worth noting that this way of creating composite models can help with light domain shifts. As we see in section 4, input models trained on different datasets (for the same task) can still be effectively combined (even with different sets of categories). To what extent exactly our method can robustly support domain shifts remains out of the scope of this work, and we leave that as a future direction.
4 Experiment results and discussions
4.1 Experiments setup
Selected model architectures: We have selected the task of object detection as the main experiments task due to its importance and wide-spread usage in practical applications. That being said, we will also provide results on the task of image classification, as it is often used as a baseline experiment task. For object detection, we utilized the following architectures: EfficientDet-D0 [tan2020efficientdet], EfficientDet-D1 [tan2020efficientdet], and RetinaNet-ResNet-50 [lin2017focal]. For the classification task, we used: ResNet-18 [resnet], ResNet-152[resnet], and DenseNet-121[densenet].
Datasets: We used three sets of benchmarking datasets for object detection: COCO [COCO], Pascal-VOC [Pascal-VOC], and Open-Images-V5 [OID] (referred by OID hereafter). For classification, we use Caltech-256 [Caltech-256] and OID datasets.
Evaluation metrics: We follow the common practice by using the mean Average Precision, mAP @IoU=0.50:0.95, as the main metric to evaluate the performance of object detection models. We report top-1 accuracy for classification.
Training protocols and settings: We adopt code from [EffDet-repo]
for the object detection experiments, and use the same training hyper-parameters with ImageNet[imagenet] pre-trained backbones. We trained all the models using SGD with a momentum of 0.9. We increased the learning rate from to
for the first epoch and then trained the remaining 300 epochs using a cosine decay rule with a moving average decay set at 0.9998. Soft NMS was utilized to filter the pseudo-label detections in our method. We used an IoU threshold of 0.5 and a confidence threshold of 0.001. For the classification experiments, models were trained for 200 epochs, using an in-house code-base. An SGD optimizer with momentum 0.9 was used, and learning rate was exponentially increased from 0 to 0.01 for the first 8 epochs and then annealed down exponentially to 0.0001 in the remaining epochs.
4.2 Object detection results
Our experiments are categorized in various scenarios, which are explained in this subsection. These scenarios cover various possible cases of input models’ architectures, training data, and what kind of unlabeled data were used in our algorithm. Table 1 provides a summary of these scenarios. In Table 1, training data in each case is constructed from the training set of VOC, COCO, OID, or a subset of them (unlabeled). Validation sets are also built from the validation sets of VOC and COCO: a subset of COCO (union of input categories) for scenario 1, union of the val set of COCO & VOC for scenario 2 & 3, and val set of COCO for scenario 4. As such, validation sets may be different across different scenarios, but are the same within one scenario. Moreover, the class distributions of data for scenario 1 & 4 are shown in Table 2 and Figure 9 (supplementary), and for scenario 2 & 3 it follows the distributions of COCO & VOC. Next, we will go over the details of each experiment.
Scenario 1: Combining detectors with different expertise: We took 3 models, each trained on a subset of the COCO dataset but designed for a different purpose, one for detection of transportation related objects, one for sports related, and the other for home objects. These categories have some partial overlap. Table 2 shows the object categories used for each model. The combined model achieved by our model composition procedure combines the skills of the input models, and builds a stronger model with all object categories. We tried our method in two ways: one using unlabeled COCO images (similar data distribution to training data, but without using the labels), and the other using unlabeled OID (open images) dataset (entirely different dataset with different distribution). The upper-bound of the performance would be to train a model with all labels of all object categories (supervised). This model achieved 35.11% mAP on validation set of COCO (considering only object categories corresponding to the ones it was trained on). On the same validation set, our method achieved 32.61% when using unlabeled COCO, and 30.97% when using unlabeled OID. This shows that our method can effectively combine the models with different expertise, and achieve a performance close to that of the supervised upper-bound model. We further investigated the performance of our method if partial labels are available for fine-tuning, in a semi-supervised manner. Table 3 shows the results for this experiment. As observed, with fine-tuning, our method could even surpass the supervised model with 100% of labels.
|EffDet-D0||input (supervised)||COCO subset 1||72K|
|EffDet-D0||input (supervised)||COCO subset 2||66K|
|EffDet-D0||input (supervised)||COCO subset 3||81K|
|Scenario 1||EffDet-D0||Upper-bound||COCO subsets union||89K|
|EffDet-D0||ModelComp (Ours)||unlabeled COCO||118K|
|EffDet-D0||ModelComp (Ours)||unlabeled OID||1.9M|
|EffDet-D0||ModelComp (Ours)||unlabeled COCO+VOC||135K|
|EffDet-D0||ModelComp (Ours)||unlabeled OID||1.9M|
|RetinaNet-R50||ModelComp (Ours)||unlabeled COCO+VOC||135K|
|RetinaNet-R50||ModelComp (Ours)||unlabeled OID||1.9M|
|EffDet-D0||10 inputs (supervised)||10 COCO partitions||12K each|
|Scenario 4||EffDet-D0||ModelComp (Ours)||unlabeled COCO||118K|
|EffDet-D0||ModelComp (Ours)||unlabeled OID||100K|
|Model skill||Categories supported|
|Transportation||person, bicycle, car, motorcycle, bus, truck, traffic light, fire hydrant, stop sign, parking meter|
|Sports||person, bicycle, frisbee, skis, snowboard, sports ball, skateboard, baseball bat, baseball glove, motorcycle|
|Home||person, bicycle, chair, couch, bed, dining table, skateboard, refrigerator, toilet, tv|
|Proportion of labeled data used (%)|
|Scenario 1||Ours: COCO+FT||32.6||32.6||33.5||33.7||34||34||35.7|
|Scenario 2||Ours: [COCO+VOC]+FT||29||29.2||29.3||30||30.3||30.4||33.1|
|Scenario 3||Ours: [COCO+VOC]+FT||34||34.2||34.6||35||35.4||35.9||38|
|Scenario 4||Ours: COCO+FT||24.4||24.5||26.7||27.7||28.6||29.1||33.1|
Scenario 2: Combining input models that are trained on entirely different datasets. In scenario 1, input models had different expertise, by getting trained on different subsets of COCO (examples roughly came from a similar distribution). Scenario 2 investigates a more challenging case, where input models were trained with data from entirely different datasets, hence different distributions. To this end, we trained input models on COCO and Pascal-VOC datasets respectively. Similar to the previous scenario, we studied two choices of unlabeled data for our Model Composition method: a) unlabeled data from the same distribution as training data (in this case COCO+VOC images without using labels), and b) unlabeled data from a different dataset all together, e.g. the OID dataset. Note that the input models were trained on a different number of object categories (with overlap), and the output combined model was trained to support the union of object categories of the input models.
Table 3 shows the results of this experiment. It is observed from Table 3 that in the unsupervised case (i.e. no labeled data was used), our method achievs 29% and 27.4% mAP, close to the fully supervised performance of the upper-bound model. We also see from Table 3 that when partially labeled data are used for further fine-tuning, our method shows significant improvements over supervised training. In particular, when using 1%, 5%, and 10% of labels, our method shows +22.1%, +13%, and +10.3% gaps over supervised training.
Scenario 3: Combining input models with different architectures, that are trained on entirely different datasets. In this scenario, we studied the most generic case, in which input models have different architectures, are trained on different datasets, and with a different number of object categories. The output model also was chosen to have a different architecture than the input models (See Table 1). This scenario evaluated whether our method can combine the knowledge of models trained on different circumstances, data, and architecture, to a desired new and different architecture.
Table 3 shows the results of this experiment. It is observed from Table 3 that our method is very effective, and in some cases performs even better than supervised training with 100% of labels. When partial labels are available for fine-tuning, our method shows a strong performance, with large gaps compared to supervised training, especially in the low label range. Moreover, Table 3 shows that unsupervised training with our method achieved an mAP of 34%, only 1% below supervised training with all labels. After fine-tuning, we were able to meet the same performance as supervised training with only 10% of the original data.
Scenario 4: Having a large number of input models. This scenario investigates the case when a larger number of input models are provided. This would increase the diversity among the models since they can be trained on different data, or object categories, and thus results in a more challenging situation. To this end, we assumed 10 input models. Each model was trained on a randomly selected subset of the COCO dataset, so that training data for each model had no overlap to the other models. However, object categories could have overlap, as their type and count were chosen randomly. Supplementary materials [supplementary] provides a visualization of the type and count of the object categories used for these 10 models. Since each model was trained with roughly 10% of the COCO training set, different number of object categories for different models resulted in a different per-class size of training data. The imbalance here made model composition harder, but mimicked realistic situations where training data can in fact be imbalanced for input models. As mentioned, for these 10 models, categories were randomly selected and the number of categories was selected from 5, 10, 20, 30, and 40. Note that generating 10 pseudo-labels on unlabeled data can be time-consuming (although it can be parallelized in production). Therefore, we only used 100K randomly selected examples from the OID dataset for this experiment.
We observe from Table 3 that our model composition method can effectively combine the 10 input models into a single new model with the union of their object categories.
|Model/Expertise||Train set||Validation set||AP(%)|
|input: Face (D0)||face data 1 (20007)||face data 1 (4079)||52.29|
|input: Face (D0)||MAFA-faces (30870)||MAFA-faces (5338)||44.86|
|input: Mask (D1)||MAFA-masks (30870)||MAFA-masks (5338)||29.63|
|ModelComp (R50): w/o filtering & aggregation||face+mask (50877)||face+mask (9417)||30.72|
|ModelComp (R50): w/o aggregation||face+mask (50877)||face+mask (9417)||34.48|
|Ours, ModelComp: Face & Mask (R50)||face+mask (50877)||face+mask (9417)||38.90|
Remark 3: A note on the unsupervised performance of OID: As observed in Table 3, in the challenging scenarios of 3 and 4, the unsupervised (0% labels) performance of model composition with OID is considerably lower that that of COCO or [COCO+VOC]. In this regard, there are a few points worth a mention:
In general, using unrelated arbitrary data is expected to result in a lower performance compared to using data from the same distribution as the input models’ train set, since pseudo-labels will be less reliable. This is exacerbated in challenging tasks such as scenario 3 where input models are trained on different data and have different architectures with respect to each other and the output model; or in scenario 4 where there are a large number of input models trained on different small-scale data.
It is worth reminding that the case of purely unsupervised model composition means combining an arbitrary number of black-box models (trained on arbitrary data with arbitrary architecture or categories), all without using any labels. In that sense, the real baseline to compare against is the supervised training, which performs much worse than model composition in low data regimes, even in the case of unrelated OID data.
Moreover, the main goal of the paper is to explore whether or not neural networks can be combined using only unlabeled data, and if yes, to what extent (hence the title). We observe from the results that the answer is for the most part yes; however, in case unlabeled data from the original distributions was not available, for some challenging scenarios a small percentage of labels may be needed to achieve a decent performance.
In a completely unsupervised setting, model composition can still effectively combine input models. The performance will be improved if the unlabeled set size is larger.
Remark 4: A note on practical applications: As mentioned in Section 3, a fundamental motivation for our work is a cloud services application, as shown in Figure 4, in which engineers and expert users can leverage a model composition service to build stronger models with combined skills, especially in the presence of a large variety of trained models and datasets on the cloud. Different scenarios in the experiments were also inspired by such a philosophy, but designed at various levels of difficulty. Here, we add a new practical use-case. In this new experiment, we combine separate models of face and mask detection, to build one that is suitable for both face & mask detection. Results are shown in Table 4.
Ablation on the number of categories: Next, we study the impact of the number of class categories in the performance. To this end, we take two input models from scenario 4, and combine them with a varying number of classes. is trained on 5 and is trained on 10 object categories (see Figure 9 in supplementary). Each time we add a number of random categories of to , so the combined model can have 5,6,…,15 classes. Table 5 shows the results. Note that in each case the validation/training set will be different as it includes images of a particular set of categories. and have mAP of 27.1% & 25.4%, respectively (each has roughly 12K training, and 1K validation examples). In general, higher number of classes results in slightly lower mAP, but we should also note that unlabeled set size becomes also larger (i.e. more pseudo-labels).
4.3 Image classification results
In addition to our main results on the task of object detection, we also provide a highlight of our results on the task of image classification. Similar to object detection, we designed the classification experiments in the form of different scenarios.
Scenario 1: 3 input models, ResNet-18, each trained on 1/3 of the Caltech-256 dataset.
Scenario 2: 3 models, ResNet-18, ResNet-152, and DenseNet-121, trained on Caltech-256.
In both scenarios, we tried model composition with unlabeled data from the Caltech-256 dataset (i.e. similar data distribution but without labels), and a 160K subset unlabeled data from OID (i.e. a different dataset altogether). Table 6 shows the results for these scenarios.
In Table 7, we provide a comparison between our method and two additional baselines: i) a simple model ensemble by aggregating directly the prediction of the input models; ii) knowledge distillation when using the input models as teachers, such as [hinton2015distilling, ahn2019variational]. For the second baseline, we consider the vanilla distillation [hinton2015distilling] but with soft labels.
It is observed from the results that the proposed method is effective in combining image classification models. In both the unsupervised and semi-supervised cases, our method performs competitively compared to supervised models, even when 100% of labels are used.
|Proportion of labeled data used (%)|
|Scenario 1||Ours: Caltech+FT||83||82.6||82.8||82.9||83||83.5|
|Scenario 2||Ours: Caltech+FT||83.2||82.1||81.8||81.8||83.1||83.3|
|Scenario 1||Scenario 2|
This paper proposed a method for combining multiple trained neural networks into a single model, using unlabeled data. To this end, first the input models’ predictions (pseudo-labels) were collected. The pseudo-labels were then filtered based on confidence scores of the predictions. Next, a consensus aggregation strategy was incorporated to combine these pseudo-labels. The remaining pseudo-labels were used to train the output model. The proposed method supported the use of an arbitrary number of input models with arbitrary architectures and categories. Performance evaluations on various datasets, tasks, and network architectures demonstrated the effectiveness of the proposed method.
6 Supplementary materials
This section contains the supplementary materials.
6.1 Source code
We share our implementation code to make it easy to reproduce our results. The source-code is attached to the supplementary materials in a ‘code’ directory. We also provide detailed instructions for training and evaluating our models in ‘README.md’ files.
6.2 Additional visualizations
Fig. 6 provides a visualization of object detection results of Table 3. We observe from this figure that in low data regimes model composition performs considerably better than supervised training with partial data. Fig. 7 shows an extended visualization on the cloud embodiment introduced in the paper. In this figure, we provide an easier comparison between before & after incorporating the model composition as a service. Moreover, Fig. 8 demonstrates an example of pseudo-label aggregation procedure of Algorithm 2. In addition, Fig. 9 visualizes the data splits of object detection scenario 4, where we combined 10 models trained on different COCO subsets.