Revealing the Underlying Patterns: Investigating Dataset Similarity, Performance, and Generalization

08/07/2023
by   Akshit Achara, et al.
0

Supervised deep learning models require significant amount of labelled data to achieve an acceptable performance on a specific task. However, when tested on unseen data, the models may not perform well. Therefore, the models need to be trained with additional and varying labelled data to improve the generalization. In this work, our goal is to understand the models, their performance and generalization. We establish image-image, dataset-dataset, and image-dataset distances to gain insights into the model's behavior. Our proposed distance metric when combined with model performance can help in selecting an appropriate model/architecture from a pool of candidate architectures. We have shown that the generalization of these models can be improved by only adding a small number of unseen images (say 1, 3 or 7) into the training set. Our proposed approach reduces training and annotation costs while providing an estimate of model performance on unseen data in dynamic environments.

READ FULL TEXT

page 5

page 8

page 11

page 12

research
11/17/2022

Data-Centric Debugging: mitigating model failures via targeted data collection

Deep neural networks can be unreliable in the real world when the traini...
research
05/19/2022

Dataset Pruning: Reducing Training Data by Examining Generalization Influence

The great success of deep learning heavily relies on increasingly larger...
research
05/22/2022

Investigating classification learning curves for automatically generated and labelled plant images

In the context of supervised machine learning a learning curve describes...
research
09/16/2020

Exploring Bayesian Surprise to Prevent Overfitting and to Predict Model Performance in Non-Intrusive Load Monitoring

Non-Intrusive Load Monitoring (NILM) is a field of research focused on s...
research
05/20/2021

Probing the Effect of Selection Bias on NN Generalization with a Thought Experiment

Learned networks in the domain of visual recognition and cognition impre...
research
11/15/2021

Reducing the Long Tail Losses in Scientific Emulations with Active Learning

Deep-learning-based models are increasingly used to emulate scientific s...
research
08/14/2020

Quantification of BERT Diagnosis Generalizability Across Medical Specialties Using Semantic Dataset Distance

Deep learning models in healthcare may fail to generalize on data from u...

Please sign up or login with your details

Forgot password? Click here to reset