MedShift: identifying shift data for medical dataset curation

12/27/2021
by   Xiaoyuan Guo, et al.
26

To curate a high-quality dataset, identifying data variance between the internal and external sources is a fundamental and crucial step. However, methods to detect shift or variance in data have not been significantly researched. Challenges to this are the lack of effective approaches to learn dense representation of a dataset and difficulties of sharing private data across medical institutions. To overcome the problems, we propose a unified pipeline called MedShift to detect the top-level shift samples and thus facilitate the medical curation. Given an internal dataset A as the base source, we first train anomaly detectors for each class of dataset A to learn internal distributions in an unsupervised way. Second, without exchanging data across sources, we run the trained anomaly detectors on an external dataset B for each class. The data samples with high anomaly scores are identified as shift data. To quantify the shiftness of the external dataset, we cluster B's data into groups class-wise based on the obtained scores. We then train a multi-class classifier on A and measure the shiftness with the classifier's performance variance on B by gradually dropping the group with the largest anomaly score for each class. Additionally, we adapt a dataset quality metric to help inspect the distribution differences for multiple medical sources. We verify the efficacy of MedShift with musculoskeletal radiographs (MURA) and chest X-rays datasets from more than one external source. Experiments show our proposed shift data detection pipeline can be beneficial for medical centers to curate high-quality datasets more efficiently. An interface introduction video to visualize our results is available at https://youtu.be/V3BF0P1sxQE.

READ FULL TEXT

page 18

page 20

page 24

page 25

page 26

page 28

page 29

page 30

research
08/09/2023

Multi-Class Deep SVDD: Anomaly Detection Approach in Astronomy with Distinct Inlier Categories

With the increasing volume of astronomical data generated by modern surv...
research
04/06/2022

OSCARS: An Outlier-Sensitive Content-Based Radiography Retrieval System

Improving the retrieval relevance on noisy datasets is an emerging need ...
research
08/18/2020

Transferring Complementary Operating Conditions for Anomaly Detection

In complex industrial systems, the number of possible fault types is unc...
research
04/05/2023

Industrial Anomaly Detection with Domain Shift: A Real-world Dataset and Masked Multi-scale Reconstruction

Industrial anomaly detection (IAD) is crucial for automating industrial ...
research
05/09/2022

Towards Measuring Domain Shift in Histopathological Stain Translation in an Unsupervised Manner

Domain shift in digital histopathology can occur when different stains o...
research
09/20/2022

Sanity Check for External Clustering Validation Benchmarks using Internal Validation Measures

We address the lack of reliability in benchmarking clustering techniques...

Please sign up or login with your details

Forgot password? Click here to reset