From Fake to Real (FFR): A two-stage training pipeline for mitigating spurious correlations with synthetic data
Visual recognition models are prone to learning spurious correlations induced by an imbalanced training set where certain groups (Females) are under-represented in certain classes (Programmers). Generative models offer a promising direction in mitigating this bias by generating synthetic data for the minority samples and thus balancing the training set. However, prior work that uses these approaches overlooks that visual recognition models could often learn to differentiate between real and synthetic images and thus fail to unlearn the bias in the original dataset. In our work, we propose a novel two-stage pipeline to mitigate this issue where 1) we pre-train a model on a balanced synthetic dataset and then 2) fine-tune on the real data. Using this pipeline, we avoid training on both real and synthetic data, thus avoiding the bias between real and synthetic data. Moreover, we learn robust features against the bias in the first step that mitigate the bias in the second step. Moreover, our pipeline naturally integrates with bias mitigation methods; they can be simply applied to the fine-tuning step. As our experiments prove, our pipeline can further improve the performance of bias mitigation methods obtaining state-of-the-art performance on three large-scale datasets.
READ FULL TEXT