Overlearning Reveals Sensitive Attributes
`Overlearning' means that a model trained for a seemingly simple objective implicitly learns to recognize attributes that are (1) statistically uncorrelated with the objective, and (2) sensitive from a privacy or bias perspective. For example, a binary gender classifier of facial images also learns to recognize races— even races that are not represented in the training data— and identities. We demonstrate overlearning in several image-analysis and NLP models and analyze its harmful consequences. First, inference-time internal representations of an overlearned model reveal sensitive attributes of the input, breaking privacy protections such as model partitioning. Second, an overlearned model can be `re-purposed' for a different, uncorrelated task. Overlearning may be inherent to some tasks. We show that techniques for censoring unwanted properties from representations either fail, or degrade the model's performance on both the original and unintended tasks. This is a challenge for regulations that aim to prevent models from learning or using certain attributes.
READ FULL TEXT