Moderator: Solon Barocas
Is my dataset biased? The answer is likely, yes. In machine learning, “dataset bias” happens when the training data is not representative of future test data. Finite datasets cannot include all variations possible in the real world, so every machine learning dataset is biased in some way. Yet, machine learning progress is traditionally measured by testing on in-distribution data. This obscures the real danger that models will fail on new domains. For example, a pedestrian detector trained on pictures of people in the sidewalk could fail on jaywalkers. A medical classifier could fail on data from a new sensor or hospital. The good news is, we can fight dataset bias with techniques from domain adaptation, semi-supervised learning and generative modeling. I will describe the evolution of efforts to improve domain transfer, their successes and failures, and a vision for the future.