Leakage and the Reproducibility Crisis in ML-based Science by Sayash Kapoor, a second year PhD candidate at Princeton University.

IDSAI Research Seminar Series 2022-2023

The use of machine learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences.

An Institute for Data Science and Artificial Intelligence seminar
Date	9 November 2022
Time	13:30 to 14:30
Place	Streatham Court Old C Hybrid delivery by Zoom.

Event details

However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this paper, we systematically investigate reproducibility issues in ML-based science. We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Specifically, through a survey of literature in research communities that adopted ML methods, we find 17 fields where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions. Based on our survey, we present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems.

We argue for fundamental methodological changes to ML-based science so that cases of leakage can be caught before publication. To that end, we propose model info sheets for reporting scientific claims based on ML models that would address all types of leakage identified in our survey. To investigate the impact of reproducibility errors and the efficacy of model info sheets, we undertake a reproducibility study in a field where complex ML models are believed to vastly outperform older statistical models such as Logistic Regression (LR): civil war prediction. We find that all papers claiming the superior performance of complex ML models compared to LR models fail to reproduce due to data leakage, and complex ML models don't perform substantively better than decades-old LR models. While none of these errors could have been caught by reading the papers, model info sheets would enable the detection of leakage in each case.

Delivery and Registration:

The seminar will be delivered hybrid, with Sayash presenting remotely. To register, please click here. Registration closes: Wednesday, 9 November 2022 at 09:00 (BST).

Whilst we appreciate the flexibility that hybrid deliver brings, we would encourage you to come along in person where there will be tea and coffee beforehand at 13:00.

If you have any queries, please contact idsai@exeter.ac.uk.

This forms part of the IDSAI Research Seminar Series for 2022-2023. Click here to find out more.

Location:

Streatham Court Old C