Dr Thomas Lyall Keevers1
1DSTG, Sydney, Australia
Cross-validation is the de facto standard for assessing model quality in machine learning. One data set is used to train or calibrate a complex model, while a second independent data set is used to estimate its generalized performance. The elegance and simplicity of this approach is sometimes perceived to render traditional theoretical approaches unnecessary. This attitude is exemplified in Chris Andersen’s provocative essay “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”,
‘Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.’
However, a number of recent studies call into question the sufficiency of cross-validation. Neural networks that achieve human-level accuracy in image recognition are vulnerable against adversarial examples in which images become misclassified after miniscule manipulation, despite strong performance in cross-validation. Google Flu Trends was able to accurately predict influenza outbreaks for several years before the model suddenly mispredicted outbreak timings and intensities.
We argue that the data science community should embrace a holistic approach to model validation. This means using cross-validation in addition to assessing data quality and completeness, identifying subjective judgements in model assessment, and using an extended peer review process.
Thomas Keevers is an analyst at Defence Science and Technology Group. He facilitates the Data Science Reading Group at Australian Technology Park.