“Quantifying Independently Reproducible Machine Learning”, 2020-02-06 (; similar):
How reproducible is the latest ML research, and can we begin to quantify what impacts its reproducibility? This question served as motivation for my NeurIPS 2019 paper. Based on a combination of masochism and stubbornness, over the past eight years I have attempted to implement various ML algorithms from scratch. This has resulted in a ML library called JSAT. My investigation in reproducible ML has also relied on personal notes and records hosted on Mendeley and Github. With these data, and clearly no instinct for preserving my own sanity, I set out to quantify and verify reproducibility! As I soon learned, I would be engaging in meta-science, the study of science itself.
…Some of the results were unsurprising. For example, the number of authors shouldn’t have any particular importance to a paper’s reproducibility, and it did not have a statistically-significant relationship. Hyperparameters are the knobs we can adjust to change an algorithms behavior, but are not learned by the algorithm itself. Instead, we humans must set their values (or devise a clever way to pick them). Whether or not a paper detailed the hyperparameters used was found to be statistically-significant, and we can intuit why. If you don’t tell the reader what the settings where, the reader has to guess. That takes work, time, and is error prone! So, some of our results have given credence to the ideas the community has already been pursuing in order to make papers more reproducible. What is important is that we can now quantify why these are good things to be pursuing. Other findings follow basic logic, such as the finding that papers that are easier to read are easier to reproduce, likely because they are easier to understand.
…Having fewer equations per page makes a paper more reproducible.
Empirical papers may be more reproducible than theory-oriented papers.
Sharing code is not a panacea
Having detailed pseudo code is just as reproducible as having no pseudo code.
Creating simplified example problems do not appear to help with reproducibility.
Please, check your email
…After completing this effort, my inclination is that there is room for improvement, but that we in the AI/ML field are doing a better job than most disciplines. A 62% success rate is higher than many meta-analyses from other sciences, and I suspect my 62% number is lower than reality…Finally, it has been pointed out to me that I may have created the most un-reproducible ML research ever. But in reality, it leads to a number of issues regarding how we do the science of meta-science, to study how we implement and evaluate our research. With that, I hope I’ve encouraged you to read my paper for further details and discussion. Think about how your own work fits into the larger picture of human knowledge and science. As the avalanche of new AI and ML research continues to grow, our ability to leverage and learn from all this work will be highly dependent on our ability to distill ever more knowledge down to a digestible form. At the same time, our process and systems must result in reproducible work that does not lead us astray. I have more work I would like to do in this space, and I hope you will join me.