Reproducible Machine Learning Results by default
It is best practice to have reproducible outcomes within software projects. As a matter of fact, it could even be the norm by now. And if it is not, it should be.
You can choose any developer at random and they should be capable of adhering to your process to check out the code base from revision control and develop a build of the software ready to use. It is even better if you possess a procedure for setting up an environment and for putting out the software to users/operational environments.
It is the utilities and the process that the make the result reproducible. In this blog post, you will come to know that it is just as critical to make the results of your machine learning projects reproducible and their practitioners and scholars within the domain of machine learning struggle with this.
As a programmer and developer you already possess the utilities and tools and the procedure to leap ahead, if you can inculcate the discipline.
Reproducibility of Outcomes in Computational Sciences
Reproducibility of experiments is one of the primary principles of the scientific method. You author up what you did but other scientists don’t have to take your word for it, they adhere to a similar procedure and expect to obtain the same outcome.
Working in the computational sciences consist of code, running on computers that reads and authors data. Experiments that provide outcomes that do not overtly mention any of these elements are very probable and not easily reproducible. If the experiment cannot be reproduced, then of what value is the work.
This is an open problem in computational sciences and is turning ever more of a concern as more domains are reliant on computational outcomes of experiments. In this section, we will review this open problem by seeking a few papers that consider the issue.
Ten Simple Rules for Reproducible Computational Research
This was an article published in PLoS Computational Biology in 2013 by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor and Eivind Hovig. In the article, the writers detail 10 simple rules that if followed are expected to have the outcome of more accessible, reproducible, computational research. The rules have been summarized here:
- Rule 1: For every outcome, keep track of how it was generated.
- Rule 2: Avoid manual data manipulation steps
- Rule 3: Archive the exact versions of all external programs used
- Rule 4: Version control all custom scripts
- Rule 5: Record all intermediate outcomes, when possible in conventional formats
- Rule 6: For Analyses that consist of randomness, note underlying random seeds
- Rule 7: Always record raw data behind plots
- Rule 8: Produce hierarchal analysis output, enabling layers of increasing detail to be inspected.
- Rule 9: Connect textual statements to underlying results.
- Rule 10: Provide public access to scripts, runs, and results
The writers are commenting from the domain of computational biology. Nonetheless, we would argue that the rules are not adequate. We find them descriptive and would be a lot more prescriptive.
For instance, with rule 2 “Avoid Manual Data Manipulation Steps”, we would make the argument that all data manipulation must be automated. For rule 4 “Version Control All Custom Scripts”, we would put forth the argument that the complete automated procedure to develop work product be in revision control.
If you are a developer acquainted with professional process, your mind should be full of ideas as to how useful dependency management, build systems, makeup systems for documents that can execute embedded code, and continuous integration tools could really bring some rigor.
Accessible Reproducible Research
In an article by Jill Mesirov released in Science Magazine in 2010, the writer puts forth a terminology for systems that facilitate reproducible computational research by scientists, particularly:
- Reproducible Research System (RSS): Made up of a reproducible research environment and a reproducible research publisher.
- Reproducible Research Environment (RRE): The computational tools, administration of data, analyses and outcomes and the ability to package them together for redistribution.
- Reproducible Research Publisher (RRP): The document prep framework which connects to the reproducible research environment and furnishes the capability to embed analyses and outcomes.
A prototype system is detailed that was generated for Gene Expression analysis experiments referred to as GenePattern-Word RRS.
Again, looking through the perspective of software development and the tools available, the RRE appears like revision control plus a build system with dependency administration plus a continuous integration server. The RRP appears like a markup system with linking and a build process.
An invitation to reproducible computational research
This was a paper authored by David Donoho in Biostatistics, 2010. This is a brilliant research paper, we concur with the points it brings up. For instance,
Computational reproducibility is not an afterthought, it is something that must be developed into a project from the start.
We couldn’t put it in better words ourselves. In the research paper, the author details the benefits for developing reproducibility into computational research. For the researcher, the advantages are:
- Improved work and work habits
- Improved teamwork
- Greater impact (Less inadvertent competition and more acknowledgment)
- Increased continuity and cumulative impact
The advantages the writer details for the taxpayer that funds the research are:
- Steward ship of public goods
- Public access to public goods
We made some of the same arguments to our co-workers off the cuff and it is brilliant to be able to indicate to this paper that does a much better job of making a case.
Making Scientific Computations Reproducible
Put out in computing in Science and Engineering, 2000 by Matthias Schwab, Martin Karrenbach and Jon Claerbout. The opening sentences of this paper are brilliant:
“Commonly research consisting of scientific computations are reproducible in principle but not in practice. The published documentation are merely the advertisement of scholarship whereas the computer programs, input data, parameter values, etc. embody the scholarship itself. Consequently authors are typically unable to reproduce their proprietary research after a few months or years.
The paper details the standardization of computational experiments through the adopting of GNU make, conventional project structure and the distribution of experimental project files online. These practices were made standard practice in the Stanford Exploration Project (SEP).
The motivating problem tackled by the adoption was the loss of programming input when a graduate student left the group due to the inability to reproduce and develop upon experiments.
The ideas of a standard project structure and build system appear so natural to a developer.
Reproducibility by Default in Machine Learning
The critical point we wish to make is to not ignore the stand-out practices that have built up to standard in software development when beginning in the domain of machine learning. Leverage them and develop on top of them.
There are blue prints available for machine learning projects that can be leveraged online. The following are some tips for reuse of software tools to make reproducibility a default for applied machine learning and machine learning projects, generally speaking.
- Develop a build system and have all outcomes generated automatically by build targets. If it’s not automated, it not an aspect of the project, that is, we have an idea for a graph or an analysis? Automate its generation.
- Automate all data selection, pre-processing and transformations. We even put in wget’s for accruing data files when operating on machine learning competitions. We wish to get up and run from the ground up on new workstations and fast servers.
- Leverage revision control and tag milestones.
- Strongly consider checking in dependencies or at the very least, linking.
- Avoid authoring code: Author thin scripts and leverage standard tools and leverage standard unix commands to connect things together. Authoring heavy duty code is a final resort during analysis or a final step prior to operations.
- Leverage a markup to develop reports for analysis and presentation output products. We like to whip up tons of fascinating things in batch and go about implementing them all and allow my build system develop them when it next runs. This enables us to assess and contemplate deeply about the observations at a later date when we’re not in an idea mode.
Pro Tip
Leverage a continuous integration server to execute your test harness often (daily or hourly).
We have conditions established in our test harness to check for the presence of output products and develop them if they are missing. That implies that every time we execute the harness, only things that have altered or outcomes that are missing are computed. This implies we can let our mins run wild and keep on including algorithms, data transforms and all kinds of crazy ideas to the harness and some server somewhere will compute missing outputs on the next run for me to assess.
The disconnect that is imposed between data integration and outcome evaluation quickly hastens progression on a project.
We identify bugs in our harness, we delete the results and rebuild them all again with confidence on the next cycle.
Conclusion
In this blog post, you have learned that the practice of ML is project work with source data, code, computations, and intermediate work product and output work products. There are also probably all manner of things in between.
If you handle a machine learning project such as a software project and reap the advantages of reproducibility by default. You will also obtain additional advantages of speed and confidence which will have the result of improved outcomes.
Resources
If you would like to read more into these issues, we have detailed the resources leveraged in the research of this post below.
- Reproducibility – Wikipedia page
- Ten Simple Rules For Reproducible Computational Research, Geir Kjetil Sandve, Anton Nekrutenko, James Taylor and Eivind Hovig, 2013
- Accessible Reproducible Research, Jill Mesirov, 2010
- An invitation to reproducible computational research, David Donoho, 2010
- Making scientific computations reproducible, Matthias Schwab, Martin Karrenbach and Jon Claerbout, 2000
- Reproducible Research with R and RStudio by Christopher Gandrud is a book on this topic leveraging R.