How to layout and handle your ML project
Project layout is crucial for machine learning projects just like it is for software development project. We perceive it as a language. A project layout goes about organizing thoughts and provides you context for ideas just like being aware of the names for things provides you with the foundation for thinking.
In this blog article by AICorespot, we wish to illustrate some considerations in the layout and management of your machine learning project. This is very much connected to the objectives of project and science reproducibility. There is no “ideal” way, you will wish to choose and adopt the practices that ideally meet your predilections and project requirements.
Workflow Motivating Questions
Jeromy Anglim gave a presentation at the Melbourne R Users group in 2010 on the state of project layout for R.
The following are the motivation questions from Jeromy’s presentation.
- Divide a project into files and folders?
- Integrate R analyses into a report?
- Convert default R output into publication quality tables, figures, and text?
- Develop the final product?
- Sequence the analyses?
- Divide code into functions?
The following is the Youtube Video of the presentation.
Objectives for Project Workflow
David Smith furnishes a summarization of what he believes are the objective of a good project workflow in the article entitled A workflow for R. These are outstanding and should be remembered when developing your own project layout.
- Transparency: Logical and clear layout for the project making it intuitive for the reader.
- Maintainability: Simple to alter the project with standard naming for files and directories.
- Modularity: Discrete activities separated into separate scripts with a singular responsibility.
- Portability: Simple to move a project to another system (relative pathways and known dependencies)
- Reproducibility: Simple run and develop the same artefacts by you in the future or another individual.
- Efficiency: Less thought on meta project details like the utilities and more on the problems you are finding a solution to.
John Myles has an R project referred to as ProjectTemplate that intends to automatically develop a well defined layout for a statistical analysis project. It furnishes connections and utilities for automatically loading and munging data.
The logo for ProjectTemplate, a project for laying out your R statistical analysis project. The project layout is bigger than we would prefer, but furnishes insight into a highly-structured method for organization for your project.
- Cache: Preprocessed datasets that don’t require to be re-generated each time you carry out an analysis.
- Config: Configuration settings for the project.
- Data: Raw data files.
- Munge: Preprocessing data munging code, the outputs of which are put in cache.
- Src: Statistical analysis scripts
- Diagnostics: Scripts to diagnose data sets for corruption or outliers.
- Doc: Documentation authored about the analysis.
- Graphs: Graphs developed from analysis.
- Lib: Helper library functions but not the core statistical analysis.
- Logs: Output of scripts and any automatic logging.
- Profiling: Scripts to benchmark the timing of your code.
- Reports: Output reports and content that could go into reports like tables.
- Tests: Unit tests and regression suite for your code.
- README: Notes that orient any newcomers to the project.
- TODO: List of future enhancements and bug fixes you choose to make.
Software carpentry furnishes a short representation entitled Data Management. The strategy to data management draws inspiration from an article by William Stafford entitled A Quick Guide to Organizing Computational Biology Projects.
The presentation details issues with upkeeping several versions of data on disk or in version control. It comments that the primary necessity in data archiving and puts forth a strategy of dated directory names and data file metadata files that are themselves managed in version control. It’s a fascinating approach.
There is a ton of discussion of best practices with regards to project layout and code organization for data analysis projects on question and answer sites. For instance, some widespread instances consist of:
- How Do You Manage Your Files and Directories for your Projects?
- Workflow for statistical analysis and report writing
- Project organization with R
- What are effective ways to organize R code and output?
A good instance is the question How to efficiently manage a statistical analysis project? Which was turned into a community wiki detailing the best practices. In summarization, these practices are divided into the following sections:
- Data management: Leverage a directory structure, never alter raw data directly, check data consistency, leverage GNU make.
- Coding: Organize code into functional units, document everything, custom functions in a devoted file.
- Analysis: Document your random seeds, separate parameters into config files, leverage multivariate plots.
- Versioning: Leverage version control, backup everything, and leverage an issue tracker.
- Editing/Reporting: Combine code and reporting and leverage formal report generators.
Every project we attempt to refine our project layout. It’s difficult as the projects demonstrate variance with data and aims as do the language and utilities. We’ve attempted all compiled code and all scripting language versions. Some good tips we’d like to provide consist of:
- Stick to a POSIX filesystem layout (var, etc, bin, lib, and so on)
- Put all commands in scripts.
- Call all scripts from GNU make targets.
- Have make targets that create environment and download public datasets.
- Develop recipes and let the infrastructure check and develop any missing output products every run.
This final point is a game changer. It enables you to pipeline your workflow and give definition to recipes with wild abandon for activities such as data analysis, preprocessing, model configuration, feature selection, etc. The framework is aware how to execute recipes and develops outcomes for you to review.