Documenting your work is necessary, but boring, regardless of the type of work you do. While tracking and reproducing work for most generic web-connected applications and workflows is becoming more standardized (i.e., document state-saving and tracking through Google Docs and code version control system like Github) there is currently no widely accepted standard or simple automation for data science and machine learning. This is not to say developers and data scientists don’t track their work, but their process tends to be rinse and repeat, time-consuming, and rarely automated.
Not only is keeping track of the state of your work an important part of getting things done, but automating and observing best practices in tracking also drives better productivity and collaboration. Below, we propose some best practices and an open source system for solving tracking and reproducibility when working with data and machine learning. Furthermore, we introduce a new way to automate this process. Let’s start with the goals that we strive to achieve.
Prevent lost history of your trained models, configurations, metrics, and environments
Ensure accurate reporting of results
Avoid errors when repeating, learning from, and reproducing someone else’s work (or even your own work!)
Broadly we can categorize tracking in data projects into 5 sections: code, configurations, environments, performance metrics, and files. Below we break down the current workflow and problems with it, some best practices for handling these workflows and the Datmo equivalents for the best practices.
Tracking Problems with Existing Solutions
Best Practices with Existing Solutions
Keeping track of workflows while working with and modeling data shouldn’t be like pulling teeth. Datmo’s simple command-line interface takes into account the common workflows and best practices that data scientists and developers are used to, and automates the entire tracking process.
More specifically, Datmo enables tracking all 5 components above via Snapshots of models. In addition, Datmo enables simple orchestration of tasks and runs to ensure replicability of snapshots across machines. Datmo provides 2 main value propositions:
Tracks models via snapshots, which are points in time of a trained model
Enables orchestration for running tasks run in parallel and maintaining isolated results
Datmo’s CLI, along with GUI platform, allow you to build, track, share, and collaborate on data projects. This allows collaborators to see all of the work you did, including snapshots and task runs. These collaborators can now improve, comment and work on new experiments from there improving the way everyone works on their data projects.
We hope no one will ever have to deal with the issue of tracking work manually ever again. Looking forward to your feedback!
Check out our simple Iris Classification demo below:
Signup to our newsletter at https://datmo.com/ to get updates on our progress and tips on how to improve your workflow.