There are countless people trying to get into data science lately, but many are intimidated by the idea of learning a new workflow. In just the last year alone, the number of people reading about and interacting with machine learning has jumped from ~100k in 2016 to over 10 million in 2017 worldwide.
The two biggest camps of people learning data science are software engineers looking to learn about quantitative development, and quantitative-oriented mathematicians and statisticians looking to scale their analysis and impact using machine learning (ML) and advanced analysis.
For software engineers, there are actually a lot of similarities to the “plan-build-test-deploy” (PBTD) process that you already know well.
We’re going to break down the data science development and production pipeline, and show how you can leverage the knowledge you already have about the PBTD workflow to hit the ground running. > # Data Science follows the same PBTD loop you already know and love.
While the process looks the same at the high level, there are some specifics that make data science slightly different.
“Every extra minute spent planning is 10 minutes less spent debugging”
Data science, just like any task, requires planning; much like software engineering, developing a specification is part of the game. Here are the key steps:
Identify business needs and create metrics: leverage business stakeholder knowledge to identify business needs and translate those needs into metrics that can be reliably measured. Machine learning needs certain values to minimize or maximize to assess its effectiveness.
Design experiments: come up with a set of experiments to generate the metrics you care about and identify the algorithms you need to use to get the desired results.
Once you’ve come up with a plan, you can start to build your system. In software engineering this is is where you write your app’s code. In data science it’s just a few more steps:
Environment setup: Based on the algorithm and experiment architecture you planned for, identify the library and framework dependencies you have (e.g. numpy, pandas, Python 2.7, TensorFlow, etc). Then set it up on the required hardware (local vs. distributed solutions).
Data Exploration: Explore the data tables or sources that can help generate the metrics you care about. This is where you clean your data, extract relavent variables, and engineer features that will be fed as inputs into your machine learning algorithm.
Model Generation: Write the code that executes the algorithm you have chosen. This is the crux of the data science process and has a few key components:
Algorithm selection: select the model that makes most sense e.g. linear regression, random forests, neural network, etc.
Debugging: The model only works if the code that generate it can execute your algorithm and data cleaning steps without error. This can manifest as a runtime/compilation error or a functional error which may process data incorrectly and thus result in a broken model.
Optimization: All data analysis, including machine learning algorithms, are optimizations without a global maxima or minima. That’s just fancy parlance for saying it is not guaranteed you will get an answer in a given time. Unlike software engineering where you are programming functions with deterministic outputs, here, these processes are non-deterministic. In order to optimize, you will need to tune hyperparameters which teach the machine learning optimization function to calculate the best “weights” by minimizing a variable known as the loss function. We can think of this as a “compilation” step of data science. In software engineering, this step turns our code into binary machine language, in data science the data is turned into “weights” of the model via this optimization process.
Testing in software comes in many different forms. For data science, there are a few types which are particularly relevant.
Unit testing: we can test a trained model by checking if the data we’re working with in our production pipeline is compatible with the model we have trained during the Build phase. By testing with real data outside of that which was used to train the model, we’ll get a better idea of whether outliers or strange data points return unexpected results from our trained model.
Integration testing: we can test how well the trained model behaves within the ecosystem of our product. For example, if the trained model is expecting data with a particular schema, does it parse it properly and return outcomes as expected based on the understanding of the model when we built it? In practice, this would look like a picture color recognition model erroring because one of the parameters it was accidentally passed was the date of origin. Containerization and microservices help make this possible (see Deployment below for details).
Performance/Stress testing: Much like in software engineering, we check the worst case scenario to test performance. For code we have the compilation time (which has become negligible in most cases) and big-O notation to connote computational complexity for runtime. In data science we use the training time and testing time as indicators. How does model training runtime compare to less greedy optimization methods? How does the model testing runtime compare to simpler models with fewer weights? The additional runtime of more complex models does not always translate to much better performance, and is sometimes not worth the large increase in runtime.
Once we have tested the trained model, we push it to production, similar to software engineering. There are many tools for deploying production code, all built around the ability to track changes with git commits. For data science projects, there is a huge void in deployment tools, as only the largest technology companies have the resources required to build proprietary in-house solutions. Datmo is building the platform and CLI that solves these problems specifically for data science and machine learning development. Drawing parallels to SWE, here are the key components
Version control: Datmo is like Git for data projects — revert to prior models, share it with others, and compare results
Provisioning microservices: Provision servers and containers specifically for the exact model you want to work with
Continuous Integration: Prior to deployment, you can run the tests mentioned above to ensure the trained model works before pushing the microservice to production — no interruption of service!
Let’s see it in action
In Part 2 (not yet released), we’ll walk you through this process for a simple model — a sorting hat! You will provide a selfie / picture of your face and the sorting hat will sort you into the appropriate Hogwarts house!
Plan: We are going to build a model with a single image as an input and the output will be a set of 4 categories: Hufflepuff, Gryffindor, Slytherin, and Ravenclaw. We will use a feature extractor on the image and port the result into a logistic regression classifier. Our training data will come from a set of images and houses of Harry Potter characters.
Build: We will use Python and Scikit-learn to implement our algorithm. Then we will run the training images through the feature extractor and train the classifier.
Test: From the training test we will hold out a small subset and will test on that validation set during our training to validate it works. Then we will have a set of new images to see how they perform — since I don’t know any wizards or have a real sorting hat to corroborate, I can’t provide a quantitative measure for my integration tests.
Deploy: Create a prediction function that is translated to a simple RESTful API and deploy it locally on my server with an open port for others to access.
Welcome to Datmo — your AI workflow, simplified. Sign up for free on our website.