The Machine Learning Project Pipeline - A binary classification example
Binary classification is a common machine learning task, in this article we detail an overview of the process involved.
Binary classification is a common machine learning task, in this article we detail an overview of the process involved.
A brief overview of building your first binary classification model
Binary classification is the task of classifying the elements of a set into two groups, based on a classification rule. The most common example that you will probably be familiar with is whether an email is spam or ham. Straightforward. right?
Using machine learning for this means that you have historical, labeled data on which you base your model. In essence, that is all that supervised machine learning is: learning and improving from historic data.
Popular algorithms that can be used for binary classification include:
· Logistic Regression
· k-Nearest Neighbours
· Decision Trees
· Random Forests
· Support Vector Machine
· Naive Bayes
Which algorithm to use entirely depends on your case scenario and data. I won’t be going into depth on these algorithms here, but feel free to read up on them to better your understanding.
Let’s refer to our example of spam emails. The problem statement is: “Is this email spam or not?”.
This seems pretty easy to answer if you were to manually look at a specific email and its contents. However, trying to do this for 306.4 billion emails, which is the average number of emails sent every single day, is impossible.
This is where the beauty of machine learning comes in. Machine learning can learn patterns across a multitude of data points and automate a decision. This significantly improves our email problem and makes it feasible.
This is by far the most important step in any machine learning project. You will spend hours combing and preparing your data. Then, just when you are happy with your data set, you will inevitably find some more tweaks that could be made.
The popular phrase “garbage in, garbage out” could not be truer. You can have the most advanced model ever created, but you will always fall short of your goal if your data is not correctly prepared and cleaned. Understanding your data is a crucial part of the process. You need to know the ins and outs of what your problem requires.
I recommend looking at this article for some tips and tricks.
The next step will be splitting your data into 2 sets: one for your model to train on and another for it to test on.
Wisdom states that the higher the number of examples you have the smaller the train/test split can be. More test examples imply that you will cut down the variance of the score. This allows you to train on more examples, creating a more robust model. In data science, more data is always a good thing.
A good point to start at is 70% train and 30% test split.
Scikit-learn is a library you will undoubtedly be using when creating your model. Within this library is a function called train_test_split which makes it pretty convenient to randomly split data into a decided proportion.
You now have the black box of your model:
1. You have your cleaned data
2. You have already split it into train and test sets.
Now you will want to try out a few of the different algorithms mentioned to compare which one best explains your data.
Depending on your chosen algorithm, different scoring metrics are used to find a model that best fits your data.
As a data scientist, one should always look for accurate, yet fast, methods or functions to do the data modeling work. If the method is inherently slow, then it will create an execution bottleneck for large data sets.
Parameters that define the model architecture are referred to as hyperparameters. Thus, this process of searching for the ideal model architecture is referred to as hyperparameter tuning.
Some built-in functions of scikit-learn that are easy to use and greatly effective are RandomizedSearchCV, which implements a fit and score method on the data, and GridSearchCV, which does the same but instead of selecting random parameters to test from a grid, it tests all of them individually.
Using these functions sets you up to maximize the power of your algorithm.
Now for the exciting step of testing your model – making predictions on your test data will be the key measure in determining whether you built a successful model.
A common way to test accuracy is to construct a confusion matrix and calculate the specificity and sensitivity. You will find yourself refitting and testing models again and again, learning with each iteration where you can improve your model. Here is a great article explaining different measures of accuracy.
Cross-validation is done to prevent overfitting a model to the test data. A common method of this is k-fold cross-validation.
The original sample is randomly partitioned into k equal-sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsample is used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation, as seen in the image below.
Model building requires you to have an understanding of both business and coding. If you write the perfect code, but you don’t understand the business or the problem you are creating the model for, you will surely be setting yourself up for a very hard time!
Make sure that you are comfortable that your data is in a state that will yield good results and that you are already more than halfway to the finish line. The rest is just selecting a good model and using hyperparameter tuning to give you that final boost.
Stay up to date with the latest AI news, strategies, and insights sent straight to your inbox!