Text Classification

Supervised text classification - A Beginner’s Guide

Text classification is not as difficult as you think! Here we provide a simple guide to get you started.


To solve any problem with machine learning, you need data. This data can come in many familiar forms:

· Numerical data (count of monthly car sales or total income received from these sales)

· Categorical data (the make and model of cars sold)

· Time-series data (comparison of monthly car sale numbers over the year).

Machine learning algorithms have adapted to allow for the modelling of other types of data:

1. Modelling images may involve representing each pixel of an image as a binary number or series of binary numbers.

2. Modelling sounds may involve converting that sound into a digital signal, sampling the signal at regular time intervals, and converting these to binary numbers.

Text processing is fast becoming one of the more popular tasks in many machine learning applications: Machine Translation (translating sentences from one language to another), Text Summarisation (providing a summary of a given piece of writing) and Question Answering(answering posed questions) are just a few examples of text processing machine learning applications that are being used in the real world today.

A further application of text processing is supervised text classification.

What is Supervised Text Classification?

Supervised text classification is the preferred machine learning technique when the goal of your analysis is to automatically classify pieces of text into one or more defined categories. The type of problem statements that would benefit from its use are:

· Spam filtering (detecting and classifying spam and non-spam emails)

· Sentiment analysis (determining user sentiment behind a social media post)

· News article classification (categorising a news article into one of a few defined topics)

Supervised text classification example: Hotel review classification task

The aim of this task is to detect the sentiment behind hotel reviews. We will explore a labelled dataset of hotel reviews and accompanying labels and train our model on these data pairs. Using our trained model, we want to be able to determine the sentiment behind any future hotel review received. I have included code snippets to make it easier to follow.


The dataset we are going to use to train our model is available on Kaggle and consists of 6 998 hotel reviews and accompanying labels: excellent, good, average, bad and pathetic.

Machine learning library

We will use the Scikit-learn (Sklearn) library for our modelling exercise. Sklearn is a free software machine learning library for Python. I often use this library as it includes tools that cover a wide variety of machine learning tasks, has comprehensive documentation, and is developed by a large community of developers and machine learning experts.

Required packages

Several Python packages are required to implement the modelling exercise. These will need to be imported into our notebook. A few of these will be discussed in more detail in later sections:

· Pandas: used for data manipulation and analysis

· Numpy: used for supporting large, multi-dimensional arrays and matrices, along with high-level mathematical functions to operate on these arrays

· Clean text: used for manipulating raw text into a more useable format which we can work with within our machine learning model

· Sklearn’s train_test_split: a function in Sklearn model selection for splitting data arrays into two subsets: training data and testing data

· Sklearn’s TfidfVectorizer: used to transform the review text data into new feature vectors to be used to predict the review sentiment

· Sklearn’s GaussianNB: implements the Gaussian Naive Bayes algorithm for our multilabel classification model

· Sklearn’s accuracy_score: our chosen metric for measuring our classification model’s predictive accuracy

Import the following libraries:

import pandas as pd
import numpy as np
from cleantext import clean
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

If your system does not have any of the required libraries, you can install them at their individual official links:

· Pandas

· Numpy

· Cleantext

· Scikit-learn

Importing the dataset

We use pandas.read_csv to read in our CSV file of labelled hotel reviews and assign this to a dataframe named ‘data’.

data = pd.read_csv('data.csv')

Calling data allows us to view our dataframe. We see that our dataframe consists of 6 998 rows, each row consisting of a unique ID, hotel review and label.

The dataframe containing the text we wish to classify.

Handling missing values

We can use data.isnull().sum() to determine whether there are any missing values in our dataframe:

The output shows us that there are one missing review and no missing labels. Since there is only one missing value, the prudent thing would be to delete the row containing this missing value.

We can do this by redefining data as data.dropna(). Running data to view our dataframe once more, we see that our dataframe consists of 6 997 rows, with the row containing missing data dropped.

data = data.dropna()

Cleaning the text

We need to use the cleantext package to define a function that will allow us to manipulate the raw review text into a more useable format which we can work with within our machine learning model.

We define the function do_clean(text) which accepts any string and returns cleaned text. We apply this function to each review in our dataset and replace the raw reviews with the cleaned version.

 def do_clean(text):

"""Cleans text given.
text (str): text to be cleaned. """

return clean(text,

# Fix various unicode errors


# Transliterate to closest ASCII representation


# Lowercase text


# Fully strip line breaks as opposed to only normalizing them


# Replace all URLs with a special token


# Replace all email addresses with a special token


# Replace all phone numbers with a special token


# Replace all numbers with a special token


# Replace all digits with a special token


# Replace all currency symbols with a special token


# Fully remove punctuation








# Set to 'de' for German special handling

data['Reviews'] = data['Reviews'].apply(do_clean)

The effect of cleaning text with the cleantext library.

The text has been cleaned and we notice there are no longer any capital letters or numbers.

Train test split

We define our independent X variable as the column of cleaned reviews and our dependent y variable as the label column. We then use Sklearn’s train_test_split function to randomly split the data into a training set and a test set. I specified the size of the test set to be 20% of the total data.

X = data['Reviews']

y = data['Label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Note that the size of the training set is now 5 597 rows and the size of the testing set is 1400 rows.

Feature engineering: TF-IDF Vectors as features

This step involves transforming the cleaned review text data into new feature vectors to be used to predict the review sentiment. We will use SKlearn’s TfidfVectorizer to do this.

TF-IDF stands for term frequency-inverse document frequency. We use TF-IDF to compute feature vectors which consist of weights corresponding to the relative frequency in which certain words appear.

The TF-IDF weight for a word is a measure used to assess the importance of a word to a particular review within a corpus. The importance increases proportionally to the number of times a word appears in the review but is offset by the frequency of the word in the entire review corpus.

The TF-IDF weight is calculated for each word in a document and consists of two parts:

TF or Term Frequency measures how frequently a term appears in a document

IDF or Inverse Document Frequency measures the importance of each term

Feature generation

We use the hyperparameter max_features to build a vocabulary that only considers the top five words. It is ordered by term frequency across the corpus of reviews. This hyperparameter should be adjusted based on the volume of data given.

The fit_transform method is then used on the training data to calculate the TF-IDF scores.

The transform method is then used on the testing data, where the TF-IDF scores generated from fit() on training data, are applied to transform the test data.

tfidfconverter = TfidfVectorizer(max_features=5)

X_train_Tfidf_df = tfidfconverter.fit_transform(X_train).toarray()

X_train_Tfidf_df = pd.DataFrame(X_train_Tfidf_df)

X_test_Tfidf_df = tfidfconverter.transform(X_test).toarray()

X_test_Tfidf_df = pd.DataFrame(X_test_Tfidf_df)

The effect of applying Tfidf on the data.

From above we see, calling X_train_Tfidf_df provides a dataframe with five columns representing the TF-IDF score. It considers the top five words ordered by term frequency across the corpus of reviews.


We can now train our model using the training sets with any multilabel classification model. I have used the Gaussian Naive Bayes model here, but any other should suffice. We do this by defining the model and fitting out training sets to this model.

model = GaussianNB()

model.fit(X_train_Tfidf_df, y_train)


We are now ready to predict the response for our converted test dataset using the model.predict() method.

y_pred = model.predict(X_test_Tfidf_df)

Metric for measuring our classification model’s predictive accuracy

Accuracy can then be measured by comparing the predicted label with the labels in the test set. The model was able to correctly classify the review sentiment about 56% of the time.

accuracy_score(y_test, y_pred)

output: 0.557


We have just created our very own text classification model and a simple guide to supervised text classification in Python (with code).

Enjoyed this read?

Stay up to date with the latest AI news, strategies, and insights sent straight to your inbox!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.