Supervised text classification - A Beginner’s Guide
Text classification is not as difficult as you think! Here we provide a simple guide to get you started.
Text classification is not as difficult as you think! Here we provide a simple guide to get you started.
To solve any problem with machine learning, you need data. This data can come in many familiar forms:
· Numerical data (count of monthly car sales or total income received from these sales)
· Categorical data (the make and model of cars sold)
· Time-series data (comparison of monthly car sale numbers over the year).
Machine learning algorithms have adapted to allow for the modelling of other types of data:
1. Modelling images may involve representing each pixel of an image as a binary number or series of binary numbers.
2. Modelling sounds may involve converting that sound into a digital signal, sampling the signal at regular time intervals, and converting these to binary numbers.
Text processing is fast becoming one of the more popular tasks in many machine learning applications: Machine Translation (translating sentences from one language to another), Text Summarisation (providing a summary of a given piece of writing) and Question Answering(answering posed questions) are just a few examples of text processing machine learning applications that are being used in the real world today.
A further application of text processing is supervised text classification.
Supervised text classification is the preferred machine learning technique when the goal of your analysis is to automatically classify pieces of text into one or more defined categories. The type of problem statements that would benefit from its use are:
· Spam filtering (detecting and classifying spam and non-spam emails)
· Sentiment analysis (determining user sentiment behind a social media post)
· News article classification (categorising a news article into one of a few defined topics)
Supervised text classification example: Hotel review classification task
The aim of this task is to detect the sentiment behind hotel reviews. We will explore a labelled dataset of hotel reviews and accompanying labels and train our model on these data pairs. Using our trained model, we want to be able to determine the sentiment behind any future hotel review received. I have included code snippets to make it easier to follow.
The dataset we are going to use to train our model is available on Kaggle and consists of 6 998 hotel reviews and accompanying labels: excellent, good, average, bad and pathetic.
We will use the Scikit-learn (Sklearn) library for our modelling exercise. Sklearn is a free software machine learning library for Python. I often use this library as it includes tools that cover a wide variety of machine learning tasks, has comprehensive documentation, and is developed by a large community of developers and machine learning experts.
Several Python packages are required to implement the modelling exercise. These will need to be imported into our notebook. A few of these will be discussed in more detail in later sections:
· Pandas: used for data manipulation and analysis
· Numpy: used for supporting large, multi-dimensional arrays and matrices, along with high-level mathematical functions to operate on these arrays
· Clean text: used for manipulating raw text into a more useable format which we can work with within our machine learning model
· Sklearn’s train_test_split: a function in Sklearn model selection for splitting data arrays into two subsets: training data and testing data
· Sklearn’s TfidfVectorizer: used to transform the review text data into new feature vectors to be used to predict the review sentiment
· Sklearn’s GaussianNB: implements the Gaussian Naive Bayes algorithm for our multilabel classification model
· Sklearn’s accuracy_score: our chosen metric for measuring our classification model’s predictive accuracy
Import the following libraries:
import pandas as pd
import numpy as np
from cleantext import clean
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
If your system does not have any of the required libraries, you can install them at their individual official links:
· Pandas
· Numpy
We use pandas.read_csv to read in our CSV file of labelled hotel reviews and assign this to a dataframe named ‘data’.
data = pd.read_csv('data.csv')
Calling data allows us to view our dataframe. We see that our dataframe consists of 6 998 rows, each row consisting of a unique ID, hotel review and label.
data
We can use data.isnull().sum() to determine whether there are any missing values in our dataframe:
The output shows us that there are one missing review and no missing labels. Since there is only one missing value, the prudent thing would be to delete the row containing this missing value.
We can do this by redefining data as data.dropna(). Running data to view our dataframe once more, we see that our dataframe consists of 6 997 rows, with the row containing missing data dropped.
data = data.dropna()
We need to use the cleantext package to define a function that will allow us to manipulate the raw review text into a more useable format which we can work with within our machine learning model.
We define the function do_clean(text) which accepts any string and returns cleaned text. We apply this function to each review in our dataset and replace the raw reviews with the cleaned version.
def do_clean(text):
"""Cleans text given.
Parameters:
text (str): text to be cleaned. """
return clean(text,
# Fix various unicode errors
fix_unicode=True,
# Transliterate to closest ASCII representation
to_ascii=True,
# Lowercase text
lower=True,
# Fully strip line breaks as opposed to only normalizing them
no_line_breaks=False,
# Replace all URLs with a special token
no_urls=False,
# Replace all email addresses with a special token
no_emails=False,
# Replace all phone numbers with a special token
no_phone_numbers=True,
# Replace all numbers with a special token
no_numbers=True,
# Replace all digits with a special token
no_digits=True,
# Replace all currency symbols with a special token
no_currency_symbols=False,
# Fully remove punctuation
no_punct=False,
replace_with_url="<URL>",
replace_with_email="<EMAIL>",
replace_with_phone_number="<ACCOUNT>",
replace_with_number="<NUMBER>",
replace_with_digit="0",
replace_with_currency_symbol="<CUR>",
# Set to 'de' for German special handling
lang="en")
data['Reviews'] = data['Reviews'].apply(do_clean)
data.head()
The text has been cleaned and we notice there are no longer any capital letters or numbers.
We define our independent X variable as the column of cleaned reviews and our dependent y variable as the label column. We then use Sklearn’s train_test_split function to randomly split the data into a training set and a test set. I specified the size of the test set to be 20% of the total data.
X = data['Reviews']
y = data['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Note that the size of the training set is now 5 597 rows and the size of the testing set is 1400 rows.
This step involves transforming the cleaned review text data into new feature vectors to be used to predict the review sentiment. We will use SKlearn’s TfidfVectorizer to do this.
TF-IDF stands for term frequency-inverse document frequency. We use TF-IDF to compute feature vectors which consist of weights corresponding to the relative frequency in which certain words appear.
The TF-IDF weight for a word is a measure used to assess the importance of a word to a particular review within a corpus. The importance increases proportionally to the number of times a word appears in the review but is offset by the frequency of the word in the entire review corpus.
The TF-IDF weight is calculated for each word in a document and consists of two parts:
TF or Term Frequency measures how frequently a term appears in a document
IDF or Inverse Document Frequency measures the importance of each term
We use the hyperparameter max_features to build a vocabulary that only considers the top five words. It is ordered by term frequency across the corpus of reviews. This hyperparameter should be adjusted based on the volume of data given.
The fit_transform method is then used on the training data to calculate the TF-IDF scores.
The transform method is then used on the testing data, where the TF-IDF scores generated from fit() on training data, are applied to transform the test data.
tfidfconverter = TfidfVectorizer(max_features=5)
X_train_Tfidf_df = tfidfconverter.fit_transform(X_train).toarray()
X_train_Tfidf_df = pd.DataFrame(X_train_Tfidf_df)
X_test_Tfidf_df = tfidfconverter.transform(X_test).toarray()
X_test_Tfidf_df = pd.DataFrame(X_test_Tfidf_df)
X_train_Tfidf_df
From above we see, calling X_train_Tfidf_df provides a dataframe with five columns representing the TF-IDF score. It considers the top five words ordered by term frequency across the corpus of reviews.
We can now train our model using the training sets with any multilabel classification model. I have used the Gaussian Naive Bayes model here, but any other should suffice. We do this by defining the model and fitting out training sets to this model.
model = GaussianNB()
model.fit(X_train_Tfidf_df, y_train)
We are now ready to predict the response for our converted test dataset using the model.predict() method.
y_pred = model.predict(X_test_Tfidf_df)
Accuracy can then be measured by comparing the predicted label with the labels in the test set. The model was able to correctly classify the review sentiment about 56% of the time.
accuracy_score(y_test, y_pred)
output: 0.557
We have just created our very own text classification model and a simple guide to supervised text classification in Python (with code).
Stay up to date with the latest AI news, strategies, and insights sent straight to your inbox!