Build your own Text classification with less than 25 lines of code using fasttext

Ravindra Elicherla
6 min readAug 29, 2018

Text Classification

Text classification is a basic machine learning technique used to smartly classify text into different categories. There are plenty of use cases for text classification. Spam filtering, sentiment analysis, classify product reviews, drive the customer browsing behaviour depending what she searches or browses and targeted marketing based on what the customer does online etc. In this example, we will use supervised classification of text. It works on the principle of “training” and “validate” principle. We input labeled data to the machine learning algorithm to work on. After the algorithm is trained, we use the training dataset to understand accuracy of the algorithm and training data. The effectiveness of the output depends on quality of the data and strength of the algorithm. In this example, there is no need to write any algorithm, we will use fasttext internal algorithm.

fasttext

In this blog we will classify consumer complaints automatically into one or more relevant categories automatically using fasttext. FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. This is Open Sourced by Facebook.

Installing fasttext is very easy. I am using OSx.

$ git clone https://github.com/facebookresearch/fastText.git

$ cd fastText

$ make

You can check the success of fasttext installation by giving below command.

Ravindras-MacBook-Pro:fastText ravindraprasad$ ./fasttext

Get and prepare data:

Download the Consumer complaints data csv file from here. This dataset has complaints received about financial products and services. These complaints are neatly categorised into various products. The CSV file has Date received, Product, Sub-Product, Issue, Sun-Issue, Consumer Complaint Narrative etc. In this example we are interested in Product and Consumer Complaint Narrative. The problem statement we are trying to solve is “When a customer writes new complaint, how do we categorise into product automatically?” Is it not interesting? Let’s start with the solution now.

Read and process the file using below python code.

Your output will look something like this.

fasttext expects the data to be some thing like this

__label__1 this is my text

__label__2 this is also my text

we now need to prepare product data. The output should look like

__label__credit_reporting An account on my credit report…

__label__Debt_collection This company refuses to provide me…

Let’s extend the previous code.

give this command to check the contents.

consumercompliants.head(200)

bellow is output

also check the tail

consumercompliants.tail(1000)

In the terminal give this command to see if the file is loaded correctly

head consumer.complaints.txt

count number of records

wc consumer.complaints.txt

output is

314263 62315198 415588908 consumer.complaints.txt

file has 3,14,263 records.

We will make two datasets with about 80% data or 2,50,000 records for training and 20% data or 64263 records for validation (testing).

head -n 250000 consumer.complaints.txt > complaints.train.txt

tail -n 64263 consumer.complaints.txt > complaints.valid.txt

now train the model using fasttext

./fasttext supervised -input complaints.train.txt -output model_complaints

we can now test the model using

./fasttext predict model_complaints.bin -

Some complaints it is able to classify correctly. But some looks like did not go well. Especially phone and customer service is bad. I liked “Where is money?” The model correctly predicted it is related to money transfer. However, i would think “i need to speak to someone” should ideally be related to Service. But in this case it is showing under Mortgage. We will further enhance the model in later steps.

We can also check the effectiveness of the model using test data.

./fasttext test model_complaints.bin complaints.valid.txt

Here N is number of test records, Precision at 1 (P@1) and Recall at 1 (R@1)

The Precision is the number of correct labels among the labels predicted by fastText. The Recall is the number of labels that successfully were predicted, among all the real labels.

We will try improving the model by getting rid of special characters and changing upper case letters to lower case.

cat consumer.complaints.txt | sed -e “s/\([.\!?,’/()]\)/ \1 /g” | tr “[:upper:]” “[:lower:]” > consumer.processed.txt

Lets again make data for training and testing.

head -n 250000 consumer.processed.txt > complaints.processed.train.txt

tail -n 64263 consumer.processed.txt > complaints.processed.valid.txt

./fasttext supervised -input complaints.processed.train.txt -output model_complaints_processed

Number of words are reduced from 2,29,616 to 1,14,973 almost 50% drop.

./fasttext test model_complaints_processed.bin complaints.processed.valid.txt

You can see that precision increased now from 0.788 to 0.793 about 0.8%. Not a big change.

The number of times each examples is seen (also known as the number of epochs), can be increased using the -epoch option:

./fasttext supervised -input complaints.processed.train.txt -output model_complaints_processed -epoch 25

check the model effectiveness now.

./fasttext test model_complaints_processed.bin complaints.processed.valid.txt

Precision increased from 0.793 to 0.802 about 1%

Now let’s change the learning rate. A learning rate of 0 would means that the model does not change at all, and thus, does not learn anything. Good values of the learning rate are in the range 0.1 to 1.0

./fasttext supervised -input complaints.processed.train.txt -output model_complaints_processed -lr 1.0

./fasttext test model_complaints_processed.bin complaints.processed.valid.txt

There is no change.

Now lets try both epoch and learning rate.

./fasttext supervised -input complaints.processed.train.txt -output model_complaints_processed -lr 1.0 -epoch 25

./fasttext test model_complaints_processed.bin complaints.processed.valid.txt

Looks like not a good attempt. Precision went down. Now revert back to just epoch 25.

Finally lets try with word n-grams.

./fasttext supervised -input complaints.processed.train.txt -output model_complaints_processed -epoch 25 -wordNgrams 2

./fasttext test model_complaints_processed.bin complaints.processed.valid.txt

Precision is now up from 0.802 to 0.814. This is 1.5% up. Lets try some manual examples on this model.

./fasttext predict model_complaints_processed.bin -

Not fully accurate. Well…. machine will learn with the time and more real time scenarios. When i get time, i will try removing the stop words and test the model.

Thats all folks. If you liked the article, please share and dont forget to clap.

Thanks to Sunil M for helping me with beautiful Python code.

--

--

Ravindra Elicherla

Geek, Painter, Fitness enthusiast, Book worm, Options expert and Simple human