BBC News Classification Kaggle Mini-Project¶

The objectives of this mini-project are:

  • Perform an exploratory data analysis (EDA) procedure.
  • Build and train an unsupervised learning model using Non-negative Matrix Factorisation (NMF).
  • Build and train a supervised learning model.
  • Compare the two models.

Project setup¶

Import the required modules and load the datasets.

In [7]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/learn-ai-bbc/BBC News Train.csv
/kaggle/input/learn-ai-bbc/BBC News Sample Solution.csv
/kaggle/input/learn-ai-bbc/BBC News Test.csv
In [26]:
from itertools import permutations
import matplotlib.pyplot as plt
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from wordcloud import WordCloud
In [9]:
data_train = pd.read_csv("/kaggle/input/learn-ai-bbc/BBC News Train.csv")
data_test = pd.read_csv("/kaggle/input/learn-ai-bbc/BBC News Test.csv")
sample_solution = pd.read_csv("/kaggle/input/learn-ai-bbc/BBC News Sample Solution.csv")

Inspection¶

Training data¶

In [10]:
print(data_train.info())
data_train.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1490 entries, 0 to 1489
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ArticleId  1490 non-null   int64 
 1   Text       1490 non-null   object
 2   Category   1490 non-null   object
dtypes: int64(1), object(2)
memory usage: 35.1+ KB
None
Out[10]:
ArticleId Text Category
0 1833 worldcom ex-boss launches defence lawyers defe... business
1 154 german business confidence slides german busin... business
2 1101 bbc poll indicates economic gloom citizens in ... business
3 1976 lifestyle governs mobile choice faster bett... tech
4 917 enron bosses in $168m payout eighteen former e... business

Using the data tab in the competition, we can see that all rows are valid, but only 1440 out 1490 values are unique in the Text column.

In addition, the Category column has a dtype of object without being categorised.

The next cell fixes these two issues.

In [11]:
data_train = data_train.drop_duplicates(subset = ["Text"])
data_train.Category = pd.Categorical(data_train.Category)

Next we check the ratios of the categories using a pie chart. From the chart below, it seems that the categories are well-balanced.

In [12]:
data_train.Category.value_counts().plot.pie(autopct='%1.1f%%', ylabel='')
Out[12]:
<Axes: >
No description has been provided for this image

Test data and provided sample solution¶

These have the same structure as the training data, except that one drops the Category column, while the solution drops the Text column.

In [13]:
print(data_test.info())
data_test.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 735 entries, 0 to 734
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ArticleId  735 non-null    int64 
 1   Text       735 non-null    object
dtypes: int64(1), object(1)
memory usage: 11.6+ KB
None
Out[13]:
ArticleId Text
0 1018 qpr keeper day heads for preston queens park r...
1 1319 software watching while you work software that...
2 1138 d arcy injury adds to ireland woe gordon d arc...
3 459 india s reliance family feud heats up the ongo...
4 1020 boro suffer morrison injury blow middlesbrough...
In [14]:
print(sample_solution.info())
sample_solution.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 735 entries, 0 to 734
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ArticleId  735 non-null    int64 
 1   Category   735 non-null    object
dtypes: int64(1), object(1)
memory usage: 11.6+ KB
None
Out[14]:
ArticleId Category
0 1018 sport
1 1319 tech
2 1138 business
3 459 entertainment
4 1020 politics

Preprocessing¶

Tf-idf converts a count matrix into a metric more suitable for our usecase by reducing the impact of highly-occuring words, making sure that words that occur less frequently overall, and are thus more likely to contribute to defining the topic of the article, are properly accounted for.

Before applying tf-idf, there are preprocessing tasks under consideration:

  1. Convert to lower case.
  2. Remove punctuation.
  3. Remove stopwords.
  4. Stemming/Lemmatisation.
  5. Stripping extra white space.

These are designed to remove the unnecessary features (such as meaningless words or words that appear frequently regardless of topic). Words which are essentially the same, such as Business and business should be treated as one word if possible to make the data more grouped and thus more useful.

Fortunately, all of these except of lemmatisation can be handled by TfidVectorizer easily, greatly simplifying the implementation. Note that when analyzer == 'word', punctuation along with extra white space are automatically removed. Setting to lower case and removing stopwords can also be achieved by setting their respective parameters.

An idea to increase performance¶

As for lemmatisation, it seems to be potentially useful according to this stackexchange answer. However, this paper suggests that the resulting complexity is not worth it, so I decided not to implement it afterall.

Instead, I will try replacing the numbers in the text with NUM using regex. This was much easier to implement, and would give some insight into whether the type of modifications done by stemming/lemmatisation actually help.

In [38]:
vectoriser = TfidfVectorizer(stop_words="english") # lowercase is True by default and analyzer==word.
combined_text = pd.concat([data_train.Text, data_test.Text], ignore_index=True)
vectorised = vectoriser.fit_transform(combined_text)
print(vectorised.shape)
(2175, 29126)

Now we will try to visualise the most common words. Notice that because we used the version with the numbers, some of them appear here. Between the numbers, it seems 000, along with specific years, are frequently occurring.

In [37]:
word_frequencies = pd.DataFrame(vectorised.toarray(), columns=vectoriser.get_feature_names_out()).T.sum(axis=1)
wordcloud = WordCloud(background_color='white').generate_from_frequencies(word_frequencies.to_dict())
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
No description has been provided for this image

Building and training the NMF model¶

In training the model, only the train data is used - the test dataset is not included. This allows the test dataset to do its job of measuring the accuracy of the model on untrained data. Nevertheless, it is very possible to train on the test data as well, as the label assignment can be done based on the train portion of the predictions.

Accuracy will be used as the performance metric as it is the one used by the competition.

The code below gives an accuracy of 90.8%. Surprisingly, when tested using data_test and submitted, it returned an even higher accuracy of 92.2% (V5).

In [42]:
nmf = NMF(n_components=5)
train_pred = nmf.fit_transform(vectorised).argmax(axis=1)[:1440]
In [43]:
def label_permute_compare(ytdf,yp):
    """
    ytdf: labels column from dataframe object.
    yp: label prediction output.
    Returns permuted label order and accuracy.
    Example output: ('business', 'politics', 'sport', 'entertainment', 'tech'), 0.74 .
    """
    y_true = ytdf
    best_acc = 0.
    for perm in permutations(ytdf.cat.categories):
        y_pred = [perm[i] for i in yp]
        accuracy = accuracy_score(y_true, y_pred)
        if accuracy > best_acc:
            best_acc = accuracy
            best_perm = perm
    return best_perm, best_acc

labels, acc = label_permute_compare(data_train.Category, train_pred)
print("Accuracy: %.1f%%" % (acc * 100))
confusion_matrix(data_train.Category, [labels[x] for x in train_pred])
Accuracy: 91.7%
Out[43]:
array([[302,   1,  17,   1,  14],
       [  5, 211,   7,   3,  37],
       [ 13,   0, 242,   2,   9],
       [  1,   2,   0, 339,   0],
       [  0,   3,   0,   4, 227]])

Try replacing the numbers with NUM¶

Here we unfortunately see that reducing the complexity by generalising the numbers actually decreases the performance of the model.

In [44]:
text_without_nums = combined_text.str.replace(r'\d+', 'NUM', regex=True)
vectorised_without_nums = TfidfVectorizer(stop_words="english").fit_transform(text_without_nums)
without_nums_pred = NMF(5).fit_transform(vectorised_without_nums).argmax(axis=1)[:1440]
labels_without_num, acc_without_num = label_permute_compare(data_train.Category, without_nums_pred)
print("Accuracy: %.1f%%" % (acc_without_num * 100))
confusion_matrix(data_train.Category, [labels_without_num[x] for x in without_nums_pred])
Accuracy: 88.0%
Out[44]:
array([[287,   1,  30,   1,  16],
       [ 20, 200,   8,   6,  29],
       [ 12,   0, 241,   5,   8],
       [ 19,   2,   0, 321,   0],
       [  7,   3,   0,   6, 218]])

Changing the hyperparameters¶

In the previous models, parameters such as the maximum number of features and normalisation were ignored. We will now try out different combinations to try and improve the accuracy.

In [48]:
def gen_model(vectoriser):
    model = {"vectoriser": vectoriser}
    vectorised = vectoriser.fit_transform(combined_text)
    nmf = NMF(n_components=5)
    model["nmf"] = nmf
    y_pred = nmf.fit_transform(vectorised).argmax(axis=1)[:1440]
    model["pred"] = y_pred
    model["labels"], model["acc"] = label_permute_compare(data_train.Category, y_pred)
    return model

best_accuracy = 0.
for norm in ("l1", "l2", None):
    for max_features in (2000, 4000, 6000, 8000, None):
        model = gen_model(TfidfVectorizer(stop_words="english", norm=norm, max_features=max_features))
        print("norm=%s, max_features=%s, acc=%.5f%%" % (norm, max_features, model["acc"] * 100))
        if model["acc"] > best_accuracy:
            best_model = model
            best_accuracy = model["acc"]
print("Best model accuracy = %.3f%%" % (best_model["acc"] * 100))
confusion_matrix(data_train.Category, [best_model["labels"][x] for x in best_model["pred"]])
norm=l1, max_features=2000, acc=90.20833%
norm=l1, max_features=4000, acc=90.27778%
norm=l1, max_features=6000, acc=90.97222%
norm=l1, max_features=8000, acc=90.97222%
norm=l1, max_features=None, acc=90.69444%
norm=l2, max_features=2000, acc=91.25000%
norm=l2, max_features=4000, acc=91.73611%
norm=l2, max_features=6000, acc=91.52778%
norm=l2, max_features=8000, acc=91.45833%
norm=l2, max_features=None, acc=91.73611%
norm=None, max_features=2000, acc=47.22222%
norm=None, max_features=4000, acc=42.29167%
norm=None, max_features=6000, acc=42.01389%
norm=None, max_features=8000, acc=40.69444%
norm=None, max_features=None, acc=45.69444%
Best model accuracy = 91.736%
Out[48]:
array([[304,   1,  18,   0,  12],
       [  5, 209,   7,   4,  38],
       [ 13,   0, 243,   2,   8],
       [  2,   2,   0, 338,   0],
       [  0,   3,   0,   4, 227]])

Test on data_test¶

The model used has been developed on both the train and test datasets together, as NMF does not require labels to train. As the test labels are not provided, there is no risk of "cheating". In a real-life scenario, there is nothing stopping the model from training itself on the test data before providing results, just like we are doing here.

In [50]:
vectorised = best_model["vectoriser"].transform(combined_text)
test_pred = best_model["nmf"].transform(vectorised).argmax(axis=1)[1440:]
In [51]:
sample_solution.Category = [best_model["labels"][x] for x in test_pred]
print("The order of solution df and test df is equivalent:", all(sample_solution.ArticleId == data_test.ArticleId))
print(sample_solution)
sample_solution.to_csv("submission.csv", index=False)
The order of solution df and test df is equivalent: True
     ArticleId       Category
0         1018          sport
1         1319           tech
2         1138          sport
3          459       business
4         1020          sport
..         ...            ...
730       1923       business
731        373  entertainment
732       1704           tech
733        206       business
734        471       politics

[735 rows x 2 columns]

Building and training the Supervised Learning model¶

We will now compare the performance with that of supervised learning. To demonstrate, the most basic form of supervised learning will be used: Logistic Regression. To start with, the model will be trained on 80% of the data.

In [23]:
train_vectorised = best_model["vectoriser"].transform(data_train.Text)
X_train, X_test, y_train, y_test = train_test_split(train_vectorised, data_train.Category, test_size=.2, random_state=4)
clf = LogisticRegression(random_state=4).fit(X_train, y_train)
print("Training score: %.1f%%" % (clf.score(X_train, y_train) * 100))
print("Test score: %.1f%%" % (clf.score(X_test, y_test) * 100))
confusion_matrix(y_test, clf.predict(X_test))
Training score: 99.7%
Test score: 96.9%
Out[23]:
array([[69,  0,  0,  0,  1],
       [ 0, 46,  0,  0,  0],
       [ 4,  0, 54,  0,  0],
       [ 0,  0,  0, 65,  0],
       [ 0,  3,  0,  1, 45]])

It is evident that supervised learning is superior to unsupervised learning for such tasks given the labels are readily available. The disparity between the training score and test score raises concerns however, as such models have a tendency to overfit. We will try reducing the size of the training data in decrements of 20% to see if that affects performance.

In [24]:
print("Training size: 60%")
X_train, X_test, y_train, y_test = train_test_split(train_vectorised, data_train.Category, test_size=.4, random_state=4)
clf = LogisticRegression(random_state=4).fit(X_train, y_train)
print("Training score: %.1f%%" % (clf.score(X_train, y_train) * 100))
print("Test score: %.1f%%" % (clf.score(X_test, y_test) * 100))
print("\nTraining size: 40%")
X_train, X_test, y_train, y_test = train_test_split(train_vectorised, data_train.Category, test_size=.6, random_state=4)
clf = LogisticRegression(random_state=4).fit(X_train, y_train)
print("Training score: %.1f%%" % (clf.score(X_train, y_train) * 100))
print("Test score: %.1f%%" % (clf.score(X_test, y_test) * 100))
print("\nTraining size: 20%")
X_train, X_test, y_train, y_test = train_test_split(train_vectorised, data_train.Category, test_size=.8, random_state=4)
clf = LogisticRegression(random_state=4).fit(X_train, y_train)
print("Training score: %.1f%%" % (clf.score(X_train, y_train) * 100))
print("Test score: %.1f%%" % (clf.score(X_test, y_test) * 100))
Training size: 60%
Training score: 99.8%
Test score: 97.0%

Training size: 40%
Training score: 99.8%
Test score: 96.5%

Training size: 20%
Training score: 100.0%
Test score: 92.4%

Interestingly, both training and test scores both increase by 0.1% at first. It is only when the training dataset is at 20% that the model really suffers from overfitting, with training score being 100% while test score being 92.4%. Thus it can be concluded that the model is indeed vulnerable to overfitting, but has good data-efficiency - only needs around 50% of the dataset to achieve high test scores.

Trying the same with LinearSVC¶

In [31]:
print("Training size: 80%")
X_train, X_test, y_train, y_test = train_test_split(train_vectorised, data_train.Category, test_size=.2, random_state=4)
clf = LinearSVC(random_state=4).fit(X_train, y_train)
print("Training score: %.1f%%" % (clf.score(X_train, y_train) * 100))
print("Test score: %.1f%%" % (clf.score(X_test, y_test) * 100))
Training size: 80%
Training score: 100.0%
Test score: 96.9%
In [30]:
print("Training size: 60%")
X_train, X_test, y_train, y_test = train_test_split(train_vectorised, data_train.Category, test_size=.4, random_state=4)
clf = LinearSVC(random_state=4).fit(X_train, y_train)
print("Training score: %.1f%%" % (clf.score(X_train, y_train) * 100))
print("Test score: %.1f%%" % (clf.score(X_test, y_test) * 100))
print("\nTraining size: 40%")
X_train, X_test, y_train, y_test = train_test_split(train_vectorised, data_train.Category, test_size=.6, random_state=4)
clf = LinearSVC(random_state=4).fit(X_train, y_train)
print("Training score: %.1f%%" % (clf.score(X_train, y_train) * 100))
print("Test score: %.1f%%" % (clf.score(X_test, y_test) * 100))
print("\nTraining size: 20%")
X_train, X_test, y_train, y_test = train_test_split(train_vectorised, data_train.Category, test_size=.8, random_state=4)
clf = LinearSVC(random_state=4).fit(X_train, y_train)
print("Training score: %.1f%%" % (clf.score(X_train, y_train) * 100))
print("Test score: %.1f%%" % (clf.score(X_test, y_test) * 100))
Training size: 60%
Training score: 100.0%
Test score: 97.2%

Training size: 40%
Training score: 100.0%
Test score: 96.9%

Training size: 20%
Training score: 100.0%
Test score: 95.7%

It seems from these tests that LinearSVC does not improve much on LogisticRegression in terms of top performance, but wins in terms of data-efficiency, as it was able to achive 95.7% accuracy with just 20% training data.

References¶

  1. How to apply a custom stemmer before passing the training corpus to TfidfVectorizer in sklearn?
  2. How does adding/omitting lemmatization affect TF-IDF?
  3. Comparing Apples to Apple: The Effects of Stemmers on Topic Models
  4. Sklearn: adding lemmatizer to CountVectorizer
  5. TfidfTransformer
In [ ]: