Detecting Spam Emails With Artifical Neural Networks

Spam Classifer

Spam Classifer Kaggle Competition 96% accuracy solution

In is notebook I will be completing the challenge listed at https://inclass.kaggle.com/c/adcg-ss14-challenge-02-spam-mails-detection.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from bs4 import BeautifulSoup
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.feature_extraction.text import CountVectorizer

Step 1: Collect Data

run extract_data.py in working dir to get the formatted train and test datasets. Kaggle provides an extract script and that may work I however wrote my own that makes use of "eml_parser" to extract the subject and email.

My parser extracts both the "subject" and "from" and that's the data we'll use to train our model. The extractor will extract the data into "train_data.csv" and "test_data.csv"

Step 2: Preprocess Data

In [2]:
train_data = pd.read_csv('train_data.csv',index_col=0)
test_data = pd.read_csv('test_data.csv',index_col=0)
train_data.head()
Out[2]:
ID subject from prediction
0 1734 瑪瑙戒指-2-148- anlin002@ms82.url.com.tw 0
1 1662 [SAtalk] Version 1.1.11 of the DCC (fwd) tony@svanstrom.com 1
2 2288 Special offer for hibody, 80% better price niymekigif5719@telecomitalia.it 0
3 217 Fw: PROTECT YOUR COMPUTER,YOU NEED SYSTEMWORKS... eklabunde@hotmail.com 0
4 1084 Fwd: Norton makes the BEST software available,... cesarsimone@hotmail.com 0

Replace missing "subject" columns with No Subject and missing "from" columns with ?

In [3]:
print("train missing values",train_data.isnull().values.sum())
print("test missing values",test_data.isnull().values.sum())

#get indexes of missing values
train_indices = np.where(train_data['subject'].isna())
test_indices = np.where(test_data['subject'].isna())

#fill missing subject values
train_data['subject'] = train_data['subject'].fillna('no subject')
test_data['subject'] = test_data['subject'].fillna('no subject')
#vill missing training values
train_data['from'] = train_data['from'].fillna('?')
test_data['from'] = test_data['from'].fillna('?')
    
train missing values 3
test missing values 4

Vectorize columns with CountVectorizer from sklean. Durring this step I'll be vectorizing both columns in order to feed it through the neural net. We need to vectorize each column we want to pass through separately.

Step 1: Create n instances of count vectorizer for each column
Step 2: Fit on test data ONLY for each column
Step 3: Transform each column using .transform
Step 4: Concatenate each column togeather

In [4]:
"""
function that transforms N columns in a dataset to vectors and then concats them.
Must have == number of training and test data columns



Args:
    VectorizerClass: type of vectorizer to use
    train_data: your training data
    test_data: your test data
    
Returns vectorized columns of dataset for test and train data
"""

def vectorize_columns(VectorizerClass, train_data, test_data):
    #if your naughty
    if len(train_data) != len(test_data):
        raise ValueError("Please provide an equal number of columns in train_data and test_data")
        
    train_columns = []
    test_columns = []
    for train_column, test_column in zip(train_data, test_data):        
        #create new instance of vectorizer each iteration
        vectorizer = VectorizerClass(lowercase=True)
        #fit only on training data
        vectorizer.fit(train_column)
        #transform both train and test
        transform_train = vectorizer.transform(train_column)
        transform_test = vectorizer.transform(test_column)
        #create dataframe for later concat
        train = pd.DataFrame(transform_train.todense(), columns=vectorizer.get_feature_names())
        test = pd.DataFrame(transform_test.todense(), columns=vectorizer.get_feature_names())
        #append to corresponding lists
        train_columns.append(train)
        test_columns.append(test)
        
        
    X_train = pd.concat(train_columns,axis=1)
    X_test = pd.concat(test_columns,axis=1)
    
    return X_train, X_test

subject = train_data['subject']


X_train, X_test = vectorize_columns(CountVectorizer, train_data=[train_data['subject'], train_data['from']], test_data = [test_data['subject'], test_data['from']])


print(X_train.shape)
print(X_test.shape)
(2500, 6311)
(1827, 6311)

get labels to be used for the prediction

In [5]:
#get y labels
y_train = train_data['prediction']

Build Model

We'll be using a shallow neural network with 2 dense layers with the last layer using a sigmoid activation function for the last layer outputing a probability for each of the classes. For our loss function we'll be using binary crossentropy that's used if you if you have only 2 classes to predict.

We're posing the question what is the probability of the email not being spam. Ideally we want a probability of 1 (100%) for not spam and 0 (0%) for spam.

In [6]:
input_dim = X_train.shape[1]
def create_model():
    model = Sequential()

    model.add(Dense(10, input_dim=input_dim, activation="relu"))
    model.add(Dense(1, activation="sigmoid"))

    model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['acc'])
    
    return model
    

Using KerasClassifer and GridSearchCV find the optimial hyper parameters and print the results.

In [7]:
# create model
model = KerasClassifier(build_fn=create_model, verbose=1)

# define the grid search parameters
epochs = [4,5,6,7,8,9,10,11,12]
#pass params must be a dict.
param_grid = dict(epochs=epochs)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X_train, y_train)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
/home/tyler/.local/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
Epoch 1/11
2500/2500 [==============================] - 1s 218us/sample - loss: 0.5848 - acc: 0.8148
Epoch 2/11
2500/2500 [==============================] - 0s 178us/sample - loss: 0.3402 - acc: 0.9532
Epoch 3/11
2500/2500 [==============================] - 0s 179us/sample - loss: 0.1919 - acc: 0.9848
Epoch 4/11
2500/2500 [==============================] - 0s 181us/sample - loss: 0.1176 - acc: 0.9916
Epoch 5/11
2500/2500 [==============================] - 0s 182us/sample - loss: 0.0785 - acc: 0.9948
Epoch 6/11
2500/2500 [==============================] - 0s 183us/sample - loss: 0.0554 - acc: 0.9960
Epoch 7/11
2500/2500 [==============================] - 0s 183us/sample - loss: 0.0410 - acc: 0.9972
Epoch 8/11
2500/2500 [==============================] - 0s 183us/sample - loss: 0.0313 - acc: 0.9984
Epoch 9/11
2500/2500 [==============================] - 0s 182us/sample - loss: 0.0246 - acc: 0.9992
Epoch 10/11
2500/2500 [==============================] - 0s 178us/sample - loss: 0.0197 - acc: 0.9996
Epoch 11/11
2500/2500 [==============================] - 0s 197us/sample - loss: 0.0161 - acc: 0.9996
Best: 0.964000 using {'epochs': 11}
0.948800 (0.003988) with: {'epochs': 4}
0.948800 (0.013910) with: {'epochs': 5}
0.950800 (0.007641) with: {'epochs': 6}
0.963200 (0.004418) with: {'epochs': 7}
0.963600 (0.003443) with: {'epochs': 8}
0.960400 (0.005948) with: {'epochs': 9}
0.962800 (0.007771) with: {'epochs': 10}
0.964000 (0.008371) with: {'epochs': 11}
0.963200 (0.007344) with: {'epochs': 12}

Fit model on best hyper parameters.

In [9]:
best = grid_result.best_params_
tensorboard = TensorBoard(log_dir='./logs')
history = model.fit(X_train, y_train, epochs=best['epochs'], verbose=1, validation_split=0.20, batch_size=10, callbacks=[tensorboard])
Train on 2000 samples, validate on 500 samples
Epoch 1/11
2000/2000 [==============================] - 1s 420us/sample - loss: 0.5136 - acc: 0.8395 - val_loss: 0.3314 - val_acc: 0.9260
Epoch 2/11
2000/2000 [==============================] - 1s 297us/sample - loss: 0.2052 - acc: 0.9710 - val_loss: 0.1764 - val_acc: 0.9640
Epoch 3/11
2000/2000 [==============================] - 1s 291us/sample - loss: 0.0925 - acc: 0.9920 - val_loss: 0.1301 - val_acc: 0.9620
Epoch 4/11
2000/2000 [==============================] - 1s 289us/sample - loss: 0.0518 - acc: 0.9940 - val_loss: 0.1105 - val_acc: 0.9620
Epoch 5/11
2000/2000 [==============================] - 1s 298us/sample - loss: 0.0327 - acc: 0.9965 - val_loss: 0.0995 - val_acc: 0.9660
Epoch 6/11
2000/2000 [==============================] - 1s 294us/sample - loss: 0.0220 - acc: 0.9970 - val_loss: 0.0932 - val_acc: 0.9660
Epoch 7/11
2000/2000 [==============================] - 1s 295us/sample - loss: 0.0154 - acc: 0.9990 - val_loss: 0.0881 - val_acc: 0.9660
Epoch 8/11
2000/2000 [==============================] - 1s 292us/sample - loss: 0.0111 - acc: 0.9995 - val_loss: 0.0842 - val_acc: 0.9640
Epoch 9/11
2000/2000 [==============================] - 1s 304us/sample - loss: 0.0084 - acc: 0.9995 - val_loss: 0.0827 - val_acc: 0.9660
Epoch 10/11
2000/2000 [==============================] - 1s 294us/sample - loss: 0.0064 - acc: 1.0000 - val_loss: 0.0807 - val_acc: 0.9660
Epoch 11/11
2000/2000 [==============================] - 1s 294us/sample - loss: 0.0050 - acc: 1.0000 - val_loss: 0.0796 - val_acc: 0.9660

Plot Acccuracy and Loss

In [10]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')

def plot_history(history):
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()
In [11]:
plot_history(history)

Use model to predict classes for test set

In [12]:
results = model.predict(X_test)
1827/1827 [==============================] - 0s 81us/sample
In [13]:
kaggle_prediction = pd.DataFrame(columns=['Id', 'Prediction'])
In [14]:
kaggle_prediction['Id'] = test_data['ID']
In [15]:
kaggle_prediction['Prediction'] = results
kaggle_prediction.head(10)
Out[15]:
Id Prediction
0 1131 0
1 805 1
2 336 1
3 220 0
4 111 0
5 208 1
6 1574 0
7 1324 1
8 692 0
9 1243 1
In [16]:
kaggle_prediction.to_csv('keras_spam_predictions.csv' ,index=False)
In [ ]:
 

4 thoughts on “Detecting Spam Emails With Artifical Neural Networks

  1. Hmm it appears like your blog ate my first comment (it was super long) so I guess I’ll just sum it up what I had written and say, I’m thoroughly enjoying your blog. I too am an aspiring blog blogger but I’m still new to the whole thing. Do you have any tips for beginner blog writers? I’d certainly appreciate it.| а

Leave a Reply

Your email address will not be published. Required fields are marked *