AI: Deep Learning for Phishing URL Detection

For this particular project, I wanted to focus on anomaly detection in the domain of cyber security. I figured that analysis of web logs for anomalies would be a great start to this experiment. After doing some research, it seems that unsupervised deep learning would be a great way to implement this type of analysis. An autoencoder neural network is a very popular way to detect anomalies in data. The autoencoder tries to learn to approximate the identity function:

Identity function

Here is what a typical autoencoder model might look like:

Autoencoder model

For detailed information on these models, there are plenty of blogs, research, etc. for the curious mind.

As I needed comprehensive data, I looked for a database of web logs that could be easily ran through my autoencoder model. I found a dataset at Kaggle: https://www.kaggle.com/shawon10/web-log-dataset#webLog.csv . This dataset is a 10787 X 4 vector/tensor. The 4 columns represent the IP address, the time, the directory requested, and the HTTP Response code. I removed the time column from my data because every one of these entries would be unique and might not help elicitate a pattern within the data that will help with anomaly detection. Here are some charts from the output of the model:

Statistics on the Reconstruction Errors: Reconstruction error statistics

Binning of the Reconstruction Errors: Reconstruction error binning

Plotting of the Reconstruction Errors vs. the data: Reconstruction error vs. data

The first bubble in the upper left part of the latest chart is a non-patterned data point that I purposely included to verify the model is working correctly. As you can see, it does indeed stand out. I created a pipeline to extract all original data entries that are above the 99th quartile of mean squared error (reconstruction error) from the data. This is the threshold that I used to automatically detect anomalies. Samples of the data above the threshold value can be seen below; all of the data points above the threshold are available on Github as a separate text file. You can verify yourself that these directories are unique in the original dataset. It is incredible that this AI was able to figure out what values are anomalies based on some hyperparameters and the training of the model with this data.

200
GET /madeup.php HTTP/1.1
10.4.5.2
----------------------------------
GET /profile.php?user=bala HTTP/1.1
10.130.2.1
200
----------------------------------
GET /edit.php?name=bala HTTP/1.1
10.131.2.1
200
----------------------------------
10.131.2.1
200
GET /contestproblem.php?name=Toph%20Contest%202 HTTP/1.1
----------------------------------
10.131.2.1
GET /details.php?id=3 HTTP/1.1
200
----------------------------------
10.131.2.1
200
GET /contestsubmission.php?id=4 HTTP/1.1
----------------------------------
10.131.2.1
200
GET /edit.php?name=ksrsingh HTTP/1.1
----------------------------------
200
GET /showcode.php?id=285&nm=ksrsingh HTTP/1.1
10.131.0.1
----------------------------------
GET /allsubmission.php?name=shawon HTTP/1.1
200
10.128.2.1
----------------------------------

If there are issues with accessing my Gihub repo below, I have a zipped file with my code, model, and datasets here: Repo Copy

Please see my Github for code, model, and the dataset related to this project.

I've also included my output from Keras below:

Found 271 unique tokens.
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_4 (InputLayer)         (None, 3)                 0
_________________________________________________________________
dense_13 (Dense)             (None, 2)                 8
_________________________________________________________________
dense_14 (Dense)             (None, 1)                 3
_________________________________________________________________
dense_15 (Dense)             (None, 2)                 4
_________________________________________________________________
dense_16 (Dense)             (None, 3)                 9
=================================================================
Total params: 24
Trainable params: 24
Non-trainable params: 0
_________________________________________________________________
Train on 8630 samples, validate on 2157 samples
Epoch 1/50
8630/8630 [==============================] - 1s 77us/step - loss: 544.2785 - acc: 0.3645 - val_loss: 250.2417 - val_acc: 0.0000e+00
Epoch 2/50
8630/8630 [==============================] - 1s 58us/step - loss: 542.9074 - acc: 0.8287 - val_loss: 249.4843 - val_acc: 0.0000e+00
Epoch 3/50
8630/8630 [==============================] - 0s 56us/step - loss: 541.6439 - acc: 0.1955 - val_loss: 248.8086 - val_acc: 0.0000e+00
Epoch 4/50
8630/8630 [==============================] - 0s 56us/step - loss: 540.5283 - acc: 0.5802 - val_loss: 248.2224 - val_acc: 0.0000e+00
Epoch 5/50
8630/8630 [==============================] - 0s 57us/step - loss: 539.5738 - acc: 0.9196 - val_loss: 247.7275 - val_acc: 0.9986
Epoch 6/50
8630/8630 [==============================] - 0s 58us/step - loss: 538.7705 - acc: 0.9461 - val_loss: 247.3153 - val_acc: 0.9986
Epoch 7/50
8630/8630 [==============================] - 0s 56us/step - loss: 538.1015 - acc: 0.9461 - val_loss: 246.9732 - val_acc: 0.9986
Epoch 8/50
8630/8630 [==============================] - 0s 57us/step - loss: 537.5472 - acc: 0.9461 - val_loss: 246.6904 - val_acc: 0.9986
Epoch 9/50
8630/8630 [==============================] - 0s 57us/step - loss: 537.0872 - acc: 0.9461 - val_loss: 246.4559 - val_acc: 0.9986
.................................................................
Epoch 45/50
8630/8630 [==============================] - 0s 57us/step - loss: 534.4239 - acc: 0.9461 - val_loss: 245.0778 - val_acc: 0.9986
Epoch 46/50
8630/8630 [==============================] - 0s 56us/step - loss: 534.4204 - acc: 0.9461 - val_loss: 245.0758 - val_acc: 0.9986
Epoch 47/50
8630/8630 [==============================] - 0s 56us/step - loss: 534.4172 - acc: 0.9461 - val_loss: 245.0742 - val_acc: 0.9986
Epoch 48/50
8630/8630 [==============================] - 0s 57us/step - loss: 534.4143 - acc: 0.9461 - val_loss: 245.0727 - val_acc: 0.9986
Epoch 49/50
8630/8630 [==============================] - 0s 56us/step - loss: 534.4117 - acc: 0.9461 - val_loss: 245.0713 - val_acc: 0.9986
Epoch 50/50
8630/8630 [==============================] - 0s 56us/step - loss: 534.4094 - acc: 0.9461 - val_loss: 245.0701 - val_acc: 0.9986