AI Security — the concept of Overfitting in Machine Learning

Vishal Thakur
8 min readApr 10, 2024


Overfitting is another concept along with Underfitting, that is of great importance to learn about if you’re interested in AI Security. In this post, we’ll take a look at what Overfitting means, how it can occur and examples of real-world overfitting techniques.

Overfitting is an undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data. When data scientists use machine learning models for making predictions, they first train the model on a known data set. Then, based on this information, the model tries to predict outcomes for new data sets. An overfit model can give inaccurate predictions and cannot perform well for all types of new data.¹

First up, have your AI Security lab up and running. If you don’t have one, you can use this post to set one up and come back here once ready.

Setup and load data for testing

For this lab, we’ll again use the fashion_mnist dataset that we used previously to demonstrate how underfitting works.

Fashion-MNIST is a dataset of Zalando’s article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.²

Create a python file (*.py) at this point and call it something like and keep adding the code provided below to it for testing.

Let’s start by loading the data and preparing it:

import tensorflow as tf
from tensorflow.keras.datasets import fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

train_images = train_images / 255.0
test_images = test_images / 255.0

train_images = train_images.reshape((-1, 28, 28, 1))
test_images = test_images.reshape((-1, 28, 28, 1))

Run the above code and you shall see something similar as the image below:


In order to achieve overfitting, we need to increase the complexity of our model as the first step.

In the code below, you can see that we have changed quite a few things, compared to the similar code section in our underfitting post.

We won’t dive deep into this code section in this post (that’s best left for a separate post) but take note that we can increase the complexity of a AI model by increasing the number of parameters and depth of the architecture.

model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, (3, 3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(128, (3, 3), activation='relu'),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')


Train the Model

In order to achieve overfitting, we’ll need to train the model with a large number of epochs. Without early stopping, we should be able to train the model with 50 epochs in this case.

history =, train_labels, epochs=50, validation_split=0.2)

This could take a while, as we’ve gone for 50 ‘epochs’ in our code.


Now, we are ready to evaluate the model and test its accuracy.

test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print('\nTest accuracy:', test_acc)

In this instance, we have achieved high accuracy as you can see in the image below:

In order to see if overfitting has occured, we need to visualize the results.

import matplotlib.pyplot as plt

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs_range = range(50)

plt.figure(figsize=(12, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')

Looking at the below graphs, we can see that training accuracy is increasing but the Validation accuracy ‘plateaus’ after an iniitial increase.

We can also simply print out the accuracy data for a more granular look by running the following code (for each epoch):

training_accuracy = history.history['accuracy']
validation_accuracy = history.history['val_accuracy']

for epoch in range(len(training_accuracy)):
print(f"Epoch {epoch+1}:")
print(f" Training Accuracy: {training_accuracy[epoch]}")
print(f" Validation Accuracy: {validation_accuracy[epoch]}")

Here are the results for the current model for each of the 50 epochs:

Epoch 1:
Training Accuracy: 0.836020827293396
Validation Accuracy: 0.8770833611488342
Epoch 2:
Training Accuracy: 0.8925208449363708
Validation Accuracy: 0.8963333368301392
Epoch 3:
Training Accuracy: 0.9106666445732117
Validation Accuracy: 0.9028333425521851
Epoch 4:
Training Accuracy: 0.9213333129882812
Validation Accuracy: 0.9058333039283752
Epoch 5:
Training Accuracy: 0.9327499866485596
Validation Accuracy: 0.9153333306312561
Epoch 6:
Training Accuracy: 0.9419375061988831
Validation Accuracy: 0.9147499799728394
Epoch 7:
Training Accuracy: 0.9495208263397217
Validation Accuracy: 0.9108333587646484
Epoch 8:
Training Accuracy: 0.9580416679382324
Validation Accuracy: 0.9176666736602783
Epoch 9:
Training Accuracy: 0.9645624756813049
Validation Accuracy: 0.9150000214576721
Epoch 10:
Training Accuracy: 0.9696249961853027
Validation Accuracy: 0.9207500219345093
Epoch 11:
Training Accuracy: 0.9739999771118164
Validation Accuracy: 0.9100000262260437
Epoch 12:
Training Accuracy: 0.9785416722297668
Validation Accuracy: 0.9169166684150696
Epoch 13:
Training Accuracy: 0.9796666502952576
Validation Accuracy: 0.9139166474342346
Epoch 14:
Training Accuracy: 0.9827499985694885
Validation Accuracy: 0.9142500162124634
Epoch 15:
Training Accuracy: 0.9847916960716248
Validation Accuracy: 0.9138333201408386
Epoch 16:
Training Accuracy: 0.9861041903495789
Validation Accuracy: 0.9172499775886536
Epoch 17:
Training Accuracy: 0.9881250262260437
Validation Accuracy: 0.9129999876022339
Epoch 18:
Training Accuracy: 0.9873124957084656
Validation Accuracy: 0.9104166626930237
Epoch 19:
Training Accuracy: 0.9887083172798157
Validation Accuracy: 0.9099166393280029
Epoch 20:
Training Accuracy: 0.9896458387374878
Validation Accuracy: 0.9139999747276306
Epoch 21:
Training Accuracy: 0.9903125166893005
Validation Accuracy: 0.9115833044052124
Epoch 22:
Training Accuracy: 0.9913125038146973
Validation Accuracy: 0.9131666421890259
Epoch 23:
Training Accuracy: 0.9902916550636292
Validation Accuracy: 0.9134166836738586
Epoch 24:
Training Accuracy: 0.991812527179718
Validation Accuracy: 0.9141666889190674
Epoch 25:
Training Accuracy: 0.99197918176651
Validation Accuracy: 0.9131666421890259
Epoch 26:
Training Accuracy: 0.992145836353302
Validation Accuracy: 0.9079166650772095
Epoch 27:
Training Accuracy: 0.9912291765213013
Validation Accuracy: 0.9160000085830688
Epoch 28:
Training Accuracy: 0.9945416450500488
Validation Accuracy: 0.9135000109672546
Epoch 29:
Training Accuracy: 0.9927916526794434
Validation Accuracy: 0.9128333330154419
Epoch 30:
Training Accuracy: 0.9939791560173035
Validation Accuracy: 0.9135000109672546
Epoch 31:
Training Accuracy: 0.9925833344459534
Validation Accuracy: 0.9170833230018616
Epoch 32:
Training Accuracy: 0.9934791922569275
Validation Accuracy: 0.909416675567627
Epoch 33:
Training Accuracy: 0.9928541779518127
Validation Accuracy: 0.9085000157356262
Epoch 34:
Training Accuracy: 0.9948333501815796
Validation Accuracy: 0.9139999747276306
Epoch 35:
Training Accuracy: 0.9943541884422302
Validation Accuracy: 0.909583330154419
Epoch 36:
Training Accuracy: 0.9935833215713501
Validation Accuracy: 0.9089999794960022
Epoch 37:
Training Accuracy: 0.9949374794960022
Validation Accuracy: 0.9130833148956299
Epoch 38:
Training Accuracy: 0.9937291741371155
Validation Accuracy: 0.9144999980926514
Epoch 39:
Training Accuracy: 0.995145857334137
Validation Accuracy: 0.9141666889190674
Epoch 40:
Training Accuracy: 0.9937499761581421
Validation Accuracy: 0.9139166474342346
Epoch 41:
Training Accuracy: 0.9948333501815796
Validation Accuracy: 0.9139166474342346
Epoch 42:
Training Accuracy: 0.9951249957084656
Validation Accuracy: 0.9166666865348816
Epoch 43:
Training Accuracy: 0.9948124885559082
Validation Accuracy: 0.9159166812896729
Epoch 44:
Training Accuracy: 0.9954166412353516
Validation Accuracy: 0.9127500057220459
Epoch 45:
Training Accuracy: 0.9952083230018616
Validation Accuracy: 0.9144166707992554
Epoch 46:
Training Accuracy: 0.995437502861023
Validation Accuracy: 0.9162499904632568
Epoch 47:
Training Accuracy: 0.995187520980835
Validation Accuracy: 0.9105833172798157
Epoch 48:
Training Accuracy: 0.9949374794960022
Validation Accuracy: 0.9150000214576721
Epoch 49:
Training Accuracy: 0.9965624809265137
Validation Accuracy: 0.9069166779518127
Epoch 50:
Training Accuracy: 0.9944791793823242
Validation Accuracy: 0.9125000238418579

Let’s take it up a notch and confirm overfitting in our model before we wrap this up.

Time to introduce F1 to this mix

Not that F1…!

In statistical analysis of binary classification and information retrieval systems, the F-score or F-measure is a measure of predictive performance. Harmonic mean of precision and recall.

So how does it fit into our little lab here? A significant difference in F1 scores between training and testing can indicate overfitting.

Now, a bit about precision and recall.

Precision is defined as the number of true positive predictions divided by the total number of positive predictions.

Recall is the number of true positive predictions divided by the total number of actual positives in the data.

And now that we know what we’re talking about, let’s get this newfound knowledge to work.

The Plan:

We calculate the F1 values for our model for the training and test datasets. Once we have those, we can compare these values and determine if overfitting has occured.

Run this code for the training dataset:

from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

y_true = test_labels
y_pred = model.predict(test_images).argmax(axis=1)
precision = precision_score(y_true, y_pred, average='macro')
recall = recall_score(y_true, y_pred, average='macro')
f1 = f1_score(y_true, y_pred, average='macro')

print(f"Precision: {precision}\nRecall: {recall}\nF1 Score: {f1}")

Here are the results:

And now run this code for the test dataset:

import numpy as np
train_predictions = model.predict(train_images)
train_pred_labels = np.argmax(train_predictions, axis=1)
from sklearn.metrics import precision_score, recall_score, f1_score

train_precision = precision_score(train_labels, train_pred_labels, average='macro')
train_recall = recall_score(train_labels, train_pred_labels, average='macro')
train_f1 = f1_score(train_labels, train_pred_labels, average='macro')

print(f"Training Precision: {train_precision}")
print(f"Training Recall: {train_recall}")
print(f"Training F1 Score: {train_f1}")

Here are the results for the training dataset:

Comparing the test and training metrics, we can see that there is a significant diffence! The F1 score for the training dataset is quite high (98%) where as the score for test dataset is quite low comparatively (90%).

Overfitting occurs when a model learns the training data too well, including its noise and outliers, to the extent that it negatively impacts the model’s ability to generalize to unseen data.

The results show that the score drops considerably from training to test dataset, clearly indicating overfitting.

In addition, high precision score on the training data also shows that the model has perfected the training dataset. The gap in the performance also shows that the model’s ability to generalize from the training data to unseen test data is compromised, a classic symptom of overfitting.


In this lab, we’ve delved into key aspects of evaluating and addressing model performance issues in machine learning, with a particular focus on overfitting. We’ve learned that precision, recall, and the F1 score are crucial metrics for assessing a model’s ability to generalize beyond its training data. Through comparing these metrics between training and testing datasets, we identified signs of overfitting — a condition where a model learns the training data too well, to the detriment of its performance on new, unseen data.

In conclusion, achieving a well-balanced model that generalizes well to new data involves careful monitoring of performance metrics, recognizing signs of overfitting, and applying appropriate techniques to mitigate it. By fostering a deeper understanding of these concepts and strategies, we can develop more reliable and effective machine learning models that perform well across a variety of tasks and datasets.

AWS: Machine Learning and AI

Images created by DALL-E — Code execution on Google Colab Research



Vishal Thakur

DFIR enthusiast. Founder of HCKSYD. Founder of Security BSides Sydney Australia. Malware Analyst.