Enhancing Financial Security: Credit Card Fraud Detection With Random Forest Classifier
Machine learning has revolutionized various industries in our world today. From manufacturing to retail, to healthcare, machine learning has improved productivity, and efficiency, optimized business costs, and even facilitated better decision-making.
As technology advances and online transactions are made more accessible and seamless, it is important to create a system that protects users from people with malicious intent. The accessibility and seamlessness that technology provides open the door for potential fraudsters seeking to take advantage of others for their financial gain.
In this blog post, you will discover how to build a credit card fraud detection model. You will learn how an organization leverages its available data to detect and prevent fraudulent transactions on its platform.
About the dataset
The dataset was sourced from Kaggle, a repository for datasets across various topics and domains.
The dataset contains credit card transactions made by European cardholders in 2023.
The dataset comprises over 550,000 records anonymized to protect the cardholders’ identities.
Import necessary libraries
# Import necessary libraries
import pandas as pd #data manipulation and analysis
import numpy as np #numerical computations
import matplotlib.pyplot as plt #data visualization
import seaborn as sns #data visualization
from sklearn.model_selection import train_test_split #splitting data into training and testing sets
from sklearn.ensemble import RandomForestClassifier #for implementing Random Forest algorithm
- The code imports essential libraries for data manipulation, visualization, and model-building.
Load the dataset
df = pd.read_csv('creditcard_2023.csv')
# check the first 5 records
df.head()
- The code reads a CSV file named
creditcard_2023.csv
into a Pandas DataFrame (`df`) and displays the first 5 records of the DataFrame for a quick preview of the dataset's structure and content.
Generates descriptive statistics
# Generate summary statistics
df.describe()
- The describe() method provides summary statistics like the numerical features' count, mean, standard deviation, minimum, maximum, and various percentiles.
Check for Null values
df.isnull().sum()
- The code checks for missing (null) values in the DataFrame (`df`) and returns the total count of missing values for each column.
Exploratory Data Analysis
To understand the credit card fraud dataset, you will analyze each feature by plotting different graphical representations, bar charts for categorical features, histograms, and scatter plots for numerical features.
Target/Dependent feature
df['Class'].value_counts()
# bar chart
plt.figure(figsize = (4, 4))
slices = df['Class'].value_counts()
labels = ['Fraud', 'Not Fraud']
plt.bar(x = slices.index, height = slices.values, label = labels)
plt.title('The Bar Chart of the Fraud Class')
plt.ylabel('Count')
plt.xlabel('Class')
plt.tight_layout()
plt.show()
This code counts the occurrences and visualizes the distribution of fraud and non-fraud cases in the dataset. This helps to visualize the imbalance between fraud and non-fraud cases.
There are equal numbers of both fraudulent and non-fraudulent transactions in the dataset, which is 284, 315
Numerical feature histogram distribution
for feature in df.columns[df.dtypes == 'float64']:
plt.figure(figsize=(12, 5))
df.hist(feature)
plt.title(f'{feature} histogram distribution')
plt.ylabel('Frequency')
plt.xlabel(feature)
plt.show()
This code generates histograms for all numerical (float64) features in the dataset to visualize their distribution.
This helps to understand the distribution of each numerical feature in the dataset.
Outlier detection
for feature in df.columns[df.dtypes == 'float64']:
plt.figure(figsize=(12, 5))
sns.boxplot(x = 'Class', y = feature, data = df)
plt.title(f'The boxplot distribution of {feature} vs target variable')
plt.show()
This code visualizes outliers in the numerical features by creating boxplots for each feature against the target variable (`Class`).
This helps to identify data points significantly different from other observations in numerical features based on their relationship with the target variable (`Class`).
Since all the features in the dataset are already in the numeric format, you can skip certain feature engineering steps typically for handling non-numeric data, such as encoding categorical variables or converting text data to numeric representations.
Model Building.
For tree-based models like Decision Trees, Random Forest, Gradient Boosting, Extreme Gradient Boosting, LightGBM, CatBoost, etc, it is not important to perform some of the following data preprocessing techniques:
1. Feature Selection
This is a process of selecting a subset of the available features in a dataset for modeling. This is because the tree-based model inherently selects features based on the feature's importance. Features that are less informative or relevant have lower feature importance and are less likely to be used at the nodes of the trees.
2. Feature Scaling
This process could either be Normalization (Min-Max scaling) or Standardization (Z-score normalization). They ensure that the magnitude of the features are uniformized to a certain range usually between 0 and 1 or -1 and 1, that does not give undue advantage to a feature of higher magnitude over a feature with a smaller magnitude. Tree-based models make decisions in modeling based on the feature values/importance and not the actual scale/magnitude of the feature.
Train Test Split
# target and predictors
X= df.drop(['Class', 'id'], axis =1)
y = df['Class']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
- This code separates the target variable from the predictors. Then separate 80% of the data for training and the remaining 20% for testing the accuracy of the trained model
Train a RandomForest Classifier
# instantiate a Random Forest model
clf = RandomForestClassifier(random_state=42)
# fit the model
clf.fit(X_train, y_train)
# check accuracy score
clf.score(X_test, y_test)
This code instantiates a RadomForest Classifier model with a random state set to 42 to ensure reproducibility and fit the model on the train data, the predictor, and the target variable.
It also evaluates the model's accuracy on the test data (`X_test` and
y_test
), returning the proportion of correctly classified instances.
Model Evaluation
# predict test data
y_pred = clf.predict(X_test)
# confusion matrix
print(confusion_matrix(y_test_full, y_pred))
# plot confusion matrix in a heatmap
plt.figure(figsize = (8, 5))
sns.heatmap(confusion_matrix(y_test_full, y_pred),
annot = True, fmt = 'd', linewidth = 3)
# classification report
print(classification_report(y_test_full, y_pred))
After training the model, you need to ascertain the model's accuracy by making predictions with the test data. The model is evaluated by:
Predicting the class labels for the test dataset.
Computing and plotting the confusion matrix to show the model performance regarding true positives, true negatives, false positives, and false negatives.
Checking a detailed classification report which includes precision, recall, and F1-score for each class to evaluate the model's performance.
Conclusion
This article has provided a step-by-step guide to building a credit card fraud detection system.
Using the Random Forest Classifier, we have demonstrated how to enhance financial security using Machine Learning. We utilized Exploratory Data Analysis to understand the data distribution and identify outliers. We skipped certain feature engineering steps to handle non-numeric data since all our data are already in numbers.
We advanced to the model-building phase using the Random Forest Classifier, which intrinsically selects features and handles different scales.
By evaluating the model’s performance using a confusion matrix and classification report, we confirmed its accuracy in detecting fraudulent transactions.