Understanding and Preventing Data Leakage in Machine Learning

Imagine a student named Bauer who took an Algebra class with his classmates. Bauer paid attention during the lessons but may not completely understand the underlying principle of Algebra. Two weeks later, the teacher gave the class a test. Fortunately for Bauer, he had managed to get a prior knowledge of what the test question would be. With this foreknowledge, he memorized the steps and answers of the test. So, he did brilliantly in the test and earned the praise of his teacher and colleagues.

Because of his excellent performance on the test, Bauer was chosen to represent his school in a community Algebra competition. He arrived at the competition and discovered that the questions were completely different from the ones he had memorized for the test. Without a solid understanding of Algebra's underlying principles, Bauer struggled with the new and unfamiliar questions. He performed poorly to the surprise and disappointment of his teacher and classmates.

This is exactly what data leakage is and the consequences on the Machine Learning (ML) model when pushed into production.

Image from Google Gemini

So, In this blog post, you will understand:

  • what data leakage is

  • types of data leakage

  • signs of data leakage

  • its impact on ML models, and

  • how to prevent it.

What is Data leakage?

Data leakage is when the data you are using to train an ML algorithm happens to have the information you are trying to predict. The ML model therefore already has information about the test data and its target variable.

Just as Bauer had prior access to the test questions, data leakage occurs when the model gets information about the test data during training that the model would not have in a real-world scenario.

Data leakage leads to high accuracy with the training and test data, but results in poor performance when the model is deployed into the production environment.

Data leakage is a subtle problem in machine learning that must be avoided to create a robust and generalized predictive model.

Types of Data Leakage

  1. Target leakage: In ML, the features can be divided into Predictors (dependent) and Target (independent) features. The predictors are used to predict the target, while the target feature is the output generated by the model. So, data leakage occurs when the target feature is included with the predictors used in training the model. The model predicts using the target feature information that will not be available during deployment. It is important to correctly separate the predictors from the target feature before training the model.

  2. Train-Test Contamination: To build a robust and generalized model, it is necessary to test its performance on previously unseen data. Data leakage occurs when preprocessing techniques like normalization, scaling, and missing values imputation, are applied to the entire dataset. This happens because the model learns certain parameters of the test data when preprocessing techniques are used on the whole dataset.

    Therefore, the whole dataset is split into the train (to train the model) and test (to validate the model) datasets before applying the preprocessing techniques. Every preprocessing technique should be used to train and test data separately to avoid train-test contamination.

  3. Information Leakage: This happens when a feature that provides excessive information about the target variable is included among the predictors. This feature could be directly correlated with the target feature or contain information obtained or created after the target feature is determined. This is related to Target leakage except that the feature here in Information leakage is not the Target feature.

Implications of Data Leakage in Machine Learning

  1. Model overfitting: The model performed extremely well on the test dataset as the model was exposed to information in the test dataset during the training phase. The model memorized the pattern in the leaked information specific to the test data. This overfitting leads to a false belief that the model is good as it outputs correct labels.

  2. Reduced Model Performance: When the model is deployed to production and tested on new unseen data, there is a reduction in the model performance. The model performance was reduced because the model did not learn the underlying patterns and relationships in the data and could not generalize well on unseen data.

  3. Increased costs: Training an ML model requires time, data collection, computational power, and human resources. When the ML model created fails to generalize and perform well in production due to data leakage, it has to be retrained or replaced. This incurs higher costs as additional resources are required to retrain or replace the failed model. Data leakage can also lead to a loss of revenue as it consistently outputs wrong decisions.

  4. Loss of trust: When the ML model makes incorrect decisions in production due to data leakage, it can be detrimental to the organization that relies on the model. This is critical, especially in healthcare industries where ML model accuracy has a significant impact. Users and stakeholders can develop a distrust as a result.

Signs of data leakage

Data leakage is one of the subtle errors that can occur when creating an ML model. An inflated model performance and suspiciously low error rates are classical signs that should raise skepticism and suspicion of data leakage.

  • Inflated model performance: When you see an unusually high performance, it is encouraged to be skeptical about your model result, and check again for possible data leakage. A too-good-to-be-true result should raise suspicion rather than excitement.

  • Suspiciously low error rate: A low error rate is associated with an inflated model performance. A low error may indicate that the model has seen some information about the test during training.

Ways to Prevent Data Leakage

  1. Split data before data preprocessing: To avoid data leakage due to train-test contamination, you have to split the entire dataset before preprocessing your data. Preprocessing techniques like normalization, scaling, missing values imputation, grouping, etc., should be done separately on the training and test data.

  2. Understand your data: This is where domain knowledge is important. You should thoroughly understand your data, its source, and how it was collected, and combined. This will help you identify and remove features generated after a prediction from the training data.

  3. Exploratory Data Analysis (EDA): Conducting a thorough EDA will reveal features that are highly correlated to the target feature. Such features can then be dropped from the training dataset and prevent data leakage through information leakage.

  4. Have a fellow data scientist take a look: Share the programming code with a fellow data scientist to check out. A fresh look can reveal where data leakage may have occurred in the ML pipeline.

  5. Time-based splitting: When working with time series data, data should never be split randomly, but based on its time sequence. The test data must be ahead of time relative to the training data.

Conclusion

Data leakage is a predominant issue in ML that even top researchers fall into. Understanding this and equipping one with strategies to prevent it is crucial. A good model is largely dependent on the quality of the data.

Data leakage should be prevented while trying to prepare and improve data quality. Preventing data leakage requires careful handling of the feature engineering process. The entire dataset should be split before preprocessing, and highly correlated features should be identified and removed through exploratory data analysis.

Collaborating with peers for code review, and using established coding libraries are important practices to identify and prevent data leakages.