Data Cleaning
Data cleaning includes processes such as filling in missing values and handling inconsistencies. It detects corrupt data and replaces or modifies it.
Missing Values
The concept of missing values is important to understand if you want to master the skill of successful management and understanding of data.
As you can see in the table below, the data belongs to a bank;
Each row is a separate customer and each column contains their details, such as age and credit amount.
There are some cells that have either NA or are just empty.
This is missing data. Each piece of information about the customer is crucial for the bank.
If any of the information is missing, then it will be difficult for the bank to predict the risk of providing a loan to the customer.
Handling Missing Data
Intelligent handling of missing data will result in building a robust model capable of handling complex tasks. There are many ways to handle missing data.
Removing the Data
Checking missing values is the first and the most important step in data pre-processing.
A model cannot accept data with missing values.
This is a very simple and commonly used method to handle missing values: we delete a row if the missing value corresponds to the places in the row, or we delete a column if it has more than 70%-75% of missing data.
Again, the threshold value is not fixed and depends on how much you wish to fix.
The benefit of this approach is that it is quick and easy to do, and in many cases no data is better than bad data. The drawback is that you may end up losing important information, because you're deleting a whole feature based on a few missing values.
Here is an example on how we remove missing data. You will need the Banking_Marketing.csv dataset to try out this example.
Mean/Median/Mode Imputation
In the case of numerical data, we can compute it's mean or median and use the result to replace missing values.
In the case of the categorical (non-numerical) data, we can compute its mode to replace the missing value. This is known as imputation.
The benefit of using imputation, rather than just removing data, is that it prevents data loss.
The drawback is that you don't know how accurate using the mean, median, or mode is going to be in a given situation.
Here is an example on how we impute missing data. You will need the Banking_Marketing.csv dataset to try out this example.
Outliers
Outliers are values that are very large or very small with respect to the distribution of the other data.
We can only find outliers in numerical data. Box plots are one good way to find the outliers in a dataset, as you can see in the following figure:
Consider a sample dataset of temperatures from a place at different times:
71, 70, 90, 70, 70, 60, 70, 72, 72, 320, 71, 69
First, we'll sort the data: 60, 69, 70, 70, 70, 70, 71, 71, 72, 72, 90, 320
Next, we'll calculate the median (Q2). The median is the middle data after sorting. Here, the middle terms are 70 and 71 after sorting the list.
The median is(70+71)/2=70.5
Then we'll calculate the lower quartile (Q1). Q1 is the middle value (median) of the first half of the dataset.
First half of the data = 60, 69, 70, 70, 70, 70
Points 3 and 4 of the bottom 6 are both equal to 70.
The average is (70+70)/2 = 70
Q1=70
Then we calculate the upper quartile (Q3). Q3 is the middle value (median) of the second half of the dataset.
Second half of the data = 71, 71, 72, 72, 90, 320
Points 3 and 4 of the upper 6 are 72 and 72.
The average is (72+72)/2=72
Q3=72
Then we find the interquartile range (IQR).
IQR=Q3–Q1=72–70
IQR=2
Next, we find the upper and lower fences.
Lowerfence=Q1–1.5(IQR)=70–1.5(2)=67
Upper fence = Q3 + 1.5 (IQR) = 71.5 + 1.5(2) = 74.5
Boundaries of our fences = 67 and 74.5
Any data points lower than the lower fence and greater than the upper fence are outliers.
Thus, the outliers from this example are 60, 90 and 320.
Here is an example on how we find and remove outliers from a dataset. You will need the german_credit_data.csv dataset to try out this example.