Data Integration

We combine data from different sources to get a unified structure with more meaningful and valuable information.

This is mostly used if the data is segregated into different sources.

To make it simple, let's assume we have data in CSV format in different places, all talking about the same scenario. Say we have some data about an employee in a database.

We can't expect all the data about the employee to reside in the same table. It's possible that the employee's personal data will be located in one table, the employee's project history will be in a second table, the employee's time-in and time-out details will be in another table, and so on.

So, if we want to do some analysis about the employee, we need to get all the employee data in one common place. This process of bringing data together in one place is called data integration.

To do data integration, we can merge multiple pandas DataFrames using the merge function.

Here is an example of how we merge data from multiple datasets. You would need the student.csv and mark.csv datasets to try out this example.

Open the Colab Notebook

Data Transformation

Previously, we saw how we can combine data from different sources into a unified dataframe.

Now, we have a lot of columns that have different types of data.

Our goal is to transform the data into a machine-learning-digestible format.

All machine learning algorithms are based on mathematics.

So, we need to convert all the columns into numerical format. Before that, let's see all the different types of data we have.

Taking a broader perspective, data is classified into numerical and categorical data:

Numerical: As the name suggests, this is numeric data that is quantifiable.

Categorical: The data is a string or non-numeric data that is qualitative in nature.

Numerical data is further divided into the following:

Discrete: To explain in simple terms, any numerical data that is countable is called discrete.

For example: the number of people in a family or the number of students in a class.

Discrete data can only take certain values (such as1,2,3,4,etc).

Continuous: Any numerical data that is measurable is called continuous.

For example: the height of a person or the time taken to reach a destination.

Continuous data can take virtually any value (for example, 1.25, 3.8888, and 77.1276).

Categorical data is further divided into the following:

Ordered: Any categorical data that has some order associated with it is called ordered categorical data.

For example: movie ratings (excellent, good, bad, worst) and feedback (happy, not bad, bad).

You can think of ordered data as being something you could mark on a scale.  

Nominal: Any categorical data that has no order is called nominal categorical data.

Examples include gender and country.

Handling Categorical Data

There are some algorithms that can work well with categorical data, such as decision trees. But most machine learning algorithms cannot operate directly with categorical data. These algorithms require the input and output both to be in numerical form. If the output to be predicted is categorical, then after prediction we convert them back to categorical data from numerical data.

Let's discuss some key challenges that we face while dealing with categorical data:

High cardinality: Cardinality means uniqueness in data. The data column, in this case, will have a lot of different values. A good example is User ID – in a table of 500 different users, the User ID column would have 500 unique values.

Rare occurrences: These data columns might have variables that occur very rarely and therefore would not be significant enough to have an impact on the model.

Frequent occurrences: There might be a category in the data columns that occurs many times with very low variance, which would fail to make an impact on the model.

Won’t fit: This categorical data, left unprocessed, won’t fit our model.

Encoding:  To address the problems associated with categorical data, we can use encoding. This is the process by which we convert a categorical variable into a numerical form. 

Replacing : This is a technique in which we replace the categorical data with a number. This is a simple replacement and does not involve much logical processing.

Here is an example of how we replace categorical data with a number. You would need the student.csv dataset to try out this example

Open the Colab Notebook

Label Encoding

This is a technique in which we replace each value in a categorical column with numbers from 0 to N-1.

For example, say we've got a list of employee names in a column. After performing label encoding, each employee name will be assigned a numeric label.

But this might not be suitable for all cases because the model might consider numeric values to be weights assigned to the data.

Label encoding is the best method to use for ordinal data. The scikit-learn library provides LabelEncoder(), which helps with label encoding.

Here is an example of how convert categorial data to numerical data using label encoding. You would need the Banking_Marketing.csv dataset to try out this example.

Open the Colab Notebook

One Hot Encoding

In label encoding, categorical data is converted to numerical data, and the values are assigned labels (such as 1, 2, and 3).

Predictive models that use this numerical data for analysis might sometimes mistake these labels for some kind of order (for example, a model might think that a label of 3 is "better" than a label of 1, which is incorrect).

In order to avoid this confusion, we can use one-hot encoding. Here, the label-encoded data is further divided into n number of columns.

Here, n denotes the total number of unique labels generated while performing label encoding. For example, say that three new labels are generated through label encoding.

Then, while performing one-hot encoding, the columns will be divided into three parts. So, the value of n is 3.

Open the Colab Notebook

Data in Different Scales

In real life, values in a dataset might have a variety of different magnitudes, ranges, or scales. Algorithms that use distance as a parameter may not weigh all these in the same way.

There are various data transformation techniques that are used to transform the features of our data so that they use the same scale, magnitude, or range. This ensures that each feature has an appropriate effect on a model's predictions.

Some features in our data might have high-magnitude values (for example, annual salary), while others might have relatively low values (for example, the number of years worked at a company). Just because some data has smaller values does not mean it is less significant.

So, to make sure our prediction does not vary because of different magnitudes of features in our data, we can perform feature scaling, standardization, or normalization (these are three similar ways of dealing with magnitude issues in data).

Standard Scaling

Min Max Scaling

Discretization Of Continuous Data