Roadmap for Building Machine Learning Models
Data Pre-processing
- This is the first step in building a machine learning model. Data pre-processing refers to the transformation of data before feeding it into the model.
- It deals with the techniques that are used to convert unusable raw data into clean reliable data.
- Since data collection is often not performed in a controlled manner, raw data often contains outliers (for example, age = 120), nonsensical data combinations (for example, model: bicycle, type: 4- wheeler), missing values, scale problems, and so on.
- Because of this, raw data cannot be fed into a machine learning model because it might compromise the quality of the results. As such, this is the most important step in the process of data science.
Model Learning
- After pre-processing the data and splitting it into train/test sets, we move on to modeling.
- Models are nothing but sets of well-defined methods called algorithms that use pre- processed data to learn patterns, which can later be used to make predictions.
- There are different types of learning algorithms, including supervised, semi-supervised, unsupervised, and reinforcement learning.
- In this stage, the models are evaluated with the help of specific performance metrics.
- With these metrics, we can go on to tune the hyperparameters of a model in order to improve it.
- This process is called hyperparameter optimization.
- We will repeat this step until we are satisfied with the performance.
- Once we are happy with the results from the evaluation step, we will then move on to predictions.
- Predictions are made by the trained model when it is exposed to a new dataset.
- In a business setting, these predictions can be shared with decision makers to make effective business choices.
- The whole process of machine learning does not just stop with model building and prediction.
- It also involves making use of the model to build an application with the new data.
- Depending on the business requirements, the deployment may be a report, or it may be some repetitive data science steps that are to be executed.
- After deployment, a model needs proper management and maintenance at regular intervals to keep it up and running.
Data Representation
The main objective of machine learning is to build models that understand data and find underlying patterns. In order to do so, it is very important to feed the data in a way that is interpretable by the computer.
To feed the data into a model, it must be represented as a table or a matrix of the required dimensions.
Converting your data into the correct tabular form is one of the first steps before pre-processing can properly begin.
![](https://cds.santechz.com/userfiles/media/uploaded/lme9y837.png)
Independent Variables
These are all the features in the DataFrame except the target variable. They are of size (m, n), where m is the number of observations and n is the number of features. These variables must be normally distributed and should NOT contain:
- Missing or NULL values
- Highly categorical data features or high cardinality
- Outliers
- Data on different scales
- Human error
- Multicollinearity (independent variables that are correlated)
- Very large independent feature sets (too many independent variables to be manageable)
- Sparse data
- Special characters
Feature Matrix and Target Vector
A single piece of data is called a scalar. A group of scalars is called a vector, and a group of vectors is called a matrix.
A matrix is represented in rows and columns.
Feature matrix data is made up of independent columns, and the target vector depends on the feature matrix columns. To get a better understanding of this, let's look at the following table:
As you can see in the table, there are various columns: Car Model, Car Capacity, Car Brand, and Car Price.
All columns except Car Price are independent variables and represent the feature matrix. Car Price is the dependent variable that depends on the other columns (Car Model, Car Capacity, and Car Brand).
It is a target vector because it depends on the feature matrix data.