Supervised Learning
Supervised learning is broadly split into two categories. These categories are as follows:
- Classification mainly deals with categorical target variables. A classification algorithm helps to predict which group or class a data point belongs to.
- When the prediction is between two classes, it is known as binary classification. An example is predicting whether or not a customer will buy a product (in this case, the classes are yes and no).
- If the prediction involves more than two target classes, it is known as multi-classification; for example, predicting all the items that a customer will buy.
- Regression deals with numerical target variables. A regression algorithm predicts the numerical value of the target variable based on the training dataset.
- Linear regression measures the link between one or more predictor variables and one outcome variable. For example, linear regression could help to enumerate the relative impacts of age, gender, and diet (the predictor variables) on height (the outcome variable).
- Confusion matrix
- Precision
- Recall
- Accuracy
- F1 score
Time series analysis
Time series analysis, as the name suggests, deals with data that is distributed with respect to time, that is, data that is in a chronological order.
Stock market prediction and customer churn prediction are two examples of time series data.
Depending on the requirement or the necessities, time series analysis can be either a regression or classification task.
Unsupervised Learning
Unlike supervised learning, the unsupervised learning process involves data that is neither classified nor labeled.
The algorithm will perform analysis on the data without guidance. The job of the machine is to group unclustered information according to similarities in the data.
The aim is for the model to spot patterns in the data in order to give some insight into what the data is telling us and to make predictions.
An example is taking a whole load of unlabeled customer data and using it to find patterns to cluster customers into different groups.
Different products could then be marketed to the different groups for maximum profitability.
Unsupervised learning is broadly categorized into two types:
Clustering: A clustering procedure helps to discover the inherent patterns in the data.
Association: An association rule is a unique way to find patterns associated with a large amount of data, such as the supposition that when someone buys product 1, they also tend to buy product 2.
Reinforcement Learning
Reinforcement learning is a broad area in machine learning where the machine learns to perform the next step in an environment by looking at the results of actions already performed.
Reinforcement learning does not have an answer, and the learning agent decides what should be done to perform the specified task. It learns from its prior knowledge. This kind of learning involves both a reward and a penalty.
No matter the type of machine learning you're using, you'll want to be able to measure how effective your model is. You can do this using various performance metrics
Performance Metrics
There are different evaluation metrics in machine learning, and these depend on the type of data and the requirements.
Some of the metrics are as follows:
Train and Test Data
Once you've pre-processed your data into a format that's ready to be used by your model, you need to split up your data into train and test sets.
This is because your machine learning algorithm will use the data in the training set to learn what it needs to know.
It will then make a prediction about the data in the test set, using what it has learned.
You can then compare this prediction against the actual target variables in the test set in order to see how accurate your model is.
Example: Splitting Data into Train and Test Sets
We will do the train/test split in proportions. The larger portion of the data split will be the train set and the smaller portion will be the test set.
This will help to ensure that you are using enough data to accurately train your model.
In general, we carry out the train-test split with an 80:20 ratio, as per the Pareto principle.
The Pareto principle states that "for many events, roughly 80% of the effects come from 20% of the causes." But if you have a large dataset, it really doesn't matter whether it's an 80:20 split or 90:10 or 60:40.
For this example we will use the USA_Housing.csv dataset