The idea of one-hot encoding labels in supervised learning isn’t really new. The need for encoding categorical data birthed out of a necessity for data science and machine learning algorithms to understand categorical data.

As almost every machine learning algorithm cannot correctly work with categorical data (even numerical datasets where the feature values have large difference gaps usually have to be normalized), the idea of providing dummy numerical data to represent categorization is the simplest way out.

The idea is simple: assign ZERO to all the other classes except for the correct class for which ONE is assigned.

In machine learning, the one-hot encoding concept is very widely applied for correctly calculating loss functions (Loss functions in supervised machine learning help tell us how far a predicted value is from the expected value).

Take a supervised learning example for instance, where we’re trying to train a model to learn how to classify between an apple, an orange, and a lemon.

Here’s a small dataset of 5 samples:

ID Fruit
1 Lemon
2 Apple
3 Lemon
4 Orange
5 Apple

Using their straight up labels, let’s try to calculate the loss function for two cases:

What this difference in loss functions tells the model is that class 1 is closer to class 2 but way farther than class 4, which is very misleading.

Think about how larger the differences would be if there were more classes/labels.

Here’s the same dataset after one-hot encoding:

ID Apple Orange Lemon
1 0 0 1
2 1 0 0
3 0 0 1
4 0 1 0
5 1 0 0

Now, let’s take the same two cases of wrong predictions and calculate their loss functions again.

Now, this makes more sense for the model.