One Hot Encoding

What is One-Hot Encoding?

One-hot encoding is used in machine learning as a method to quantify categorical data. In short, this method produces a vector with length equal to the number of categories in the data set.  If a data point belongs to the

ith category then components of this vector are assigned the value 0 except for the ith component, which is assigned a value of 1.  In this way one can keep track of the categories in a numerically meaningful way.
The question may arise: What happens if there are multiple 1’s? Which classification is correct? This is quickly rectified by the fact that something that is one-hot encoded has exactly one position in its array that is labeled as a 1. For example, [0,0,0,1,0] would be a valid one-hot encoding that would tell you the classification in position 4 (or 3 in array indexing) is the classification of the object. Contrastingly, [0,1,0,1,0],and [1,1,1,1,1] are examples of invalid one-hot encodings.


Consider the problem of classifying a person into one of four categories: male, female, gender-neutral, and other. We can represent this as an array with four positions. For every person we encounter, we want to be able to represent them as a one-hot encoding with relation to our four categories. Let’s say walking down the street, we encounter 4 people who identify as female, 3 people who identify as male, one person who identifies as gender-neutral, and 2 people who identify as something other than the other three categories. Then, we can represent these people in the following way:

[0,1,0,0] // female
[1,0,0,0] // male
[0,0,1,0] // gender-neutral
[0,0,0,1] // other
Notice how the order of the categories as they were presented remained the same. It is important that one establish the proper order of the array before something can be properly encoded. For this example, we maintained the order that the categories were initially given, meaning that our template array looked like [male, female, gender-neutral, other].