Categorical data input to multilayer perceptrons - a quick discussion

Author: Goran Trlin

This article is a brief discussion about the use of categorical data as inputs and/or outputs to different neural networks, instead of continuos values used in this article.

Although there are different ways of encoding categorical data, the most popular ones are:

  • One-hot vectors
  • Data embeddings

One-hot vectors

One-hot encodings represent categorical values as vectors with all but one zero elements. Example: word "car" could be represented as [0,0,1,0,0,0], while the word "engine" could be represented as [0,0,0,0,0,1]. This approach is very common in machine learning applicatons. This is the encoding format we used in the previous tutorial as the output format from our multilayer perceptron network. Also, this is the input format of our basic reccurent neural network created in this tutorial.

The one-hot vector representation is well-understood, powerful and commonly used in practice. However, it does have its drawbacks.

The main downsides of one-hot encoding are its sparsity and inability to determine any form of distance metric between two different one-hot vectors. The sparsity shows itself in computational complexity and high memory consumption. The equidistant property of one-hot vector representation is useful in some scenarios, but less than optimal in most of the others (like Natural Language Processing, NLP).

Data embeddings

Categorical data embeddings can be thought of as distributed representations of categorical values. This implies that each category is represenented by a vector, but that most of the elements of that vector are non-zero values. We can think of a data embedding vector as a vector of floating-point numbers. Example: [0.35,0.22,0.89,0.125,0.25,0.52]. One important advantage of this representation over one-hot vectors is its ability to preserve distances between data points. For example, we can have high cosine similarty between words often appearing in the same context, such as "car" and "engine", and at the same time a small cosine similarity between dissimilar words such as "car" and "mountain".

An important difference between one-hot vectors and data embeddings is the fact that the former ones are typically predefined (well-known during the whole process of trainging and prediction), while the latter ones need to be learned from data examples. Data embeddings are typically extracted from weights between inputs and hidden layers in standard feed-forward neural networks.

They are mostly used in Natural Language Processing (NLP). The use cases in NLP applications often involve prediction of a single word based on the given context (Skip-gram models), prediction of a context based on a single word (CBOW models) and so on.

Click here to see an example of a basic Skip-gram based neural network used for the purpose of generating word embeddings for a small vocabulary.