Having to encode categorical features is a very common problem in machine learning, generally being handled based on the type of categorical feature.

For low or high-cardinality ordinal features, it's easy to assign a real number because of an inherent hierarchy. For example, alerts, ranks, etc.

Temporal features can also be handled well by using cyclical representations. For example, one can represent an hour of a day by just using 2 features (sin and cosine values)

Low cardinality nominal features are generally handled well by methods like one-hot encoding. The major issues arise with high cardinality nominal features where one can end up with hundreds or thousands of categories (i.e. customer names/ users, destination codes, etc.). In these scenarios, methods like one-hot encoding come with the curse of dimensionality. Feature hashing can help but it has its own disadvantages, sometimes there could be collision and loss of information as well.

The most common way to handle the high cardinality nominal features is to do some kind of statistical encoding, where encoding depends on some statistic of the feature itself, or the target variable, or multiple other features. For example, if your target variable is a binary class, one can use the ratio of positives to total in historical data for the selected category as the statistic.

Problem

When we have multiple categorical variables in the data and the target variable that we want to predict is continuous, a simple statistic-based encoding like mean/median/some quantile might not always be very relevant because of underlying data distribution.

Here, we want to find a real value number for each category value that preserves some or most of the similarity information based on the target distribution for respective category values.

For our problem, we want to predict the yield (revenue per pound, our target variable) based on different categorical values like product_code, origin_code, customer_code, and a bunch of other continuous features. All the categorical features here are of high cardinality. The yield could be very different for the same product code if the origin or the customer code is different.

For example, yield distributions for the different product codes look like the following (Assume that Rick and Morty planets are our products 🙂):

https://s3-ap-southeast-1.amazonaws.com/assets.tech.portcast.io/product_yield_dist.html

We'll try to find the encoding for the product_code here as an example.

Approach