Having to encode categorical features is a very common problem in machine learning, generally being handled based on the type of categorical feature.
For low or high-cardinality ordinal features, it's easy to assign a real number because of an inherent hierarchy. For example, alerts, ranks, etc.
Temporal features can also be handled well by using cyclical representations. For example, one can represent an hour of a day by just using 2 features (sin and cosine values)
Low cardinality nominal features are generally handled well by methods like one-hot encoding. The major issues arise with high cardinality nominal features where one can end up with hundreds or thousands of categories (i.e. customer names/ users, destination codes, etc.). In these scenarios, methods like one-hot encoding come with the curse of dimensionality. Feature hashing can help but it has its own disadvantages, sometimes there could be collision and loss of information as well.
The most common way to handle the high cardinality nominal features is to do some kind of statistical encoding, where encoding depends on some statistic of the feature itself, or the target variable, or multiple other features. For example, if your target variable is a binary class, one can use the ratio of positives to total in historical data for the selected category as the statistic.
When we have multiple categorical variables in the data and the target variable that we want to predict is continuous, a simple statistic-based encoding like mean/median/some quantile might not always be very relevant because of underlying data distribution.
Here, we want to find a real value number for each category value that preserves some or most of the similarity information based on the target distribution for respective category values.
For our problem, we want to predict the yield (revenue per pound, our target variable) based on different categorical values like product_code, origin_code, customer_code, and a bunch of other continuous features. All the categorical features here are of high cardinality. The yield could be very different for the same product code if the origin or the customer code is different.
For example, yield distributions for the different product codes look like the following (Assume that Rick and Morty planets are our products 🙂):
https://s3-ap-southeast-1.amazonaws.com/assets.tech.portcast.io/product_yield_dist.html
We'll try to find the encoding for the product_code here as an example.
The goal is to have a generalised approach to find floating-point representation (encoding) for each category value such that encodings with high proximity have similar yield. Here we are assuming a linear relationship between derived encodings and yield
We use a negative absolute correlation (encodings vs. yield) as the loss function. We find a global minimum where the correlation between encodings and the respective target yield becomes maximum. Following is the loss function that we want to minimize
loss function: $- \lvert \frac{ \sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}) }{% \sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}} \rvert$
To find the global minimum, we use dual_annealing. This is a stochastic global optimization approach that combines the generalisation of CSA (Classical Simulated Annealing) and FSA (Fast Simulated Annealing). It finds a global minimum of a function within given bounds (0-1 in our case). The use of simulated annealing helps us find the acceptable (not always the best) solution much faster than the brute force approach.
Note that since there will be an imbalance in the data, i.e. some products might have more occurrences than others, we choose to sample the data so that each category value gets equal weightage.