Elo Merchant Category Recommendation Silver Medal

2 minute read

Competition

Elo Merchant Category Recommendation was a early 2019 competition hosted by Elo on kaggle. After the excitement from reaching the final of the Data Science Game 2018 and taking Christmas off to rest, I was eager to jump into another data science competition and Elo provided a great chance for that.

After 2 months of a hard competition, all the hard work paid off in winning a silver medal thanks to being in the top 5% oc competitors from all over the world. Below is a copy of my thought process, pipeline and solution from my original post on kaggle.

Introduction

Hello everyone, thanks for the great competition and discussion. Much was was learned and put into practice.

I went into this competition marked by Yifan Xie words:

> Each and every kaggle competition shall be treated as a work/study project in which you should setup a proper workflow to structure your work.

With that in mind I hope someone can take from this just as I learnt a lot from all the discussions.

Organization

Github for code, reproducible jupyter notebook, lgbm/xgboost for testing things, ensemble of 4 models for submitting. The train_model function here was a great find.

Feature Engineering

Much feature engineering was done. It would not have been possible with the much used memory usage script frequently shared here. Here is everything that wen into the final models, in order of what I tried. Since most ideas are not new, I will link to other users example implementation

  • Basic time features were initially done, things such as this

  • More heavy time feature engineering, things such as here, here and here.

  • Negative month lag features

  • RFM, from both the customer_id and merchant_id, weighted and not weighted

  • Aggregations that only had few examples. For this, a especially consideration was used to not include information from whatever was in the currently validation set, ie, for each pass of the fold I would calculate it from the training data of that fold only.

  • More and more aggregation and stats

Models

Final model was based on combining model with and without outliers as mentioned here with some extra modifications; basically, with a 3rd model to predict outliers, I found a ideal threshold amount to set as outliers and for those instances, the prediction was the average of the two models; for all others the prediction was based on the model without outliers.

Each model was in itself a ensemble of xgboost, catboost, lightgbm - stacking was tried, but the results weren’t a improvement.

Things I didn’t have the time to do

Late into the competition I had to take a break due to work/traveling, so some things in the pipeline could not be tried to the end.

I was playing with the idea of manually identifying states (and possibly a number of cities from correlation of spending/population) and including external economics and social data based on my findings that category 2 is identifying the 5 different regions in Brasil, which other people also found out.

Neural networks were also one of my next go-to things to try but alas, no time :/

Ideas that failed

FM and FFM - spent some time understanding the theory, but in the end all the models built were very sub par.

Featuretools - It kind of worked and some of the features were slightly useful, but not what I expected, although I was also lacking in computing power by this point