Elo Merchant Category Recommendation was a early 2019 competition hosted by Elo on kaggle. After the excitement from reaching the final of the Data Science Game 2018 and taking Christmas off to rest, I was eager to jump into another data science competition and Elo provided a great chance for that.
After 2 months of a hard competition, all the hard work paid off in winning a silver medal thanks to being in the top 5% oc competitors from all over the world. Below is a copy of my thought process, pipeline and solution from my original post on kaggle.
Hello everyone, thanks for the great competition and discussion. Much was was learned and put into practice.
> Each and every kaggle competition shall be treated as a work/study project in which you should setup a proper workflow to structure your work.
With that in mind I hope someone can take from this just as I learnt a lot from all the discussions.
Github for code, reproducible jupyter notebook, lgbm/xgboost for testing things, ensemble of 4 models for submitting. The train_model function here was a great find.
Much feature engineering was done. It would not have been possible with the much used memory usage script frequently shared here. Here is everything that wen into the final models, in order of what I tried. Since most ideas are not new, I will link to other users example implementation
Basic time features were initially done, things such as this
Negative month lag features
RFM, from both the customer_id and merchant_id, weighted and not weighted
Aggregations that only had few examples. For this, a especially consideration was used to not include information from whatever was in the currently validation set, ie, for each pass of the fold I would calculate it from the training data of that fold only.
More and more aggregation and stats
Final model was based on combining model with and without outliers as mentioned here with some extra modifications; basically, with a 3rd model to predict outliers, I found a ideal threshold amount to set as outliers and for those instances, the prediction was the average of the two models; for all others the prediction was based on the model without outliers.
Each model was in itself a ensemble of xgboost, catboost, lightgbm - stacking was tried, but the results weren’t a improvement.
Things I didn’t have the time to do
Late into the competition I had to take a break due to work/traveling, so some things in the pipeline could not be tried to the end.
I was playing with the idea of manually identifying states (and possibly a number of cities from correlation of spending/population) and including external economics and social data based on my findings that category 2 is identifying the 5 different regions in Brasil, which other people also found out.
Neural networks were also one of my next go-to things to try but alas, no time :/
Ideas that failed
FM and FFM - spent some time understanding the theory, but in the end all the models built were very sub par.
Featuretools - It kind of worked and some of the features were slightly useful, but not what I expected, although I was also lacking in computing power by this point