Customer segmentation and prediction for Arvato Financial Services.

Khurazov Ruslan
9 min readOct 11, 2021
Picture taken from here

Full project notebook can be found on Github.

This is the final project for Data science nanodegree by Udacity in collaboration with Bertelsmann Arvato. The project focuses on clustering general population in Germany, analyzing data on customers of a mail-order company. In this part K-Means clustering algorithm will be used to segment data into distinctive groups. The other goal of the project is to build a classifier to predict a customers response to a marketing campaign. Here I will try different classifiers, evaluate them and pick the best performing one. After classifier is chosen I will tune its hyperparameters and submit the result on the competition page on kaggle.com.

The data for this project is provided by Arvato Financial Services and consists of 4 .csv and 2 .xlsx files:

  • Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
Figure 1. First look at general population data
  • Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
Figure 2. First look at customers data
  • Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
Figure 3. First look at training set
  • Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).
Figure 4. First look at testing set
  • DIAS_Information_Levels-Attributes_2017.xlsx: a top-level list of attributes and descriptions, organized by informational category (didn’t happen to use this file in the project).
  • DIAS_Attributes-Values_2017.xlsx: a detailed mapping of data values for each feature in alphabetical order.
Figure 5. First look at variables information

Unsupervised learning

The goal of this is to analyze both general population in Germany and company’s customers, split them into distinctive groups and try to figure out characterestics that make up a customer. We have two big datasets for this section of the project: Udacity_AZDIAS and Udacity_CUSTOMERS. Considering the size of the datasets a lot of time was spend on exploring the data. The first step was to get to know the features and what they represent. It turned out that we had detailed information only on 272 features and for this part of the project only those features were left and the rest were dropped. Also some of the missing values in the data were encoded as -1, 0, 9 or ‘X’ and ‘XX’. Those values were decoded back to NaN so we could see how much data was actually missing.

Figure 6. Histogram of the amount of missing values in %

Only few columns had more then 20% of data missing. After dropping those, 263 columns were left to work with.

After that two new columns were engineered from ‘PRAEGENDE_JUGENDJAHRE’ column: ‘YOUTH_DECADE’ that showed at what decade a person spent their youth and ‘MOVEMENT’ describing that person’s main view while being young. It was also discovered that an absolute majority of customers of the company were older people who spent their youth in 40th, 50th, 60th and 70th of the 20th century.

Figure 7. Most of the customers are older people

After new columns were created it was time to fill in missing values. I decided to put mean values in numerical columns and modal values in categorical columns. Then object type data was labeled with Scikit-learn’s LabelEncoder and the whole data was scaled before applying Principal Component Analysis to reduce the number of features.

Figure 8. 90 components explain about 80% of variance

All the mentioned steps above were wrapped up in a pipeline to process the data. As for PCA, 90 principal components were chosen as they explain around 80% of data variance. After apllying PCA the data was ready for KMeans clustering.

Figure 9. Elbow method to pick a better number of clusters

I used an elbow method to determine the optimal number of clusters for KMeans. The graph showed gradual decrease without any distinctive breaking point. I chose 10 as number of clusters because I think the slope started decreasing a bit slower after 10.

After the number of clusters was set KMeans clustering was used on both general population and customers datasets and ‘CLUSTERS’ columns were created in both of them.

Figure 10. Clusters 6, 9 and 4 hold more than 70% of all customers

As it turned out there were three clusters with the majority of customers in them. Clusters 6, 9, and 4 combined held around 72% of all the customers. So, I decided to compare those clusters to general population. I found out some of the columns that had significant differences between general population and customer clusters.

Figure 11. ‘ALTERSKATEGORIE_GROB’ column

‘ALTERSKATEGORIE_GROB’ column supported the idea that most of the customers are older people. Absolute majority of customer clusters are older than 60 years old.

Figure 12. ‘ANREDE_KZ’ column

I found column ‘ANREDE_KZ’ that shows the sex of a person very interesting as well. The analysis shows that while general population is evenly spread between males and females (a bit more of the females), people in biggest customer clusters are males.

Figure 13. ‘CAMEO_DEUG_2015’ column
Figure 14. ‘FINANZTYP’ column

The graphs above show that our customers are mostly well-off people representing middle and higher income classes. They also see themselves as savvy with money saving or investing them.

Figure 15. ‘D19_GESAMT_DATUM’ column

‘D19_GESAMT_DATUM’ columns shows that while we have none or very old information on transactions with the complete file TOTAL (I couldn’t really figure out what TOTAL means. Google search shows that it somehow connected with taxes?), known customers are very active in that regard.

Figure 16. ‘EWDICHTE’ column
Figure 17. ‘KBA05_ANTG1’ column
Figure 18. ‘LP_FAMILIE_FEIN’ column

I found the three column above interesting as well because they show that our customers prefer less densely populated areas with high number of 1–2 family houses and tend to live in two-generational or even multi-generational families.

Figure 19. ‘GREEN_AVANTGARDE’ column

While 80% of general population do not belong to green avantgarde, almost 80% of customer cluster 4, for example, do.

Figure 20. ‘MOBI_REGIO’ column

Our customers show low or very low mobility compared to general population with high mobility.

After exploring and segmenting the data we can be pretty sure that the customer is a person of older generation, mostly male of middle, upper-middle or upper income class who lives in a less populated areas and tend to have a big family thus not likely to move to some new place.

Supervised learning

In this part we need to build a classifier that will be able to predict a customer’s response to a marketing campaign. We have two datasets to work with for this section of the project: Udacity_MAILOUT_052018_TRAIN.csv (a dataset with known target, column ‘RESPONSE’) and Udacity_MAILOUT_052018_TEST.csv (a dataset where target needs to be predicted).Data processing for this part of the project was almost the same as for the unsupervised learning. But I kept all the columns at first including those we don’t have information on. Missing values were decoded back to NaN to see how much data was missing.

Figure 21. Histogram of the amount of missing data in %

For classification part I kept all columns that had less than 50% of the data missing (359 columns were left).Missing values were dealt with as in the first part of the project (mean values for numerical columns, modal values for categorical values).

Figure 22. Unbalanced train data

The data was highly unbalnced with a little over 1% being positive responses. In this case it made sense choosing ROC_AUC score as a metric for evaluation (competition on kaggle.com also uses that metric to evaluate submissions). I tried overpopulating data to balance it but it led to overfitting. So, eventually I used the model trained on undersampled data and it actually worked much better and faster. Instead of wrapping features with PCA to reduce dimensions I used feature_importances_ method of Random Forest Classifier to filter out irrelevant variables. For the baseline threshold of 0.002 was selected (this parameter was later tuned). All the processing steps were put into a pipeline and then cross validation was used to find the best perfoming model.

A function above was created to build pipeline and use cross validation on different models. All the models had their default settings except for random_state which was set to 42 for reproducibility. The first few attempts the function kept returning NaN as a score until I put error_score parameter of cross validation to ‘raise’ and saw the actual error.

After running the function several times the results were stored in a separate dataframes. I used slightly different sets of features to find the best performing model.

Figure 23

This is the result of the following:

  • Random Forest Classifier was trained on the full dataset to find feature importances
  • Models were cross validated on the full dataset as well.

LightGBM happened to performed very well. In general tree based models were performing much better in this task than others.

Figure 24

Here I used feature importances from Random Forest that was trained on the full dataset and used them to train models on undersampled dataset with just 1064 rows (532 positive and 532 negative responces). The results here are much better for most of the models. Plus, it was 13 times faster.

Figure 25

This is when I trained Random Forest on undersampled data to extract important features and then trained the models on undersampled data as well. XGBoost performed the best and was chosen as main classifier.

After the model was picked out, I used hyperopt library for Bayesian optimization to try tuning hyperparameters and improve the score. After 500 evaluation which took about 3 hours the following parameters were chosen.

These parameters were used in the final model on the test dataset which was submitted on kaggle.com and recieved a score of 0.80251

Figure 26. Public leaderboard

Interestingly, the most important feature for Random Forest with value of 0.066813 was ‘D19_SOZIALES’ which we don’t have information on. But all of ‘D19_’ columns seem to represent a person’s activity with a certain type of product (Bank, Insurance, etc.). I believe that having information on some of the features could help in processing data and improving the score. Features that we discovered to be important during cluster analysis regarding a person’s age (‘ALTERSKATEGORIE_GROB’ column) and sex (‘ANREDE_KZ’ column) were in top 10 of most important features for Random Forest decision making but most of the features in that list had no detailed description. With XGBoost performing so well, I tried to use it for feature selection but it didn’t work as good and the score was lower. We could use another search space for Bayesian optimization which might also improve the score.

Overall it was a very interesting and diverse project. The first part took the most time because of data exploration and size of the files we worked with. It was very fun trying different features and pipelines during the supervising learning part of the project and it will be very interesting to see how the result holds in the private leaderboard. The model should be robust considering it was chosen after 10-fold split cross validation.

I’d like to thank this post by Natassha Selvaraj for explaining the basics of customer segmentation, this post by Abhijith Chandradas for showing how to plot radar charts and this post by Wai for helping with hyperparameters tuning.

Also I’d like to thank Data science nanodegree and its authors. As I started learning data science from the scratch I found the course very challenging but interesting to go through at the same time.

Check out my Github to see other projects I did. Any comments are deeply appreciated.

--

--