Table of Contents
When it comes to choosing which algorithm to deploy in production, the deciding factors go far and beyond than just measuring the prediction accuracy of every algorithm. You should be asking yourself: Is the algorithm fast enough to deliver on the volumes of data we will encounter in production? How much memory does is consume? But probably the biggest trade-off against the accuracy of machine learning models is their explainability- can I explain the model and understand what I learned?
In this blog we will walk you through the “spectrum of complexity” of machine learning models by exploring the trade offs between simple, easy to interpret models and more complex ones and by examining the explainability of each algorithm we will encounter.
Although simple, explainable models often fall short trying in finding complex patterns in the data, the ability to interpret and glance into those “explainable” models could help us:
- “Debug” the model: making sure it learned actual patterns and there is no “data leakage” or overfitting. Gaining trust in the model is obviously a hard thing
- Derive insights
- Improve the model: Identify weak spots and vulnerabilities in the model in order to improve it and gain additional ideas for high-impact features
Let’s start by getting our hands dirty with real-life data so we can explore and compare how different models can impact a business decision.
We’ll start by importing the relevant libraries:
# modeling libraries from sklearn.model_selection import cross_val_score from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import LinearSVC from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression, LogisticRegressionCV from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import cross_val_predict, cross_val_score from xgboost import XGBClassifier # Libraries for manipulating data: import pandas as pd import numpy as np import warnings warnings.filterwarnings("ignore")
And we’ll go with an e-commerce use case that requires us to build a model that predicts whether or not a website visitor will buy our product.
We can use a propensity model like this to optimize the funnel and user experience accordingly in order to perform actions such as showing discounts to customers who are less likely to make a purchase, etc.
The data sources we will be dealing with are:
- Core tagged data: whether the customer is a paying customer or not
- Crm_data: for this model we will only observe customers that have an existing CRM record with basic information. This information will be used to connect the data to external and 3rd party data
- Website analytics: clicks, scrolls, page views per user
- Catalog: connected to the website analytics to get more context on the products the customer is looking at
The target column we will want to predict is the ‘paying customer (1/0).’
paying_users_df = pd.read_csv('~/blogs/propensity_modeling/conversiton.csv') crm_df = pd.read_csv('~/blogs/propensity_modeling/crm.csv') website_analytics = pd.read_csv('~/blogs/propensity_modeling/website_analytics.csv') catalog = pd.read_csv('~/blogs/propensity_modeling/catalog.csv') paying_users_df.head()
Out: 245 | user_id | paying customer (1/0) |
---|---|---|
0 | id_6443772949357306687 | 1 |
1 | id_565093448583115634 | 0 |
2 | id_2864656352317940240 | 0 |
3 | id_5728513598666929907 | 0 |
4 | id_475202609164704146 | 1 |
To keep our focus on explainability, and in order not to spend most of our time on engineering features just to get something to build our models on, we will be using Explorium’s automated feature discovery capabilities (via the programmatic API) to automate the entire process.
In short, Explorium will receive the raw, disconnected datasets, enrich them with a lot more external sources (i.e. professional and academic background, demographics, spending habits by zip code, date related events, geographical attributes, etc.), automatically generate an enormous amount of features (from internal data as well as external data) and will select the best subset of features for the task of predicting our target column.
Here is how we use the SDK:
from explorium_sdk.data_bundle import DataBundle from explorium_sdk.features import search_for_features data_bundle = DataBundle( core_dataset = paying_users_df, contexual_datasets = [crm, website_analytics, catalog], connections = { paying_users_df['user_id']: crm['user_id'], paying_users_df['user_id']: website_analytics['user_id'], website_analytics['sku']: catalog:['sky'] } ) features = search_for_features( data_bundle, label_column='paying customer (1/0)', augment_with_external_datasets=True ) features.head()
Out: 246 | paying customer (1/0) | Spending Score (Apparel) By Zip Code | Person Occupation (by email).industry == “Retail” | Count Distinct (page_view_url) | avg(timestamp – website_visit_start_time) | avg(seconds_between_events) | Popular Gender->F | avg(Average Number of Reviews) | Popular Gender->M | Mean Family Income | max(Separated) | var(Separated) | min(Percentage of Homes With Some Type of Debt) | mean(Mean Monthly Owner Costs) | sum(Divorced) | max(Percentage of Homes With Some Type of Debt) | max(Married) | var(Population) | var(Female Population) | Popular Gender->empty |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 410.684851 | False | 15.0 | -17808.058824 | 61.73 | True | 21.750000 | False | 0.0000 | 0.00000 | 0.00000 | 0.00000 | 0.000000 | 0.00000 | 0.00000 | 0.00000 | 0.000000e+00 | 0.000000 | 0 |
1 | 0 | 244.235297 | False | 2.0 | -17805.000000 | 95.44 | True | 0.000000 | False | 115900.6672 | 0.01312 | 0.00000 | 0.66580 | 872.916850 | 0.05977 | 0.66580 | 0.61735 | 0.000000e+00 | 0.000000 | 0 |
2 | 0 | 100.088075 | False | 2.0 | -17763.000000 | 84.59 | False | 19.000000 | False | 0.0000 | 0.00000 | 0.00000 | 0.00000 | 0.000000 | 0.00000 | 0.00000 | 0.00000 | 0.000000e+00 | 0.000000 | 1 |
3 | 0 | 2.388643 | False | 6.0 | -17774.000000 | 61.55 | False | 162.288000 | True | 0.0000 | 0.00000 | 0.00000 | 0.00000 | 0.000000 | 0.00000 | 0.00000 | 0.00000 | 0.000000e+00 | 0.000000 | 0 |
4 | 1 | 812.205378 | True | 5.0 | -17787.000000 | 85.00 | True | 18.333333 | False | 101454.3116 | 0.02281 | 0.000173 | 0.61426 | 715.876413 | 0.24723 | 0.73133 | 0.67378 | 1.359672e+06 | 311440.333333 | 0 |
from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score features = features.drop('paying customer (1/0)', axis=1) X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33, random_state=42)
Let’s start with the first model we all learned during our first dive into data science: Linear Regression. From an explainability point of view it’s perfect because it’s linear. From an accuracy point of view, it’s (usually) terrible because it’s linear. There are no interactions between features or complex patterns learned. So although it is very robust to overfitting, it is also in a higher risk of underfitting.
Let’s train the model on the features we’ve extracted:
classifier = LogisticRegressionCV() classifier.fit(X_train, y_train) predictions = classifier.predict_proba(X_test)[:,1] print(f"AUC score from linear model: {roc_auc_score(y_true=y_test, y_score=predictions)}")
AUC score from linear model: 0.6042318566835239
This is not a good fit. An 0.604 AUC score pretty much means there was almost nothing learned at all. Later on we will demonstrate how more complex models are capable of capturing a better essence of the data and patterns (and score a higher AUC).
But for now let’s interpret the model to extract insights and basic patterns. Because the model is so simple and the equation is basically a linear combination of weights, we can plot the essence of what it learned pretty easily:
%matplotlib inline weights = pd.Series(classifier.coef_[0], index=features.columns) weights = weights.reindex(weights.abs().sort_values(ascending=False).index) weights[:13].plot(kind='barh', color='lightseagreen')
Out[250]:5ddf5f8>
<matplotlib.axes._subplots.AxesSubplot at 0x13
The correlations are pretty straight forward:
Spending score (geo-based enrichment) is positively correlated with a higher percentage of converting customers while the number of physical stores around the customer’s location is actually a negatively influencing factor (kind of makes sense, given that the model is built on e-commerce data – the more stores you have around your neighborhood, the less you feel the need to buy on the internet). But linear weights could be weak classifiers when it comes to problems where the patterns are a bit more complicated and interactions between features could be beneficial in terms of discovering predictive factors.
Let’s take a model which is a bit more complex from that point of view: Decision Tree.
Decision trees help us model interactions between different features and use rules learned from the data by finding optimal splitting points.
Let’s start by training a simple decision tree:
classifier = DecisionTreeClassifier(max_depth=10) classifier.fit(X_train, y_train) predictions = classifier.predict_proba(X_test)[:,1] print(f"AUC score for decision tree model: {roc_auc_score(y_true=y_test, y_score=predictions)}")
AUC score for decision tree model: 0.6222223419617647
An improvement of 3% from the linear model isn’t a great one, but definitely a move in the right direction. Decision trees are wonderful models to extract rules that are a bit more complicated than a human could come up with on their own (or at least save a lot of time).
Let’s visualize the tree:
from explorium.sdk.utils import visualize_decision_tree visualize_decision_tree(tree, features)
This is a pretty cool way to extract insights from the model; for every node you have a condition. For example, the first condition dictates what happens in the salary of the customer is larger than $5,980.00. if the condition it true for the specific sample go left, if not – go right.
The more blue the branch gets, the higher the probability that the visitor will purchase the product. The more red it gets, the smaller the chance the visitor will convert to a paying customer. As we can see, the decision tree classifier learned different things, potentially more complex than the linear regression one. For example, all customers with an estimated payroll larger than $5,980 and who spend more than 0.954 seconds between activities in the website (clicks, scrolls, etc) are much more likely to make a purchase.
That rule is a very interesting one, as well as an actionable one. Maybe we should change the website specifically for those users? Maybe raise our prices or disable some promotions?
But the cool thing about it is that those rules were inferred automatically by the model! No human was needed in trying to tune those rules and threshold until they got the patterns that work.
Now we will move on to a more complicated algorithm – RandomForest. It’s way more complicated than a simple decision tree, because, well, it contains multiple decision trees. RandomForest is an ensemble of decision trees made to fuse multiple “weak” learners (decision trees) into one strong ensemble model.
from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(max_depth=10, n_estimators=500) classifier.fit(X_train, y_train) predictions = classifier.predict_proba(X_test)[:,1] print(f"AUC score for decision tree model: {roc_auc_score(y_true=y_test, y_score=predictions)}")
AUC score for decision tree model: 0.6784661821787067
We see a 4% improvement over the decision tree and 7% improvement over the linear model. Obviously, the more complex the model is the better one in this situation as it grasps more complex patterns in the data allowing it to make more accurate predictions. Remember that models with too many parameters will experience overfitting, but this is out of the scope of this blog.
Although it’s way more accurate, it would be hard to analyze this model. In fact,this model is a combination of 500 different decision trees and therefore will require us to look at each and every decision tree to actually uncover the patterns it discovered in the data.
Last but not least, let’s see what happens when we go even further in the spectrum of complexity.
Let’s do a bit of “stacking.” We will train multiple machine learning models, combine them and feed them into one model.
n_rows = len(features) classifiers = [ RandomForestClassifier(max_depth=10, n_estimators=500, n_jobs=-1), XGBClassifier(n_jobs=-1), KNeighborsClassifier(), GaussianNB(), DecisionTreeClassifier(max_depth=7), LogisticRegressionCV(n_jobs=-1) ] predictions_as_features = [] for cls in classifiers: cls_predictions = cross_val_predict(cls, features, labels, method='predict_proba')[:,1] print(f'{cls.__class__.__name__} AUC score == {roc_auc_score(y_true=labels, y_score=cls_predictions)}') predictions_as_features.append(cls_predictions.reshape(n_rows, -1)) # Create features from the low level classifiers - predictions_as_features = np.concatenate(predictions_as_features, axis=1) # Train a random forest model on top - score = np.mean(cross_val_score(RandomForestClassifier(max_depth=4), predictions_as_features, labels, scoring='roc_auc', cv=5)) print('------------------------------------') print(f'AUC score of combination of models {score}')
RandomForestClassifier AUC score == 0.6811876656908529
XGBClassifier AUC score == 0.6928055019470039
KNeighborsClassifier AUC score == 0.5560842777695425
GaussianNB AUC score == 0.5945711663764063
DecisionTreeClassifier AUC score == 0.6219268004797422
LogisticRegressionCV AUC score == 0.5747947765285704
————————————
AUC score of combination of models 0.70120148588394
We got an additional 3.4% on our AUC!
Obviously, the model is now much more complex, harder to explain, and more difficult to derive insights from.
But that’s the message of this blog post, there’s a trade-off between complexity and explainability- or at least there used to be. In next blog post we’ll show how model explainability tools (like LIME, SHAP, and more) are disrupting this balance by allowing explainability in complex models.