{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Model Visualization and Explainability\n", "\n", "Model explainability remains a hurdle towards widespread adoption and understanding of machine learning. In this notebook, we will train and visualize a neural net that predicts credit card defaults based on credit usage and payment history, plus some demographic information. The goal is to explore how we can use VIP to visualize the output of complex machine learning models, and to then explain the results in terms of input features.\n", "\n", "In the final portion of the notebook, we also visualize the results of a gridsearch optimization of hyperparameters to determine what combinations of these hyperparameters optimize a gradient boosting machine (GBM)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setting up WebSocket connection to: ws://localhost:12345/api\n", "Connection Successful! Initializing session.\n" ] } ], "source": [ "# Import API\n", "from virtualitics import api\n", "vip=api.VIP()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Import Data and Preprocess\n", "\n", "Data from UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients\n", "\n", "This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. All amounts are given in Taiwanese dollars." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", " | LIMIT_BAL | \n", "SEX | \n", "EDUCATION | \n", "MARRIAGE | \n", "AGE | \n", "PAY_0 | \n", "PAY_2 | \n", "PAY_3 | \n", "PAY_4 | \n", "PAY_5 | \n", "... | \n", "BILL_AMT4 | \n", "BILL_AMT5 | \n", "BILL_AMT6 | \n", "PAY_AMT1 | \n", "PAY_AMT2 | \n", "PAY_AMT3 | \n", "PAY_AMT4 | \n", "PAY_AMT5 | \n", "PAY_AMT6 | \n", "default payment next month | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "20000 | \n", "Female | \n", "university | \n", "married | \n", "24 | \n", "2 | \n", "2 | \n", "-1 | \n", "-1 | \n", "-2 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "689 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
1 | \n", "120000 | \n", "Female | \n", "university | \n", "single | \n", "26 | \n", "-1 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "3272 | \n", "3455 | \n", "3261 | \n", "0 | \n", "1000 | \n", "1000 | \n", "1000 | \n", "0 | \n", "2000 | \n", "1 | \n", "
2 | \n", "90000 | \n", "Female | \n", "university | \n", "single | \n", "34 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "14331 | \n", "14948 | \n", "15549 | \n", "1518 | \n", "1500 | \n", "1000 | \n", "1000 | \n", "1000 | \n", "5000 | \n", "0 | \n", "
3 | \n", "50000 | \n", "Female | \n", "university | \n", "married | \n", "37 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "28314 | \n", "28959 | \n", "29547 | \n", "2000 | \n", "2019 | \n", "1200 | \n", "1100 | \n", "1069 | \n", "1000 | \n", "0 | \n", "
4 | \n", "50000 | \n", "Male | \n", "university | \n", "married | \n", "57 | \n", "-1 | \n", "0 | \n", "-1 | \n", "0 | \n", "0 | \n", "... | \n", "20940 | \n", "19146 | \n", "19131 | \n", "2000 | \n", "36681 | \n", "10000 | \n", "9000 | \n", "689 | \n", "679 | \n", "0 | \n", "
5 rows × 24 columns
\n", "\n", " | SmartMapping Rank | \n", "Feature | \n", "Correlated Group | \n", "
---|---|---|---|
0 | \n", "1 | \n", "PAY_3 | \n", "None | \n", "
1 | \n", "2 | \n", "PAY_2 | \n", "None | \n", "
2 | \n", "3 | \n", "PAY_0 | \n", "None | \n", "
3 | \n", "4 | \n", "PAY_4 | \n", "None | \n", "
4 | \n", "5 | \n", "PAY_5 | \n", "None | \n", "
Insight |
---|
The 1 category makes up a relatively high distribution of points when LIMIT_BAL is between 50000 and 70000, PAY_AMT2 is between 10 and 60994, and PAY_AMT1 is between 28 and 4448. Out of 3,303 points in this region, 23.46% are 1 vs. only 21.95% overall. |
The 1 category makes up a relatively high distribution of points when LIMIT_BAL is between 10000 and 40000, PAY_AMT2 is between 12 and 28883, and PAY_AMT1 is between 100 and 36147. Out of 2,682 points in this region, 28.82% are 1 vs. only 21.46% overall. |
The 1 category makes up a relatively high distribution of points when LIMIT_BAL is between 80000 and 710000, PAY_AMT2 is between 0 and 316, and PAY_AMT1 is between 0 and 18. Out of 2,171 points in this region, 29.53% are 1 vs. only 21.54% overall. |
The 1 category makes up a relatively high distribution of points when LIMIT_BAL is between 10000 and 70000, PAY_AMT2 is between 0 and 50000, and PAY_AMT1 is between 0 and 16. Out of 1,714 points in this region, 46.97% are 1 vs. only 20.61% overall. |
The 1 category makes up a relatively high distribution of points when LIMIT_BAL is between 10000 and 140000, PAY_AMT2 is between 0 and 9, and PAY_AMT1 is between 39 and 125000. Out of 1,685 points in this region, 40.12% are 1 vs. only 21.05% overall. |
The 1 category makes up a relatively high distribution of points when LIMIT_BAL is between 80000 and 680000, PAY_AMT2 is between 322 and 149654, and PAY_AMT1 is between 0 and 21. Out of 1,495 points in this region, 33.18% are 1 vs. only 21.54% overall. |
The 0 category makes up a relatively high distribution of points when LIMIT_BAL is between 150000 and 300000, PAY_AMT2 is between 876 and 4003, and PAY_AMT1 is between 29 and 164163. Out of 2,764 points in this region, 88.49% are 0 vs. only 76.8% overall. |
The 0 category makes up a relatively high distribution of points when LIMIT_BAL is between 150000 and 300000, PAY_AMT2 is between 4004 and 8858, and PAY_AMT1 is between 22 and 304815. Out of 2,375 points in this region, 85.6% are 0 vs. only 77.22% overall. |
The 0 category makes up a relatively high distribution of points when LIMIT_BAL is between 310000 and 1000000, PAY_AMT2 is between 7524 and 1684259, and PAY_AMT1 is between 33 and 873552. Out of 1,680 points in this region, 90.6% are 0 vs. only 77.13% overall. |
The 0 category makes up a relatively high distribution of points when LIMIT_BAL is between 50000 and 140000, PAY_AMT2 is between 10 and 140000, and PAY_AMT1 is between 4560 and 143713. Out of 1,663 points in this region, 84.97% are 0 vs. only 77.46% overall. |
The 0 category makes up a relatively high distribution of points when LIMIT_BAL is between 310000 and 800000, PAY_AMT2 is between 884 and 7507, and PAY_AMT1 is between 31 and 368199. Out of 1,627 points in this region, 92.5% are 0 vs. only 77.04% overall. |
The 0 category makes up a relatively high distribution of points when LIMIT_BAL is between 150000 and 300000, PAY_AMT2 is between 8866 and 237648, and PAY_AMT1 is between 22 and 273844. Out of 1,520 points in this region, 89.21% are 0 vs. only 77.28% overall. |
\n", " | SmartMapping Rank | \n", "Feature | \n", "Correlated Group | \n", "
---|---|---|---|
0 | \n", "1 | \n", "ON_TIME | \n", "None | \n", "
1 | \n", "2 | \n", "PAY_0 | \n", "None | \n", "
2 | \n", "3 | \n", "PAY_2 | \n", "None | \n", "
3 | \n", "4 | \n", "PAY_3 | \n", "None | \n", "
4 | \n", "5 | \n", "PAY_4 | \n", "None | \n", "