4. Evaluation Metrics for Classification¶

Last week we trained a model for churn. How do we know if it’s good?

The fourth week of Machine Learning Zoomcamp is about different metrics to evaluate a binary classifier. These measures include accuracy, confusion table, precision, recall, ROC curves(TPR, FRP, random model, and ideal model), AUROC, and cross-validation.

4.1 Evaluation metrics: session overview¶

Metric – function that compares the predictions with the actual values and outputs a single number that tells how good the predictions are

In [85]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [86]:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [87]:

import urllib.request

In [88]:

url = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv'

filename = 'data-week-3.csv'

df = pd.read_csv(url)

In [89]:

df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)

df.churn = (df.churn == 'yes').astype(int)

In [90]:

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

del df_train['churn']
del df_val['churn']
del df_test['churn']

In [91]:

numerical = ['tenure', 'monthlycharges', 'totalcharges']

categorical = [
    'gender',
    'seniorcitizen',
    'partner',
    'dependents',
    'phoneservice',
    'multiplelines',
    'internetservice',
    'onlinesecurity',
    'onlinebackup',
    'deviceprotection',
    'techsupport',
    'streamingtv',
    'streamingmovies',
    'contract',
    'paperlessbilling',
    'paymentmethod',
]

In [92]:

dv = DictVectorizer(sparse=False)

train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

model = LogisticRegression()
model.fit(X_train, y_train)

Out[92]:

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [93]:

val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

y_pred = model.predict_proba(X_val)[:, 1]
churn_decision = (y_pred >= 0.5)
(y_val == churn_decision).mean()

Out[93]:

0.8034066713981547

4.2 Accuracy and dummy model¶

Accurcy measures the fraction of correct predictions. Specifically, it is the number of correct predictions divided by the total number of predictions.

We can change the decision threshold, it should not be always 0.5. But, in this particular problem, the best decision cutoff, associated with the hightest accuracy (80%), was indeed 0.5.

Note that if we build a dummy model in which the decision cutoff is 1, so the algorithm predicts that no clients will churn, the accuracy would be 73%. Thus, we can see that the improvement of the original model with respect to the dummy model is not as high as we would expect.

Therefore, in this problem accuracy can not tell us how good is the model because the dataset is unbalanced, which means that there are more instances from one category than the other. This is also known as class imbalance.

Classes and methods:

np.linspace(x,y,z) – returns a numpy array starting at x until y with a z step
Counter(x) – collection class that counts the number of instances that satisfy the x condition
accuracy_score(x, y) – sklearn.metrics class for calculating the accuracy of a model, given a predicted x dataset and a target y dataset.

In [94]:

len(y_val)

Out[94]:

In [95]:

(y_val == churn_decision).sum() / len(y_val)

Out[95]:

0.8034066713981547

So we have the accuracy of our base model using 0.5 as the cutoff for churn, no churn. Now we will vary the 0.5 to different numbers to see if the accuracy of our model is better or not.

The np.linspace() method can be used to create an array. In this case we want the values in the array to be from 0 to 1 and we want 21 of them. That will start at 0 and increment by .05 each step.

In [96]:

thresholds = np.linspace(0, 1, 21)
thresholds

Out[96]:

array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])

In [97]:

scores = []

for t in thresholds:
    
    churn_decision = (y_pred >= t)
    score = (y_val == churn_decision).mean()
    print('%.2f %.3f' % (t, score))
    scores.append(score)

In [98]:

plt.plot(thresholds, scores)

Out[98]:

[<matplotlib.lines.Line2D at 0x29303c8cd30>]

We can see that 0.5 is the best one. We used our own function for calculating accuracy (y_val == churn_decision).mean(). sklearn has a function for accuracy, accuracy_score. Let’s implement that.

In [99]:

from sklearn.metrics import accuracy_score

In [100]:

accuracy_score(y_val, y_pred >= 0.5)

Out[100]:

0.8034066713981547

Let’s add that into our previous function to calculate accuracy with the thresholds array.

In [101]:

scores = []

for t in thresholds:
    
    score = accuracy_score(y_val, y_pred >= t)
    print('%.2f %.3f' % (t, score))
    scores.append(score)

In [102]:

plt.plot(thresholds, scores)

Out[102]:

[<matplotlib.lines.Line2D at 0x29303cd7730>]

At the threshold of 1.0 we can see our prediction rate is 72.6%. That must mean that our churn rate in the validation set is 72.6%. Let’s verify that using the Counter method, which simply counts the number of True or False.

In [103]:

from collections import Counter

In [104]:

Counter(y_pred >= 1.0)

Out[104]:

Counter({False: 1409})

As we can see the number of False records is at 1409, which is the size of our array, so none of the values are greater than 1. Below we can simply do a small calculation and see that it is simply a calculation of the churn rate of our dataset.

In [105]:

1 - y_val.mean()

Out[105]:

0.7260468417317246

So the actual non churn rate is at 72.6% for the validation dataset and our model is only 7.4% better. So accuracy isn’t the best choice for scoring our model. This is because we have what is called class imbalance. We have 72.6% of customers that don’t churn compared to 27.4% of customers that do churn.

In [106]:

y_val.mean()

Out[106]:

0.2739531582682754

4.3 Confusion table¶

Confusion table is a way to measure different types of errors and correct decisions that binary classifiers can made. Considering this information, it is possible evaluate the quality of the model by different strategies.

If we predict the probability of churning from a customer, we have the following scenarios:

No churn – Negative class
- Customer did not churn – True Negative (TN)
- Customer churned – False Negative (FN)
Churn – Positive class
- Customer churned – True Positive (TP)
- Customer did not churn – False Positive (FP)

The confusion table help us to summarize the measures explained above in a tabular format, as is shown below:

Actual/Predictions	Negative	Postive
Negative	TN	FP
Postive	FN	TP

The accuracy corresponds to the sum of TN and TP divided by the total of observations.

In [107]:

actual_positive = (y_val == 1)
actual_negative = (y_val == 0)
actual_positive, actual_negative

Out[107]:

(array([False, False, False, ..., False,  True,  True]),
 array([ True,  True,  True, ...,  True, False, False]))

In [108]:

t = 0.5
predict_positive = (y_pred >= t)
predict_negative = (y_pred < t)
predict_positive, predict_negative

Out[108]:

(array([False, False, False, ..., False,  True,  True]),
 array([ True,  True,  True, ...,  True, False, False]))

In effect, what we are setting up is a logic AND argument. If both are True then the output is True, if both are False then the output is True, if either is different then we get an output that is False

In [109]:

predict_positive & actual_positive

Out[109]:

array([False, False, False, ..., False,  True,  True])

In [110]:

(predict_positive & actual_positive).sum()

Out[110]:

We have 210 True Positives and 922 True Negatives

In [111]:

tp = (predict_positive & actual_positive).sum()
tn = (predict_negative & actual_negative).sum()
tp, tn

Out[111]:

(210, 922)

We have 101 False Positives and 176 False Negatives

In [112]:

fp = (predict_positive & actual_negative).sum()
fn = (predict_negative & actual_positive).sum()
fp, fn

Out[112]:

(101, 176)

In [113]:

confusion_matrix = np.array([
    [tn, fp],
    [fn, tp]  
])
confusion_matrix

Out[113]:

array([[922, 101],
       [176, 210]])

What all of this is telling us is that with our current model we would be sending out 101 discount emails to people who not at risk of churning and we would be missing out by not sending out 176 discount emails to people who were going to churn. In the first case we are losing money because we gave discounts to people who were not at risk of churning and in the second case we are missing out on future money by not attempting to retain customers that are at high risk for churning.

In [114]:

(confusion_matrix / confusion_matrix.sum()).round(2)

Out[114]:

array([[0.65, 0.07],
       [0.12, 0.15]])

4.4 Precision and Recall¶

Precision tell us the fraction of positive predictions that are correct. It takes into account only the positive class (TP and FP – second column of the confusion matrix), as is stated in the following formula:

TP / (TP + FP)

Recall measures the fraction of correctly identified postive instances. It considers parts of the postive and negative classes (TP and FN – second row of confusion table). The formula of this metric is presented below:

TP / (TP + FN)

In this problem, the precision and recall values were 67% and 54% respectively. So, these measures reflect some errors of our model that accuracy did not notice due to the class imbalance.

Precision¶

In [115]:

(tp + tn) / (tp + tn + fp + fn) # accuracy score of our model

Out[115]:

0.8034066713981547

In the precision score we are only interested in scores that equal churn. In this case we would be interested in the TP and FP.

In [117]:

p = tp / (fp + tp)
p

Out[117]:

0.6752411575562701

Recall¶

In [120]:

r = tp / (tp + fn)
r

Out[120]:

0.5440414507772021

4.5 ROC Curves¶

ROC stands for Receiver Operating Characteristic, and this idea was applied during the Second World War for evaluating the strenght of radio detectors. This measure considers False Positive Rate (FPR) and True Postive Rate (TPR), which are derived from the values of the confusion matrix.

FPR is the fraction of false positives (FP) divided by the total number of negatives (FP and TN – the first row of confusion matrix), and we want to minimize it. The formula of FPR is the following:

In the other hand, TPR or Recall is the fraction of true positives (TP) divided by the total number of positives (FN and TP – second row of confusion table), and we want to maximize this metric. The formula of this measure is presented below:

ROC curves consider Recall and FPR under all the possible thresholds. If the threshold is 0 or 1, the TPR and Recall scores are the opposite of the threshold (1 and 0 respectively), but they have different meanings, as we explained before.

We need to compare the ROC curves against a point of reference to evaluate its performance, so the corresponding curves of random and ideal models are required. It is possible to plot the ROC curves with FPR and Recall scores vs thresholds, or FPR vs Recall.

Classes and methods:

np.repeat([x,y], [z,w]) – returns a numpy array with a z number of x values, and a w number of y values.
roc_curve(x, y) – sklearn.metrics class for calculating the false positive rates, true positive rates, and thresholds, given a target x dataset and a predicted y dataset.

TPR and FPR¶

TPR is the same equation as recall

In [123]:

tpr = tp / (fn + tp)
tpr

Out[123]:

0.5440414507772021

In [125]:

fpr = fp / (tn + fp)
fpr

Out[125]:

0.09872922776148582

In [129]:

scores = []

thresholds = np.linspace(0, 1, 101)

for t in thresholds:
    actual_positive = (y_val == 1)
    actual_negative = (y_val == 0)
    
    predict_positive = (y_pred >= t)
    predict_negative = (y_pred < t)
    
    tp = (predict_positive & actual_positive).sum()
    tn = (predict_negative & actual_negative).sum()
    
    fp = (predict_positive & actual_negative).sum()
    fn = (predict_negative & actual_positive).sum()
    
    scores.append((t, tp, fp, fn, tn))

In [130]:

scores

Out[130]:

[(0.0, 386, 1023, 0, 0),
 (0.01, 385, 914, 1, 109),
 (0.02, 384, 830, 2, 193),
 (0.03, 383, 766, 3, 257),
 (0.04, 381, 715, 5, 308),
 (0.05, 379, 683, 7, 340),
 (0.06, 377, 661, 9, 362),
 (0.07, 372, 640, 14, 383),
 (0.08, 371, 613, 15, 410),
 (0.09, 369, 580, 17, 443),
 (0.1, 366, 556, 20, 467),
 (0.11, 365, 528, 21, 495),
 (0.12, 365, 509, 21, 514),
 (0.13, 360, 477, 26, 546),
 (0.14, 355, 453, 31, 570),
 (0.15, 351, 435, 35, 588),
 (0.16, 347, 419, 39, 604),
 (0.17, 346, 401, 40, 622),
 (0.18, 344, 384, 42, 639),
 (0.19, 338, 369, 48, 654),
 (0.2, 333, 356, 53, 667),
 (0.21, 329, 341, 57, 682),
 (0.22, 323, 322, 63, 701),
 (0.23, 320, 313, 66, 710),
 (0.24, 316, 304, 70, 719),
 (0.25, 309, 291, 77, 732),
 (0.26, 304, 281, 82, 742),
 (0.27, 303, 270, 83, 753),
 (0.28, 295, 256, 91, 767),
 (0.29, 291, 244, 95, 779),
 (0.3, 284, 236, 102, 787),
 (0.31, 280, 230, 106, 793),
 (0.32, 278, 225, 108, 798),
 (0.33, 276, 221, 110, 802),
 (0.34, 274, 212, 112, 811),
 (0.35000000000000003, 272, 207, 114, 816),
 (0.36, 267, 201, 119, 822),
 (0.37, 265, 197, 121, 826),
 (0.38, 260, 185, 126, 838),
 (0.39, 253, 179, 133, 844),
 (0.4, 249, 166, 137, 857),
 (0.41000000000000003, 246, 159, 140, 864),
 (0.42, 243, 158, 143, 865),
 (0.43, 241, 150, 145, 873),
 (0.44, 234, 147, 152, 876),
 (0.45, 230, 135, 156, 888),
 (0.46, 224, 125, 162, 898),
 (0.47000000000000003, 218, 120, 168, 903),
 (0.48, 217, 115, 169, 908),
 (0.49, 213, 110, 173, 913),
 (0.5, 210, 101, 176, 922),
 (0.51, 207, 99, 179, 924),
 (0.52, 204, 93, 182, 930),
 (0.53, 196, 91, 190, 932),
 (0.54, 194, 86, 192, 937),
 (0.55, 185, 79, 201, 944),
 (0.56, 182, 76, 204, 947),
 (0.5700000000000001, 176, 68, 210, 955),
 (0.58, 171, 61, 215, 962),
 (0.59, 163, 59, 223, 964),
 (0.6, 151, 53, 235, 970),
 (0.61, 145, 49, 241, 974),
 (0.62, 141, 46, 245, 977),
 (0.63, 133, 40, 253, 983),
 (0.64, 125, 37, 261, 986),
 (0.65, 119, 34, 267, 989),
 (0.66, 114, 31, 272, 992),
 (0.67, 105, 29, 281, 994),
 (0.68, 94, 26, 292, 997),
 (0.6900000000000001, 88, 25, 298, 998),
 (0.7000000000000001, 76, 20, 310, 1003),
 (0.71, 63, 14, 323, 1009),
 (0.72, 57, 11, 329, 1012),
 (0.73, 47, 10, 339, 1013),
 (0.74, 41, 8, 345, 1015),
 (0.75, 33, 7, 353, 1016),
 (0.76, 30, 6, 356, 1017),
 (0.77, 25, 5, 361, 1018),
 (0.78, 19, 3, 367, 1020),
 (0.79, 15, 2, 371, 1021),
 (0.8, 13, 2, 373, 1021),
 (0.81, 6, 0, 380, 1023),
 (0.8200000000000001, 5, 0, 381, 1023),
 (0.8300000000000001, 3, 0, 383, 1023),
 (0.84, 0, 0, 386, 1023),
 (0.85, 0, 0, 386, 1023),
 (0.86, 0, 0, 386, 1023),
 (0.87, 0, 0, 386, 1023),
 (0.88, 0, 0, 386, 1023),
 (0.89, 0, 0, 386, 1023),
 (0.9, 0, 0, 386, 1023),
 (0.91, 0, 0, 386, 1023),
 (0.92, 0, 0, 386, 1023),
 (0.93, 0, 0, 386, 1023),
 (0.9400000000000001, 0, 0, 386, 1023),
 (0.9500000000000001, 0, 0, 386, 1023),
 (0.96, 0, 0, 386, 1023),
 (0.97, 0, 0, 386, 1023),
 (0.98, 0, 0, 386, 1023),
 (0.99, 0, 0, 386, 1023),
 (1.0, 0, 0, 386, 1023)]

In [131]:

df_scores = pd.DataFrame(scores)

In [132]:

df_scores

Out[132]:

	0	1	2	3	4
0	0.00	386	1023	0	0
1	0.01	385	914	1	109
2	0.02	384	830	2	193
3	0.03	383	766	3	257
4	0.04	381	715	5	308
…	…	…	…	…	…
96	0.96	0	0	386	1023
97	0.97	0	0	386	1023
98	0.98	0	0	386	1023
99	0.99	0	0	386	1023
100	1.00	0	0	386	1023

101 rows × 5 columns

In [134]:

columns = ['threshold', 'tp', 'fp', 'fn', 'tn']
df_scores = pd.DataFrame(scores, columns = columns)

In [137]:

df_scores[::10]  # looking at each 10th record

Out[137]:

	threshold	tp	fp	fn	tn
0	0.0	386	1023	0	0
10	0.1	366	556	20	467
20	0.2	333	356	53	667
30	0.3	284	236	102	787
40	0.4	249	166	137	857
50	0.5	210	101	176	922
60	0.6	151	53	235	970
70	0.7	76	20	310	1003
80	0.8	13	2	373	1021
90	0.9	0	0	386	1023
100	1.0	0	0	386	1023

In [140]:

df_scores['tpr'] = df_scores.tp / (df_scores.tp + df_scores.fn)
df_scores['fpr'] = df_scores.fp / (df_scores.fp + df_scores.tn)

In [141]:

df_scores[::10]

Out[141]:

	threshold	tp	fp	fn	tn	tpr	fpr
0	0.0	386	1023	0	0	1.000000	1.000000
10	0.1	366	556	20	467	0.948187	0.543500
20	0.2	333	356	53	667	0.862694	0.347996
30	0.3	284	236	102	787	0.735751	0.230694
40	0.4	249	166	137	857	0.645078	0.162268
50	0.5	210	101	176	922	0.544041	0.098729
60	0.6	151	53	235	970	0.391192	0.051808
70	0.7	76	20	310	1003	0.196891	0.019550
80	0.8	13	2	373	1021	0.033679	0.001955
90	0.9	0	0	386	1023	0.000000	0.000000
100	1.0	0	0	386	1023	0.000000	0.000000

In [177]:

plt.plot(df_scores.threshold, df_scores['tpr'], label='TPR')
plt.plot(df_scores.threshold, df_scores['fpr'], label='FPR')

plt.xlabel('Threshold')
plt.ylabel('Accuracy')

plt.legend()

Out[177]:

<matplotlib.legend.Legend at 0x2930d5ff880>

Random model¶

In [146]:

np.random.seed(1)
y_rand = np.random.uniform(0, 1, size=len(y_val))

In [147]:

y_rand.round(3)

Out[147]:

array([0.417, 0.72 , 0.   , ..., 0.774, 0.334, 0.089])

In [148]:

((y_rand >= 0.5) == y_val).mean()

Out[148]:

0.5017743080198722

In [154]:

def tpr_fpr_dataframe(y_val, y_pred):
    scores = []

    thresholds = np.linspace(0, 1, 101)

    for t in thresholds:
        actual_positive = (y_val == 1)
        actual_negative = (y_val == 0)

        predict_positive = (y_pred >= t)
        predict_negative = (y_pred < t)

        tp = (predict_positive & actual_positive).sum()
        tn = (predict_negative & actual_negative).sum()

        fp = (predict_positive & actual_negative).sum()
        fn = (predict_negative & actual_positive).sum()

        scores.append((t, tp, fp, fn, tn))
        
    columns = ['threshold', 'tp', 'fp', 'fn', 'tn']
    df_scores = pd.DataFrame(scores, columns = columns)

    df_scores['tpr'] = df_scores.tp / (df_scores.tp + df_scores.fn)
    df_scores['fpr'] = df_scores.fp / (df_scores.fp + df_scores.tn)
        
    return df_scores

In [155]:

df_rand = tpr_fpr_dataframe(y_val, y_rand)

In [176]:

df_rand[::10]

plt.plot(df_rand.threshold, df_rand['tpr'], label='TPR')
plt.plot(df_rand.threshold, df_rand['fpr'], label='FPR')

plt.xlabel('Threshold')
plt.ylabel('Accuracy')

plt.legend()

Out[176]:

<matplotlib.legend.Legend at 0x2930d5e7640>

Ideal model¶

In [159]:

num_neg = (y_val == 0).sum()
num_pos = (y_val == 1).sum()
num_neg, num_pos

Out[159]:

(1023, 386)

The np.repeat function will create an array with 0’s the number of times of num_neg and 1’s the number of times of num_pos. This creates the order in the second set in the image.

In [161]:

y_ideal = np.repeat([0, 1], [num_neg, num_pos])
y_ideal

Out[161]:

array([0, 0, 0, ..., 1, 1, 1])

In [163]:

y_ideal_pred = np.linspace(0, 1, len(y_val))

In [166]:

1 - y_val.mean()

Out[166]:

0.7260468417317246

In [165]:

((y_ideal_pred >= 0.726) == y_ideal).mean()

Out[165]:

1.0

In [167]:

df_ideal = tpr_fpr_dataframe(y_ideal, y_ideal_pred)

In [175]:

plt.plot(df_ideal.threshold, df_ideal['tpr'], label='TPR')
plt.plot(df_ideal.threshold, df_ideal['fpr'], label='FPR')

plt.xlabel('Threshold')
plt.ylabel('Accuracy')


plt.legend()

Out[175]:

<matplotlib.legend.Legend at 0x2930bcdddb0>

Putting it all together¶

In [171]:

plt.plot(df_scores.threshold, df_scores['tpr'], label='TPR_scores')
plt.plot(df_scores.threshold, df_scores['fpr'], label='FPR_scores')

plt.plot(df_ideal.threshold, df_ideal['tpr'], label='TPR_ideal')
plt.plot(df_ideal.threshold, df_ideal['fpr'], label='FPR_ideal')

#plt.plot(df_rand.threshold, df_rand['tpr'], label='TPR_rand')
#plt.plot(df_rand.threshold, df_rand['fpr'], label='FPR_rand')

plt.legend()

Out[171]:

<matplotlib.legend.Legend at 0x293090417b0>

In [180]:

plt.figure(figsize=(5, 5))

plt.plot(df_scores.fpr, df_scores.tpr, label='Model')
#plt.plot(df_rand.fpr, df_rand.tpr, label='random')
#plt.plot(df_ideal.fpr, df_ideal.tpr, label='ideal')
plt.plot([0, 1], [0, 1], label='Random', linestyle='--')

plt.xlabel('FPR')
plt.ylabel('TPR')

plt.legend()

Out[180]:

<matplotlib.legend.Legend at 0x2930bf91240>

In [181]:

from sklearn.metrics import roc_curve

In [182]:

fpr, tpr, thresholds = roc_curve(y_val, y_pred)

In [183]:

plt.figure(figsize=(5, 5))

plt.plot(fpr, tpr, label='Model')
plt.plot([0, 1], [0, 1], label='Random', linestyle='--')

plt.xlabel('FPR')
plt.ylabel('TPR')

plt.legend()

Out[183]:

<matplotlib.legend.Legend at 0x2930e905ae0>

4.6 ROC AUC¶

The Area under the ROC curves can tell us how good is our model with a single value. The AUROC of a random model is 0.5, while for an ideal one is 1.

In ther words, AUC can be interpreted as the probability that a randomly selected positive example has a greater score than a randomly selected negative example.

Classes and methods:

auc(x, y) – sklearn.metrics class for calculating area under the curve of the x and y datasets. For ROC curves x would be false positive rate, and y true positive rate.
roc_auc_score(x, y) – sklearn.metrics class for calculating area under the ROC curves of the x false positive rate and y true positive rate datasets.

In [184]:

from sklearn.metrics import auc

In [185]:

auc(fpr, tpr)

Out[185]:

0.8438302463039216

In [187]:

auc(df_scores.fpr, df_scores.tpr)

Out[187]:

0.8438365773732646

In [188]:

auc(df_ideal.fpr, df_ideal.tpr)

Out[188]:

0.9999430203759136

In [196]:

fpr, tpr, thresholds = roc_curve(y_val, y_pred)
auc(fpr, tpr)

Out[196]:

0.8438302463039216

In [197]:

from sklearn.metrics import roc_auc_score

In [198]:

roc_auc_score(y_val, y_pred)

Out[198]:

0.8438302463039216

In [199]:

neg = y_pred[y_val == 0]
pos = y_pred[y_val == 1]
neg

Out[199]:

array([0.00899416, 0.20483208, 0.21257239, ..., 0.10764076, 0.31400436,
       0.13641188])

In [200]:

import random

Below is an example of how roc_auc_score is actually calculated

In [207]:

n = 1000000
success = 0

for i in range(n):
    pos_ind = random.randint(0, len(pos) - 1)
    neg_ind = random.randint(0, len(neg) - 1)
    
    if pos[pos_ind] > neg[neg_ind]:
        success = success + 1
success / n

Out[207]:

0.843749

We can also do this with Numpy

In [209]:

n = 1000000

random.seed(1)
pos_ind = np.random.randint(0, len(pos), size=n)
neg_ind = np.random.randint(0, len(neg), size=n)

(pos[pos_ind] > neg[neg_ind]).mean()

Out[209]:

0.843908

4.7 Cross-Validation¶

Extra resources In the lesson we talked about iterators and generators in Python. You can read more about them here:

Notes

Cross-validation refers to evaluating the same model on different subsets of a dataset, getting the average prediction, and spread within predictions. This method is applied in the parameter tuning step, which is the process of selecting the best parameter.

In this algorithm, the full training dataset is divided into k partitions, we train the model in k-1 partiions of this dataset and evaluate it on the remaining subset. Then, we end up evaluating the model in all the k folds, and we calculate the average evaluation metric for all the folds.

In general, if the dataset is large, we should use the hold-out validation dataset strategy. In the other hand, if the dataset is small or we want to know the standard deviation of the model across different folds, we can use the cross-validation approach.

Libraries, classes and methods:

Kfold(k, s, x) – sklearn.model_selection class for calculating the cross validation with k folds, s boolean attribute for shuffle decision, and an x random state
Kfold.split(x) – sklearn.Kfold method for splitting the x dataset with the attributes established in the Kfold’s object construction.
for i in tqdm() – library for showing the progress of each i iteration in a for loop.

In [237]:

def train(df_train, y_train, C=1.0):
    dicts = df_train[categorical + numerical].to_dict(orient='records')
    
    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)
    
    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_train, y_train)
    
    return dv, model

In [240]:

dv, model = train(df_train, y_train, C=0.001)

In [213]:

def predict(df, dv, model):
    dicts = df[categorical + numerical].to_dict(orient='records')
    
    X = dv.transform(dicts)
    y_pred = model.predict_proba(X)[:, 1]
    
    return y_pred

In [215]:

y_pred = predict(df_val, dv, model)

In [217]:

from sklearn.model_selection import KFold

In [218]:

kfold = KFold(n_splits=10, shuffle=True, random_state=1)
train_idx, val_idx = next(kfold.split(df_full_train))

tqdm is a package that allows us to see progress bars during iteration functions.

In [228]:

import sys
!conda install --yes --prefix {sys.prefix} tqdm

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\daver\Desktop\DataScience\zoomcamp\env

  added / updated specs:
    - tqdm


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2022.9.14          |  py310haa95532_0         155 KB
    tqdm-4.64.0                |  py310haa95532_0         156 KB
    ------------------------------------------------------------
                                           Total:         312 KB

The following NEW packages will be INSTALLED:

  tqdm               pkgs/main/win-64::tqdm-4.64.0-py310haa95532_0

The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    conda-forge::ca-certificates-2022.9.1~ --> pkgs/main::ca-certificates-2022.07.19-haa95532_0
  certifi            conda-forge/noarch::certifi-2022.9.14~ --> pkgs/main/win-64::certifi-2022.9.14-py310haa95532_0
  openssl            conda-forge::openssl-1.1.1q-h8ffe710_0 --> pkgs/main::openssl-1.1.1q-h2bbff1b_0



Downloading and Extracting Packages

tqdm-4.64.0          | 156 KB    |            |   0% 
tqdm-4.64.0          | 156 KB    | #          |  10% 
tqdm-4.64.0          | 156 KB    | ###        |  31% 
tqdm-4.64.0          | 156 KB    | #####1     |  51% 
tqdm-4.64.0          | 156 KB    | ########## | 100% 

certifi-2022.9.14    | 155 KB    |            |   0% 
certifi-2022.9.14    | 155 KB    | #          |  10% 
certifi-2022.9.14    | 155 KB    | ##         |  21% 
certifi-2022.9.14    | 155 KB    | ####1      |  41% 
certifi-2022.9.14    | 155 KB    | ########## | 100% 
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Retrieving notices: ...working... done

==> WARNING: A newer version of conda exists. <==
  current version: 4.14.0
  latest version: 22.9.0

Please update conda by running

    $ conda update -n base -c defaults conda

In [229]:

from tqdm.auto import tqdm

In [245]:

n_splits = 5

for C in tqdm([0.001, 0.01, 0.1, 0.5, 1, 5, 10]):
    
    kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)
    
    scores = []
    
    for train_idx, val_idx in kfold.split(df_full_train):
        df_train = df_full_train.iloc[train_idx]
        df_val = df_full_train.iloc[val_idx]

        y_train = df_train.churn.values
        y_val = df_val.churn.values

        dv, model = train(df_train, y_train, C=C)
        y_pred = predict(df_val, dv, model)

        auc = roc_auc_score(y_val, y_pred)
        scores.append(auc)
        
    print('C=%s %.3f +- %.3f' % (C, np.mean(scores), np.std(scores)))    

  0%|          | 0/7 [00:00<?, ?it/s]

C=0.001 0.825 +- 0.009
C=0.01 0.840 +- 0.009
C=0.1 0.841 +- 0.008
C=0.5 0.841 +- 0.007
C=1 0.841 +- 0.008
C=5 0.840 +- 0.008
C=10 0.841 +- 0.007

Above we see the scores and they are all very similar, 0.001 is the lowest. All the rest are very close and since C=1.0 is the default we will just use that. Since all the scores were very similar we can run the model on the test data set.

In [246]:

dv, model = train(df_full_train, df_full_train.churn.values, C=1.0)
y_pred = predict(df_test, dv, model)

auc = roc_auc_score(y_test, y_pred)
auc

Out[246]:

0.8572386167896259

4.8 Summary¶

General definitions:

Metric: A single number that describes the performance of a model
Accuracy: Fraction of correct answers; sometimes misleading
Precision and recall are less misleading when we have class inbalance
ROC Curve: A way to evaluate the performance at all thresholds; okay to use with imbalance
K-Fold CV: More reliable estimate for performance (mean + std)

In brief, this weeks was about different metrics to evaluate a binary classifier. These measures included accuracy, confusion table, precision, recall, ROC curves(TPR, FRP, random model, and ideal model), and AUROC. Also, we talked about a different way to estimate the performance of the model and make the parameter tuning with cross-validation.

4.9 Explore more¶

Check the precision and recall of the dummy classifier that always predict “FALSE”
F1 score = 2 P R / (P + R)
Evaluate precision and recall at different thresholds, plot P vs R – this way you’ll get the precision/recall curve (similar to ROC curve)
Area under the PR curve is also a useful metric

Other projects

Calculate the metrics for the suggested datasets from the previous week

Evaluation Metrics for Classification

4. Evaluation Metrics for Classification¶

4.1 Evaluation metrics: session overview¶

4.2 Accuracy and dummy model¶

4.3 Confusion table¶

4.4 Precision and Recall¶

Precision¶

Recall¶

4.5 ROC Curves¶

TPR and FPR¶

Random model¶

Ideal model¶

Putting it all together¶

4.6 ROC AUC¶

4.7 Cross-Validation¶

4.8 Summary¶

4.9 Explore more¶

Comments

Leave a Reply Cancel reply