You have created a well-tuned, performing model. Now what? People can’t access it while it sits on your PC in a Jupyter Notebook. It has to be deployed so that it can be utilized. That is what this week was all about, deploying machine learning models.
Let’s first import all of the libraries we will need to create and train the model.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import urllib.request
Now we can import our data. Alexey Grigorev, from Data Talks Club has the data on his GitHub account.
url = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv'
filename = 'data-week-3.csv'
df = pd.read_csv(url)
Now that we have our data let’s clean up the data by replacing spaces with ‘_’. With the categorical features, we will do another replacement of spaces with ‘_’. During our exploration, we noticed that the ‘totalcharges’ feature contained ‘NAN’s, which we can’t have, so we will fill those with a ‘0’. The last action we will take is to turn our target variable, ‘churn’ into an integer using the ‘astype(int)‘ function
df.columns = df.columns.str.lower().str.replace(' ', '_')
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)
for c in categorical_columns:
df[c] = df[c].str.lower().str.replace(' ', '_')
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)
df.churn = (df.churn == 'yes').astype(int)
It is time to split our data into train, validation, and test datasets using ‘train_test_split‘. First, we split our data into 80% and 20% and assign each to df_full_train and df_test, respectively. Second, we split our df_train_full dataset into 67% and 33%, respectively. Since this was a training class, the random_state was used so that we were all working with the same splits. We have also reset the index on each dataset with the ‘.reset_index’ function just for appearances, it isn’t required. We then assign y_train and y_val the target variable, again ‘churn’, and then we delete the target variable from our train and val set.
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=11)
df_train_full = df_train_full.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_train, df_val = train_test_split(df_train_full, test_size=0.33, random_state=11)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
y_train = df_train.churn.values
y_val = df_val.churn.values
del df_train['churn']
del df_val['churn']
We will assign our different types of features to two variables here. ‘Categorical’ will contain all features that are considered categorical, i.e. ‘gender’ will be assigned as male=0, female=1, ‘seniorcitizen’ will be no=0, yes=1. The other features are numerical so therefore will be added to the numerical variable.
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
'phoneservice', 'multiplelines', 'internetservice',
'onlinesecurity', 'onlinebackup', 'deviceprotection',
'techsupport', 'streamingtv', 'streamingmovies',
'contract', 'paperlessbilling', 'paymentmethod']
numerical = ['tenure', 'monthlycharges', 'totalcharges']
Let’s create a function that will turn our categorical and numerical features into a dictionary and assign that dictionary to the variable ‘cat’. Next, we will assign the Dictionary Vectorizer, ‘DictVectorizer’ function, which will turn our dictionary into a vector, to ‘dv’. The ‘fit’ method
def train(df, y, C=1.0):
cat = df[categorical + numerical].to_dict(orient='rows')
dv = DictVectorizer(sparse=False)
dv.fit(cat)
X = dv.transform(cat)
model = LogisticRegression(solver='liblinear', C=C)
model.fit(X, y)
return dv, model
def predict(df, dv, model):
cat = df[categorical + numerical].to_dict(orient='rows')
X = dv.transform(cat)
y_pred = model.predict_proba(X)[:, 1]
return y_pred
Leave a Reply