The post Deploying Machine Learning Models appeared first on Don's Machine Learning.
]]>Let’s first import all of the libraries we will need to create and train the model.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import urllib.request
Now we can import our data. Alexey Grigorev, from Data Talks Club has the data on his GitHub account.
url = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv'
filename = 'data-week-3.csv'
df = pd.read_csv(url)
Now that we have our data let’s clean up the data by replacing spaces with ‘_’. With the categorical features, we will do another replacement of spaces with ‘_’. During our exploration, we noticed that the ‘totalcharges’ feature contained ‘NAN’s, which we can’t have, so we will fill those with a ‘0’. The last action we will take is to turn our target variable, ‘churn’ into an integer using the ‘astype(int)‘ function
df.columns = df.columns.str.lower().str.replace(' ', '_')
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)
for c in categorical_columns:
df[c] = df[c].str.lower().str.replace(' ', '_')
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)
df.churn = (df.churn == 'yes').astype(int)
It is time to split our data into train, validation, and test datasets using ‘train_test_split‘. First, we split our data into 80% and 20% and assign each to df_full_train and df_test, respectively. Second, we split our df_train_full dataset into 67% and 33%, respectively. Since this was a training class, the random_state was used so that we were all working with the same splits. We have also reset the index on each dataset with the ‘.reset_index’ function just for appearances, it isn’t required. We then assign y_train and y_val the target variable, again ‘churn’, and then we delete the target variable from our train and val set.
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=11)
df_train_full = df_train_full.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_train, df_val = train_test_split(df_train_full, test_size=0.33, random_state=11)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
y_train = df_train.churn.values
y_val = df_val.churn.values
del df_train['churn']
del df_val['churn']
We will assign our different types of features to two variables here. ‘Categorical’ will contain all features that are considered categorical, i.e. ‘gender’ will be assigned as male=0, female=1, ‘seniorcitizen’ will be no=0, yes=1. The other features are numerical so therefore will be added to the numerical variable.
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
'phoneservice', 'multiplelines', 'internetservice',
'onlinesecurity', 'onlinebackup', 'deviceprotection',
'techsupport', 'streamingtv', 'streamingmovies',
'contract', 'paperlessbilling', 'paymentmethod']
numerical = ['tenure', 'monthlycharges', 'totalcharges']
Let’s create a function that will turn our categorical and numerical features into a dictionary and assign that dictionary to the variable ‘cat’. Next, we will assign the Dictionary Vectorizer, ‘DictVectorizer’ function, which will turn our dictionary into a vector, to ‘dv’. The ‘fit’ method
def train(df, y, C=1.0):
cat = df[categorical + numerical].to_dict(orient='rows')
dv = DictVectorizer(sparse=False)
dv.fit(cat)
X = dv.transform(cat)
model = LogisticRegression(solver='liblinear', C=C)
model.fit(X, y)
return dv, model
def predict(df, dv, model):
cat = df[categorical + numerical].to_dict(orient='rows')
X = dv.transform(cat)
y_pred = model.predict_proba(X)[:, 1]
return y_pred
The post Deploying Machine Learning Models appeared first on Don's Machine Learning.
]]>