2. Machine Learning for Regression¶

2.1 Car price prediction project
2.2 Data preparation
2.3 Exploratory data analysis
2.4 Setting up the validation framework
2.5 Linear regression
2.6 Linear regression: vector form
2.7 Training linear regression: Normal equation
2.8 Baseline model for car price prediction project
2.9 Root mean squared error
2.10 Using RMSE on validation data
2.11 Feature engineering
2.12 Categorical variables
2.13 Regularization
2.14 Tuning the model
2.15 Using the model
2.16 Car price prediction project summary

2.1 Car price prediction project¶

Project plan:

Prepare data and Exploratory data analysis (EDA)
Use linear regression for predicting price
Understanding the internals of linear regression
Evaluating the model with RMSE
Feature engineering
Regularization
Using the model

First we import the libraries for manipulating data. Pandas is a python package used to analyze and manipulate data. NumPy is a python package used to work with arrays.

In [1]:

import pandas as pd
import numpy as np

2.2 Data preparation¶

Pandas attributes and methods:

pd.read.csv() – read csv files
df.head() – take a look of the dataframe
df.columns – retrieve colum names of a dataframe
df.columns.str.lower() – lowercase all the letters
df.columns.str.replace(' ', '_') – replace the space separator
df.dtypes – retrieve data types of all features
df.index – retrieve indices of a dataframe

In [2]:

data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv'

Using wget to download the data needed for this regression

In [5]:

import urllib.request
url = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv'
filename = 'data.csv'
urllib.request.urlretrieve(url, filename)

Out[5]:

('data.csv', <http.client.HTTPMessage at 0x16e163eb730>)

Load the file using pandas read read_csvfunction

In [6]:

df = pd.read_csv('data.csv')

In [7]:

df

Out[7]:

	Make	Model	Year	Engine Fuel Type	Engine HP	Engine Cylinders	Transmission Type	Driven_Wheels	Number of Doors	Market Category	Vehicle Size	Vehicle Style	highway MPG	city mpg	Popularity	MSRP
0	BMW	1 Series M	2011	premium unleaded (required)	335.0	6.0	MANUAL	rear wheel drive	2.0	Factory Tuner,Luxury,High-Performance	Compact	Coupe	26	19	3916	46135
1	BMW	1 Series	2011	premium unleaded (required)	300.0	6.0	MANUAL	rear wheel drive	2.0	Luxury,Performance	Compact	Convertible	28	19	3916	40650
2	BMW	1 Series	2011	premium unleaded (required)	300.0	6.0	MANUAL	rear wheel drive	2.0	Luxury,High-Performance	Compact	Coupe	28	20	3916	36350
3	BMW	1 Series	2011	premium unleaded (required)	230.0	6.0	MANUAL	rear wheel drive	2.0	Luxury,Performance	Compact	Coupe	28	18	3916	29450
4	BMW	1 Series	2011	premium unleaded (required)	230.0	6.0	MANUAL	rear wheel drive	2.0	Luxury	Compact	Convertible	28	18	3916	34500
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
11909	Acura	ZDX	2012	premium unleaded (required)	300.0	6.0	AUTOMATIC	all wheel drive	4.0	Crossover,Hatchback,Luxury	Midsize	4dr Hatchback	23	16	204	46120
11910	Acura	ZDX	2012	premium unleaded (required)	300.0	6.0	AUTOMATIC	all wheel drive	4.0	Crossover,Hatchback,Luxury	Midsize	4dr Hatchback	23	16	204	56670
11911	Acura	ZDX	2012	premium unleaded (required)	300.0	6.0	AUTOMATIC	all wheel drive	4.0	Crossover,Hatchback,Luxury	Midsize	4dr Hatchback	23	16	204	50620
11912	Acura	ZDX	2013	premium unleaded (recommended)	300.0	6.0	AUTOMATIC	all wheel drive	4.0	Crossover,Hatchback,Luxury	Midsize	4dr Hatchback	23	16	204	50920
11913	Lincoln	Zephyr	2006	regular unleaded	221.0	6.0	AUTOMATIC	front wheel drive	4.0	Luxury	Midsize	Sedan	26	17	61	28995

11914 rows × 16 columns

df.columns is part of the Pandas library that will return the column names of the dataframe.

In [8]:

df.columns

Out[8]:

Index(['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
       'Number of Doors', 'Market Category', 'Vehicle Size', 'Vehicle Style',
       'highway MPG', 'city mpg', 'Popularity', 'MSRP'],
      dtype='object')

We want to clean up our column names to remove caps and replace spaces with underscores and the operations can be chained together and reassigned to the columns variable. The str.lower() is a method for altering the selected string to all lower case. The str.replace() method can be used to replace specific characters in the string, in this case we want to replace spaces in column names with an underscore so that we can use dot methods. Dot methods will not work on strings with spaces.

In [9]:

df.columns = df.columns.str.lower().str.replace(' ', '_')

Now when we view the dataframe we can see that the column names no longer contain spaces and everything is lower case.

In [10]:

df

Out[10]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity	msrp
0	BMW	1 Series M	2011	premium unleaded (required)	335.0	6.0	MANUAL	rear wheel drive	2.0	Factory Tuner,Luxury,High-Performance	Compact	Coupe	26	19	3916	46135
1	BMW	1 Series	2011	premium unleaded (required)	300.0	6.0	MANUAL	rear wheel drive	2.0	Luxury,Performance	Compact	Convertible	28	19	3916	40650
2	BMW	1 Series	2011	premium unleaded (required)	300.0	6.0	MANUAL	rear wheel drive	2.0	Luxury,High-Performance	Compact	Coupe	28	20	3916	36350
3	BMW	1 Series	2011	premium unleaded (required)	230.0	6.0	MANUAL	rear wheel drive	2.0	Luxury,Performance	Compact	Coupe	28	18	3916	29450
4	BMW	1 Series	2011	premium unleaded (required)	230.0	6.0	MANUAL	rear wheel drive	2.0	Luxury	Compact	Convertible	28	18	3916	34500
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
11909	Acura	ZDX	2012	premium unleaded (required)	300.0	6.0	AUTOMATIC	all wheel drive	4.0	Crossover,Hatchback,Luxury	Midsize	4dr Hatchback	23	16	204	46120
11910	Acura	ZDX	2012	premium unleaded (required)	300.0	6.0	AUTOMATIC	all wheel drive	4.0	Crossover,Hatchback,Luxury	Midsize	4dr Hatchback	23	16	204	56670
11911	Acura	ZDX	2012	premium unleaded (required)	300.0	6.0	AUTOMATIC	all wheel drive	4.0	Crossover,Hatchback,Luxury	Midsize	4dr Hatchback	23	16	204	50620
11912	Acura	ZDX	2013	premium unleaded (recommended)	300.0	6.0	AUTOMATIC	all wheel drive	4.0	Crossover,Hatchback,Luxury	Midsize	4dr Hatchback	23	16	204	50920
11913	Lincoln	Zephyr	2006	regular unleaded	221.0	6.0	AUTOMATIC	front wheel drive	4.0	Luxury	Midsize	Sedan	26	17	61	28995

11914 rows × 16 columns

The df.dytpes shows us what type of data is contained in each column. We are particularlly interested in the “object” columns as they contain data in a string format.

In [11]:

df.dtypes

Out[11]:

make                  object
model                 object
year                   int64
engine_fuel_type      object
engine_hp            float64
engine_cylinders     float64
transmission_type     object
driven_wheels         object
number_of_doors      float64
market_category       object
vehicle_size          object
vehicle_style         object
highway_mpg            int64
city_mpg               int64
popularity             int64
msrp                   int64
dtype: object

We can see in the dataframe that the data is also inconsistant in its formatting. Lets clean that up and we start by identifying which columns contain a string format, which is equated by the data type object

In [12]:

df.dtypes == 'object'

Out[12]:

make                  True
model                 True
year                 False
engine_fuel_type      True
engine_hp            False
engine_cylinders     False
transmission_type     True
driven_wheels         True
number_of_doors      False
market_category       True
vehicle_size          True
vehicle_style         True
highway_mpg          False
city_mpg             False
popularity           False
msrp                 False
dtype: bool

Now we want to select only those columns (or ‘features’) that are of the data type ‘object’

In [13]:

df.dtypes[df.dtypes == 'object']

Out[13]:

make                 object
model                object
engine_fuel_type     object
transmission_type    object
driven_wheels        object
market_category      object
vehicle_size         object
vehicle_style        object
dtype: object

Now that we know which features are objects we don’t need the return showing us the data types and we can ‘unselect’ those by using .index to just retrieve the index

In [14]:

df.dtypes[df.dtypes == 'object'].index

Out[14]:

Index(['make', 'model', 'engine_fuel_type', 'transmission_type',
       'driven_wheels', 'market_category', 'vehicle_size', 'vehicle_style'],
      dtype='object')

Lets transform those “object” containing column names to a python list and and assign the list to the variable named “strings”.

In [15]:

strings = list(df.dtypes[df.dtypes == 'object'].index)

Now we can create a short function that will iterate through each column in the variable list named “strings” and apply the .lower abd .replace methods to data contained within the column. After we call the df again we can see that all the ‘object’ data with in the dataframe has been transformed to remove spaces and remove all capitlization.

In [16]:

for col in strings:
    df[col] = df[col].str.lower().str.replace(' ', '_')
    
df.head()

Out[16]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity	msrp
0	bmw	1_series_m	2011	premium_unleaded_(required)	335.0	6.0	manual	rear_wheel_drive	2.0	factory_tuner,luxury,high-performance	compact	coupe	26	19	3916	46135
1	bmw	1_series	2011	premium_unleaded_(required)	300.0	6.0	manual	rear_wheel_drive	2.0	luxury,performance	compact	convertible	28	19	3916	40650
2	bmw	1_series	2011	premium_unleaded_(required)	300.0	6.0	manual	rear_wheel_drive	2.0	luxury,high-performance	compact	coupe	28	20	3916	36350
3	bmw	1_series	2011	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury,performance	compact	coupe	28	18	3916	29450
4	bmw	1_series	2011	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury	compact	convertible	28	18	3916	34500

2.3 Exploratory data analysis¶

Pandas attributes and methods:

df[col].unique() – returns a list of unique values in the series
df[col].nunique() – returns the number of unique values in the series
df.isnull().sum() – retunrs the number of null values in the dataframe

Matplotlib and seaborn methods:

%matplotlib inline – assure that plots are displayed in jupyter notebook’s cells
sns.histplot() – show the histogram of a series

Numpy methods:

np.log1p() – applies log transformation to a variable and adds one to each result

Long-tail distributions usually confuse the ML models, so the recommendation is to transform the target variable distribution to a normal one whenever possible.

In [17]:

df

Out[17]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity	msrp
0	bmw	1_series_m	2011	premium_unleaded_(required)	335.0	6.0	manual	rear_wheel_drive	2.0	factory_tuner,luxury,high-performance	compact	coupe	26	19	3916	46135
1	bmw	1_series	2011	premium_unleaded_(required)	300.0	6.0	manual	rear_wheel_drive	2.0	luxury,performance	compact	convertible	28	19	3916	40650
2	bmw	1_series	2011	premium_unleaded_(required)	300.0	6.0	manual	rear_wheel_drive	2.0	luxury,high-performance	compact	coupe	28	20	3916	36350
3	bmw	1_series	2011	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury,performance	compact	coupe	28	18	3916	29450
4	bmw	1_series	2011	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury	compact	convertible	28	18	3916	34500
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
11909	acura	zdx	2012	premium_unleaded_(required)	300.0	6.0	automatic	all_wheel_drive	4.0	crossover,hatchback,luxury	midsize	4dr_hatchback	23	16	204	46120
11910	acura	zdx	2012	premium_unleaded_(required)	300.0	6.0	automatic	all_wheel_drive	4.0	crossover,hatchback,luxury	midsize	4dr_hatchback	23	16	204	56670
11911	acura	zdx	2012	premium_unleaded_(required)	300.0	6.0	automatic	all_wheel_drive	4.0	crossover,hatchback,luxury	midsize	4dr_hatchback	23	16	204	50620
11912	acura	zdx	2013	premium_unleaded_(recommended)	300.0	6.0	automatic	all_wheel_drive	4.0	crossover,hatchback,luxury	midsize	4dr_hatchback	23	16	204	50920
11913	lincoln	zephyr	2006	regular_unleaded	221.0	6.0	automatic	front_wheel_drive	4.0	luxury	midsize	sedan	26	17	61	28995

11914 rows × 16 columns

In [18]:

df.columns

Out[18]:

Index(['make', 'model', 'year', 'engine_fuel_type', 'engine_hp',
       'engine_cylinders', 'transmission_type', 'driven_wheels',
       'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style',
       'highway_mpg', 'city_mpg', 'popularity', 'msrp'],
      dtype='object')

We want to learn more about the data that we are using. We can use some simple functions to more fully understand what the data contains.

In [19]:

for col in df.columns:
    print(col)
    print(df[col].head())
    print()

make
0    bmw
1    bmw
2    bmw
3    bmw
4    bmw
Name: make, dtype: object
model
0    1_series_m
1      1_series
2      1_series
3      1_series
4      1_series
Name: model, dtype: object
year
0    2011
1    2011
2    2011
3    2011
4    2011
Name: year, dtype: int64
engine_fuel_type
0    premium_unleaded_(required)
1    premium_unleaded_(required)
2    premium_unleaded_(required)
3    premium_unleaded_(required)
4    premium_unleaded_(required)
Name: engine_fuel_type, dtype: object
engine_hp
0    335.0
1    300.0
2    300.0
3    230.0
4    230.0
Name: engine_hp, dtype: float64
engine_cylinders
0    6.0
1    6.0
2    6.0
3    6.0
4    6.0
Name: engine_cylinders, dtype: float64
transmission_type
0    manual
1    manual
2    manual
3    manual
4    manual
Name: transmission_type, dtype: object
driven_wheels
0    rear_wheel_drive
1    rear_wheel_drive
2    rear_wheel_drive
3    rear_wheel_drive
4    rear_wheel_drive
Name: driven_wheels, dtype: object
number_of_doors
0    2.0
1    2.0
2    2.0
3    2.0
4    2.0
Name: number_of_doors, dtype: float64
market_category
0    factory_tuner,luxury,high-performance
1                       luxury,performance
2                  luxury,high-performance
3                       luxury,performance
4                                   luxury
Name: market_category, dtype: object
vehicle_size
0    compact
1    compact
2    compact
3    compact
4    compact
Name: vehicle_size, dtype: object
vehicle_style
0          coupe
1    convertible
2          coupe
3          coupe
4    convertible
Name: vehicle_style, dtype: object
highway_mpg
0    26
1    28
2    28
3    28
4    28
Name: highway_mpg, dtype: int64
city_mpg
0    19
1    19
2    20
3    18
4    18
Name: city_mpg, dtype: int64
popularity
0    3916
1    3916
2    3916
3    3916
4    3916
Name: popularity, dtype: int64
msrp
0    46135
1    40650
2    36350
3    29450
4    34500
Name: msrp, dtype: int64

The above doesn’t give us much detail, for instance, under ‘make’ it shows all bmw’s but we can use the unique and nunique functions to get a little better understanding of what our data contains. This function will show us the first five unique items in each column and count the number of unique items in each column. For instance we can see there are 48 different manufacturers of cars in the dataset.

In [20]:

for col in df.columns:
    print(col) # prints the col name
    print(df[col].unique()[:5]) # prints the first 5 unique values in the col
    print(df[col].nunique()) # calculates the number of unique values in each col
    print()

make
['bmw' 'audi' 'fiat' 'mercedes-benz' 'chrysler']
48
model
['1_series_m' '1_series' '100' '124_spider' '190-class']
914
year
[2011 2012 2013 1992 1993]
28
engine_fuel_type
['premium_unleaded_(required)' 'regular_unleaded'
 'premium_unleaded_(recommended)' 'flex-fuel_(unleaded/e85)' 'diesel']
10
engine_hp
[335. 300. 230. 320. 172.]
356
engine_cylinders
[ 6.  4.  5.  8. 12.]
9
transmission_type
['manual' 'automatic' 'automated_manual' 'direct_drive' 'unknown']
5
driven_wheels
['rear_wheel_drive' 'front_wheel_drive' 'all_wheel_drive'
 'four_wheel_drive']
4
number_of_doors
[ 2.  4.  3. nan]
3
market_category
['factory_tuner,luxury,high-performance' 'luxury,performance'
 'luxury,high-performance' 'luxury' 'performance']
71
vehicle_size
['compact' 'midsize' 'large']
3
vehicle_style
['coupe' 'convertible' 'sedan' 'wagon' '4dr_hatchback']
16
highway_mpg
[26 28 27 25 24]
59
city_mpg
[19 20 18 17 16]
69
popularity
[3916 3105  819  617 1013]
48
msrp
[46135 40650 36350 29450 34500]
6049

Let us see what the distribution of prices are, we can use visualization to do this. We will import Matplotlib and Seaborn. Seaborn is a package built on top of the Matplotlib library that allows enhanced visualizations on data. The %matplotlib inline statement is notifying Jupyter Notebook that we will be doing some plotting.

In [21]:

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

In [22]:

sns.histplot(df.msrp)

Out[22]:

<AxesSubplot:xlabel='msrp', ylabel='Count'>

It isn’t a very good plot right off the bat, we can increase it’s readability.

In [23]:

sns.histplot(df.msrp, bins=50) # bins allow us to group the data points into, well, bins

Out[23]:

<AxesSubplot:xlabel='msrp', ylabel='Count'>

That is a very long tail, with most of the car prices grouped near the zeros and some cars valued up around 2 million. We will filter out some of the high price outliers to, perhaps, get a better representation of the data.

In [24]:

sns.histplot(df.msrp[df.msrp < 100000], bins=50)

Out[24]:

<AxesSubplot:xlabel='msrp', ylabel='Count'>

Our data has that very long tail, which will confuse our model. We will have to do something to change this. Changing the prices to a logarithm scale, will group our prices closer. A standard practice in converting a log scale is adding 1 to each data point. This ensures that there are no errors in the event that one of the points is a zero. For now we are doing this just to the plot the data and see that it is more of a normal distribution. Later we will apply this log transformation to the “msrp” column of our dataset.

In [25]:

price_logs = np.log1p(df.msrp)
price_logs

Out[25]:

0        10.739349
1        10.612779
2        10.500977
3        10.290483
4        10.448744
           ...    
11909    10.739024
11910    10.945018
11911    10.832122
11912    10.838031
11913    10.274913
Name: msrp, Length: 11914, dtype: float64

Now we have our prices close together, our graph should look better.

In [26]:

sns.histplot(price_logs, bins=50)

Out[26]:

<AxesSubplot:xlabel='msrp', ylabel='Count'>

We have eliminated that long tail off to the right and now our data looks more like a normal distribution.

Missing values¶

In [27]:

df.isnull().sum()

Out[27]:

make                    0
model                   0
year                    0
engine_fuel_type        3
engine_hp              69
engine_cylinders       30
transmission_type       0
driven_wheels           0
number_of_doors         6
market_category      3742
vehicle_size            0
vehicle_style           0
highway_mpg             0
city_mpg                0
popularity              0
msrp                    0
dtype: int64

2.4 Setting up the validation framework¶

In general, the dataset is split into three parts: training, validation, and test. For each partition, we need to obtain feature matrices (X) and y vectors of targets. First, the size of partitions is calculated, records are shuffled to guarantee that values of the three partitions contain non-sequential records of the dataset, and the partitions are created with the shuffled indices.

Pandas attributes and methods:

df.iloc[] – returns subsets of records of a dataframe, being selected by numerical indices
df.reset_index() – restate the orginal indices
del df[col] – eliminates target variable

Numpy methods:

np.arange() – returns an array of numbers
np.random.shuffle() – returns a shuffled array
np.random.seed() – set a seed

Let’s draw it

In [28]:

len(df)

Out[28]:

We have 11,914 records that need to be separated into training, validation and testing sets. Lets calculate how many records will be in each set using len() and some math.

In [29]:

n = len(df)
n_val = int(n * 0.2)
n_test = int(n * 0.2)
n_train = n - n_val - n_test
n_train, n_val, n_test

Out[29]:

(7150, 2382, 2382)

So our training set will contain 7,150 records, and the validation and test sets will contain 2382 each.

In [30]:

df.iloc[:10] # first 10 records

Out[30]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity	msrp
0	bmw	1_series_m	2011	premium_unleaded_(required)	335.0	6.0	manual	rear_wheel_drive	2.0	factory_tuner,luxury,high-performance	compact	coupe	26	19	3916	46135
1	bmw	1_series	2011	premium_unleaded_(required)	300.0	6.0	manual	rear_wheel_drive	2.0	luxury,performance	compact	convertible	28	19	3916	40650
2	bmw	1_series	2011	premium_unleaded_(required)	300.0	6.0	manual	rear_wheel_drive	2.0	luxury,high-performance	compact	coupe	28	20	3916	36350
3	bmw	1_series	2011	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury,performance	compact	coupe	28	18	3916	29450
4	bmw	1_series	2011	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury	compact	convertible	28	18	3916	34500
5	bmw	1_series	2012	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury,performance	compact	coupe	28	18	3916	31200
6	bmw	1_series	2012	premium_unleaded_(required)	300.0	6.0	manual	rear_wheel_drive	2.0	luxury,performance	compact	convertible	26	17	3916	44100
7	bmw	1_series	2012	premium_unleaded_(required)	300.0	6.0	manual	rear_wheel_drive	2.0	luxury,high-performance	compact	coupe	28	20	3916	39300
8	bmw	1_series	2012	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury	compact	convertible	28	18	3916	36900
9	bmw	1_series	2013	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury	compact	convertible	27	18	3916	37200

In [31]:

df.iloc[10:20] # second 10 records

Out[31]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity	msrp
10	bmw	1_series	2013	premium_unleaded_(required)	300.0	6.0	manual	rear_wheel_drive	2.0	luxury,high-performance	compact	coupe	28	20	3916	39600
11	bmw	1_series	2013	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury,performance	compact	coupe	28	19	3916	31500
12	bmw	1_series	2013	premium_unleaded_(required)	300.0	6.0	manual	rear_wheel_drive	2.0	luxury,performance	compact	convertible	28	19	3916	44400
13	bmw	1_series	2013	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury	compact	convertible	28	19	3916	37200
14	bmw	1_series	2013	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury,performance	compact	coupe	28	19	3916	31500
15	bmw	1_series	2013	premium_unleaded_(required)	320.0	6.0	manual	rear_wheel_drive	2.0	luxury,high-performance	compact	convertible	25	18	3916	48250
16	bmw	1_series	2013	premium_unleaded_(required)	320.0	6.0	manual	rear_wheel_drive	2.0	luxury,high-performance	compact	coupe	28	20	3916	43550
17	audi	100	1992	regular_unleaded	172.0	6.0	manual	front_wheel_drive	4.0	luxury	midsize	sedan	24	17	3105	2000
18	audi	100	1992	regular_unleaded	172.0	6.0	manual	front_wheel_drive	4.0	luxury	midsize	sedan	24	17	3105	2000
19	audi	100	1992	regular_unleaded	172.0	6.0	automatic	all_wheel_drive	4.0	luxury	midsize	wagon	20	16	3105	2000

In [32]:

df.iloc[11910:] #last 4 records because the dataset has 11,914 records

Out[32]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity	msrp
11910	acura	zdx	2012	premium_unleaded_(required)	300.0	6.0	automatic	all_wheel_drive	4.0	crossover,hatchback,luxury	midsize	4dr_hatchback	23	16	204	56670
11911	acura	zdx	2012	premium_unleaded_(required)	300.0	6.0	automatic	all_wheel_drive	4.0	crossover,hatchback,luxury	midsize	4dr_hatchback	23	16	204	50620
11912	acura	zdx	2013	premium_unleaded_(recommended)	300.0	6.0	automatic	all_wheel_drive	4.0	crossover,hatchback,luxury	midsize	4dr_hatchback	23	16	204	50920
11913	lincoln	zephyr	2006	regular_unleaded	221.0	6.0	automatic	front_wheel_drive	4.0	luxury	midsize	sedan	26	17	61	28995

Now we could just grab our records using .iloc and the new variables we just set up.

In [33]:

df_train = df.iloc[:n_train]
df_val = df.iloc[n_train:n_train+n_val]
df_test = df.iloc[n_train+n_val:]

The problem with breaking up our records like this will be that it isn’t a cross-section of records. For example, df_train will contain all of the bmw vehicles. So we need to do something to fhuffle the records around.

In [34]:

df_train

Out[34]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity	msrp
0	bmw	1_series_m	2011	premium_unleaded_(required)	335.0	6.0	manual	rear_wheel_drive	2.0	factory_tuner,luxury,high-performance	compact	coupe	26	19	3916	46135
1	bmw	1_series	2011	premium_unleaded_(required)	300.0	6.0	manual	rear_wheel_drive	2.0	luxury,performance	compact	convertible	28	19	3916	40650
2	bmw	1_series	2011	premium_unleaded_(required)	300.0	6.0	manual	rear_wheel_drive	2.0	luxury,high-performance	compact	coupe	28	20	3916	36350
3	bmw	1_series	2011	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury,performance	compact	coupe	28	18	3916	29450
4	bmw	1_series	2011	premium_unleaded_(required)	230.0	6.0	manual	rear_wheel_drive	2.0	luxury	compact	convertible	28	18	3916	34500
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
7145	mazda	navajo	1994	regular_unleaded	160.0	6.0	manual	four_wheel_drive	2.0	NaN	compact	2dr_suv	18	14	586	2000
7146	mazda	navajo	1994	regular_unleaded	160.0	6.0	manual	four_wheel_drive	2.0	NaN	compact	2dr_suv	18	14	586	2000
7147	lincoln	navigator	2015	regular_unleaded	365.0	6.0	automatic	four_wheel_drive	4.0	luxury	large	4dr_suv	20	15	61	65055
7148	lincoln	navigator	2015	regular_unleaded	365.0	6.0	automatic	four_wheel_drive	4.0	luxury	large	4dr_suv	19	15	61	67220
7149	lincoln	navigator	2015	regular_unleaded	365.0	6.0	automatic	rear_wheel_drive	4.0	luxury	large	4dr_suv	22	16	61	61480

7150 rows × 16 columns

We use the np.arange method to sort the dataframe and then assign it to the variable “idx”.

In [35]:

idx = np.arange(n)
idx

Out[35]:

array([    0,     1,     2, ..., 11911, 11912, 11913])

Then we can use .random.shuffle() to shuffle everything around so that all the bmw’s are not in the train test set.

Since I am doing this as part of the ml-zoomcamp class we want us to have the same shuffle for our data that our instructor has. We can do this by using the np.random.seed() method. and then we will do the suffle again.

In [36]:

np.random.seed(2)
np.random.shuffle(idx)

In [37]:

df_train = df.iloc[idx[:n_train]]
df_val = df.iloc[idx[n_train:n_train+n_val]]
df_test = df.iloc[idx[n_train+n_val:]]

In [38]:

df_train.head()

Out[38]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity	msrp
2735	chevrolet	cobalt	2008	regular_unleaded	148.0	4.0	manual	front_wheel_drive	2.0	NaN	compact	coupe	33	24	1385	14410
6720	toyota	matrix	2012	regular_unleaded	132.0	4.0	automatic	front_wheel_drive	4.0	hatchback	compact	4dr_hatchback	32	25	2031	19685
5878	subaru	impreza	2016	regular_unleaded	148.0	4.0	automatic	all_wheel_drive	4.0	hatchback	compact	4dr_hatchback	37	28	640	19795
11190	volkswagen	vanagon	1991	regular_unleaded	90.0	4.0	manual	rear_wheel_drive	3.0	NaN	large	passenger_minivan	18	16	873	2000
4554	ford	f-150	2017	flex-fuel_(unleaded/e85)	385.0	8.0	automatic	four_wheel_drive	4.0	flex_fuel	large	crew_cab_pickup	21	15	5657	56260

In [39]:

len(df_train), len(df_val), len(df_val)

Out[39]:

(7150, 2382, 2382)

In [40]:

df_train

Out[40]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity	msrp
2735	chevrolet	cobalt	2008	regular_unleaded	148.0	4.0	manual	front_wheel_drive	2.0	NaN	compact	coupe	33	24	1385	14410
6720	toyota	matrix	2012	regular_unleaded	132.0	4.0	automatic	front_wheel_drive	4.0	hatchback	compact	4dr_hatchback	32	25	2031	19685
5878	subaru	impreza	2016	regular_unleaded	148.0	4.0	automatic	all_wheel_drive	4.0	hatchback	compact	4dr_hatchback	37	28	640	19795
11190	volkswagen	vanagon	1991	regular_unleaded	90.0	4.0	manual	rear_wheel_drive	3.0	NaN	large	passenger_minivan	18	16	873	2000
4554	ford	f-150	2017	flex-fuel_(unleaded/e85)	385.0	8.0	automatic	four_wheel_drive	4.0	flex_fuel	large	crew_cab_pickup	21	15	5657	56260
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
434	bmw	4_series	2015	premium_unleaded_(required)	300.0	6.0	automatic	rear_wheel_drive	2.0	luxury,performance	midsize	convertible	31	20	3916	54900
1902	volkswagen	beetle	2015	premium_unleaded_(recommended)	210.0	4.0	automated_manual	front_wheel_drive	2.0	hatchback,performance	compact	2dr_hatchback	30	24	873	29215
9334	gmc	sierra_1500	2015	flex-fuel_(unleaded/e85)	285.0	6.0	automatic	four_wheel_drive	4.0	flex_fuel	large	extended_cab_pickup	22	17	549	34675
5284	rolls-royce	ghost	2014	premium_unleaded_(required)	563.0	12.0	automatic	rear_wheel_drive	4.0	exotic,luxury,performance	large	sedan	21	13	86	303300
2420	volkswagen	cc	2017	premium_unleaded_(recommended)	200.0	4.0	automated_manual	front_wheel_drive	4.0	performance	midsize	sedan	31	22	873	37820

7150 rows × 16 columns

Now that we have a randomly shuffled dataframe we want to reset the index so that is in numerical order.

In [41]:

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

Before we start the regression problem we need to apply the log function to the “msrp” column so that the data doesn’t have that long tail.

In [42]:

df_train.msrp

Out[42]:

0        14410
1        19685
2        19795
3         2000
4        56260
         ...  
7145     54900
7146     29215
7147     34675
7148    303300
7149     37820
Name: msrp, Length: 7150, dtype: int64

We will use the np.log1p function to perform this on the data.

In [43]:

np.log1p(df_train.msrp)

Out[43]:

0        9.575747
1        9.887663
2        9.893235
3        7.601402
4       10.937757
          ...    
7145    10.913287
7146    10.282472
7147    10.453803
7148    12.622481
7149    10.540620
Name: msrp, Length: 7150, dtype: float64

The data above is a series, we will convert that to a pandas array and then assign it to the “y_train” variable and then we will do that to the other sets as well.

In [44]:

y_train = np.log1p(df_train.msrp.values)
y_val = np.log1p(df_val.msrp.values)
y_test = np.log1p(df_test.msrp.values)

None of our df sets, i.e. df_test, need the msrp column because that is what we are trying to predict. We will delete that column from our df sets.

In [45]:

del df_train['msrp']
del df_val['msrp']
del df_test['msrp']

In [46]:

len(y_train)

Out[46]:

2.5 Linear regression¶

Model for solving regression tasks, in which the objective is to adjust a line for the data and make predictions on new values. The input of this model is the feature matrix and a y vector of predictions is obtained, trying to be as close as possible to the actual y values. The LR formula is the sum of the bias term (WO), which refers to the predictions if there is no information, and each of the feature values times their corresponding weights. We need to assure that the result is shown on the untransformed scale. Screenshot%20from%202022-09-12%2023-33-52.png

So our model is g, df_train, df_val and df_test are matricour feature matrices and our targets are the variables we created with y_train, y_val and y_test

In [47]:

df_train

Out[47]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity
0	chevrolet	cobalt	2008	regular_unleaded	148.0	4.0	manual	front_wheel_drive	2.0	NaN	compact	coupe	33	24	1385
1	toyota	matrix	2012	regular_unleaded	132.0	4.0	automatic	front_wheel_drive	4.0	hatchback	compact	4dr_hatchback	32	25	2031
2	subaru	impreza	2016	regular_unleaded	148.0	4.0	automatic	all_wheel_drive	4.0	hatchback	compact	4dr_hatchback	37	28	640
3	volkswagen	vanagon	1991	regular_unleaded	90.0	4.0	manual	rear_wheel_drive	3.0	NaN	large	passenger_minivan	18	16	873
4	ford	f-150	2017	flex-fuel_(unleaded/e85)	385.0	8.0	automatic	four_wheel_drive	4.0	flex_fuel	large	crew_cab_pickup	21	15	5657
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
7145	bmw	4_series	2015	premium_unleaded_(required)	300.0	6.0	automatic	rear_wheel_drive	2.0	luxury,performance	midsize	convertible	31	20	3916
7146	volkswagen	beetle	2015	premium_unleaded_(recommended)	210.0	4.0	automated_manual	front_wheel_drive	2.0	hatchback,performance	compact	2dr_hatchback	30	24	873
7147	gmc	sierra_1500	2015	flex-fuel_(unleaded/e85)	285.0	6.0	automatic	four_wheel_drive	4.0	flex_fuel	large	extended_cab_pickup	22	17	549
7148	rolls-royce	ghost	2014	premium_unleaded_(required)	563.0	12.0	automatic	rear_wheel_drive	4.0	exotic,luxury,performance	large	sedan	21	13	86
7149	volkswagen	cc	2017	premium_unleaded_(recommended)	200.0	4.0	automated_manual	front_wheel_drive	4.0	performance	midsize	sedan	31	22	873

7150 rows × 15 columns

We can look at different records in our dataset, below we are looking at the row indexed at 10 using the .loc[] method.

In [48]:

df_train.iloc[10]

Out[48]:

make                                 rolls-royce
model                     phantom_drophead_coupe
year                                        2015
engine_fuel_type     premium_unleaded_(required)
engine_hp                                  453.0
engine_cylinders                            12.0
transmission_type                      automatic
driven_wheels                   rear_wheel_drive
number_of_doors                              2.0
market_category        exotic,luxury,performance
vehicle_size                               large
vehicle_style                        convertible
highway_mpg                                   19
city_mpg                                      11
popularity                                    86
Name: 10, dtype: object

We are going to choose engine_hp, city_mpg and popularity.

In [49]:

xi = [453, 11, 86]
# this is our feature set for our example

Now we will right a short function to predict a price based on the feature set we chose.

In [50]:

def g(xi):
    # do something
    return 10000

In [51]:

g(xi)

Out[51]:

In [52]:

w0 = 0
w = [1, 1, 1]

Each feature in xi will be assigned a weight to either increase or decrease the efficacy on the price of a vehicle.

Screenshot%20from%202022-09-12%2023-52-53.png

With Python we start at 0 so it would be from 0 to 2 for our 3 features.

In [53]:

def linear_regression(xi):
    n = len(xi)
    
    pred = w0
    
    for j in range(n):
        pred = pred + w[j] * xi[j]
        
    return pred

In [54]:

linear_regression(xi)

Out[54]:

So our prediction was 550 for the price of the vehicle, in log + 1. It isn’t correct because w0 and w aren’t the correct starting points, that is where machine learning comes in but lets keep working at this solution. Keep in mind the returns we are getting are in a log price that have to back-transform the log. Also, because we did np.log1p() which added a 1, we can use np.expm1 to back-transform and subtract 1.

In [55]:

xi = [453, 11, 86] # features
w0 = 7.17 # bias term
w = [0.01, 0.04, 0.002] # weights

In [56]:

def linear_regression(xi):
    n = len(xi)
    
    pred = w0
    
    for j in range(n):
        pred = pred + w[j] * xi[j]
        
    return pred

In [57]:

linear_regression(xi)

Out[57]:

12.312

Back-transform the log.

In [58]:

np.expm1(linear_regression(xi))

Out[58]:

222347.2221101062

The value above is our first attempt and predicting the price of the Rolls-Royce @ 222,347

2.6 Linear regression: vector form¶

The formula of LR can be synthesized with the dot product between features and weights. The feature vector includes the bias term with an x value of one. When all the records are included, the LR can be calculated with the dot product between feature matrix and vector of weights, obtaining the y vector of predictions.

Screenshot%20from%202022-09-13%2014-27-36.png

In [59]:

def dot(xi, w):
    n = len(xi)
        
    res = 0.0
    
    for j in range(n):
        res = res + xi[j] * w[j]
        
    return res

In [60]:

w_new = [w0] + w

In [61]:

w_new

Out[61]:

[7.17, 0.01, 0.04, 0.002]

In [62]:

def linear_regression(xi):
    xi = [1] + xi
    return dot(xi, w_new)

In [63]:

linear_regression(xi)

Out[63]:

12.312

Now we will do another prediction on another record (making some numbers up just for the example)

In [64]:

w_new = [w0] + w
w0 = 7.17 # bias term
w = [0.01, 0.04, 0.002] # weights

In [65]:

x1 = [1, 148, 24, 1385]
x2 = [1, 132, 25, 2031]
x10 = [1, 453, 11, 86]
X = [x1, x2, x10]
X = np.array(X)
X

Out[65]:

array([[   1,  148,   24, 1385],
       [   1,  132,   25, 2031],
       [   1,  453,   11,   86]])

In [66]:

def linear_regression(X):
    return X.dot(w_new)

In [67]:

linear_regression(X)

Out[67]:

array([12.38 , 13.552, 12.312])

2.7 Training linear regression: Normal equation¶

Obtaining predictions as close as possible to y target values requires the calculation of weights from the general LR equation. The feature matrix does not have an inverse because it is not square, so it is required to obtain an approximate solution, which can be obtained using the Gram matrix (multiplication of feature matrix and its transpose). The vector of weights or coefficients obtained with this formula is the closest possible solution to the LR system.

Screenshot%20from%202022-09-14%2001-24-06.png

In [68]:

def train_linear_regression(X, y):
    pass

In [80]:

X = [[148, 24, 1385], 
     [132, 25, 2031], 
     [453, 11, 86],
     [158, 24, 185], 
     [172, 25, 201], 
     [413, 11, 86],
     [38, 54, 185], 
     [142, 25, 431], 
     [453, 31, 86]
]
X = np.array(X)
X

Out[80]:

array([[ 148,   24, 1385],
       [ 132,   25, 2031],
       [ 453,   11,   86],
       [ 158,   24,  185],
       [ 172,   25,  201],
       [ 413,   11,   86],
       [  38,   54,  185],
       [ 142,   25,  431],
       [ 453,   31,   86]])

We need to add our bias terms into the array, which will be 1’s. We will take the shape of the array above and assign the row count to the np.ones method to create a vector of 1’s to match the matrix and then we will append that vector to the matrix

In [70]:

ones = np.ones(X.shape[0])
ones

Out[70]:

array([1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [71]:

X = np.column_stack([ones, X])

We need to create a y vector.

In [72]:

y = [100, 200, 150, 250, 100, 200, 150, 250, 120]

In [73]:

XTX = X.T.dot(X) # this is our Gram matrix
XTX

Out[73]:

array([[9.000000e+00, 2.109000e+03, 2.300000e+02, 4.676000e+03],
       [2.109000e+03, 6.964710e+05, 4.411500e+04, 7.185400e+05],
       [2.300000e+02, 4.411500e+04, 7.146000e+03, 1.188030e+05],
       [4.676000e+03, 7.185400e+05, 1.188030e+05, 6.359986e+06]])

We now create our inverse matrix, which actuall creates an identity matrix.

In [74]:

XTX_inv = np.linalg.inv(XTX)

We can use the dot product of XTX and XTX_inv to see if we have created an identity matrix, and we have.

In [75]:

XTX.dot(XTX_inv).round(1)

Out[75]:

array([[ 1., -0.,  0.,  0.],
       [ 0.,  1.,  0., -0.],
       [ 0.,  0.,  1.,  0.],
       [ 0., -0.,  0.,  1.]])

In [76]:

w_full = XTX_inv.dot(X.T).dot(y)
w_full

Out[76]:

array([ 3.00067767e+02, -2.27742529e-01, -2.57694130e+00, -2.30120640e-02])

In [77]:

w0 = w_full[0]
w = w_full[1:]
w0, w

Out[77]:

(300.06776692555593, array([-0.22774253, -2.5769413 , -0.02301206]))

In [81]:

def train_linear_regression(X, y):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])
    
    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)
    
    return w_full[0], w_full[1:]

In [82]:

train_linear_regression(X, y)

Out[82]:

(300.06776692555593, array([-0.22774253, -2.5769413 , -0.02301206]))

2.8 Baseline model for car price prediction project¶

The LR model obtained in the previous section was used with the dataset of car price prediction. For this model, only the numerical variables were considered. The training data was pre-processed, replacing the NaN values with 0, in such a way that these values were omitted by the model. Then, the model was trained and it allowed to make predictions on new data. Finally, distributions of y target variable and predictions were compared by plotting their histograms.

In [83]:

df_train

Out[83]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity
0	chevrolet	cobalt	2008	regular_unleaded	148.0	4.0	manual	front_wheel_drive	2.0	NaN	compact	coupe	33	24	1385
1	toyota	matrix	2012	regular_unleaded	132.0	4.0	automatic	front_wheel_drive	4.0	hatchback	compact	4dr_hatchback	32	25	2031
2	subaru	impreza	2016	regular_unleaded	148.0	4.0	automatic	all_wheel_drive	4.0	hatchback	compact	4dr_hatchback	37	28	640
3	volkswagen	vanagon	1991	regular_unleaded	90.0	4.0	manual	rear_wheel_drive	3.0	NaN	large	passenger_minivan	18	16	873
4	ford	f-150	2017	flex-fuel_(unleaded/e85)	385.0	8.0	automatic	four_wheel_drive	4.0	flex_fuel	large	crew_cab_pickup	21	15	5657
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
7145	bmw	4_series	2015	premium_unleaded_(required)	300.0	6.0	automatic	rear_wheel_drive	2.0	luxury,performance	midsize	convertible	31	20	3916
7146	volkswagen	beetle	2015	premium_unleaded_(recommended)	210.0	4.0	automated_manual	front_wheel_drive	2.0	hatchback,performance	compact	2dr_hatchback	30	24	873
7147	gmc	sierra_1500	2015	flex-fuel_(unleaded/e85)	285.0	6.0	automatic	four_wheel_drive	4.0	flex_fuel	large	extended_cab_pickup	22	17	549
7148	rolls-royce	ghost	2014	premium_unleaded_(required)	563.0	12.0	automatic	rear_wheel_drive	4.0	exotic,luxury,performance	large	sedan	21	13	86
7149	volkswagen	cc	2017	premium_unleaded_(recommended)	200.0	4.0	automated_manual	front_wheel_drive	4.0	performance	midsize	sedan	31	22	873

7150 rows × 15 columns

In [84]:

df_train.dtypes

Out[84]:

make                  object
model                 object
year                   int64
engine_fuel_type      object
engine_hp            float64
engine_cylinders     float64
transmission_type     object
driven_wheels         object
number_of_doors      float64
market_category       object
vehicle_size          object
vehicle_style         object
highway_mpg            int64
city_mpg               int64
popularity             int64
dtype: object

We are going to build a model using the features engine_hp, engine_cylinders, highway_mgp, city_mpg, and popularity.

In [85]:

df_train.columns

Out[85]:

Index(['make', 'model', 'year', 'engine_fuel_type', 'engine_hp',
       'engine_cylinders', 'transmission_type', 'driven_wheels',
       'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style',
       'highway_mpg', 'city_mpg', 'popularity'],
      dtype='object')

We assign the columns we want to create a model on and assign it to the variable ‘base’.

In [86]:

base = ['engine_hp', 'engine_cylinders', 'highway_mpg',
        'city_mpg', 'popularity']
df_train[base]

Out[86]:

	engine_hp	engine_cylinders	highway_mpg	city_mpg	popularity
0	148.0	4.0	33	24	1385
1	132.0	4.0	32	25	2031
2	148.0	4.0	37	28	640
3	90.0	4.0	18	16	873
4	385.0	8.0	21	15	5657
…	…	…	…	…	…
7145	300.0	6.0	31	20	3916
7146	210.0	4.0	30	24	873
7147	285.0	6.0	22	17	549
7148	563.0	12.0	21	13	86
7149	200.0	4.0	31	22	873

7150 rows × 5 columns

Next, we need to extract the values from our columns to use in our model and we assign those values to the variable ‘X_train’.

In [87]:

X_train = df_train[base].values
X_train

Out[87]:

array([[ 148.,    4.,   33.,   24., 1385.],
       [ 132.,    4.,   32.,   25., 2031.],
       [ 148.,    4.,   37.,   28.,  640.],
       ...,
       [ 285.,    6.,   22.,   17.,  549.],
       [ 563.,   12.,   21.,   13.,   86.],
       [ 200.,    4.,   31.,   22.,  873.]])

We have already created our ‘y’ with the MSRP column that we removed from ‘df_train’ earlier.

In [88]:

y_train

Out[88]:

array([ 9.57574708,  9.887663  ,  9.89323518, ..., 10.45380308,
       12.62248099, 10.54061978])

With the ‘X_train’ and ‘y_train’ variables we can insert these into the train_linear_regression function we created earlier.

In [89]:

train_linear_regression(X_train, y_train)

Out[89]:

(nan, array([nan, nan, nan, nan, nan]))

As we can see above we have some ‘nan’ values, which we can’t have. We noticed earlier that we had some null values in the dataset, we have fill those in with some values.

In [90]:

df_train[base].isnull().sum()

Out[90]:

engine_hp           40
engine_cylinders    14
highway_mpg          0
city_mpg             0
popularity           0
dtype: int64

As we can see above we have 40 empty entries for ‘engine_hp’ and 14 for ‘engine_cylinders’. There are various ways to fill in these values. We can use the median or mean values of the columns or fill them with a 0.

In [91]:

X_train = df_train[base].fillna(0).values

Now that we have filled in the nan values we can see that our function now works properly.

In [92]:

train_linear_regression(X_train, y_train)

Out[92]:

(7.927257388070117,
 array([ 9.70589522e-03, -1.59103494e-01,  1.43792133e-02,  1.49441072e-02,
        -9.06908672e-06]))

Let’s assign the model to our bias term and w.

In [93]:

w0, w = train_linear_regression(X_train, y_train)

Now we can do the matrix to matrix multiplication to get our predictions.

In [94]:

w0 + X_train.dot(w)

Out[94]:

array([ 9.54792783,  9.38733977,  9.67197758, ..., 10.30423015,
       11.9778914 ,  9.99863111])

Let’s assign our predictions to the variable ‘y_pred’.

In [95]:

y_pred = w0 + X_train.dot(w)

Let’s plot our ‘y_preds’ to see how they comapre to ‘y_train’, which are the actual MSRP’s from our data, which are also known as the target variables. We will can assign different colors for each, the ‘alpha’ argument changes the transparency so that we can see both and the ‘bins’ argument breaks the data into 50 segments.

In [96]:

sns.histplot(y_pred, color='red', alpha=0.5, bins=50)
sns.histplot(y_train, color='blue', alpha=0.5, bins=50)

Out[96]:

<AxesSubplot:ylabel='Count'>

We can see in the plot that our predicted prices are lower than actual prices and the mean of the prices peak is also low. Even though our model isn’t perfect we can use it to evaluate other models against it.

2.9 Root mean squared error¶

The RMSE is a measure of the error associated with a model for regression tasks. The video explained the RMSE formula in detail and implemented it in Python. This will be how we quanitfy the accuracy of the previous built model.

Screenshot%20from%202022-09-14%2016-23-41.png

In [97]:

def rmse(y, y_pred):
    se = (y - y_pred) ** 2
    mse = se.mean()
    return np.sqrt(mse)    

In [98]:

rmse(y_train, y_pred)

Out[98]:

0.7554192603920132

2.10 Using RMSE on validation data¶

Calculation of the RMSE on validation partition of the dataset of car price prediction. In this way, we have a metric to evaluate the model’s performance. The code below is what we needed to train our model on the X_train dataset, now we will set it up to train on the validation set that we created earlier in this notebook.

In [99]:

base = ['engine_hp', 'engine_cylinders', 'highway_mpg',
        'city_mpg', 'popularity']
X_train = df_train[base].fillna(0).values
w0, w = train_linear_regression(X_train, y_train)
y_pred = w0 + X_train.dot(w)

In [100]:

def prepare_X(df):
    df_num = df[base]
    df_num = df_num.fillna(0)
    X = df_num.values
    return X

In [101]:

# Predicting on the training data set
X_train = prepare_X(df_train)
w0, w = train_linear_regression(X_train, y_train)
# Predicting on the validation data set
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
rmse(y_val, y_pred)

Out[101]:

0.7616530991301601

2.11 Feature engineering¶

The feature age of the car was included in the dataset, obtained with the subtraction of the maximum year of cars and each of the years of cars. This new feature improved the model performance, measured with the RMSE and comparing the distributions of y target variable and predictions.

In [102]:

df_train

Out[102]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity
0	chevrolet	cobalt	2008	regular_unleaded	148.0	4.0	manual	front_wheel_drive	2.0	NaN	compact	coupe	33	24	1385
1	toyota	matrix	2012	regular_unleaded	132.0	4.0	automatic	front_wheel_drive	4.0	hatchback	compact	4dr_hatchback	32	25	2031
2	subaru	impreza	2016	regular_unleaded	148.0	4.0	automatic	all_wheel_drive	4.0	hatchback	compact	4dr_hatchback	37	28	640
3	volkswagen	vanagon	1991	regular_unleaded	90.0	4.0	manual	rear_wheel_drive	3.0	NaN	large	passenger_minivan	18	16	873
4	ford	f-150	2017	flex-fuel_(unleaded/e85)	385.0	8.0	automatic	four_wheel_drive	4.0	flex_fuel	large	crew_cab_pickup	21	15	5657
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
7145	bmw	4_series	2015	premium_unleaded_(required)	300.0	6.0	automatic	rear_wheel_drive	2.0	luxury,performance	midsize	convertible	31	20	3916
7146	volkswagen	beetle	2015	premium_unleaded_(recommended)	210.0	4.0	automated_manual	front_wheel_drive	2.0	hatchback,performance	compact	2dr_hatchback	30	24	873
7147	gmc	sierra_1500	2015	flex-fuel_(unleaded/e85)	285.0	6.0	automatic	four_wheel_drive	4.0	flex_fuel	large	extended_cab_pickup	22	17	549
7148	rolls-royce	ghost	2014	premium_unleaded_(required)	563.0	12.0	automatic	rear_wheel_drive	4.0	exotic,luxury,performance	large	sedan	21	13	86
7149	volkswagen	cc	2017	premium_unleaded_(recommended)	200.0	4.0	automated_manual	front_wheel_drive	4.0	performance	midsize	sedan	31	22	873

7150 rows × 15 columns

The age of a car is a good prediction of price. The older the car the lower the price. We can calculate the age of the car by subtracing the year of the car from the max() year in the dataset.

In [103]:

df_train.year.max()

Out[103]:

In [104]:

2017 - df_train.year

Out[104]:

0        9
1        5
2        1
3       26
4        0
        ..
7145     2
7146     2
7147     2
7148     3
7149     0
Name: year, Length: 7150, dtype: int64

We can use the function we created earlier and modify it to add the column ‘age’ as one of the features in our dataset.

In [105]:

def prepare_X(df):
    df = df.copy() # create a copy so we are not modifying the original
    
    df['age'] = 2017 - df.year # here we create the 'age' column.
    features = base + ['age']# here we add the 'age' column to our base features.
    
    df_num = df[features]
    df_num = df_num.fillna(0)
    X = df_num.values
    
    return X

In [106]:

X_train = prepare_X(df_train)
X_train

Out[106]:

array([[1.480e+02, 4.000e+00, 3.300e+01, 2.400e+01, 1.385e+03, 9.000e+00],
       [1.320e+02, 4.000e+00, 3.200e+01, 2.500e+01, 2.031e+03, 5.000e+00],
       [1.480e+02, 4.000e+00, 3.700e+01, 2.800e+01, 6.400e+02, 1.000e+00],
       ...,
       [2.850e+02, 6.000e+00, 2.200e+01, 1.700e+01, 5.490e+02, 2.000e+00],
       [5.630e+02, 1.200e+01, 2.100e+01, 1.300e+01, 8.600e+01, 3.000e+00],
       [2.000e+02, 4.000e+00, 3.100e+01, 2.200e+01, 8.730e+02, 0.000e+00]])

After we run our predictions again we can see the RMSE decreased from 0.7554192603920132 down to 0.5172055461058329. The smaller the RMSE number is the closer our predictions are to the known prices.

In [107]:

# Predicting on the training data set
X_train = prepare_X(df_train)
w0, w = train_linear_regression(X_train, y_train)
# Predicting on the validation data set
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
rmse(y_val, y_pred)

Out[107]:

0.5172055461058335

Visually we can see the graph is closer to the known prices.

In [108]:

sns.histplot(y_pred, color='red', alpha=0.5, bins=50)
sns.histplot(y_val, color='blue', alpha=0.5, bins=50)

Out[108]:

<AxesSubplot:ylabel='Count'>

We are getting closer, there looks to be an issue with some of the lower priced cars.

2.12 Categorical variables¶

Categorical variables are typically strings, and pandas identify them as object types. These variables need to be converted to a numerical form because the ML models can interpret only numerical features. It is possible to incorporate certain categories from a feature, not necessarily all of them. This transformation from categorical to numerical variables is known as One-Hot encoding.

In [109]:

for v in [2, 3, 4]:
    # ''%s' is used as a place holder for formatting strings and the '% v'
    # will take the v iterable and replace the s in the '%s'
    # the line below is creating a new column (feature) with the 'num_doors_2, 3 and 4'
    # that will contain binary code to signify whether a records has that many doors.
    # so where a records has 3 doors, there will be a 0 in the 2 column, 1 in the 3 column
    # and 0 in the 4 column and then it is converted from boolean to an integer.
   
    df_train['num_doors_%s' % v] = (df_train.number_of_doors == v).astype('int')

In [110]:

def prepare_X(df):
    df = df.copy() # create a copy so we are not modifying the original
    features = base.copy()
    
    df['age'] = 2017 - df.year # here we create the 'age' column.
    features.append('age')
    
    for v in [2, 3, 4]:
    # ''%s' is used as a place holder for formatting strings and the '% v'
    # will take the v iterable and replace the s in the '%s'
    # the line below is creating a new column (feature) with the 'num_doors_2, 3 and 4'
    # that will contain binary code to signify whether a records has that many doors.
    # so where a records has 3 doors, there will be a 0 in the 2 column, 1 in the 3 column
    # and 0 in the 4 column and then it is converted from boolean to an integer.
   
        df['num_doors_%s' % v] = (df.number_of_doors == v).astype('int')
        features.append('num_doors_%s' % v)
    
    df_num = df[features]
    df_num = df_num.fillna(0)
    X = df_num.values
    
    return X

In [111]:

prepare_X(df_train)

Out[111]:

array([[148.,   4.,  33., ...,   1.,   0.,   0.],
       [132.,   4.,  32., ...,   0.,   0.,   1.],
       [148.,   4.,  37., ...,   0.,   0.,   1.],
       ...,
       [285.,   6.,  22., ...,   0.,   0.,   1.],
       [563.,  12.,  21., ...,   0.,   0.,   1.],
       [200.,   4.,  31., ...,   0.,   0.,   1.]])

Above we can see that our ‘prepare’ function added our categorical columns for number of doors.

In [112]:

# Predicting on the training data set
X_train = prepare_X(df_train)
w0, w = train_linear_regression(X_train, y_train)
# Predicting on the validation data set
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
rmse(y_val, y_pred)

Out[112]:

0.5157995641502352

There is a very slight improvement using doors as a feature. Let’s look at adding the ‘Make’ as a feature.

In [113]:

makes = list(df.make.value_counts().head().index)
makes

Out[113]:

['chevrolet', 'ford', 'volkswagen', 'toyota', 'dodge']

In [114]:

def prepare_X(df):
    df = df.copy() # create a copy so we are not modifying the original
    features = base.copy()
    
    df['age'] = 2017 - df.year # here we create the 'age' column.
    features.append('age')
    
    for v in [2, 3, 4]:
    # ''%s' is used as a place holder for formatting strings and the '% v'
    # will take the v iterable and replace the s in the '%s'
    # the line below is creating a new column (feature) with the 'num_doors_2, 3 and 4'
    # that will contain binary code to signify whether a records has that many doors.
    # so where a records has 3 doors, there will be a 0 in the 2 column, 1 in the 3 column
    # and 0 in the 4 column and then it is converted from boolean to an integer.
   
        df['num_doors_%s' % v] = (df.number_of_doors == v).astype('int')
        features.append('num_doors_%s' % v)
        
    for v in makes:
        df['make_%s' % v] = (df.make == v).astype('int')
        features.append('make_%s' % v)
    
    df_num = df[features]
    df_num = df_num.fillna(0)
    X = df_num.values
    
    return X

In [115]:

# Predicting on the training data set
X_train = prepare_X(df_train)
w0, w = train_linear_regression(X_train, y_train)
# Predicting on the validation data set
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
rmse(y_val, y_pred)

Out[115]:

0.5076038849557034

As we can see, that made another slight improvement to our model by increasing it’s accuracy a little more.

In [116]:

df_train.dtypes

Out[116]:

make                  object
model                 object
year                   int64
engine_fuel_type      object
engine_hp            float64
engine_cylinders     float64
transmission_type     object
driven_wheels         object
number_of_doors      float64
market_category       object
vehicle_size          object
vehicle_style         object
highway_mpg            int64
city_mpg               int64
popularity             int64
num_doors_2            int32
num_doors_3            int32
num_doors_4            int32
dtype: object

In [117]:

categorical_variables= [
    'make', 'engine_fuel_type', 'transmission_type', 'driven_wheels', 
    'market_category', 'vehicle_size', 'vehicle_style'
    
]

In [118]:

categories = {}
for c in categorical_variables:
    categories[c] = list(df[c].value_counts().head().index)
categories

Out[118]:

{'make': ['chevrolet', 'ford', 'volkswagen', 'toyota', 'dodge'],
 'engine_fuel_type': ['regular_unleaded',
  'premium_unleaded_(required)',
  'premium_unleaded_(recommended)',
  'flex-fuel_(unleaded/e85)',
  'diesel'],
 'transmission_type': ['automatic',
  'manual',
  'automated_manual',
  'direct_drive',
  'unknown'],
 'driven_wheels': ['front_wheel_drive',
  'rear_wheel_drive',
  'all_wheel_drive',
  'four_wheel_drive'],
 'market_category': ['crossover',
  'flex_fuel',
  'luxury',
  'luxury,performance',
  'hatchback'],
 'vehicle_size': ['compact', 'midsize', 'large'],
 'vehicle_style': ['sedan',
  '4dr_suv',
  'coupe',
  'convertible',
  '4dr_hatchback']}

In [119]:

def prepare_X(df):
    df = df.copy() # create a copy so we are not modifying the original
    features = base.copy()
    
    df['age'] = 2017 - df.year # here we create the 'age' column.
    features.append('age')
    
    for v in [2, 3, 4]:
    # ''%s' is used as a place holder for formatting strings and the '% v'
    # will take the v iterable and replace the s in the '%s'
    # the line below is creating a new column (feature) with the 'num_doors_2, 3 and 4'
    # that will contain binary code to signify whether a records has that many doors.
    # so where a records has 3 doors, there will be a 0 in the 2 column, 1 in the 3 column
    # and 0 in the 4 column and then it is converted from boolean to an integer.
   
        df['num_doors_%s' % v] = (df.number_of_doors == v).astype('int')
        features.append('num_doors_%s' % v)
        
    for c, values in categories.items():        
            for v in values:
                df['%s_%s' % (c, v)] = (df[c] == v).astype('int')
                features.append('%s_%s' % (c, v))
    
    df_num = df[features]
    df_num = df_num.fillna(0)
    X = df_num.values
    
    return X

In [120]:

prepare_X(df_train)

Out[120]:

array([[148.,   4.,  33., ...,   1.,   0.,   0.],
       [132.,   4.,  32., ...,   0.,   0.,   1.],
       [148.,   4.,  37., ...,   0.,   0.,   1.],
       ...,
       [285.,   6.,  22., ...,   0.,   0.,   0.],
       [563.,  12.,  21., ...,   0.,   0.,   0.],
       [200.,   4.,  31., ...,   0.,   0.,   0.]])

In [121]:

# Predicting on the training data set
X_train = prepare_X(df_train)
w0, w = train_linear_regression(X_train, y_train)
# Predicting on the validation data set
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
rmse(y_val, y_pred)

Out[121]:

30.95303534636814

In [122]:

w0, w

Out[122]:

(1.0988239383087294e+16,
 array([ 2.57718953e-01, -1.19934338e+01,  3.82516348e-01,  3.02111785e+00,
        -1.53845377e-03,  1.21209125e+00,  1.91949857e+03,  1.93164538e+03,
         1.90555760e+03, -3.69174147e+00,  8.68736088e-01,  7.64802567e+00,
        -8.43561473e+00,  2.31619373e+00,  1.31628092e+02,  1.15952480e+02,
         1.21941647e+02,  1.29232893e+02,  1.15296794e+02, -8.45903444e+15,
        -8.45903444e+15, -8.45903444e+15, -8.45903444e+15, -8.45903444e+15,
        -2.52920494e+15, -2.52920494e+15, -2.52920494e+15, -2.52920494e+15,
         3.10643155e+00,  4.16776654e+00, -4.10798492e-01, -4.66244197e+00,
        -1.26472134e+01,  9.18260122e+00,  1.26076280e+01,  1.73773329e+01,
        -4.85492239e-02,  5.44797374e-02,  1.78241160e-01,  3.41906701e-01,
        -1.64412078e-01]))

2.13 Regularization¶

If the feature matrix has duplicated columns, it does not have an inverse matrix. But, sometimes this error could be passed if certain values are slightly different between duplicated columns.

So, if we apply the normal equation with this feature matrix, the values associated with duplicated columns are very large, which decreases the model performance. To solve this issue, one alternative is adding a small number to the diagonal of the feature matrix, which corresponds to regularization.

This technique works because the addition of small values to the diagonal makes it less likely to have duplicated columns. The regularization value is a parameter of the model. After applying regularization the model performance improved.

In [124]:

def train_linear_regression_reg(X, y, r=0.001):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])
    
    XTX = X.T.dot(X)
    XTX = XTX + r * np.eye(XTX.shape[0])
    
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)
    
    return w_full[0], w_full[1:]

In [127]:

# Predicting on the training data set
X_train = prepare_X(df_train)
w0, w = train_linear_regression_reg(X_train, y_train, r=0.01)
# Predicting on the validation data set
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
rmse(y_val, y_pred)

Out[127]:

0.45652199012705297

2.14 Tuning the model¶

The tuning consisted of finding the best regularization value, using the validation partition of the dataset. After obtaining the best regularization value, the model was trained with this regularization parameter.

In [130]:

for r in [0.0, 0.00001, 0.0001, 0.001, 0.1, 1, 10]:
    # Predicting on the training data set
    X_train = prepare_X(df_train)
    w0, w = train_linear_regression_reg(X_train, y_train, r=r)
    # Predicting on the validation data set
    X_val = prepare_X(df_val)
    y_pred = w0 + X_val.dot(w)
    score = rmse(y_val, y_pred)
    
    print(r, w0, score)

0.0 1.0988239383087294e+16 30.95303534636814
1e-05 9.263833450169736 0.45651701545049306
0.0001 6.330946383506262 0.45651706300621603
0.001 6.285522284126121 0.456517508702663
0.1 6.191208657252675 0.45656927630429756
1 5.634896667950106 0.45722043179967253
10 4.283980108969658 0.47014569320991684

In [131]:

r = 0.001
X_train = prepare_X(df_train)
w0, w = train_linear_regression_reg(X_train, y_train, r=r)
# Predicting on the validation data set
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
score = rmse(y_val, y_pred)
    
print(r, w0, score)

0.001 6.285522284126121 0.456517508702663

2.15 Using the model¶

After finding the best model and its parameters, it was trained with training and validation partitions and the final evaluation was calculated on the test partition. Finally, the final model was used to predict the price of new cars.

Now that we validated our model with the validation set, it can be added back in with the training data set so that we have one large training set, which comprises 80% of the original dataset.

In [134]:

df_full_train = pd.concat([df_train, df_val])
df_full_train

Out[134]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity	num_doors_2	num_doors_3	num_doors_4
0	chevrolet	cobalt	2008	regular_unleaded	148.0	4.0	manual	front_wheel_drive	2.0	NaN	compact	coupe	33	24	1385	1.0	0.0	0.0
1	toyota	matrix	2012	regular_unleaded	132.0	4.0	automatic	front_wheel_drive	4.0	hatchback	compact	4dr_hatchback	32	25	2031	0.0	0.0	1.0
2	subaru	impreza	2016	regular_unleaded	148.0	4.0	automatic	all_wheel_drive	4.0	hatchback	compact	4dr_hatchback	37	28	640	0.0	0.0	1.0
3	volkswagen	vanagon	1991	regular_unleaded	90.0	4.0	manual	rear_wheel_drive	3.0	NaN	large	passenger_minivan	18	16	873	0.0	1.0	0.0
4	ford	f-150	2017	flex-fuel_(unleaded/e85)	385.0	8.0	automatic	four_wheel_drive	4.0	flex_fuel	large	crew_cab_pickup	21	15	5657	0.0	0.0	1.0
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2377	volvo	v60	2015	regular_unleaded	240.0	4.0	automatic	front_wheel_drive	4.0	luxury	midsize	wagon	37	25	870	NaN	NaN	NaN
2378	maserati	granturismo_convertible	2015	premium_unleaded_(required)	444.0	8.0	automatic	rear_wheel_drive	2.0	exotic,luxury,high-performance	midsize	convertible	20	13	238	NaN	NaN	NaN
2379	cadillac	escalade_hybrid	2013	regular_unleaded	332.0	8.0	automatic	rear_wheel_drive	4.0	luxury,hybrid	large	4dr_suv	23	20	1624	NaN	NaN	NaN
2380	mitsubishi	lancer	2016	regular_unleaded	148.0	4.0	manual	front_wheel_drive	4.0	NaN	compact	sedan	34	24	436	NaN	NaN	NaN
2381	kia	sorento	2015	regular_unleaded	290.0	6.0	automatic	front_wheel_drive	4.0	crossover	midsize	4dr_suv	25	18	1720	NaN	NaN	NaN

9532 rows × 18 columns

Above we can see that there are 9532 rows but the index is still set from the validation dataset so we will reset the index.

In [135]:

df_full_train = df_full_train.reset_index(drop=True)

In [136]:

df_full_train

Out[136]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity	num_doors_2	num_doors_3	num_doors_4
0	chevrolet	cobalt	2008	regular_unleaded	148.0	4.0	manual	front_wheel_drive	2.0	NaN	compact	coupe	33	24	1385	1.0	0.0	0.0
1	toyota	matrix	2012	regular_unleaded	132.0	4.0	automatic	front_wheel_drive	4.0	hatchback	compact	4dr_hatchback	32	25	2031	0.0	0.0	1.0
2	subaru	impreza	2016	regular_unleaded	148.0	4.0	automatic	all_wheel_drive	4.0	hatchback	compact	4dr_hatchback	37	28	640	0.0	0.0	1.0
3	volkswagen	vanagon	1991	regular_unleaded	90.0	4.0	manual	rear_wheel_drive	3.0	NaN	large	passenger_minivan	18	16	873	0.0	1.0	0.0
4	ford	f-150	2017	flex-fuel_(unleaded/e85)	385.0	8.0	automatic	four_wheel_drive	4.0	flex_fuel	large	crew_cab_pickup	21	15	5657	0.0	0.0	1.0
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
9527	volvo	v60	2015	regular_unleaded	240.0	4.0	automatic	front_wheel_drive	4.0	luxury	midsize	wagon	37	25	870	NaN	NaN	NaN
9528	maserati	granturismo_convertible	2015	premium_unleaded_(required)	444.0	8.0	automatic	rear_wheel_drive	2.0	exotic,luxury,high-performance	midsize	convertible	20	13	238	NaN	NaN	NaN
9529	cadillac	escalade_hybrid	2013	regular_unleaded	332.0	8.0	automatic	rear_wheel_drive	4.0	luxury,hybrid	large	4dr_suv	23	20	1624	NaN	NaN	NaN
9530	mitsubishi	lancer	2016	regular_unleaded	148.0	4.0	manual	front_wheel_drive	4.0	NaN	compact	sedan	34	24	436	NaN	NaN	NaN
9531	kia	sorento	2015	regular_unleaded	290.0	6.0	automatic	front_wheel_drive	4.0	crossover	midsize	4dr_suv	25	18	1720	NaN	NaN	NaN

9532 rows × 18 columns

Now that we have concatenated the train and val set back together and reset the index we can now use our previously built prepare_X function to prepare the new entire training set.

In [137]:

X_full_train = prepare_X(df_full_train)

In [138]:

X_full_train

Out[138]:

array([[148.,   4.,  33., ...,   1.,   0.,   0.],
       [132.,   4.,  32., ...,   0.,   0.,   1.],
       [148.,   4.,  37., ...,   0.,   0.,   1.],
       ...,
       [332.,   8.,  23., ...,   0.,   0.,   0.],
       [148.,   4.,  34., ...,   0.,   0.,   0.],
       [290.,   6.,  25., ...,   0.,   0.,   0.]])

We also need to concatenate the y’s together.

In [140]:

y_full_train = np.concatenate([y_train, y_val])
y_full_train

Out[140]:

array([ 9.57574708,  9.887663  ,  9.89323518, ..., 11.21756062,
        9.77542688, 10.1924563 ])

In [141]:

w0, w = train_linear_regression_reg(X_full_train, y_full_train, r=0.001)
w0, w

Out[141]:

(6.321897140639891,
 array([ 1.52506334e-03,  1.18188694e-01, -6.66105724e-03, -5.33414117e-03,
        -4.87603196e-05, -9.69091849e-02, -7.92623108e-01, -8.90864322e-01,
        -6.35103033e-01, -4.14339218e-02,  1.75560737e-01, -5.78067084e-04,
        -1.00563873e-01, -9.27549683e-02, -4.66859089e-01,  7.98659955e-02,
        -3.16047638e-01, -5.51981604e-01, -7.89525255e-02,  1.09536726e+00,
         9.20059720e-01,  1.14963711e+00,  2.65277321e+00,  5.09996289e-01,
         1.62933899e+00,  1.53004304e+00,  1.61722175e+00,  1.54522114e+00,
        -9.70559788e-02,  3.73062078e-02, -5.81767461e-02, -2.35940808e-02,
        -1.19357029e-02,  2.18895262e+00,  2.07458271e+00,  2.05916687e+00,
        -5.00802769e-02,  5.62184639e-02,  1.84794024e-01,  3.32646151e-01,
        -1.58817038e-01]))

In [158]:

X_test = prepare_X(df_test)
y_pred = w0 + X_test.dot(w)
score = rmse(y_test, y_pred)
    
score

Out[158]:

0.45177493042600725

Now our model has been created, we want to try it on some of the cars in our dataset. We picked the car with index 20 and we want to convert the data for that car into a dictionary just as a website might send the data.

In [145]:

car = df_test.iloc[20].to_dict()
car

Out[145]:

{'make': 'toyota',
 'model': 'sienna',
 'year': 2015,
 'engine_fuel_type': 'regular_unleaded',
 'engine_hp': 266.0,
 'engine_cylinders': 6.0,
 'transmission_type': 'automatic',
 'driven_wheels': 'front_wheel_drive',
 'number_of_doors': 4.0,
 'market_category': nan,
 'vehicle_size': 'large',
 'vehicle_style': 'passenger_minivan',
 'highway_mpg': 25,
 'city_mpg': 18,
 'popularity': 2031}

Now, since we need a dataframe to plug into our model, we will convert the dictionary sent from the website to a dataframe.

In [147]:

df_small = pd.DataFrame([car])
df_small

Out[147]:

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity
0	toyota	sienna	2015	regular_unleaded	266.0	6.0	automatic	front_wheel_drive	4.0	NaN	large	passenger_minivan	25	18	2031

Now we use our prepare_X function to generate our weights.

In [148]:

X_small = prepare_X(df_small)
X_small

Out[148]:

array([[2.660e+02, 6.000e+00, 2.500e+01, 1.800e+01, 2.031e+03, 2.000e+00,
        0.000e+00, 0.000e+00, 1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        1.000e+00, 0.000e+00, 1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        0.000e+00, 1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00]])

With our weights generated we can see the prediction that our model creates.

In [151]:

y_pred = w0 + X_small.dot(w)
y_pred = y_pred[0]
y_pred

Out[151]:

10.462651726547548

Remember, our model was converted to a log scale, which we have to undo here to get a dollar amount.

In [152]:

np.expm1(y_pred)

Out[152]:

34983.19708133757

Let’s do the same thing to the actual values, and we can see that our model is very, very close.

In [154]:

np.expm1(y_test[20])

Out[154]:

35000.00000000001

2.16 Car price prediction project summary¶

In summary, this session covered some topics, including data preparation, exploratory data analysis, the validation framework, linear regression model, LR vector and normal forms, the baseline model, root mean squared error, feature engineering, regularization, tuning the model, and using the best model with new data. All these concepts were explained using the problem to predict the price of cars.

In [ ]:

Machine Learning for Regression

2. Machine Learning for Regression¶

2.1 Car price prediction project¶

2.2 Data preparation¶

2.3 Exploratory data analysis¶

Missing values¶

2.4 Setting up the validation framework¶

2.5 Linear regression¶

2.6 Linear regression: vector form¶

2.7 Training linear regression: Normal equation¶

2.8 Baseline model for car price prediction project¶

2.9 Root mean squared error¶

2.10 Using RMSE on validation data¶

2.11 Feature engineering¶

2.12 Categorical variables¶

2.13 Regularization¶

2.14 Tuning the model¶

2.15 Using the model¶

2.16 Car price prediction project summary¶

Comments

Leave a Reply Cancel reply