Linear Regression - My ML Learning Journey 001

Photo by Andrew Neel on Unsplash

Linear Regression - My ML Learning Journey 001

ยท

7 min read

Linear Regression - My ML Learning Journey 001

Introduction

Hi guys, welcome to my blog! I am Sivvie, it is an honour to meet you here! In this blog post I will share my process and learning journey to submit my first Kaggle entry ever, on topic of Linear Regression from Machine Learning. It is not a tutorial as I do not possessed such knowledge, I merely wanted to record down my learning process towards this brand new topic!

The Kaggle competition that I picked is the Beginner Competition which is right over here: House Prices - Advanced Regression Techniques | Kaggle

After enters the competition and downloaded the CSV data file to my local, here it started ๐Ÿ˜†

Import Data

First of all I imported the basics into my local Jupyter Notebook.

import pandas as pd # For Read CSV File
import matplotlib.pyplot as plt # For Data Visualization
import seaborn as sns # For Data Visualization
import numpy as np # For Math Operation

These are the packages that first came to my mind when I started:

  • pandas - For reading the CSV data file
  • matplotlib.pyplot - For data visualization such as bar graphs or scatter plot
  • seaborn - For data visualization such as bar plot
  • numpy - For mathematics operation etc.

If there is error when import the package, please make sure to pip install those packages into your local

After that, I am ready to take a look into the data by

train_data = pd.read_csv('local_path\\train.csv')  
test_data = pd.read_csv('local_path\\test.csv')  
train_data.head()

Run the cell and the .head() method gave me the first 5 rows of the data, which looks like this:

Pasted image 20220623111515.png

Run another cell with train_data.shape will gave me the shape of the data which looks like this: (1460, 81). And this tells me that the data had 81 columns and 1460 rows. Which I interpreted as having 79 features after excluding Id and SalePrice as the output.

Data Visualization

Then, I want to see how the SalePrice looks like in a histogram, so I plotted them using below:

plt.hist(train_data.SalePrice, color='blue')  
plt.show()

It shows something like this:

Pasted image 20220624001352.png

So most of the SalePrice falls about $150000-200000, I also learned a trick to np.log() them to prevent skewed distribution, this graph does not look skewed to me but when I try it, I think it does look better distributed? Like a normal distribution.

plt.hist(np.log(train_data.SalePrice), color='blue')  
plt.show()

Pasted image 20220624001649.png

Next, I think I want to find which feature affected the output the most, so I run below cell with the code

corr = train_data.corr()  
print(corr['SalePrice'].sort_values(ascending=False))

And this gave me something like this (I only showcase those have correlation above 0.5 here):

FeatureCorrelation
SalePrice1.000000
OverallQual0.790982
GrLivArea0.708624
GarageCars0.640409
GarageArea0.623431
TotalBsmtSF0.613581
1stFlrSF0.605852
FullBath0.560664
TotRmsAbvGrd0.533723
YearBuilt0.522897
YearRemodAdd0.507101

Now I know the most affected feature is OverallQual followed by GrLiveArea and GarageCars.

If I plotted them into graphs, it looks like this:

overall_quality = train_data.pivot_table(index='OverallQual', values='SalePrice')  
overall_quality.plot(kind='bar', color='green')  
plt.xlabel('Overall Quality')  
plt.ylabel('Sale Price')  
plt.xticks(rotation=0)  
plt.show()

Pasted image 20220624002710.png

This shows very clear that with the higher overall quality of the house, the higher the sales price. I thought the same will happen to GrLiveArea but it is not quite what I think. When I applied the same code to GrLiveArea it gave me something like this:

Pasted image 20220624003729.png

Which is not quite consistent as OverallQual, so I look around the internet and then decides to try out scatter plot with below code cell:

plt.scatter(x=train_data['GrLivArea'], y=train_data['SalePrice'])  
plt.ylabel('Sale Price')  
plt.xlabel('GrLiveArea')  
plt.show()

Pasted image 20220624003927.png

This gave a much better insight on where are most of the ground living area falls in the Sale Price with some outliers. So I probably have to do something with the outliers later. Now I decide to continue with the third feature, which is a discrete value feature so I figured I probably could not plot using the bar graph or scatter plot like the other two features. So for GarageCars I used something called barplot:

plt.figure(figsize=(10,5))  
sns.barplot(x=train_data['GarageCars'],y=train_data['SalePrice'],palette='Reds')  
plt.ylabel('Sale Price')  
plt.show()

Pasted image 20220624004259.png

I tried for a couple more feature while toying around both matplotlib and seaborn for Data Visualization so that it gave me insights on how is the relationship between features and the ultimate output - Sale Price. I also found some feature have NaN value which needs to be eliminated.

Data Pre-Processing

Next step is to eliminate the NaN value from the data set, so I run below code to check NaN value for every feature:

train_data.isna().sum().sort_values(ascending=False)
FeatureNumber of NaN value
PoolQC1453
MiscFeature1406
Alley1369
Fence1179
FireplaceQu690
LotFrontage259
GarageYrBlt81
GarageCond81
GarageType81
GarageFinish81
GarageQual81
BsmtFinType238
BsmtExposure38
BsmtQual37
BsmtCond37
BsmtFinType137
MasVnrArea8
MasVnrType8

These looks a lot of NaN value to me, so again I look around to search for good ways to eliminate the NaN, and I went for this to try out:

train_data = train_data.select_dtypes(include=[np.number]).interpolate().dropna()  
train_data.isna().sum().sort_values(ascending=False)

The table is now all showing 0 so I presumed the NaN value has been eliminated. Next step for me is to eliminate the outliers. Again I searched around the internet, and I went for IsolationForest which seems interesting.

from sklearn.ensemble import IsolationForest  

isolation_forest = IsolationForest(contamination=0.1)  
isolated_result = isolation_forest.fit_predict(train_data)  

is_it_outlier = isolated_result != -1  
train_data = train_data[is_it_outlier]

Model Training

I guess I can start to prepare for model training now! First of all I defined the X and y of it, which in this case y shall be the Sale Price and X shall be the features but excluded the output and the Id. I defined y as the log of SalePrice to scale better I guess:

y = np.log(train_data.SalePrice)  
X = train_data.drop(['SalePrice', 'Id'], axis=1)

Then I use something called train_test_split from scikit-learn to separate the training data into two part for train and test (70% for train and 30% for test):

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

Now I have two sets of data for train and test for both X and y, let's train it with the LinearRegression model!

from sklearn import linear_model  
from sklearn.metrics import mean_squared_error  

regressor = linear_model.LinearRegression()  
model = regressor.fit(X_train, y_train)

After I have the trained model, next is to evaluate the performance of the model by using test data

Model Evaluation

I used the R^2 value and Mean Squared Error value as the evaluation value as R-squared value explains how well the model fits the data (the higher the number the better between 0 and 1), and mean squared error value explains the loss function (the lower the number the better between 0 and 1).

print(f"The R^2 value is {model.score(X_test, y_test)}")  
predictions = model.predict(X_test)  
print(f"The MSE value is {mean_squared_error(y_test, predictions)}")

The result gave me:

Pasted image 20220624010404.png

Which is not considered very good I guess but as my first try I felt... pretty satisfied? So I decide to go for this first without any further tweaks. I also plotted a scatter plot to try and see how is the predictions performed:

actual = y_test  
plt.scatter(predictions, actual, alpha=0.5, color='g')  
plt.xlabel('Predicted Price')  
plt.ylabel('Actual Price')  
plt.title('House Price Prediction By Linear Regression')  
plt.show()

Pasted image 20220624010759.png

Conclusion

Voila! I think I finished my first try on Linear Regression problem! I then make a submission to Kaggle after creating the output of the test_data for the submission. I placed in about middle ground so I guess there are much to improve. While looked back the code I also noticed in the end I probably only train the model with continuous value feature, maybe next time for improvement I can add in the discrete value feature, which is something new to learn!

Thank you for reading, and if you have any suggestion for the improvement of this practice please do not hesitate to comment! I have a lot to learn and I hope I could improve more and more with the practices!

Did you find this article valuable?

Support Sivvie Lim by becoming a sponsor. Any amount is appreciated!

ย