House Prices Prediction Using Deep Learning
Keras-Regression vs Multiple Linear Regression
In this tutorial, we’re going to create a model to predict House prices:house_with_garden: based on various factors across different markets.
Problem Statement
The goal of this statistical analysis is to help us understand the relationship between house features and how these variables are used to predict house price.
Objective
- Predict the house price
- Using two different models in terms of minimizing the difference between predicted and actual rating
Data used: Kaggle-kc_house Dataset
GitHub:you can find my source code here
Step 1: Exploratory Data Analysis (EDA)
First, Let’s import the data and have a look to see what kind of data we are dealing with:
#import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt#import Data
Data = pd.read_csv('kc_house_data.csv')
Data.head(5).T#get some information about our Data-Set
Data.info()
Data.describe().transpose()
:heavy_check_mark: Date:
Date house was sold
:heavy_check_mark: Price:Price is prediction target
:heavy_check_mark: Bedrooms:Number of Bedrooms/House
:heavy_check_mark: Bathrooms:Number of bathrooms/House
:heavy_check_mark: Sqft_Living:square footage of the home
:heavy_check_mark: Sqft_Lot:square footage of the lot
:heavy_check_mark: Floors:Total floors (levels) in house
:heavy_check_mark: Waterfront:House which has a view to a waterfront
:heavy_check_mark: View:Has been viewed
:heavy_check_mark: Condition:How good the condition is ( Overall )
:heavy_check_mark: Grade:grade given to the housing unit, based on King County grading system
:heavy_check_mark: Sqft_Above:square footage of house apart from basement
:heavy_check_mark: Sqft_Basement:square footage of the basement
:heavy_check_mark: Yr_Built:Built Year
:heavy_check_mark: Yr_Renovated:Year when house was renovated
:heavy_check_mark: Zipcode:Zip
:heavy_check_mark: Lat:Latitude coordinate
:heavy_check_mark: Long:Longitude coordinate
:heavy_check_mark: Sqft_Living15:Living room area in 2015(implies — some renovations)
:heavy_check_mark: Sqft_Lot15: lotSize area in 2015(implies — some renovations)Let’s plot couple of features to get a better feel of the data
#visualizing house prices
fig = plt.figure(figsize=(10,7))
fig.add_subplot(2,1,1)
sns.distplot(Data['price'])
fig.add_subplot(2,1,2)
sns.boxplot(Data['price'])
plt.tight_layout()#visualizing square footage of (home,lot,above and basement)
fig = plt.figure(figsize=(16,5))
fig.add_subplot(2,2,1)
sns.scatterplot(Data['sqft_above'], Data['price'])
fig.add_subplot(2,2,2)
sns.scatterplot(Data['sqft_lot'],Data['price'])
fig.add_subplot(2,2,3)
sns.scatterplot(Data['sqft_living'],Data['price'])
fig.add_subplot(2,2,4)
sns.scatterplot(Data['sqft_basement'],Data['price'])#visualizing bedrooms,bathrooms,floors,grade
fig = plt.figure(figsize=(15,7))
fig.add_subplot(2,2,1)
sns.countplot(Data['bedrooms'])
fig.add_subplot(2,2,2)
sns.countplot(Data['floors'])
fig.add_subplot(2,2,3)
sns.countplot(Data['bathrooms'])
fig.add_subplot(2,2,4)
sns.countplot(Data['grade'])
plt.tight_layout()
With distribution plot of price, we can visualize that most of the prices are between 0 and around 1M with few outliers close to 8 million (fancy houses:wink:). It would make sense to drop those outliers in our analysis.
It is quit useful to have a quick overview of different features distribution vs house price.
Here, I’m breaking the date columns down to years and months to see how is the house price is changing.
#let's break date to years, months
Data['date'] = pd.to_datetime(Data['date'])
Data['month'] = Data['date'].apply(lambda date:date.month)
Data['year'] = Data['date'].apply(lambda date:date.year)#data visualization house price vs months and years
fig = plt.figure(figsize=(16,5))
fig.add_subplot(1,2,1)
Data.groupby('month').mean()['price'].plot()
fig.add_subplot(1,2,2)
Data.groupby('year').mean()['price'].plot()
Let’s check if we have a Null Data and also drop some columns that we do not need.
# check if there are any Null values
Data.isnull().sum()# drop some unnecessary columns
Data = Data.drop('date',axis=1)
Data = Data.drop('id',axis=1)
Data = Data.drop('zipcode',axis=1)
Step 2: Train Test Split and Scaling
Data is divided into the Train
set and Test
set. We use the Train
set to make the algorithm learn the data’s behavior and then check the accuracy of our model on the Test
set.
X y
X = Data.drop('price',axis =1).values
y = Data['price'].values#splitting Train and Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)
Feature scaling will help us see all the variables from the same lens (same scale), it will also help our models learn faster.
#standardization scaler - fit&transform on train, fit only on test
from sklearn.preprocessing import StandardScaler
s_scaler = StandardScaler()
X_train = s_scaler.fit_transform(X_train.astype(np.float))
X_test = s_scaler.transform(X_test.astype(np.float))
Step 3: Model Selection and Evaluation
:bulb:Model 1:Multiple Linear Regressions
Multiple Linear Regression is an extension of Simple Linear Regression (read more here
) and assume that there is a linear relationship between a dependent variable Y
and independent variables X
Let’s wrap the training process in our Regression model:
# Multiple Liner Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)#evaluate the model (intercept and slope)
print(regressor.intercept_)
print(regressor.coef_)#predicting the test set result
y_pred = regressor.predict(X_test)#put results as a DataFrame
coeff_df = pd.DataFrame(regressor.coef_, Data.drop('price',axis =1).columns, columns=['Coefficient'])
coeff_df
by visualizing the residual we can see that is normally distributed (proof of having linear relationship with the dependent variable)
# visualizing residuals
fig = plt.figure(figsize=(10,5))
residuals = (y_test- y_pred)
sns.distplot(residuals)
Let’s compare actual output and predicted value to measure how far our predictions are from the real house prices.
#compare actual output values with predicted values
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.head(10)
df1# evaluate the performance of the algorithm (MAE - MSE - RMSE)
from sklearn import metricsprint('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))print('VarScore:',metrics.explained_variance_score(y_test,y_pred))
:bulb:Model 2: Keras Regressions
Let’s create a baseline neural network model for the regression problem. Starting with all of the needed functions and objects.
# Creating a Neural Network Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam
Since we have 19 features, let’s insert 19 neurons as a start, 4 hidden layers and 1 output layer due to predict house Price.
Also, ADAM optimization algorithm is used for optimizing loss function (Mean squared error)
# having 19 neuron is based on the number of available features
model = Sequential()
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(1))model.compile(optimizer='Adam',loss='mes')
Then, we train the model for 400 epochs, and each time record the training and validation accuracy in the history object. To keep track of how well the model is performing for each epoch, the model will run in both train and test data along with calculating the loss function.
model.fit(x=X_train,y=y_train, validation_data=(X_test,y_test), batch_size=128,epochs=400)model.summary()
loss_df = pd.DataFrame(model.history.history) loss_df.plot(figsize=(12,8))
Evaluation on Test Data
y_pred = model.predict(X_test)from sklearn import metricsprint('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))print('VarScore:',metrics.explained_variance_score(y_test,y_pred))# Visualizing Our predictions
fig = plt.figure(figsize=(10,5))
plt.scatter(y_test,y_pred)
# Perfect predictions
plt.plot(y_test,y_test,'r')
# visualizing residuals
fig = plt.figure(figsize=(10,5))
residuals = (y_test- y_pred)
sns.distplot(residuals)
Keras Regression vs Multiple Linear Regression!
We made it!:muscle:
we have predicted the house price using two different ML model algorithms.
The score of our Multiple Linear Regression is around 69%, so this model had room for improvement. Then we got an accuracy of ~81% with Keras Regression model .
Also, notice that RMSE (loss function) is lower for Keras Regression model which shows that our prediction is closer to actual rating price.
Without surprise, this score can be improved through feature selection or using other regression models.
Thank you for reading . Again feedback is always welcome!
很遗憾的说,推酷将在这个月底关闭。人生海海,几度秋凉,感谢那些有你的时光。
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Data Structures and Algorithm Analysis in Java
Mark A. Weiss / Pearson / 2011-11-18 / GBP 129.99
Data Structures and Algorithm Analysis in Java is an “advanced algorithms” book that fits between traditional CS2 and Algorithms Analysis courses. In the old ACM Curriculum Guidelines, this course wa......一起来看看 《Data Structures and Algorithm Analysis in Java》 这本书的介绍吧!