House Prices Prediction Using Deep Learning

栏目: IT技术 · 发布时间: 4年前

House Prices Prediction Using Deep Learning

Keras-Regression vs Multiple Linear Regression

Jul 22 ·6min read

In this tutorial, we’re going to create a model to predict House prices:house_with_garden: based on various factors across different markets.

Problem Statement

The goal of this statistical analysis is to help us understand the relationship between house features and how these variables are used to predict house price.

Objective

Predict the house price
Using two different models in terms of minimizing the difference between predicted and actual rating

Data used: Kaggle-kc_house Dataset

GitHub:you can find my source code here

Step 1: Exploratory Data Analysis (EDA)

First, Let’s import the data and have a look to see what kind of data we are dealing with:

#import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt#import Data
Data = pd.read_csv('kc_house_data.csv')
Data.head(5).T#get some information about our Data-Set
Data.info()
Data.describe().transpose()

Our features are:
:heavy_check_mark: Date:
Date house was sold
:heavy_check_mark: Price:
Price is prediction target
:heavy_check_mark: Bedrooms:
Number of Bedrooms/House
:heavy_check_mark: Bathrooms:
Number of bathrooms/House
:heavy_check_mark: Sqft_Living:
square footage of the home
:heavy_check_mark: Sqft_Lot:
square footage of the lot
:heavy_check_mark: Floors:
Total floors (levels) in house
:heavy_check_mark: Waterfront:
House which has a view to a waterfront
:heavy_check_mark: View:
Has been viewed
:heavy_check_mark: Condition:
How good the condition is ( Overall )
:heavy_check_mark: Grade:
grade given to the housing unit, based on King County grading system
:heavy_check_mark: Sqft_Above:
square footage of house apart from basement
:heavy_check_mark: Sqft_Basement:
square footage of the basement
:heavy_check_mark: Yr_Built:
Built Year
:heavy_check_mark: Yr_Renovated:
Year when house was renovated
:heavy_check_mark: Zipcode:
Zip
:heavy_check_mark: Lat:
Latitude coordinate
:heavy_check_mark: Long:
Longitude coordinate
:heavy_check_mark: Sqft_Living15:
Living room area in 2015(implies — some renovations)
:heavy_check_mark: Sqft_Lot15: lotSize area in 2015(implies — some renovations)

Let’s plot couple of features to get a better feel of the data

#visualizing house prices
fig = plt.figure(figsize=(10,7))
fig.add_subplot(2,1,1)
sns.distplot(Data['price'])
fig.add_subplot(2,1,2)
sns.boxplot(Data['price'])
plt.tight_layout()#visualizing square footage of (home,lot,above and basement)
fig = plt.figure(figsize=(16,5))
fig.add_subplot(2,2,1)
sns.scatterplot(Data['sqft_above'], Data['price'])
fig.add_subplot(2,2,2)
sns.scatterplot(Data['sqft_lot'],Data['price'])
fig.add_subplot(2,2,3)
sns.scatterplot(Data['sqft_living'],Data['price'])
fig.add_subplot(2,2,4)
sns.scatterplot(Data['sqft_basement'],Data['price'])#visualizing bedrooms,bathrooms,floors,grade
fig = plt.figure(figsize=(15,7))
fig.add_subplot(2,2,1)
sns.countplot(Data['bedrooms'])
fig.add_subplot(2,2,2)
sns.countplot(Data['floors'])
fig.add_subplot(2,2,3)
sns.countplot(Data['bathrooms'])
fig.add_subplot(2,2,4)
sns.countplot(Data['grade'])
plt.tight_layout()

With distribution plot of price, we can visualize that most of the prices are between 0 and around 1M with few outliers close to 8 million (fancy houses:wink:). It would make sense to drop those outliers in our analysis.

It is quit useful to have a quick overview of different features distribution vs house price.

Here, I’m breaking the date columns down to years and months to see how is the house price is changing.

#let's break date to years, months
Data['date'] = pd.to_datetime(Data['date'])
Data['month'] = Data['date'].apply(lambda date:date.month)
Data['year'] = Data['date'].apply(lambda date:date.year)#data visualization house price vs months and years
fig = plt.figure(figsize=(16,5))
fig.add_subplot(1,2,1)
Data.groupby('month').mean()['price'].plot()
fig.add_subplot(1,2,2)
Data.groupby('year').mean()['price'].plot()

Let’s check if we have a Null Data and also drop some columns that we do not need.

# check if there are any Null values
Data.isnull().sum()# drop some unnecessary columns
Data = Data.drop('date',axis=1)
Data = Data.drop('id',axis=1)
Data = Data.drop('zipcode',axis=1)

Step 2: Train Test Split and Scaling

Data is divided into the Train set and Test set. We use the Train set to make the algorithm learn the data’s behavior and then check the accuracy of our model on the Test set.

X
y

X = Data.drop('price',axis =1).values
y = Data['price'].values#splitting Train and Test 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

Feature scaling will help us see all the variables from the same lens (same scale), it will also help our models learn faster.

#standardization scaler - fit&transform on train, fit only on test
from sklearn.preprocessing import StandardScaler
s_scaler = StandardScaler()
X_train = s_scaler.fit_transform(X_train.astype(np.float))
X_test = s_scaler.transform(X_test.astype(np.float))

Step 3: Model Selection and Evaluation

:bulb:Model 1:Multiple Linear Regressions

Multiple Linear Regression is an extension of Simple Linear Regression (read more here ) and assume that there is a linear relationship between a dependent variable Y and independent variables X

Let’s wrap the training process in our Regression model:

# Multiple Liner Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()  
regressor.fit(X_train, y_train)#evaluate the model (intercept and slope)
print(regressor.intercept_)
print(regressor.coef_)#predicting the test set result
y_pred = regressor.predict(X_test)#put results as a DataFrame
coeff_df = pd.DataFrame(regressor.coef_, Data.drop('price',axis =1).columns, columns=['Coefficient']) 
coeff_df

by visualizing the residual we can see that is normally distributed (proof of having linear relationship with the dependent variable)

# visualizing residuals
fig = plt.figure(figsize=(10,5))
residuals = (y_test- y_pred)
sns.distplot(residuals)

Let’s compare actual output and predicted value to measure how far our predictions are from the real house prices.

#compare actual output values with predicted values
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.head(10)
df1# evaluate the performance of the algorithm (MAE - MSE - RMSE)
from sklearn import metricsprint('MAE:', metrics.mean_absolute_error(y_test, y_pred))  
print('MSE:', metrics.mean_squared_error(y_test, y_pred))  
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))print('VarScore:',metrics.explained_variance_score(y_test,y_pred))

:bulb:Model 2: Keras Regressions

Let’s create a baseline neural network model for the regression problem. Starting with all of the needed functions and objects.

# Creating a Neural Network Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam

Since we have 19 features, let’s insert 19 neurons as a start, 4 hidden layers and 1 output layer due to predict house Price.

Also, ADAM optimization algorithm is used for optimizing loss function (Mean squared error)

# having 19 neuron is based on the number of available features
model = Sequential()
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(1))model.compile(optimizer='Adam',loss='mes')

Then, we train the model for 400 epochs, and each time record the training and validation accuracy in the history object. To keep track of how well the model is performing for each epoch, the model will run in both train and test data along with calculating the loss function.

model.fit(x=X_train,y=y_train,
          validation_data=(X_test,y_test),
          batch_size=128,epochs=400)model.summary()

loss_df = pd.DataFrame(model.history.history)
loss_df.plot(figsize=(12,8))

Evaluation on Test Data

y_pred = model.predict(X_test)from sklearn import metricsprint('MAE:', metrics.mean_absolute_error(y_test, y_pred))  
print('MSE:', metrics.mean_squared_error(y_test, y_pred))  
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))print('VarScore:',metrics.explained_variance_score(y_test,y_pred))# Visualizing Our predictions
fig = plt.figure(figsize=(10,5))
plt.scatter(y_test,y_pred)
# Perfect predictions
plt.plot(y_test,y_test,'r')

# visualizing residuals
fig = plt.figure(figsize=(10,5))
residuals = (y_test- y_pred)
sns.distplot(residuals)

Keras Regression vs Multiple Linear Regression!

We made it!:muscle:

we have predicted the house price using two different ML model algorithms.

The score of our Multiple Linear Regression is around 69%, so this model had room for improvement. Then we got an accuracy of ~81% with Keras Regression model .

Also, notice that RMSE (loss function) is lower for Keras Regression model which shows that our prediction is closer to actual rating price.

Without surprise, this score can be improved through feature selection or using other regression models.

Thank you for reading . Again feedback is always welcome!

很遗憾的说，推酷将在这个月底关闭。人生海海，几度秋凉，感谢那些有你的时光。

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

House Prices Prediction Using Deep Learning

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

有限与无限的游戏

[美]詹姆斯·卡斯 / 马小悟、余倩 / 电子工业出版社 / 2013-10 / 35.00元

在这本书中，詹姆斯·卡斯向我们展示了世界上两种类型的「游戏」：「有限的游戏」和「无限的游戏」。有限的游戏，其目的在于赢得胜利；无限的游戏，却旨在让游戏永远进行下去。有限的游戏在边界内玩，无限的游戏玩的就是边界。有限的游戏具有一个确定的开始和结束，拥有特定的赢家，规则的存在就是为了保证游戏会结束。无限的游戏既没有确定的开始和结束，也没有赢家，它的目的在于将更多的人带入到游戏本身中来，从而延续......一起来看看《有限与无限的游戏》这本书的介绍吧!

码农工具

House Prices Prediction Using Deep Learning