Building a housing recommendation engine¶

So far, we have feature engineered our data set with location specific features. We explicitly defined weights for different attributes and arrived at a rank. Instead, we could simply like and dislike a few houses and let a machine learning model infer our preferences based on that. That is what this notebook tries to do.

Since it is time consuming to like and dislike a large number of properties, we pick the top 50 notebooks from our previous rank and like them all. We dislike the remaining ones.

In [1]:

Copied!





import pandas as pd
import matplotlib.pyplot as plt
from pprint import pprint
%matplotlib inline
import seaborn as sns

from arcgis.gis import GIS
from arcgis.features import Feature, FeatureLayer, FeatureSet, GeoAccessor, GeoSeriesAccessor
import pandas as pd
import matplotlib.pyplot as plt
from pprint import pprint
%matplotlib inline
import seaborn as sns

from arcgis.gis import GIS
from arcgis.features import Feature, FeatureLayer, FeatureSet, GeoAccessor, GeoSeriesAccessor

Read ranked dataset¶

In [2]:

Copied!

prop_df = pd.read_csv('resources/houses_ranked.csv')
prop_df = pd.DataFrame.spatial.from_xy(prop_df, 'LONGITUDE','LATITUDE')
prop_df = pd.read_csv('resources/houses_ranked.csv')
prop_df = pd.DataFrame.spatial.from_xy(prop_df, 'LONGITUDE','LATITUDE')

Generate preference column¶

We will pick the top 50 records and provide a positive preference to them. Then we will drop the score and rank columns and let the machine learning algorithm learn our preferences.

In [3]:

Copied!

prop_df.columns
prop_df.columns

Out[3]:

Index(['Unnamed: 0', 'SALE TYPE', 'PROPERTY TYPE', 'ADDRESS', 'CITY', 'STATE',
       'ZIP', 'PRICE', 'BEDS', 'BATHS', 'LOCATION', 'SQUARE FEET', 'LOT SIZE',
       'YEAR BUILT', 'DAYS ON MARKET', 'PRICE PER SQFT', 'HOA PER MONTH',
       'STATUS', 'URL', 'SOURCE', 'MLS', 'LATITUDE', 'LONGITUDE', 'SHAPE',
       'grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count',
       'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count',
       'edu_count', 'commute_length', 'commute_duration', 'scores_scaled',
       'rank'],
      dtype='object')

In [4]:

Copied!

prop_df.shape
prop_df.shape

Out[4]:

(331, 38)

Generate a prefernce list that is 331 records long. This list has 1 for first 50 records followed by 0.

In [5]:

Copied!

preference_list = [1]*50
preference_list.extend([0]*(331-50))
len(preference_list)
preference_list = [1]*50
preference_list.extend([0]*(331-50))
len(preference_list)

Out[5]:

In [6]:

Copied!

prop_df['favorite'] = preference_list
prop_df['favorite'] = preference_list

Drop `rank`, `scores_scaled` columns from DataFrame¶

In [7]:

Copied!

prop_df.drop(columns=['Unnamed: 0','scores_scaled','rank'], inplace=True)
prop_df.head()
prop_df.drop(columns=['Unnamed: 0','scores_scaled','rank'], inplace=True)
prop_df.head()

Out[7]:

	SALE TYPE	PROPERTY TYPE	ADDRESS	CITY	STATE	ZIP	PRICE	BEDS	BATHS	LOCATION	...	coffee_count	bars_count	gas_count	shops_count	travel_count	parks_count	edu_count	commute_length	commute_duration	favorite
0	MLS Listing	Single Family Residential	15986 SE Spokane Ct. Ave	Portland	OR	97236.0	543900.0	4.0	3.5	Portland Southeast	...	50	2	34	46	50	50	50	5.796321	16.509734	1
1	MLS Listing	Multi-Family (2-4 Unit)	SE Henderson St	Portland	OR	97206.0	625000.0	6.0	6.0	LENTS	...	50	1	50	40	50	50	50	8.380589	23.087985	1
2	MLS Listing	Single Family Residential	8268 SE Yamhill St	Portland	OR	97216.0	550000.0	7.0	4.0	Portland Southeast	...	50	1	50	43	50	50	50	6.330796	16.910622	1
3	MLS Listing	Single Family Residential	6311 SE Tenino St	Portland	OR	97206.0	479900.0	4.0	2.5	Portland Southeast	...	50	2	44	48	50	50	50	7.299694	20.389635	1
4	MLS Listing	Multi-Family (2-4 Unit)	2028 SE Harold St	Portland	OR	97202.0	699900.0	5.0	4.0	SELLWOOD - WEST MORELAND	...	50	2	40	45	50	50	50	3.710354	11.486135	1

5 rows × 36 columns

One hot encoding¶

We drop more columns that don't really determine a buyer's preference

In [8]:

Copied!

prop_df.columns
prop_df.columns

Out[8]:

Index(['SALE TYPE', 'PROPERTY TYPE', 'ADDRESS', 'CITY', 'STATE', 'ZIP',
       'PRICE', 'BEDS', 'BATHS', 'LOCATION', 'SQUARE FEET', 'LOT SIZE',
       'YEAR BUILT', 'DAYS ON MARKET', 'PRICE PER SQFT', 'HOA PER MONTH',
       'STATUS', 'URL', 'SOURCE', 'MLS', 'LATITUDE', 'LONGITUDE', 'SHAPE',
       'grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count',
       'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count',
       'edu_count', 'commute_length', 'commute_duration', 'favorite'],
      dtype='object')

In [11]:

Copied!





train_df = prop_df.drop(columns=['SALE TYPE','PROPERTY TYPE','ADDRESS', 'CITY', 'STATE', 'ZIP','LOCATION', 
                                'DAYS ON MARKET','PRICE PER SQFT','STATUS',
                                 'URL', 'SOURCE', 'MLS', 'SHAPE','LATITUDE', 'LONGITUDE'])
train_df.head()
train_df = prop_df.drop(columns=['SALE TYPE','PROPERTY TYPE','ADDRESS', 'CITY', 'STATE', 'ZIP','LOCATION', 
                                'DAYS ON MARKET','PRICE PER SQFT','STATUS',
                                 'URL', 'SOURCE', 'MLS', 'SHAPE','LATITUDE', 'LONGITUDE'])
train_df.head()

Out[11]:

	PRICE	BEDS	BATHS	SQUARE FEET	LOT SIZE	YEAR BUILT	HOA PER MONTH	grocery_count	restaurant_count	hospitals_count	coffee_count	bars_count	gas_count	shops_count	travel_count	parks_count	edu_count	commute_length	commute_duration	favorite
0	543900.0	4.0	3.5	3178.0	6969.0	2018.0	50.0	20	50	6	50	2	34	46	50	50	50	5.796321	16.509734	1
1	625000.0	6.0	6.0	2844.0	6969.0	2018.0	0.0	20	50	6	50	1	50	40	50	50	50	8.380589	23.087985	1
2	550000.0	7.0	4.0	3038.0	6969.0	2018.0	0.0	20	50	4	50	1	50	43	50	50	50	6.330796	16.910622	1
3	479900.0	4.0	2.5	2029.0	3920.0	2018.0	0.0	20	50	6	50	2	44	48	50	50	50	7.299694	20.389635	1
4	699900.0	5.0	4.0	2582.0	6969.0	2016.0	0.0	20	50	8	50	2	40	45	50	50	50	3.710354	11.486135	1

Scale numeric columns¶

We use the same MinMaxScaler we used earlier to scale the data.

In [12]:

Copied!

from sklearn.preprocessing import MinMaxScaler
mm_scaler = MinMaxScaler()
from sklearn.preprocessing import MinMaxScaler
mm_scaler = MinMaxScaler()

In [13]:

Copied!





columns_to_scale = ['PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'LOT SIZE',
       'YEAR BUILT', 'HOA PER MONTH','grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count',
       'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count',
       'edu_count', 'commute_length', 'commute_duration']
columns_to_scale = ['PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'LOT SIZE',
       'YEAR BUILT', 'HOA PER MONTH','grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count',
       'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count',
       'edu_count', 'commute_length', 'commute_duration']

In [15]:

Copied!

scaled_array = mm_scaler.fit_transform(train_df[columns_to_scale])
prop_scaled = pd.DataFrame(scaled_array, columns=columns_to_scale)
prop_scaled.head()
scaled_array = mm_scaler.fit_transform(train_df[columns_to_scale])
prop_scaled = pd.DataFrame(scaled_array, columns=columns_to_scale)
prop_scaled.head()

Out[15]:

	PRICE	BEDS	BATHS	SQUARE FEET	LOT SIZE	YEAR BUILT	HOA PER MONTH	grocery_count	restaurant_count	hospitals_count	coffee_count	bars_count	gas_count	shops_count	travel_count	parks_count	edu_count	commute_length	commute_duration
0	0.572446	0.333333	0.3	0.448684	0.100778	0.947368	0.25	1.0	1.0	0.500000	1.0	1.0	0.68	0.92	1.0	1.0	1.0	0.002085	0.004983
1	0.794577	0.666667	0.8	0.321251	0.100778	0.947368	0.00	1.0	1.0	0.500000	1.0	0.5	1.00	0.80	1.0	1.0	1.0	0.003189	0.008123
2	0.589154	0.833333	0.4	0.395269	0.100778	0.947368	0.00	1.0	1.0	0.333333	1.0	0.5	1.00	0.86	1.0	1.0	1.0	0.002313	0.005175
3	0.397151	0.333333	0.1	0.010301	0.046518	0.947368	0.00	1.0	1.0	0.500000	1.0	1.0	0.88	0.96	1.0	1.0	1.0	0.002727	0.006835
4	0.999726	0.500000	0.4	0.221290	0.100778	0.842105	0.00	1.0	1.0	0.666667	1.0	1.0	0.80	0.90	1.0	1.0	1.0	0.001193	0.002586

Split dataset into training and test¶

In [29]:

Copied!

prop_scaled.columns
prop_scaled.columns

Out[29]:

Index(['PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'LOT SIZE', 'YEAR BUILT',
       'HOA PER MONTH', 'grocery_count', 'restaurant_count', 'hospitals_count',
       'coffee_count', 'bars_count', 'gas_count', 'shops_count',
       'travel_count', 'parks_count', 'edu_count', 'commute_length',
       'commute_duration'],
      dtype='object')

In [30]:

Copied!

prop_scaled.info()
prop_scaled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331 entries, 0 to 330
Data columns (total 19 columns):
PRICE               331 non-null float64
BEDS                331 non-null float64
BATHS               331 non-null float64
SQUARE FEET         331 non-null float64
LOT SIZE            331 non-null float64
YEAR BUILT          331 non-null float64
HOA PER MONTH       331 non-null float64
grocery_count       331 non-null float64
restaurant_count    331 non-null float64
hospitals_count     331 non-null float64
coffee_count        331 non-null float64
bars_count          331 non-null float64
gas_count           331 non-null float64
shops_count         331 non-null float64
travel_count        331 non-null float64
parks_count         331 non-null float64
edu_count           331 non-null float64
commute_length      331 non-null float64
commute_duration    331 non-null float64
dtypes: float64(19)
memory usage: 49.2 KB

In [31]:

Copied!

X = prop_scaled
y = train_df['favorite']
X = prop_scaled
y = train_df['favorite']

In [32]:

Copied!

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33)

(len(X_train), len(X_test))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33)

(len(X_train), len(X_test))

Out[32]:

(221, 110)

Logistic Regression¶

In [33]:

Copied!

from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression(verbose=1)
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression(verbose=1)

In [34]:

Copied!

log_model.fit(X_train, y_train)
log_model.fit(X_train, y_train)

[LibLinear]

Out[34]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=1, warm_start=False)

In [35]:

Copied!

test_predictions = log_model.predict(X_test)
test_predictions = log_model.predict(X_test)

In [36]:

Copied!

test_predictions
test_predictions

Out[36]:

array([0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0])

Model evaluation¶

In [37]:

Copied!

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix

In [39]:

Copied!

from pprint import pprint
pprint(classification_report(y_test, test_predictions, target_names=['not fav','fav']))
from pprint import pprint
pprint(classification_report(y_test, test_predictions, target_names=['not fav','fav']))

('             precision    recall  f1-score   support\n'
 '\n'
 '    not fav       0.94      0.98      0.96        89\n'
 '        fav       0.88      0.71      0.79        21\n'
 '\n'
 'avg / total       0.93      0.93      0.92       110\n')

In [40]:

Copied!

tn, fp, fn, tp = confusion_matrix(y_test, test_predictions).ravel()
tn, fp, fn, tp
tn, fp, fn, tp = confusion_matrix(y_test, test_predictions).ravel()
tn, fp, fn, tp

Out[40]:

(87, 2, 6, 15)

Model inference¶

In [41]:

Copied!

coeff = log_model.coef_.round(5).tolist()[0]
list(zip(X_train.columns, coeff))
coeff = log_model.coef_.round(5).tolist()[0]
list(zip(X_train.columns, coeff))

Out[41]:

[('PRICE', -0.4817),
 ('BEDS', 0.56799),
 ('BATHS', 0.65258),
 ('SQUARE FEET', 0.09618),
 ('LOT SIZE', -0.10108),
 ('YEAR BUILT', 0.86107),
 ('HOA PER MONTH', 0.02129),
 ('grocery_count', -0.7736),
 ('restaurant_count', -1.22493),
 ('hospitals_count', 1.38967),
 ('coffee_count', 1.27494),
 ('bars_count', 2.9728),
 ('gas_count', 1.16501),
 ('shops_count', -0.71489),
 ('travel_count', 0.24195),
 ('parks_count', -1.02031),
 ('edu_count', -0.45057),
 ('commute_length', -0.58029),
 ('commute_duration', -0.59949)]

In [44]:

Copied!

log_model.intercept_
log_model.intercept_

Out[44]:

array([-1.46905595])

Conclusion¶

Recommendation engines¶

From the example above, we could build a recommendation engine that runs on periodically on a newer set of properties and determines which ones are worth your time (one's it predicts you would 'like'). The ML model's weights appear similar to what we defined manually. In some cases, it goes way off.

This type of recommendation is called 'content based filtering' and for this to work, we need a really large training set. In reality nobody can sit and generate such a large set. In practice, another type of recommendations called 'community based filtering' is used. Based on the features identified for the properties, it tries to find similarity between buyers and pools the training set for all similar buyers together to create a really large training set and learns from that.

Overall¶

In these sets of notebooks, we observed how data science and machine learning approaches can be employed in the real estate industry. Buying houses is a very personal process, however a lot of decisions are heavily influenced by the location of the houses. We showed how Python libraries such as Pandas can be used to statistically analyze the properties. We also showed how the ArcGIS API for Python adds spatial capabilities to Pandas allowing to perform spatial data analysis. We enriched the data with information on access to different facilities and used that to compare, score and rank the properties. The shortlist we arrived at can be used for field visits.

We conclude with a forward thinking approach to turn this into a recommendation engine and suggest scope for future work in this area.