06-build-recommendation-engine-scaled

Building a housing recommendation engine

So far, we have feature engineered our data set with location specific features. We explicitly defined weights for different attributes and arrived at a rank. Instead, we could simply like and dislike a few houses and let a machine learning model infer our preferences based on that. That is what this notebook tries to do.

Since it is time consuming to like and dislike a large number of properties, we pick the top 50 notebooks from our previous rank and like them all. We dislike the remaining ones.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from pprint import pprint
%matplotlib inline
import seaborn as sns

from arcgis.gis import GIS
from arcgis.features import Feature, FeatureLayer, FeatureSet, GeoAccessor, GeoSeriesAccessor

Read ranked dataset

In [2]:
prop_df = pd.read_csv('resources/houses_ranked.csv')
prop_df = pd.DataFrame.spatial.from_xy(prop_df, 'LONGITUDE','LATITUDE')

Generate preference column

We will pick the top 50 records and provide a positive preference to them. Then we will drop the score and rank columns and let the machine learning algorithm learn our preferences.

In [3]:
prop_df.columns
Out[3]:
Index(['Unnamed: 0', 'SALE TYPE', 'PROPERTY TYPE', 'ADDRESS', 'CITY', 'STATE',
       'ZIP', 'PRICE', 'BEDS', 'BATHS', 'LOCATION', 'SQUARE FEET', 'LOT SIZE',
       'YEAR BUILT', 'DAYS ON MARKET', 'PRICE PER SQFT', 'HOA PER MONTH',
       'STATUS', 'URL', 'SOURCE', 'MLS', 'LATITUDE', 'LONGITUDE', 'SHAPE',
       'grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count',
       'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count',
       'edu_count', 'commute_length', 'commute_duration', 'scores_scaled',
       'rank'],
      dtype='object')
In [4]:
prop_df.shape
Out[4]:
(331, 38)

Generate a prefernce list that is 331 records long. This list has 1 for first 50 records followed by 0.

In [5]:
preference_list = [1]*50
preference_list.extend([0]*(331-50))
len(preference_list)
Out[5]:
331
In [6]:
prop_df['favorite'] = preference_list
Drop rank, scores_scaled columns from DataFrame
In [7]:
prop_df.drop(columns=['Unnamed: 0','scores_scaled','rank'], inplace=True)
prop_df.head()
Out[7]:
SALE TYPE PROPERTY TYPE ADDRESS CITY STATE ZIP PRICE BEDS BATHS LOCATION coffee_count bars_count gas_count shops_count travel_count parks_count edu_count commute_length commute_duration favorite
0 MLS Listing Single Family Residential 15986 SE Spokane Ct. Ave Portland OR 97236.0 543900.0 4.0 3.5 Portland Southeast 50 2 34 46 50 50 50 5.796321 16.509734 1
1 MLS Listing Multi-Family (2-4 Unit) SE Henderson St Portland OR 97206.0 625000.0 6.0 6.0 LENTS 50 1 50 40 50 50 50 8.380589 23.087985 1
2 MLS Listing Single Family Residential 8268 SE Yamhill St Portland OR 97216.0 550000.0 7.0 4.0 Portland Southeast 50 1 50 43 50 50 50 6.330796 16.910622 1
3 MLS Listing Single Family Residential 6311 SE Tenino St Portland OR 97206.0 479900.0 4.0 2.5 Portland Southeast 50 2 44 48 50 50 50 7.299694 20.389635 1
4 MLS Listing Multi-Family (2-4 Unit) 2028 SE Harold St Portland OR 97202.0 699900.0 5.0 4.0 SELLWOOD - WEST MORELAND 50 2 40 45 50 50 50 3.710354 11.486135 1

5 rows × 36 columns

One hot encoding

We drop more columns that don’t really determine a buyer’s preference

In [8]:
prop_df.columns
Out[8]:
Index(['SALE TYPE', 'PROPERTY TYPE', 'ADDRESS', 'CITY', 'STATE', 'ZIP',
       'PRICE', 'BEDS', 'BATHS', 'LOCATION', 'SQUARE FEET', 'LOT SIZE',
       'YEAR BUILT', 'DAYS ON MARKET', 'PRICE PER SQFT', 'HOA PER MONTH',
       'STATUS', 'URL', 'SOURCE', 'MLS', 'LATITUDE', 'LONGITUDE', 'SHAPE',
       'grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count',
       'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count',
       'edu_count', 'commute_length', 'commute_duration', 'favorite'],
      dtype='object')
In [11]:
train_df = prop_df.drop(columns=['SALE TYPE','PROPERTY TYPE','ADDRESS', 'CITY', 'STATE', 'ZIP','LOCATION', 
                                'DAYS ON MARKET','PRICE PER SQFT','STATUS',
                                 'URL', 'SOURCE', 'MLS', 'SHAPE','LATITUDE', 'LONGITUDE'])
train_df.head()
Out[11]:
PRICE BEDS BATHS SQUARE FEET LOT SIZE YEAR BUILT HOA PER MONTH grocery_count restaurant_count hospitals_count coffee_count bars_count gas_count shops_count travel_count parks_count edu_count commute_length commute_duration favorite
0 543900.0 4.0 3.5 3178.0 6969.0 2018.0 50.0 20 50 6 50 2 34 46 50 50 50 5.796321 16.509734 1
1 625000.0 6.0 6.0 2844.0 6969.0 2018.0 0.0 20 50 6 50 1 50 40 50 50 50 8.380589 23.087985 1
2 550000.0 7.0 4.0 3038.0 6969.0 2018.0 0.0 20 50 4 50 1 50 43 50 50 50 6.330796 16.910622 1
3 479900.0 4.0 2.5 2029.0 3920.0 2018.0 0.0 20 50 6 50 2 44 48 50 50 50 7.299694 20.389635 1
4 699900.0 5.0 4.0 2582.0 6969.0 2016.0 0.0 20 50 8 50 2 40 45 50 50 50 3.710354 11.486135 1

Scale numeric columns

We use the same MinMaxScaler we used earlier to scale the data.

In [12]:
from sklearn.preprocessing import MinMaxScaler
mm_scaler = MinMaxScaler()
In [13]:
columns_to_scale = ['PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'LOT SIZE',
       'YEAR BUILT', 'HOA PER MONTH','grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count',
       'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count',
       'edu_count', 'commute_length', 'commute_duration']
In [15]:
scaled_array = mm_scaler.fit_transform(train_df[columns_to_scale])
prop_scaled = pd.DataFrame(scaled_array, columns=columns_to_scale)
prop_scaled.head()
Out[15]:
PRICE BEDS BATHS SQUARE FEET LOT SIZE YEAR BUILT HOA PER MONTH grocery_count restaurant_count hospitals_count coffee_count bars_count gas_count shops_count travel_count parks_count edu_count commute_length commute_duration
0 0.572446 0.333333 0.3 0.448684 0.100778 0.947368 0.25 1.0 1.0 0.500000 1.0 1.0 0.68 0.92 1.0 1.0 1.0 0.002085 0.004983
1 0.794577 0.666667 0.8 0.321251 0.100778 0.947368 0.00 1.0 1.0 0.500000 1.0 0.5 1.00 0.80 1.0 1.0 1.0 0.003189 0.008123
2 0.589154 0.833333 0.4 0.395269 0.100778 0.947368 0.00 1.0 1.0 0.333333 1.0 0.5 1.00 0.86 1.0 1.0 1.0 0.002313 0.005175
3 0.397151 0.333333 0.1 0.010301 0.046518 0.947368 0.00 1.0 1.0 0.500000 1.0 1.0 0.88 0.96 1.0 1.0 1.0 0.002727 0.006835
4 0.999726 0.500000 0.4 0.221290 0.100778 0.842105 0.00 1.0 1.0 0.666667 1.0 1.0 0.80 0.90 1.0 1.0 1.0 0.001193 0.002586

Split dataset into training and test

In [29]:
prop_scaled.columns
Out[29]:
Index(['PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'LOT SIZE', 'YEAR BUILT',
       'HOA PER MONTH', 'grocery_count', 'restaurant_count', 'hospitals_count',
       'coffee_count', 'bars_count', 'gas_count', 'shops_count',
       'travel_count', 'parks_count', 'edu_count', 'commute_length',
       'commute_duration'],
      dtype='object')
In [30]:
prop_scaled.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331 entries, 0 to 330
Data columns (total 19 columns):
PRICE               331 non-null float64
BEDS                331 non-null float64
BATHS               331 non-null float64
SQUARE FEET         331 non-null float64
LOT SIZE            331 non-null float64
YEAR BUILT          331 non-null float64
HOA PER MONTH       331 non-null float64
grocery_count       331 non-null float64
restaurant_count    331 non-null float64
hospitals_count     331 non-null float64
coffee_count        331 non-null float64
bars_count          331 non-null float64
gas_count           331 non-null float64
shops_count         331 non-null float64
travel_count        331 non-null float64
parks_count         331 non-null float64
edu_count           331 non-null float64
commute_length      331 non-null float64
commute_duration    331 non-null float64
dtypes: float64(19)
memory usage: 49.2 KB
In [31]:
X = prop_scaled
y = train_df['favorite']
In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33)

(len(X_train), len(X_test))
Out[32]:
(221, 110)

Logistic Regression

In [33]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression(verbose=1)
In [34]:
log_model.fit(X_train, y_train)
[LibLinear]
Out[34]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=1, warm_start=False)
In [35]:
test_predictions = log_model.predict(X_test)
In [36]:
test_predictions
Out[36]:
array([0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0])

Model evaluation

In [37]:
from sklearn.metrics import classification_report, confusion_matrix
In [39]:
from pprint import pprint
pprint(classification_report(y_test, test_predictions, target_names=['not fav','fav']))
('             precision    recall  f1-score   support\n'
 '\n'
 '    not fav       0.94      0.98      0.96        89\n'
 '        fav       0.88      0.71      0.79        21\n'
 '\n'
 'avg / total       0.93      0.93      0.92       110\n')
In [40]:
tn, fp, fn, tp = confusion_matrix(y_test, test_predictions).ravel()
tn, fp, fn, tp
Out[40]:
(87, 2, 6, 15)

Model inference

In [41]:
coeff = log_model.coef_.round(5).tolist()[0]
list(zip(X_train.columns, coeff))
Out[41]:
[('PRICE', -0.4817),
 ('BEDS', 0.56799),
 ('BATHS', 0.65258),
 ('SQUARE FEET', 0.09618),
 ('LOT SIZE', -0.10108),
 ('YEAR BUILT', 0.86107),
 ('HOA PER MONTH', 0.02129),
 ('grocery_count', -0.7736),
 ('restaurant_count', -1.22493),
 ('hospitals_count', 1.38967),
 ('coffee_count', 1.27494),
 ('bars_count', 2.9728),
 ('gas_count', 1.16501),
 ('shops_count', -0.71489),
 ('travel_count', 0.24195),
 ('parks_count', -1.02031),
 ('edu_count', -0.45057),
 ('commute_length', -0.58029),
 ('commute_duration', -0.59949)]
In [44]:
log_model.intercept_
Out[44]:
array([-1.46905595])

Conclusion

Recommendation engines

From the example above, we could build a recommendation engine that runs on periodically on a newer set of properties and determines which ones are worth your time (one’s it predicts you would ‘like’). The ML model’s weights appear similar to what we defined manually. In some cases, it goes way off.

This type of recommendation is called ‘content based filtering’ and for this to work, we need a really large training set. In reality nobody can sit and generate such a large set. In practice, another type of recommendations called ‘community based filtering’ is used. Based on the features identified for the properties, it tries to find similarity between buyers and pools the training set for all similar buyers together to create a really large training set and learns from that.

Overall

In these sets of notebooks, we observed how data science and machine learning approaches can be employed in the real estate industry. Buying houses is a very personal process, however a lot of decisions are heavily influenced by the location of the houses. We showed how Python libraries such as Pandas can be used to statistically analyze the properties. We also showed how the ArcGIS API for Python adds spatial capabilities to Pandas allowing to perform spatial data analysis. We enriched the data with information on access to different facilities and used that to compare, score and rank the properties. The shortlist we arrived at can be used for field visits.

We conclude with a forward thinking approach to turn this into a recommendation engine and suggest scope for future work in this area.