Building a housing recommendation engine¶
So far, we have feature engineered our data set with location specific features. We explicitly defined weights for different attributes and arrived at a rank. Instead, we could simply like and dislike a few houses and let a machine learning model infer our preferences based on that. That is what this notebook tries to do.
Since it is time consuming to like and dislike a large number of properties, we pick the top 50 notebooks from our previous rank and like them all. We dislike the remaining ones.
import pandas as pd
import matplotlib.pyplot as plt
from pprint import pprint
%matplotlib inline
import seaborn as sns
from arcgis.gis import GIS
from arcgis.features import Feature, FeatureLayer, FeatureSet, GeoAccessor, GeoSeriesAccessor
Read ranked dataset¶
prop_df = pd.read_csv('resources/houses_ranked.csv')
prop_df = pd.DataFrame.spatial.from_xy(prop_df, 'LONGITUDE','LATITUDE')
Generate preference column¶
We will pick the top 50 records and provide a positive preference to them. Then we will drop the score and rank columns and let the machine learning algorithm learn our preferences.
Index(['Unnamed: 0', 'SALE TYPE', 'PROPERTY TYPE', 'ADDRESS', 'CITY', 'STATE', 'ZIP', 'PRICE', 'BEDS', 'BATHS', 'LOCATION', 'SQUARE FEET', 'LOT SIZE', 'YEAR BUILT', 'DAYS ON MARKET', 'PRICE PER SQFT', 'HOA PER MONTH', 'STATUS', 'URL', 'SOURCE', 'MLS', 'LATITUDE', 'LONGITUDE', 'SHAPE', 'grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count', 'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count', 'edu_count', 'commute_length', 'commute_duration', 'scores_scaled', 'rank'], dtype='object')
(331, 38)
Generate a prefernce list that is 331
records long. This list has 1
for first 50
records followed by 0
preference_list = [1]*50
prop_df['favorite'] = preference_list
Drop rank
, scores_scaled
columns from DataFrame¶
prop_df.drop(columns=['Unnamed: 0','scores_scaled','rank'], inplace=True)
SALE TYPE | PROPERTY TYPE | ADDRESS | CITY | STATE | ZIP | PRICE | BEDS | BATHS | LOCATION | ... | coffee_count | bars_count | gas_count | shops_count | travel_count | parks_count | edu_count | commute_length | commute_duration | favorite | |
0 | MLS Listing | Single Family Residential | 15986 SE Spokane Ct. Ave | Portland | OR | 97236.0 | 543900.0 | 4.0 | 3.5 | Portland Southeast | ... | 50 | 2 | 34 | 46 | 50 | 50 | 50 | 5.796321 | 16.509734 | 1 |
1 | MLS Listing | Multi-Family (2-4 Unit) | SE Henderson St | Portland | OR | 97206.0 | 625000.0 | 6.0 | 6.0 | LENTS | ... | 50 | 1 | 50 | 40 | 50 | 50 | 50 | 8.380589 | 23.087985 | 1 |
2 | MLS Listing | Single Family Residential | 8268 SE Yamhill St | Portland | OR | 97216.0 | 550000.0 | 7.0 | 4.0 | Portland Southeast | ... | 50 | 1 | 50 | 43 | 50 | 50 | 50 | 6.330796 | 16.910622 | 1 |
3 | MLS Listing | Single Family Residential | 6311 SE Tenino St | Portland | OR | 97206.0 | 479900.0 | 4.0 | 2.5 | Portland Southeast | ... | 50 | 2 | 44 | 48 | 50 | 50 | 50 | 7.299694 | 20.389635 | 1 |
4 | MLS Listing | Multi-Family (2-4 Unit) | 2028 SE Harold St | Portland | OR | 97202.0 | 699900.0 | 5.0 | 4.0 | SELLWOOD - WEST MORELAND | ... | 50 | 2 | 40 | 45 | 50 | 50 | 50 | 3.710354 | 11.486135 | 1 |
5 rows × 36 columns
One hot encoding¶
We drop more columns that don't really determine a buyer's preference
Index(['SALE TYPE', 'PROPERTY TYPE', 'ADDRESS', 'CITY', 'STATE', 'ZIP', 'PRICE', 'BEDS', 'BATHS', 'LOCATION', 'SQUARE FEET', 'LOT SIZE', 'YEAR BUILT', 'DAYS ON MARKET', 'PRICE PER SQFT', 'HOA PER MONTH', 'STATUS', 'URL', 'SOURCE', 'MLS', 'LATITUDE', 'LONGITUDE', 'SHAPE', 'grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count', 'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count', 'edu_count', 'commute_length', 'commute_duration', 'favorite'], dtype='object')
train_df = prop_df.drop(columns=['SALE TYPE','PROPERTY TYPE','ADDRESS', 'CITY', 'STATE', 'ZIP','LOCATION',
PRICE | BEDS | BATHS | SQUARE FEET | LOT SIZE | YEAR BUILT | HOA PER MONTH | grocery_count | restaurant_count | hospitals_count | coffee_count | bars_count | gas_count | shops_count | travel_count | parks_count | edu_count | commute_length | commute_duration | favorite | |
0 | 543900.0 | 4.0 | 3.5 | 3178.0 | 6969.0 | 2018.0 | 50.0 | 20 | 50 | 6 | 50 | 2 | 34 | 46 | 50 | 50 | 50 | 5.796321 | 16.509734 | 1 |
1 | 625000.0 | 6.0 | 6.0 | 2844.0 | 6969.0 | 2018.0 | 0.0 | 20 | 50 | 6 | 50 | 1 | 50 | 40 | 50 | 50 | 50 | 8.380589 | 23.087985 | 1 |
2 | 550000.0 | 7.0 | 4.0 | 3038.0 | 6969.0 | 2018.0 | 0.0 | 20 | 50 | 4 | 50 | 1 | 50 | 43 | 50 | 50 | 50 | 6.330796 | 16.910622 | 1 |
3 | 479900.0 | 4.0 | 2.5 | 2029.0 | 3920.0 | 2018.0 | 0.0 | 20 | 50 | 6 | 50 | 2 | 44 | 48 | 50 | 50 | 50 | 7.299694 | 20.389635 | 1 |
4 | 699900.0 | 5.0 | 4.0 | 2582.0 | 6969.0 | 2016.0 | 0.0 | 20 | 50 | 8 | 50 | 2 | 40 | 45 | 50 | 50 | 50 | 3.710354 | 11.486135 | 1 |
Scale numeric columns¶
We use the same MinMaxScaler
we used earlier to scale the data.
from sklearn.preprocessing import MinMaxScaler
mm_scaler = MinMaxScaler()
columns_to_scale = ['PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'LOT SIZE',
'YEAR BUILT', 'HOA PER MONTH','grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count',
'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count',
'edu_count', 'commute_length', 'commute_duration']
scaled_array = mm_scaler.fit_transform(train_df[columns_to_scale])
prop_scaled = pd.DataFrame(scaled_array, columns=columns_to_scale)
PRICE | BEDS | BATHS | SQUARE FEET | LOT SIZE | YEAR BUILT | HOA PER MONTH | grocery_count | restaurant_count | hospitals_count | coffee_count | bars_count | gas_count | shops_count | travel_count | parks_count | edu_count | commute_length | commute_duration | |
0 | 0.572446 | 0.333333 | 0.3 | 0.448684 | 0.100778 | 0.947368 | 0.25 | 1.0 | 1.0 | 0.500000 | 1.0 | 1.0 | 0.68 | 0.92 | 1.0 | 1.0 | 1.0 | 0.002085 | 0.004983 |
1 | 0.794577 | 0.666667 | 0.8 | 0.321251 | 0.100778 | 0.947368 | 0.00 | 1.0 | 1.0 | 0.500000 | 1.0 | 0.5 | 1.00 | 0.80 | 1.0 | 1.0 | 1.0 | 0.003189 | 0.008123 |
2 | 0.589154 | 0.833333 | 0.4 | 0.395269 | 0.100778 | 0.947368 | 0.00 | 1.0 | 1.0 | 0.333333 | 1.0 | 0.5 | 1.00 | 0.86 | 1.0 | 1.0 | 1.0 | 0.002313 | 0.005175 |
3 | 0.397151 | 0.333333 | 0.1 | 0.010301 | 0.046518 | 0.947368 | 0.00 | 1.0 | 1.0 | 0.500000 | 1.0 | 1.0 | 0.88 | 0.96 | 1.0 | 1.0 | 1.0 | 0.002727 | 0.006835 |
4 | 0.999726 | 0.500000 | 0.4 | 0.221290 | 0.100778 | 0.842105 | 0.00 | 1.0 | 1.0 | 0.666667 | 1.0 | 1.0 | 0.80 | 0.90 | 1.0 | 1.0 | 1.0 | 0.001193 | 0.002586 |
Split dataset into training and test¶
Index(['PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'LOT SIZE', 'YEAR BUILT', 'HOA PER MONTH', 'grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count', 'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count', 'edu_count', 'commute_length', 'commute_duration'], dtype='object')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 331 entries, 0 to 330 Data columns (total 19 columns): PRICE 331 non-null float64 BEDS 331 non-null float64 BATHS 331 non-null float64 SQUARE FEET 331 non-null float64 LOT SIZE 331 non-null float64 YEAR BUILT 331 non-null float64 HOA PER MONTH 331 non-null float64 grocery_count 331 non-null float64 restaurant_count 331 non-null float64 hospitals_count 331 non-null float64 coffee_count 331 non-null float64 bars_count 331 non-null float64 gas_count 331 non-null float64 shops_count 331 non-null float64 travel_count 331 non-null float64 parks_count 331 non-null float64 edu_count 331 non-null float64 commute_length 331 non-null float64 commute_duration 331 non-null float64 dtypes: float64(19) memory usage: 49.2 KB
X = prop_scaled
y = train_df['favorite']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33)
(len(X_train), len(X_test))
(221, 110)
Logistic Regression¶
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression(verbose=1), y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=1, warm_start=False)
test_predictions = log_model.predict(X_test)
array([0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0])
Model evaluation¶
from sklearn.metrics import classification_report, confusion_matrix
from pprint import pprint
pprint(classification_report(y_test, test_predictions, target_names=['not fav','fav']))
(' precision recall f1-score support\n' '\n' ' not fav 0.94 0.98 0.96 89\n' ' fav 0.88 0.71 0.79 21\n' '\n' 'avg / total 0.93 0.93 0.92 110\n')
tn, fp, fn, tp = confusion_matrix(y_test, test_predictions).ravel()
tn, fp, fn, tp
(87, 2, 6, 15)
Model inference¶
coeff = log_model.coef_.round(5).tolist()[0]
list(zip(X_train.columns, coeff))
[('PRICE', -0.4817), ('BEDS', 0.56799), ('BATHS', 0.65258), ('SQUARE FEET', 0.09618), ('LOT SIZE', -0.10108), ('YEAR BUILT', 0.86107), ('HOA PER MONTH', 0.02129), ('grocery_count', -0.7736), ('restaurant_count', -1.22493), ('hospitals_count', 1.38967), ('coffee_count', 1.27494), ('bars_count', 2.9728), ('gas_count', 1.16501), ('shops_count', -0.71489), ('travel_count', 0.24195), ('parks_count', -1.02031), ('edu_count', -0.45057), ('commute_length', -0.58029), ('commute_duration', -0.59949)]
Recommendation engines¶
From the example above, we could build a recommendation engine that runs on periodically on a newer set of properties and determines which ones are worth your time (one's it predicts you would 'like'). The ML model's weights appear similar to what we defined manually. In some cases, it goes way off.
This type of recommendation is called 'content based filtering' and for this to work, we need a really large training set. In reality nobody can sit and generate such a large set. In practice, another type of recommendations called 'community based filtering' is used. Based on the features identified for the properties, it tries to find similarity between buyers and pools the training set for all similar buyers together to create a really large training set and learns from that.
In these sets of notebooks, we observed how data science and machine learning approaches can be employed in the real estate industry. Buying houses is a very personal process, however a lot of decisions are heavily influenced by the location of the houses. We showed how Python libraries such as Pandas can be used to statistically analyze the properties. We also showed how the ArcGIS API for Python adds spatial capabilities to Pandas allowing to perform spatial data analysis. We enriched the data with information on access to different facilities and used that to compare, score and rank the properties. The shortlist we arrived at can be used for field visits.
We conclude with a forward thinking approach to turn this into a recommendation engine and suggest scope for future work in this area.