05-rank-properties-using-features

Score and rank properties using intrinsic and spatial features

So far we have shortlisted properties based on their intrinsic properties such as number of rooms, price, HoA etc. We have enriched them with spatial attributes such as access to facilities and distance to a point of interest (such as work). In this notebook, we weigh and sum these attributes to produce a score for each property. We finally sort them to pick the top few.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from pprint import pprint
%matplotlib inline
import seaborn as sns

from arcgis.gis import GIS
from arcgis.features import Feature, FeatureLayer, FeatureSet, GeoAccessor, GeoSeriesAccessor
In [2]:
gis = GIS(profile='')

Read spatially enriched properties

In [3]:
prop_df = pd.read_csv('resources/houses_facility_counts.csv')
prop_df = pd.DataFrame.spatial.from_xy(prop_df, 'LONGITUDE','LATITUDE')
In [19]:
prop_df.head()
Out[19]:
SALE TYPE PROPERTY TYPE ADDRESS CITY STATE ZIP PRICE BEDS BATHS LOCATION hospitals_count coffee_count bars_count gas_count shops_count travel_count parks_count edu_count commute_length commute_duration
0 MLS Listing Single Family Residential 144 SE 191st Pl Portland OR 97233.0 334900.0 4.0 2.5 Gresham, Sandy, Troutdale, Corbett 0 19 0 28 41 50 50 50 13.978589 25.621265
1 MLS Listing Single Family Residential 20028 NE Hoyt St Portland OR 97230.0 349950.0 3.0 2.5 Reeds Place 2 23 0 31 38 50 50 50 13.913465 24.726944
2 MLS Listing Single Family Residential 711 SE 160th Ave Portland OR 97233.0 358999.0 4.0 2.5 Portland Southeast 0 16 0 34 47 50 50 50 14.644863 27.830024
3 MLS Listing Single Family Residential 5536 SE 142nd Pl Portland OR 97236.0 369800.0 4.0 2.5 Portland Southeast 2 18 0 29 45 48 50 50 13.097669 25.943438
4 MLS Listing Single Family Residential 15707 SE Flavel Dr Portland OR 97236.0 379900.0 3.0 2.0 Portland Southeast 0 13 0 17 44 10 50 41 14.254413 28.403598

5 rows × 35 columns

In [20]:
prop_df.columns
Out[20]:
Index(['SALE TYPE', 'PROPERTY TYPE', 'ADDRESS', 'CITY', 'STATE', 'ZIP',
       'PRICE', 'BEDS', 'BATHS', 'LOCATION', 'SQUARE FEET', 'LOT SIZE',
       'YEAR BUILT', 'DAYS ON MARKET', 'PRICE PER SQFT', 'HOA PER MONTH',
       'STATUS', 'URL', 'SOURCE', 'MLS', 'LATITUDE', 'LONGITUDE', 'SHAPE',
       'grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count',
       'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count',
       'edu_count', 'commute_length', 'commute_duration'],
      dtype='object')

Drop unnecessary columns

In [4]:
try:
    prop_df.drop(columns=['Unnamed: 0'], inplace=True)
    prop_df.drop(columns=['Unnamed: 0.1'], inplace=True)
except:
    pass

Apply weights

In the following section, we determine the relative importance of each attribute and set weights accordingly

In [5]:
def set_scores(row):
    score = ((row['PRICE']*-1.5) + # penalize by 1.5 times
             (row['BEDS']*1)+
             (row['BATHS']*1)+
             (row['SQUARE FEET']*1)+
             (row['LOT SIZE']*1)+
             (row['YEAR BUILT']*1)+
             (row['HOA PER MONTH']*-1)+  # penalize by 1 times
             (row['grocery_count']*1)+
             (row['restaurant_count']*1)+
             (row['hospitals_count']*1.5)+  # reward by 1.5 times
             (row['coffee_count']*1)+
             (row['bars_count']*1)+
             (row['shops_count']*1)+
             (row['travel_count']*1.5)+  # reward by 1.5 times
             (row['parks_count']*1)+
             (row['edu_count']*1)+
             (row['commute_length']*-1)+  # penalize by 1 times
             (row['commute_duration']*-2)  # penalize by 2 times
            )
    return score

Generate scores for all properties

We calculate the score for each property using the formula we defined in previous section

In [29]:
%%time
prop_df['scores'] = prop_df.apply(set_scores, axis=1)
CPU times: user 78 ms, sys: 1.98 ms, total: 80 ms
Wall time: 101 ms
In [30]:
prop_df.head()
Out[30]:
SALE TYPE PROPERTY TYPE ADDRESS CITY STATE ZIP PRICE BEDS BATHS LOCATION coffee_count bars_count gas_count shops_count travel_count parks_count edu_count commute_length commute_duration scores
0 MLS Listing Single Family Residential 144 SE 191st Pl Portland OR 97233.0 334900.0 4.0 2.5 Gresham, Sandy, Troutdale, Corbett 19 0 28 41 50 50 50 13.978589 25.621265 -495439.721119
1 MLS Listing Single Family Residential 20028 NE Hoyt St Portland OR 97230.0 349950.0 3.0 2.5 Reeds Place 23 0 31 38 50 50 50 13.913465 24.726944 -516614.867353
2 MLS Listing Single Family Residential 711 SE 160th Ave Portland OR 97233.0 358999.0 4.0 2.5 Portland Southeast 16 0 34 47 50 50 50 14.644863 27.830024 -528948.304911
3 MLS Listing Single Family Residential 5536 SE 142nd Pl Portland OR 97236.0 369800.0 4.0 2.5 Portland Southeast 18 0 29 45 48 50 50 13.097669 25.943438 -545169.484546
4 MLS Listing Single Family Residential 15707 SE Flavel Dr Portland OR 97236.0 379900.0 3.0 2.0 Portland Southeast 13 0 17 44 10 50 41 14.254413 28.403598 -559115.061609

5 rows × 36 columns

In [39]:
prop_df['scores'].hist(bins=50);   # trailing ; suppresses matplotlib's prints
In [37]:
sns.jointplot('PRICE','scores', data=prop_df);
/Users/atma6951/anaconda3/envs/geosaurus_gold/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

From the chart above, it appears numeric columns with large values (such as Price) weild an undue importance on our model. To rectify this, we need to scale all numeric columns and normalize them with each other. The section below performs this and then tries to generate scores based on the scaled values.

Write unscaled data to disk

In [38]:
prop_df.to_csv('resources/houses_scores_unscaled.csv')

Scale data

Although price plays an important role while selecting properties, the range and valuds of the PRICE column is much larger than any other. Thus, it influences more than its fair share. To resolve this, we scale the data using different techniques.

In [6]:
from sklearn.preprocessing import MinMaxScaler
mm_scaler = MinMaxScaler()
In [7]:
prop_df.columns
Out[7]:
Index(['SALE TYPE', 'PROPERTY TYPE', 'ADDRESS', 'CITY', 'STATE', 'ZIP',
       'PRICE', 'BEDS', 'BATHS', 'LOCATION', 'SQUARE FEET', 'LOT SIZE',
       'YEAR BUILT', 'DAYS ON MARKET', 'PRICE PER SQFT', 'HOA PER MONTH',
       'STATUS', 'URL', 'SOURCE', 'MLS', 'LATITUDE', 'LONGITUDE', 'SHAPE',
       'grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count',
       'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count',
       'edu_count', 'commute_length', 'commute_duration'],
      dtype='object')
In [8]:
columns_to_scale = ['PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'LOT SIZE',
       'YEAR BUILT', 'HOA PER MONTH','grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count',
       'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count',
       'edu_count', 'commute_length', 'commute_duration']
In [9]:
scaled_array = mm_scaler.fit_transform(prop_df[columns_to_scale])
prop_scaled = pd.DataFrame(scaled_array, columns=columns_to_scale)
prop_scaled.head()
Out[9]:
PRICE BEDS BATHS SQUARE FEET LOT SIZE YEAR BUILT HOA PER MONTH grocery_count restaurant_count hospitals_count coffee_count bars_count gas_count shops_count travel_count parks_count edu_count commute_length commute_duration
0 0.000000 0.333333 0.1 0.016024 0.023259 0.368421 0.00 1.0 1.0 0.000000 0.38 0.0 0.56 0.82 1.00 1.0 1.00 0.005581 0.009332
1 0.041222 0.166667 0.1 0.049981 0.046518 0.315789 0.00 1.0 1.0 0.166667 0.46 0.0 0.62 0.76 1.00 1.0 1.00 0.005553 0.008905
2 0.066007 0.333333 0.1 0.028233 0.069777 0.157895 0.00 1.0 1.0 0.000000 0.32 0.0 0.68 0.94 1.00 1.0 1.00 0.005865 0.010386
3 0.095590 0.333333 0.1 0.014498 0.069777 0.736842 0.00 1.0 1.0 0.166667 0.36 0.0 0.58 0.90 0.96 1.0 1.00 0.005204 0.009486
4 0.123254 0.166667 0.0 0.025181 0.093036 0.000000 0.17 1.0 1.0 0.000000 0.26 0.0 0.34 0.88 0.20 1.0 0.82 0.005698 0.010660

Apply scores on scaled data

In [10]:
prop_scaled['scores_scaled'] = prop_scaled.apply(set_scores, axis=1)

Check the influence of price on scores

In [12]:
sns.distplot(prop_scaled['scores_scaled'], bins=50)
/Users/atma6951/anaconda3/envs/geosaurus_gold/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a216e6198>
In [13]:
sns.jointplot(x='PRICE', y='scores_scaled', data=prop_scaled)
/Users/atma6951/anaconda3/envs/geosaurus_gold/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[13]:
<seaborn.axisgrid.JointGrid at 0x1a206eab00>

After scaling, the scatter plot of PRICE VS SCORES appears random without much correlation. The histogram of scores also looks normally distributed. This is important as we want to consider all aspects of a house and weigh them according to their relative importance to the buyer. It is not much of a model if the scores are entirely based off one attribute (such as PRICE).

Inverse transform data

Since we made a copy of the DataFrame when we scaled, we don’t need to inverse transfrom the scaled data. We can simply copy the scaled_scores from this DataFrame and apply that to the original DataFrame

In [13]:
prop_df['scores_scaled'] = prop_scaled['scores_scaled']
prop_df.head()
Out[13]:
SALE TYPE PROPERTY TYPE ADDRESS CITY STATE ZIP PRICE BEDS BATHS LOCATION coffee_count bars_count gas_count shops_count travel_count parks_count edu_count commute_length commute_duration scores_scaled
0 MLS Listing Single Family Residential 144 SE 191st Pl Portland OR 97233.0 334900.0 4.0 2.5 Gresham, Sandy, Troutdale, Corbett 19 0 28 41 50 50 50 13.978589 25.621265 7.516793
1 MLS Listing Single Family Residential 20028 NE Hoyt St Portland OR 97230.0 349950.0 3.0 2.5 Reeds Place 23 0 31 38 50 50 50 13.913465 24.726944 7.563760
2 MLS Listing Single Family Residential 711 SE 160th Ave Portland OR 97233.0 358999.0 4.0 2.5 Portland Southeast 16 0 34 47 50 50 50 14.644863 27.830024 7.323591
3 MLS Listing Single Family Residential 5536 SE 142nd Pl Portland OR 97236.0 369800.0 4.0 2.5 Portland Southeast 18 0 29 45 48 50 50 13.097669 25.943438 8.036890
4 MLS Listing Single Family Residential 15707 SE Flavel Dr Portland OR 97236.0 379900.0 3.0 2.0 Portland Southeast 13 0 17 44 10 50 41 14.254413 28.403598 5.162985

5 rows × 36 columns

In [14]:
sns.jointplot('PRICE', 'scores_scaled', data=prop_df)
/Users/atma6951/anaconda3/envs/geosaurus_gold/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[14]:
<seaborn.axisgrid.JointGrid at 0x1a21a4d7b8>

Sort by scores

In [15]:
prop_df_sorted = prop_df.sort_values(by='scores_scaled', ascending=False)
prop_df_sorted.head(3)
Out[15]:
SALE TYPE PROPERTY TYPE ADDRESS CITY STATE ZIP PRICE BEDS BATHS LOCATION coffee_count bars_count gas_count shops_count travel_count parks_count edu_count commute_length commute_duration scores_scaled
87 MLS Listing Single Family Residential 15986 SE Spokane Ct. Ave Portland OR 97236.0 543900.0 4.0 3.5 Portland Southeast 50 2 34 46 50 50 50 5.796321 16.509734 10.179443
170 MLS Listing Multi-Family (2-4 Unit) SE Henderson St Portland OR 97206.0 625000.0 6.0 6.0 LENTS 50 1 50 40 50 50 50 8.380589 23.087985 10.174764
101 MLS Listing Single Family Residential 8268 SE Yamhill St Portland OR 97216.0 550000.0 7.0 4.0 Portland Southeast 50 1 50 43 50 50 50 6.330796 16.910622 10.140356

3 rows × 36 columns

In [16]:
prop_df_sorted.reset_index(drop=True, inplace=True)
prop_df_sorted.head()
Out[16]:
SALE TYPE PROPERTY TYPE ADDRESS CITY STATE ZIP PRICE BEDS BATHS LOCATION coffee_count bars_count gas_count shops_count travel_count parks_count edu_count commute_length commute_duration scores_scaled
0 MLS Listing Single Family Residential 15986 SE Spokane Ct. Ave Portland OR 97236.0 543900.0 4.0 3.5 Portland Southeast 50 2 34 46 50 50 50 5.796321 16.509734 10.179443
1 MLS Listing Multi-Family (2-4 Unit) SE Henderson St Portland OR 97206.0 625000.0 6.0 6.0 LENTS 50 1 50 40 50 50 50 8.380589 23.087985 10.174764
2 MLS Listing Single Family Residential 8268 SE Yamhill St Portland OR 97216.0 550000.0 7.0 4.0 Portland Southeast 50 1 50 43 50 50 50 6.330796 16.910622 10.140356
3 MLS Listing Single Family Residential 6311 SE Tenino St Portland OR 97206.0 479900.0 4.0 2.5 Portland Southeast 50 2 44 48 50 50 50 7.299694 20.389635 10.035397
4 MLS Listing Multi-Family (2-4 Unit) 2028 SE Harold St Portland OR 97202.0 699900.0 5.0 4.0 SELLWOOD - WEST MORELAND 50 2 40 45 50 50 50 3.710354 11.486135 9.958218

5 rows × 36 columns

Apply a rank for each property based on the scores

In [18]:
prop_df_sorted['rank'] = prop_df_sorted.index
prop_df_sorted.head(10)
Out[18]:
SALE TYPE PROPERTY TYPE ADDRESS CITY STATE ZIP PRICE BEDS BATHS LOCATION bars_count gas_count shops_count travel_count parks_count edu_count commute_length commute_duration scores_scaled rank
0 MLS Listing Single Family Residential 15986 SE Spokane Ct. Ave Portland OR 97236.0 543900.0 4.0 3.5 Portland Southeast 2 34 46 50 50 50 5.796321 16.509734 10.179443 0
1 MLS Listing Multi-Family (2-4 Unit) SE Henderson St Portland OR 97206.0 625000.0 6.0 6.0 LENTS 1 50 40 50 50 50 8.380589 23.087985 10.174764 1
2 MLS Listing Single Family Residential 8268 SE Yamhill St Portland OR 97216.0 550000.0 7.0 4.0 Portland Southeast 1 50 43 50 50 50 6.330796 16.910622 10.140356 2
3 MLS Listing Single Family Residential 6311 SE Tenino St Portland OR 97206.0 479900.0 4.0 2.5 Portland Southeast 2 44 48 50 50 50 7.299694 20.389635 10.035397 3
4 MLS Listing Multi-Family (2-4 Unit) 2028 SE Harold St Portland OR 97202.0 699900.0 5.0 4.0 SELLWOOD - WEST MORELAND 2 40 45 50 50 50 3.710354 11.486135 9.958218 4
5 New Construction Home Single Family Residential 16010 SE Spokane Ct Portland OR 97236.0 518750.0 3.0 2.5 Peach Tree Meadows 2 34 46 50 50 50 5.796321 16.509734 9.929172 5
6 MLS Listing Multi-Family (2-4 Unit) 5850 SE 86th Ave Portland OR 97266.0 699000.0 6.0 6.0 LENTS 1 50 38 50 50 50 10.899918 21.144505 9.924611 6
7 MLS Listing Townhouse 2022 SE Harold St Portland OR 97202.0 699900.0 5.0 4.0 SELLWOOD - WEST MORELAND 2 40 45 50 50 50 3.713178 11.513058 9.880673 7
8 MLS Listing Single Family Residential 15965 SE Spokane Ct Portland OR 97236.0 488750.0 3.0 2.5 Portland Southeast 2 34 46 50 50 50 5.796321 16.509734 9.794684 8
9 MLS Listing Single Family Residential 16136 SE Tenino St Portland OR 97236.0 479900.0 3.0 2.5 Portland Southeast 2 36 46 50 50 50 4.998670 13.217221 9.761623 9

10 rows × 37 columns

Visualize the top 50 properties

In [29]:
top_10_map = gis.map('Portland, OR')
top_10_map.basemap = 'gray'
top_10_map

In [39]:
prop_df_sorted.head(50).spatial.plot(map_widget = top_10_map, 
                      renderer_type='c',
                     method='esriClassifyNaturalBreaks',  # classification scheme
                     class_count=10,  # between 1900 - 2000, each decade in a class
                     col='scores_scaled',
                     cmap='autumn',  # matplotlib color map
                     alpha=0.7,
                     outline_color=[0,0,0,0])
/Users/atma6951/anaconda3/envs/geosaurus_gold/lib/python3.6/site-packages/arcgis/features/geo/_accessor.py:861: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._data[col] = self._data[col]
/Users/atma6951/anaconda3/envs/geosaurus_gold/lib/python3.6/site-packages/arcgis/features/geo/_accessor.py:1968: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._data['OBJECTID'] = list(range(1, self._data.shape[0] + 1))
Out[39]:
True

Investigate statistical distribution facilities in the top 50 shortlist

In [19]:
interesting_columns=['grocery_count', 'restaurant_count', 'hospitals_count', 'coffee_count',
       'bars_count', 'gas_count', 'shops_count', 'travel_count', 'parks_count',
       'edu_count', 'commute_length', 'commute_duration','PRICE','BEDS','BATHS','YEAR BUILT','SQUARE FEET']

ax_list = prop_df_sorted[interesting_columns].iloc[:49].hist(bins=25, layout=(5,4), figsize=(15,15))
plt.tight_layout()

How is rank correlated with location properties of houses?

In the charts below we attempt to find if there exists a correlation between different facilities.

In [54]:
column_set1 = ['hospitals_count','commute_length','commute_duration',
               'gas_count', 'shops_count', 'rank']
column_set2 = ['rank','PRICE','BEDS','BATHS','YEAR BUILT','SQUARE FEET']

grd = sns.pairplot(data=prop_df_sorted[column_set1].iloc[:49])
plt.title('How is rank correlated with location properties of a house?')
Out[54]:
Text(0.5,1,'How is rank correlated with location properties of a house?')

How is rank correlated with intrinsic properties of houses?

In [55]:
grd = sns.pairplot(data=prop_df_sorted[column_set2].iloc[:49])
plt.title('How is rank correlated with intrinsic properties of a house?')
Out[55]:
Text(0.5,1,'How is rank correlated with intrinsic properties of a house?')

Save to disk

In [40]:
prop_df_sorted.to_csv('resources/houses_ranked.csv')
prop_df_sorted.spatial.to_featureclass('resources/shp/houses_ranked.shp')
Out[40]:
'resources/shp/houses_ranked.shp'