Scikit Learn syntax¶
Library constructs¶
Estimator¶
Every algorithm is exposed via an Estimator
which can be imported as
from sklearn.<family> import <model>
for linear regression
from sklearn.linear_model import LinearRegression
lm_model = LinearRegression(<estimator parameters>)
Estimator parameters are provided as arguments when you instantiate an Estimator. Sklearn provides good defaults.
In Scikit-learn, Estimators are designed such that
- consistency: all estimators share a common interface
- inspection: the hyperparameters you set when you instantiate an estimator is available for inspection as properties of that object
- limited hierarchy: only the algorithms are represented as Python objects. Training data, results, parameter names follow standard Python or Numpy / Pandas types
- composition: many workflows can be achieved as a series of more fundamental algorithms
- sensible defaults: you guessed it.
General steps when using Scikit-learn¶
- choose a class of model
- instantiate a model from the class by specifying hyperparameters to its constructor
- arrange data into
X
andy
and split them for training and testing - fit / learn the model on training data by calling
fit()
method - predict new values by calling the
predict()
method - evaluate results
Train-test split¶
To split the input data into train and validation sets, use
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3)
to split it at 70% train and 30% test sets. This method splits both the dependent and independent attributes so as to validate the prediction.
Training¶
The general syntax is model.fit(independent_train, dependent_train)
. Thus
lm_model.fit(x_train, y_train)
In case of unsupervised models you only have a training data, no test data. Hence
model.fit(x_train)
In Scikit-Learn, by convention all model parameters that were learned during the fit()
process have trailing underscores; for example in this linear model, we have model.coef_
, model.intercept_
Training score¶
A model.score()
method returns the a value 0-1
illustrating how well the model fitted the training data. Note this is useful to understand the influence of underfitting and overfitting of training data.
Prediction¶
Use model.predict(<independent_test data>)
. Thus for linear reg,
y_predicted = lm_model.predict(x_test)
Prediction probabilities¶
In case of classification problems, you also get a model.predict_proba()
method which will return the probabilities for each class. The model.predict()
will return the class with highest probability.
Transformation¶
Relevant in unsupervised models, model.transform()
is used to transform input data to a new basis. Some models combine the fitting and transformation in one step using the model.fit_transform()
method.
Validation¶
You can obtain the MAE (Mean Absolute Error), MSE (Mean Squared Error) and RMSE (Root Mean Squared Error) from the metrics
module.
from sklearn import metrics
import numpy.np
metrics.mean_absolute_error(y_test, y_predicted)
metrics.mean_squared_error(y_test, y_predicted)
np.sqrt(metrics.mean_squared_error(y_test, y_predicted)