Every algorithm is exposed via an
Estimator which can be imported as
from sklearn.<family> import <model>
for linear regression
from sklearn.linear_model import LinearRegression lm_model = LinearRegression(<estimator parameters>)
Estimator parameters are provided as arguments when you instantiate an Estimator. Sklearn provides good defaults.
In Scikit-learn, Estimators are designed such that
- consistency: all estimators share a common interface
- inspection: the hyperparameters you set when you instantiate an estimator is available for inspection as properties of that object
- limited hierarchy: only the algorithms are represented as Python objects. Training data, results, parameter names follow standard Python or Numpy / Pandas types
- composition: many workflows can be achieved as a series of more fundamental algorithms
- sensible defaults: you guessed it.
- choose a class of model
- instantiate a model from the class by specifying hyperparameters to its constructor
- arrange data into
yand split them for training and testing
- fit / learn the model on training data by calling
- predict new values by calling the
- evaluate results
To split the input data into train and validation sets, use
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3)
to split it at 70% train and 30% test sets. This method splits both the dependent and independent attributes so as to validate the prediction.
The general syntax is
model.fit(independent_train, dependent_train). Thus
In case of unsupervised models you only have a training data, no test data. Hence
In Scikit-Learn, by convention all model parameters that were learned during the
fit() process have trailing underscores; for example in this linear model, we have
model.score() method returns the a value
0-1 illustrating how well the model fitted the training data. Note this is useful to understand the influence of underfitting and overfitting of training data.
model.predict(<independent_test data>). Thus for linear reg,
y_predicted = lm_model.predict(x_test)
In case of classification problems, you also get a
model.predict_proba() method which will return the probabilities for each class. The
model.predict() will return the class with highest probability.
Relevant in unsupervised models,
model.transform() is used to transform input data to a new basis. Some models combine the fitting and transformation in one step using the
You can obtain the MAE (Mean Absolute Error), MSE (Mean Squared Error) and RMSE (Root Mean Squared Error) from the
from sklearn import metrics import numpy.np metrics.mean_absolute_error(y_test, y_predicted) metrics.mean_squared_error(y_test, y_predicted) np.sqrt(metrics.mean_squared_error(y_test, y_predicted)