Building data science projects using Azure-ML stack
This wiki covers the steps involved in building a data science project using Azure Machine Learning Workbench product. This also covers the steps involved in productionizing the model as a web service and accessing it over HTTP using its REST API.
Azure ML consists of two major parts - Azure ML portal - Azure ML Workbench
Azure ML Workbench
Azure ML provides a cloud infrastructure to turn your machine learning projects into production code with the scalability and availability of its cloud infrastructure. As of this blog, Azure itself does not have ML as Service, but it allows you to turn your projects into one.
The Workbench is a desktop application + Python libraries with which you build your ML experiments, taking it from data cleaning, model building to refinement. It provides frameworks to
- access datasets
- build models with defined inputs and outputs
- refine models by tuning hyperparameters and export model parameters and learning artifacts
The Workbench has a run history that allows you to visualize the run results as a time-series and helps you visualize the effects of each run.
Runtime environments
The Workbench supports at-least 3 different runtimes.
local
- This is your local anaconda Python environment (or R env)docker-python
- This encapsulates thelocal
environment as a Docker container and runs it with this container. This container could run locally, on host OS, or can run on a remote machine as well (such as a VM or HDInsights cluster).docker-pyspark
- similar, but is useful for distributed computation using Spark.
Deployment environments
The output from Workbench is your Docker image. Azure ML helps you to deploy this image on Azure Container Services and get a REST
API for predictions and inference during production.
The ML model consists of - model file (or dir of such files) - Python file implementing model scoring function - Conda dependency file (.yml) - runtime environment file - schema file for REST API parameters - manifest file (auto generated) for building Docker image
Azure ML Model Management console allows your register and track your models like a version control system.
Azure ML portal
Azure dashboard contains all Azure sub products, including the ML services. Sign in with free Outlook account or Office 365 account and you get $200
in free credits.
In the console, search for Machine learning experimentation (which is in preview at the time of this article). This service / dashboard provides an environment for the rest of this article.
Walk through
An important process is to write the Python code (in scripts and notebooks) by using azureml
Python library for data access, logging and printing. This is the hook for the Workbench UX to interpret model refinement and display them in the dashboard.
Initialization
- Log into the ML portal, create a new experiment. Help image
- Install Workbench. Note: You need macOS, Win 10 or higher for Docker to run. Then sign into the workbench using the same account. Your experimentation account shows up on Workbench. Help image
- Create a new project, pick from a template if necessary. Help image
Model development
Your Azure ML projects are primarily git
repositories. When you open an existing template project, the home
tab renders the readme.md
file.
- Add datasets - Click on the data tab and plus sign. Use wizard to add data sources. If data is tabular, the Workbench attempts to display it as a table and provide some preliminary statistics.
ML Workbench wants to keep a record of all your EDA and data wrangling steps, to aid reproducibility. Hence it stores a .dsource
to record how data is input and a .dprep
to record the transformations performed on the data.
-
A
.dsource
file gets created that contains connections to your dataset and how your file is read. -
Any data preparation performed by using the built-in prepare tool is stored in
.dprep
file.
At any time, you can do the same actions in Python, or turn the UX actions into Python code by right clicking the .dprep
or .dsource
file and generating code file. The sample code that reads from .dsource
and returns a pandas DataFrame is below. Script proceeds to run data preparation using .dprep
file.
from azureml.dataprep import datasource
from azureml.dataprep import package
df = datasource.load_datasource('iris.dsource')
# df.head(10)
# column-prep.dprep is my data prep file name.
df = package.run('column-prep.dprep', dataflow_idx=0)
df.head(10)
- Serialize your model and its weights using
pickle
library
with open('./outputs/model.pkl', 'wb') as mh:
pickle.dump(classifier, mh)
-
Save your plots as
.png
files into the./outputs
folder. The Workbench picks this up automatically. For each run, the Workbench stores the outputs within the run numbered folder. From the run dashboard, you can download a file later for any run. -
Choose runtime as
docker-python
. The Docker base image is specified in theaml_config/docker.compute
and run configurations, Python dependencies indocker-python.runconfig
files. Workbench first builds a Docker image from base image, installs dependencies and then runs the file.
Operationalization
Once model is developed, you need to create a schema file that contains the inputs and output of the prediction service. This service is a Python file that reads the pickled model file, accepts inputs, performs predictions and returns the results.
The azureml
Python library has methods such as azureml.api.realtime.services.generate_schema()
to generate schema. To specify the data types, use azureml.api.schema.dataTypes.DataTypes
and azureml.api.schema.sampleDefinition.SampleDefinition
to specify a sample input. Documentation on these is very thin and is left to user experimentation.
Notes
- It appears that using Workbench UX to read and clean data into DataFrames is optional. You can as well do all in Python in standard way and create a DF from a .csv file.
- You accept script parameters as command line arguments. The Workbench UX provides a generic args text box into which you can type the values when running from UX.
- The text you want persisted in the logs should be sent to Azure ML logger
from azureml.logging import get_azureml_logger
run_logger = get_azureml_logger()
run_logger.log('text')
-
Write your scripts such that there is
1
main Python file which in turn calls other files that are necessary. -
control_log
file contains the running log of the job with details injected by Workbench driver_log
file contains the print statements
Sources - Azure ML help