Building data science projects using Azure-ML stack
This wiki covers the steps involved in building a data science project using Azure Machine Learning Workbench product. This also covers the steps involved in productionizing the model as a web service and accessing it over HTTP using its REST API.
Azure ML consists of two major parts - Azure ML portal - Azure ML Workbench
Azure ML Workbench
Azure ML provides a cloud infrastructure to turn your machine learning projects into production code with the scalability and availability of its cloud infrastructure. As of this blog, Azure itself does not have ML as Service, but it allows you to turn your projects into one.
The Workbench is a desktop application + Python libraries with which you build your ML experiments, taking it from data cleaning, model building to refinement. It provides frameworks to
- access datasets
- build models with defined inputs and outputs
- refine models by tuning hyperparameters and export model parameters and learning artifacts
The Workbench has a run history that allows you to visualize the run results as a time-series and helps you visualize the effects of each run.
The Workbench supports at-least 3 different runtimes.
local- This is your local anaconda Python environment (or R env)
docker-python- This encapsulates the
localenvironment as a Docker container and runs it with this container. This container could run locally, on host OS, or can run on a remote machine as well (such as a VM or HDInsights cluster).
docker-pyspark- similar, but is useful for distributed computation using Spark.
The output from Workbench is your Docker image. Azure ML helps you to deploy this image on Azure Container Services and get a
REST API for predictions and inference during production.
The ML model consists of - model file (or dir of such files) - Python file implementing model scoring function - Conda dependency file (.yml) - runtime environment file - schema file for REST API parameters - manifest file (auto generated) for building Docker image
Azure ML Model Management console allows your register and track your models like a version control system.
Azure ML portal
Azure dashboard contains all Azure sub products, including the ML services. Sign in with free Outlook account or Office 365 account and you get
$200 in free credits.
In the console, search for Machine learning experimentation (which is in preview at the time of this article). This service / dashboard provides an environment for the rest of this article.
An important process is to write the Python code (in scripts and notebooks) by using
azureml Python library for data access, logging and printing. This is the hook for the Workbench UX to interpret model refinement and display them in the dashboard.
- Log into the ML portal, create a new experiment. Help image
- Install Workbench. Note: You need macOS, Win 10 or higher for Docker to run. Then sign into the workbench using the same account. Your experimentation account shows up on Workbench. Help image
- Create a new project, pick from a template if necessary. Help image
Your Azure ML projects are primarily
git repositories. When you open an existing template project, the
home tab renders the
- Add datasets - Click on the data tab and plus sign. Use wizard to add data sources. If data is tabular, the Workbench attempts to display it as a table and provide some preliminary statistics.
ML Workbench wants to keep a record of all your EDA and data wrangling steps, to aid reproducibility. Hence it stores a
.dsource to record how data is input and a
.dprep to record the transformations performed on the data.
.dsourcefile gets created that contains connections to your dataset and how your file is read.
Any data preparation performed by using the built-in prepare tool is stored in
At any time, you can do the same actions in Python, or turn the UX actions into Python code by right clicking the
.dsource file and generating code file. The sample code that reads from
.dsource and returns a pandas DataFrame is below. Script proceeds to run data preparation using
from azureml.dataprep import datasource from azureml.dataprep import package df = datasource.load_datasource('iris.dsource') # df.head(10) # column-prep.dprep is my data prep file name. df = package.run('column-prep.dprep', dataflow_idx=0) df.head(10)
- Serialize your model and its weights using
with open('./outputs/model.pkl', 'wb') as mh: pickle.dump(classifier, mh)
Save your plots as
.pngfiles into the
./outputsfolder. The Workbench picks this up automatically. For each run, the Workbench stores the outputs within the run numbered folder. From the run dashboard, you can download a file later for any run.
Choose runtime as
docker-python. The Docker base image is specified in the
aml_config/docker.computeand run configurations, Python dependencies in
docker-python.runconfigfiles. Workbench first builds a Docker image from base image, installs dependencies and then runs the file.
Once model is developed, you need to create a schema file that contains the inputs and output of the prediction service. This service is a Python file that reads the pickled model file, accepts inputs, performs predictions and returns the results.
azureml Python library has methods such as
azureml.api.realtime.services.generate_schema() to generate schema. To specify the data types, use
azureml.api.schema.sampleDefinition.SampleDefinition to specify a sample input. Documentation on these is very thin and is left to user experimentation.
- It appears that using Workbench UX to read and clean data into DataFrames is optional. You can as well do all in Python in standard way and create a DF from a .csv file.
- You accept script parameters as command line arguments. The Workbench UX provides a generic args text box into which you can type the values when running from UX.
- The text you want persisted in the logs should be sent to Azure ML logger
from azureml.logging import get_azureml_logger run_logger = get_azureml_logger() run_logger.log('text')
Write your scripts such that there is
1main Python file which in turn calls other files that are necessary.
control_logfile contains the running log of the job with details injected by Workbench
driver_logfile contains the print statements
Sources - Azure ML help