autogen/notebook/integrate_azureml.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved. \n",
    "\n",
    "Licensed under the MIT License.\n",
    "\n",
    "# Run FLAML in AzureML\n",
    "\n",
    "\n",
    "## 1. Introduction\n",
    "\n",
    "FLAML is a Python library (https://github.com/microsoft/FLAML) designed to automatically produce accurate machine learning models \n",
    "with low computational cost. It is fast and economical. The simple and lightweight design makes it easy \n",
    "to use and extend, such as adding new learners. FLAML can \n",
    "- serve as an economical AutoML engine,\n",
    "- be used as a fast hyperparameter tuning tool, or \n",
    "- be embedded in self-tuning software that requires low latency & resource in repetitive\n",
    "   tuning tasks.\n",
    "\n",
    "In this notebook, we use one real data example (binary classification) to showcase how to use FLAML library together with AzureML.\n",
    "\n",
    "FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the [azureml] option:\n",
    "```bash\n",
    "pip install flaml[azureml]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install flaml[azureml]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Enable mlflow in AzureML workspace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mlflow\n",
    "from azureml.core import Workspace\n",
    "\n",
    "ws = Workspace.from_config()\n",
    "mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## 2. Classification Example\n",
    "### Load data and preprocess\n",
    "\n",
    "Download [Airlines dataset](https://www.openml.org/d/1169) from OpenML. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from flaml.data import load_openml_dataset\n",
    "X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir='./')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Run FLAML\n",
    "In the FLAML automl run configuration, users can specify the task type, time budget, error metric, learner list, whether to subsample, resampling strategy type, and so on. All these arguments have default values which will be used if users do not provide them. For example, the default ML learners of FLAML are `['lgbm', 'xgboost', 'catboost', 'rf', 'extra_tree', 'lrl1']`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "''' import AutoML class from flaml package '''\n",
    "from flaml import AutoML\n",
    "automl = AutoML()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "settings = {\n",
    "    \"time_budget\": 60,  # total running time in seconds\n",
    "    \"metric\": 'accuracy',  # can be: 'r2', 'rmse', 'mae', 'mse', 'accuracy', 'roc_auc', 'roc_auc_ovr',\n",
    "                           # 'roc_auc_ovo', 'log_loss', 'mape', 'f1', 'ap', 'ndcg', 'micro_f1', 'macro_f1'\n",
    "    \"estimator_list\": ['lgbm', 'rf', 'xgboost'],  # list of ML learners\n",
    "    \"task\": 'classification',  # task type    \n",
    "    \"sample\": False,  # whether to subsample training data\n",
    "    \"log_file_name\": 'airlines_experiment.log',  # flaml log file\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "experiment = mlflow.set_experiment(\"flaml\")\n",
    "with mlflow.start_run() as run:\n",
    "    automl.fit(X_train=X_train, y_train=y_train, **settings)\n",
    "    # log the model\n",
    "    mlflow.sklearn.log_model(automl, \"automl\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl = mlflow.sklearn.load_model(f\"{run.info.artifact_uri}/automl\")\n",
    "print(automl.predict_proba(X_test))\n",
    "print(automl.predict(X_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Retrieve logs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "mlflow.search_runs(experiment_ids=[experiment.experiment_id], filter_string=\"params.learner = 'xgboost'\")"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "0cfea3304185a9579d09e0953576b57c8581e46e6ebc6dfeb681bc5a511f7544"
  },
  "kernelspec": {
   "display_name": "Python 3.8.0 64-bit ('blend': conda)",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}