How to deploy machine learning models into production




Hey hi everyone welcome to the talk how

to deploy machine learning models into

production thanks for stopping by I'm

actually really excited to talk about

this topic because I think it's very

important that we talk about the the

second step of the machine learning flow

that everyone knows which is deploying

machine learning models to production

because I believe that's that's the real

that's where the real value of data

science comes in right so in the whole

cycle till this point you have been

investing and now you come to the point

that you deploy a small somewhere that

that brings really to the enterprise and

strange enough not a lot of enterprises

talk about it for several reasons maybe

it's too complex the way they do it it's

a trade secret or maybe it's just

outright too embarrassing because they

do it in such a bad way so my my

motivation with this talk is is to give

you some insight and also some food for

thought and and some let's say tracks to

follow up on if you are on that path as

well before I go into details I briefly

introduce myself my name is summit quill

I'm a software engineer and IBM research

and development labs in South Germany if

you're on Twitter and that kind of stuff

you can follow me in Twitter I'm pretty

responsive so you have any questions

just write me in my day-to-day life I do

the development work for the product

called Watson studio but I'm also

involved in some client engagements for

IBM exactly so that's the agenda for the

talk I'm gonna talk out motivation so

that why am I talking about this and why

I think it's important what a Python

model actually is like we know model

model model but what what it what thing

is it exists exactly what kind of

different production environments look

like and then I will try to demo that in

the standard production environments

which is Cloud Foundry docker cuban ets

or through managed services so I think

all of us have seen this picture or some

variant of this picture many times so

this picture does nothing but basically

gives an overview of how a much

learning flow looks like or data science

flow looks like so it starts with data

acquisition business understanding where

you pose a problem that you're trying to

solve with your machine learning model

you data understanding data preparation

modeling and evaluation and eventually

may or may not come the step of

deployment but surprise surprise the

work does not end there right so if you

are read a scientist I would be really

really sad if I come to an office every

day for two months to model I have a

parameter optimization have final set of

parameters which does the perfect

results but it's just dies there because

nobody deployed this so they'll be

really sad for me and I think that's

where the ops part of machine learning

comes in which is I would say even a

relatively newer field and machine

learning itself so I would really like

to think that the rest of it as

development and the ops part comes in

when you when it comes to deployment so

as I said I am also pre often talking to

IBM clients and and the pattern that we

that we see that they have been

employing is that there's a data

scientist who has been working for

example let's say a Python model he did

some work for weeks or months or

whatever their cycle is he thinks that

he has a good enough model and then he

just throws it over the wall to the

application development team IT

department whatever and then their job

is to now develop an application with

that which can be used it's it's fine if

it works for enterprises but I see some

flaws with this technique which is I

mean there's there's a lot of gap or

hurdles if there's a such huge fence

between data science team and

development team because for example

let's say after a week of work Mike

decides that now he has found a new

amazing feature for his machine learning

model which improves the performance but

but Deb is still relying on the old data

set the or old information she had that

okay I need so many parameters to train

the small so yeah the my point here is

that it's it's better if you move slowly

move towards the second model which is

my data scientist and Deb the developer

they are working in unison so Mike

develops a model maybe in Python or what

and then he closely works with the data

scientist developer himself/herself to

combine that and and deploy that into

production where you're supposed to

bring the value of the machine learning


so I'll just just occur the this stage

for the presentation I'm concentrating

on production environments in the cloud

because that's what all the cool kids do

Python is a primary programming language

here micro-services based architectures

are no monolithic structures traditional

machine learning algorithms no deep

learning that's just for the fact

because it's easier to deploy

traditional machine learning algorithms

because they are smaller in size usually

and data science team and software

development team are working in unison

before I go into the details I just give

you a quick quick hinge of the use case

that we'll be using so it's it's

actually real-life data case data set

from credit card fraud detection company

what they do is they took near two three

hundred thousand credit card

transactions they had I don't know how

many hundred parameters they did

principal component analysis and reduce

that to 31 columns these are like

various aspects of offer credit card

transaction there are bank notices in

their system so they anonymize the data

so we don't know what these columns mean

but like two or three of them are

tangible which is time of the

transaction how much amount was there in

this transaction and the class so if

it's a zero it's an on fraud transaction

if it's a one it's fraud transaction

okay so that's what the data set reader

scientist got he choose Python to his

language of choice so he started working

for example in this case he's using a

random forest classifier so he did the

the standard workflow of ingesting the

data cleaning the data eventually

modeling it using a random forest

classifier so this seriously is the

development path for the data science

that's where my production environment

comes in and where is is let's say my

end goal where I want to deploy this

model now how do I reach there there are

several ways

the first one is unfortunately the most

common that we have seen till now which

is like some data scientist data model

and then since you think about


you think about Java applications node

application C C++ applications so how do

they communicate with each other one

very bad of the way of doing that is

that you translate those models into

Python into C C++ and then redeploy them

which is okay for maybe for performance

vectors but it's really bad for for time

to market or time time to target right

so you have if you have to rewrite

everything it's at least one or two

weeks gap before and it's possible in

the fast-moving world that your model

lost B value it had there's PMML so you

can translate the kind of the type of

the model but they are not meant to be

very stable the third ball

the third way which I really like a lot

is you just take your Python model you

see utilize this and you deploy it as a

Python application serving a REST API

why REST API because so if your

application can talk HTTP it can talk

about it can talk to the supplication so

this was the side of of Python machine

learning model how does a production

environment might look like so I know

every production environment is

different but it's one or the other

flavor of this one why so we are here

with my serialized object of my Python

which is my pattern model I will go into

details of that what what exactly so

it's it's basically nothing it's just

just some bytes it's a blob and and you

if you want and you want to save that

into a database why would you save a

model into database because you want to

keep lineage what model was deployed in

my production environment what time so

for example let's say 28 June my model

did a prediction if I want to look at

that in like one week I want to know

which model which version of her model

was deployed and why it predicted what I

just not want to save a blob of data I

want to also like enhance this augment

the information with the version of the

model the name of the model machine

learning framework that I use for it and

what was the performance of my model so

I saved that model and now how would my

let's say machine learning app would

look like right so it will say this is

my app the lower block it says my logic

is my model I get it I connect to a

database I download my model in there

depending upon if I'm using PI

our Scala PI spark whatever I download

that framework let's imagine docker

container in there since I'm serving it

as a rest hip rest and rest API endpoint

I need some sort of web server it could

be flask it could be tornado anything of

your choice you know the machine

learning model create a route for your

application which is like okay predict

for example you prepare your data coming

in for example so in the modeling stage

you used all those 300 transactions that

you had to build your model now in the

predict stage what happens there's a new

request coming in from for example your

bank portal which says there's a new

credit card transaction with so so many

features can you tell me if it's a fraud

or not so that's your prepare step you

run the prediction do some

post-processing and return the result to

this rest API endpoint so let's put some

real-life context into that imagine any

e-commerce company of your choice I'm

sure it was not Amazon in your mind you

put your e-commerce company in the

perspective they're our customers let's

imagine this is solved by micro services

so one micro service is a serving the UI

one is handling of the buck the basket

one is taking care of the orders

whatever then comes a new order right

from there the platform here now wants

to know if this order is a fraud or this

transaction it's a fraud or not what it

will do it will do call rest and a rest

call - it's a machine learning model and

asking it tell me if it's a fraud or not

now comes the serious questions of this

this whole business right so since now

your machine model is such a tightly

coupled part of your system you have to

put some requirements on this machine

learning model so it's not a toy use

case anymore so you have to say okay

what is your maximum response time if

it's a credit-card transaction you

cannot say machine learning model give

me something in 5-minute note that that

doesn't work like that

milliseconds depending upon your

requirements availability it should be

available all the time so if your

platform is 24/7 your model should be

24/7 as well a quality confidence of

prediction so if your machine learning

model says I'm 40% sure this is not a


what do I learn from that right 40

percent sure it's not a fraud I can

better flip a coin 50% so I know it

better so just just give me something

which is more than rubbish knowledge

what's your max return trying so

modeling is a recursive process if you

if you remember the slide from before

the model which is relevant today might

not be tomorrow relevant so what's the

maximum retrain time from your model and

does that mean a down time right so if

I'm retraining the model does it mean

that I cannot use my platform anymore

things like that so impose those kind of

requirements deployment with cloud for

me so how many of you are aware of Cloud

Foundry already good a good mass so for

the people who are not aware of cloud 4

it's a car foundry is a standard

platform as-a-service

it's kind of a software stack which you

can deploy on your data center it was

developed by VMware originally and I

think now it's owned by pivotal software

what cloud foundry does is cloud foundry

makes the life of data scientists very

easier of application developers very

easier it takes care of the application

lifecycle management how many instances

do you want if something goes wrong it

says ok one application died I want to

start a new one and it takes care of

logging monitoring routing things like

that so in the context of Cloud Foundry

application how would it look like so

Cloud Foundry has the concept of build

pack so imagine this is like a docker

container docker image in my build pack

I would say connect to my database load

my model and do the same thing so since

the example I'm using is a Python

machine learning use case a scikit-learn

so I import Sai kit package in my

application I input the flask so flask

is like a web server so standard web

server you create the route you prepare

scoring data run prediction and serve it

as a rest and API now the questions

which are very specific to the

production environment here using how do

you configure your app you do not want

to store your credentials in the

application maybe you call github repo

to get the credentials for connecting to

Cloudant your database of choice how

many instances do you want so Cloud

Foundry also takes care of load


if your app is like first rate was like

five requests per second next 8,000

requests per second so you can set it to

scale like number of instances you want

zero dampened deployments how do you

deploy a new version of your application

map and map routes things like that

crashes recovery workload scaling so

depending upon those questions you

decide on those things yeah I was

talking about build pack so just imagine

this as like different kind of pre-built

docker images for Cuban eighties for

example so since it was a Python

application I will choose the Python

build pack I will go into detail of that

in in the demo but just just keep it in

your mind that for example I deployed

this cloud foundry I developed this

cloud foundry app which is in my project

called for detection API from the CLI I

do CF push which basically means push my

application I think in Cuba netis well

it will be Cube control deploy or

something like that depending upon where

my configuration comes in I set the

environment variables for my credentials

things like things like that and which

load C data where multiple database now

my app is up and running and the

application any application practically

in the world if I do not have a secure

or secure authentication on that can

create a post request and get the result

back now let me go back quickly into the

application so you have better idea what

I'm talking about

so this is actually a very very lean

application end to end I think it's not

more than like hundred lines of code

which you don't see okay

bigger make it a bit bigger it's too big

so can everyone read it perfect it's

actually nothing very much going on

there's just one Python script it's

called hello dot py very innovative of

me it says import from Cloudant just

give me a quick second actually I just

realized I should give you more context

into what the use case is

yeah just to give you more insights into

what my fraud detection use-cases

these lines are not important just to

give you an idea this is a real

transaction from that credit card it

says at this time there were so many

features this was one forty nine point

six two dollars euros whatever and the

class was zero it's a non protease

then comes the case of preparing the

data you divide your set into training

set test set in this case we are just

doing a very simple random forest more

classifier you train the model which

took like one point six seconds all in

all one point two seconds you make

predictions of that this is the standard

confusion matrix if the weights are

higher on the this diagonal your model

is good but since if you notice here

they are just actually out of 284

transactions I think there's just 500

which are a fraud cases so it's a very

skewed problem so one way of identifying

if your model is doing good or bad is

the precision and recall precision just

simply means if if your model says it's

a fraud how sure you can be it's

actually a fraud and the recall is if

your model says it's or recall is how

many fraud cases it can actually catch

so this was done by a colleague of mine

he's a real time like real-life data

scientist he's never happy with the

models he has so he's always optimizing

them defining the grid with hyper

parameter optimization doing different

training validation sets with

cross-validation evaluate the best model

he I think trained I know like 100

models before he came to his model

called RF best random forest best model

and at this point he saved this so it's

I'm saving this to iCloud and database

it could be any database if a choice

Cloudant is unknown sequel database so

since it's a blob it's it's a wary

I think optimum key use case of Carens

basically CouchDB I put my credentials

in there import the cloudy database

create a connection to that create a

models database

save the document there name it random

forest model with this ID and this is my

attachment like this is my document

which has this blog attachment now let

me quickly go through go to the code yep

so I import the client library so I can

also download that model in my

application I import pandas and flask as

my web server I start the web

application here a physical flask if I

have credential so if if it's running so

credentials in cloud phone we are called

vcap services if it's running in Cloud

Foundry it will take the credentials

from that environment since I'm doing it

locally at the moment if I have a weak

app local dot JSON it will read the

credentials the username password the

database name create a connection to the

database and download the database

that's it it just says get attachment of

the model pickle and pickles are aware

of pickle and pickle so in in in Python

you create a model and if you want to

see utilize that object you says pickle

it it just converts that Python object

into into a serialized object and

there's actually nothing much happening

I create just one root which is serving

an index.html it's a standard HTML file

which has a form and which in turn makes

a call to the API predict endpoint which

this API project endpoint does is it

says okay incoming requests for scoring

it reads your data from the post request

call it does the model dot predict so

since your model is now loaded as model

it does a prediction on this model

object and just return that as as the

response to the e to the request okay

now let me go ahead and show that to you

how would that work

tell me let me also make that a bit


okay so here you see here is my file

called hello dot py so since now I'm in

my local environment I would just say


hello dot Y it says found local week app

services so since I'm running in my

local environment now it just started a

web server which is being opened at port

8000 the browser window yep so this is

basically nothing just just arrest like

there's a UI guy in my team who's very

fond of doing fancy stuff so he just in

five minutes build up this phone for me

now I want to score this so what do I do

for that let me just get the body


I just want to get the response object

what why I got it here second

yep so this is the my response body so

let's imagine it like a new request came

in to find out if it's a fraud or not

fraud it gave me those 31 features that

are required by model I just say okay

submit this if my application is running

it says 0 which means it's not it's not

a fraud case and I responded to the API

call if we look at the logs of the

application of the moment which is

running ok incoming requests for scoring

if everyone can see it

these are the features I got for this

request and then I responded with

prediction 0 so I think it's not a non

fraud I think it's a non fraud now so

this was my local environment that's how

basically also every application

developer or software data scientist

work they do everything locally and now

if they feel confident about it they

want to push it to production now let's

go to the second stage of that which is

let me shut down the server yeah that is

my Cloud Foundry application

now that also looks actually pretty lean

the only thing that I need to deploy

this to my application is I want to say

that it's a Python application I want to

name it fraud detection API I want to

host it at as a host like fraud

detection API b1 version 1 and I think

my application is not so famous yet so I

just give it one gigabytes of memory now

I go back to my terminal and I say CF

push which tries to find out ok I

already know that you have a manifest

file I'm using the route fraud detection

API bluemix.net like this is the IBM

cloud domain name since I already have

an app running with that it says I will

stop that app because I'm updating this

one it will deploy the new app so and if

I okay I need a Python build pack for

this one it will download the Python

built back in your warren container so

in the meantime this is being deployed

I'll tell you briefly about water and

containers so everybody knows about

docker containers because they are very

famously used in Cuban eighties Cloud

Foundry works with water in containers

it's very similar but there are some

differences with water in containers and

docker containers docker containers

support the various kinds of file

systems water in containers support less

of them they work with build packs they

up till now I think it's recently now

they can do also multiple ports but

worgen containers traditionally have

been just possible to open one port per

application now it's just doing that now

my app is running now it's doing all the

Python stuff which is required so all

the packages that I listed there it will

download those packages install those


I hope it finishes fast

in the mean time let me just copy a copy

this URL so I can use it okay so this

does not exist yes because the

application is still being deployed so

I'm on conference Wi-Fi so this might

take a few seconds it successfully

destroyed the container so I already had

the application running so it says okay

since you're updating it and you didn't

do the replication strategy so it will

kill the existing container and now it's

so since I said I just want one instance

of this because my application is not so

famous yet

it says trying to start this containers

hopefully it will successfully start

that yep it started so now let's go to

the tone browser and say my fraud

detection API it's there so I just don't

want to show you this page but I just

want to tell you that since now it's a

REST API which I did not put any

authentication yet basically anyone in

the room if you copy this can actually

use it use this post body and you will

get a response so let me just copy that

post body to just make sure it's

actually running let me get it from here

yep so my app is up and running there so

as I said production environment can

have different flavors in your company

it's maybe different or in your local

environment it's different but the main

themes and features here remain the same

now since Allah but are also using

humanities let me go to quickly tell you

how it might look like if I want to do

it with q1 80s

basically the only difference is this

file and I'm not kidding it's just this

docker file what I do in this talk or

file is I am using this base image

Python 3 6 3 so the community releases

those images I'm just using this image

which already has all the things I need

i copy my application into working


I copy all the requirement files I just

took pip install copy my static files

and I say as an entry point when my

container comes up execute the script

which basically at the end does the same

so if i have mini queue running on it i

had it but it died on me like 15 minutes


let me just try if it's running yeah it

takes too much time so I think it's not

worth it I'll go back to the

presentation okay what have we got here

so I did the Cloud Foundry mo with Cuba

notice it's as I know it's basically

this side that the right side of this

slide basically looks exactly the same

the the differences come based upon your

production environment that they are

using at the moment so if your

enterprise a company or yourself feel

more comfortable with Cuban ideas

because you have experience with that

just keep on doing that it has a few

added benefits that you have much more

control on your environment if you need

it so on Cuba notice you are actually

talking to the like OS level dependency

so it's it's easier for you to manage if

it's a deep learning model it's easier

to add GPU notes on that why because you

can just do a CUDA CUDA image and then

then you have a docker running with GPU

support how do you manage your crashes

recovery routing word cloud scaling

workload scaling things like that

the basic difference it would be just

that instead of getting everything from

a built back you would be getting from

your repository the image that you

created you deploy that into your docker

environment which will then run as a pod

you can define your deployment strategy

replica strategy I want to have three

pods running all the time the good thing

about is it's it's easier to manage the

downtime of the application because at

some point of time you just want to

deploy a new version what will will do

is if you have three instances running

it start killing one deploy the new ones

killing the second deploy the second one

so your app is some in some sense

running all the time

so that was the Cuban artist thing and

deployment with managed services so this

is the point if you feel that was still

a very involved process or if you want

to do if you want your data scientist to

do that for him it could be a bit

daunting maybe in the beginning to do

that there actually managed services

which do that for you out of the box so

what happens in those cases is you have

machine learning model and this managed

service provides you some kind of let's

say SDK or apply a Python client some

kind of client which is kind of

one-click deployment so you have two

smaller you say I want to deploy this it

will deploy this model take care of

creating the container for you managing

the versions of that the evaluation

metrics for that things like that and if

you want to say my application model is

running if the threshold the confidence

score goes below 80 points a deeper yeah

and then redeploy the app things like

that so they are actually you can do it

with I just know of these you can do it

with any cloud provider I think possibly

so much Amazon as your Ivan Watson

machine learning so since I am an IBM

employee I have to talk about it I don't

have to talk about it but I have access

to the tools and I it's free for me so I

did it with Watson machine learning let

me quickly tell you how would it look

like there

yeah how would it look like in a managed

services is that I cleared the instance

of managed services which gives me

credentials so to say the entitlement to

do this it will say so there's this

client called Watson machine learning

client you can save your model I say the

author of this model is me that's my

email id

that's my fraud detection model which

has the AUC area and a curve score of

0.8 one published this model and store

this in my repository

it says publish the model this is a

fraud detection model

initializing deployment success at the

end it just gives you the scoring end

points so the things that I did manually

using a docker image or a Python build

pack it will do it for you out of the

box it will say you this is your

endpoint just go ahead and start adding

it to your applications how the scoring

would look like it's basically the same

I just get the scoring payload I say

client or deployments so take my

deployment and score this it will just

say okay this is I think it's not a

fraud and these are the confidence that

I have on that presentation

correct so what I wanted to achieve I

hope you are a bit motivated or at least

informed about how you can do it or at

least have some things to Google around

how how would you go with that I showed

you deployment of Cloud Foundry

partially also deployment with cuban at


deployment with managed services using

IBM watson all the things there in the

slide if you're if you're getting it

from the conference website you have

some links where you can learn more

about the three platforms that i talked

about and in the end thank you for your


thank you very much so I take any

questions yep sure

so at Ford we're a huge PCF shop so this

is awesome thank you very much for the

presentation a couple of quick

integration questions

have you tried leveraging Kafka using

spring cloud stream with PCF so that

kind of integration point and then we

see a lot of PI spark and I know that

the Python spark runtime is kind of a

dependency so using PI spark inside of a

PCF container yeah exactly so this well

thanks for the question

I'll take the second question first

because I know it better with PI spark

so it's actually not very different what

happens is when people think about spark

that you basically think about very huge

use cases but what what's important to

use to notice here is that my training

was done on a different cluster now I'm

just taking care of scoring and for

scoring you just you need Python

pipeline but for that you just download

a Python named a docker image with with

spice park in it and you in your scoring

endpoint you initiate the pipeline so

that's actually not very different so I

have done the PI spark case and about

the Kuban it is about the kafka use case

I have not done it myself but I can

imagine how how would it look like but

but we didn't talk about if here more

questions yeah yep when you serialize

the model how many bytes is that this

one was not much I think it was a few

kilobytes okay if it's deep planning

model can and can go to a bit more okay

do you think it's better to

you persist that bite stream in the

database you persist like a file path to

the model that's a good question but the

thing is we want to sustain with like

things like docker and Cloud Foundry you

want to keep your application as

stateless as possible so you do not want

to keep this application with a local

file system all the time so that's why

you want to communicate with something

outside so you can kill this container

at any time without thinking about it

and then you just say okay I have a new

application I just download it from my

own yep and also depends upon what the

size of your model so a few databases

have start having problem it goes before

for above 4 gigabytes or 4 gigabytes a

single model and that's a problem


then that's my email id if anyone has

questions just write me I really like to

talk about this stuff perfect thank you

for attention everyone