1. Build Your First Machine Learning Model

today you and I are going to build a

simple neural network together we're

gonna understand line by line what every

single line does and we're gonna do

something that no other video does but

everybody working on neural networks

does which is we're going to debug that

model together and find an issue with it

and make it work for your first neural

network it's very traditional to use a

data set called M NIST which is a

handwriting data set so our goal here is

to look at digits that people actually

wrote down of different numbers and our

classifier is gonna try to figure out

from those pixels what number the person

wrote down so I'm a huge believer in

learning by doing so I strongly

recommend at this point opening up a

terminal of your own and following along

so first open up a terminal and type git

clone HTTP colon slash slash github comm

slash Lucas / ml - class this will put

all the files I use in all the classes

in a directory for you called ml class

and by the way you're welcome to modify

and easiest files in any way you like

now type cd' ml class and then type pip

install - our requirements text this one

saw all the necessary Python packages

sometimes students have trouble here

usually this comes from needing to

install a new version of Python

now type CD videos and then see the

intro to go into the directory for this

class there's actually one more step to

track your work on line type W and B

sign up just for reference WB is a

product that I built and you can create

a free account and then W and B will let

you see your progress as you build the

models and since I build this product

we're going to use it heavily during all

these videos

there's one more thing we should always

do before we start and that is to look

at our data so you can open up the file

amnesty P&G which I put in this

directory so you can see some of the

examples of the digits we're going to

classify M NIST is a famous data set of

handwritten numbers from 0 to 9 and our

goal here is to produce a model that

takes a single 28 by 28 image and

outputs which digit was written this is

a super practical task

that's generally known as optical

character recognition there are lots of

ways we could approach this problem

today we're going to use a type of

machine learning called a neural network

and we're going to start with the most

basic type of neural network which is

called a perceptron I had to introduce a

bunch of scary-sounding terms here so

don't worry if these terms are confusing

they'll become second nature as soon as

we start to work with them generally

perceptrons taken an array or list of

numbers and output a single number and

just for now let's slightly simplify our

problem to just detecting fives in our

data so here the input numbers are pixel

values and we want the output of our

perceptron to be a 1 if the digit is a 5

and 0 if our digit happens to be any

other number a little history

perceptrons were actually invented in

1957 by a psychologist named Frank

Rosenblatt and they were originally

designed for a very very similar image

recognition task to the one that we're

actually doing right now and I really

like to imagine the machine that

Rosenblatt actually built he had an

array of light sensors that would look

at the image and then he would pass the

output of each light sensor through a

system of dials he built and the output

would actually be either a light bubble

lighting up or a light bulb staying dark

and at first the weights aka the knobs

were sent randomly he would put in a

picture of a 5 and then the 5 sensor

probably wouldn't light up since the

picture actually is of a 5 and he wants

the sensor to light up he would try to

get the output to be high so how does he

do that he turns a random knob a little

bit and sees that the light gets

brighter and if light does get brighter

he turns it some more if the light gets

dimmer he tries turning the knob in the

other direction and maybe it gets

brighter then he moves on to the next

knob and does the same thing once he's

actually gone through all the knobs he

puts in another picture now if this new

picture is not of a 5 he now doesn't

want the sensor to light up so he turns

the knobs in order to make the light

actually be dimmer for this picture now

changing the knobs might mess up the

settings for the first picture but we

don't actually worry about that for now

he keeps walking through all of his

pictures turning the knobs until he's

looked at every single image of either a

5 or not a 5 at this point he's done

what's known in neural network training

as an epic once he's gone through every

single picture he starts back over at

the beginning for new epic these days we

usually think of perceptrons as an

algorithm more than a machine so let's

talk about the algorithm mathematically

it's actually super simple first we take

in a set of input numbers which in our

case happened to be pixel values then we

flatten out the pixel values into one

long fixed length array then we multiply

each input by a corresponding number

known as a weight and we add up the

results of the multiplications these

weights are those knobs and they're

somehow learned but we'll go into more

about that later

sometimes we apply what's called an

activation function to the output but we

actually don't need to worry about that

now and then finally we can interpret

the output number however we want but in

this case we've agreed that we're gonna

say a 1 means that the output is a 5 and

a 0 means that the output is not a 5 so

let's do this on a very small very

specific example imagine for simplicity

that our input images were 2 by 2 images

so here we only have 4 pixel values and

0 corresponds to a black pixel and 255

corresponds to a white pixel we flatten

the image out into array that's gonna

have length 4 and now our weights happen

to be these four numbers here starting

with 0.12 where do those weights come

from we

them randomly but we're then gonna learn

them when we do our algorithm training

so finally our weighted sum in this case

is two point five five if you're like me

this all becomes much more clear by

hacking on it so let's actually go to

the code go back into your directory and

open up perceptron - single dot pi in

your text editor now let's walk through

this the first couple lines are just

importing the library Kerris which is a

fantastic library for building neural

networks that we're going to use

extensively now the next few lines set

up W and B which we're going to use for

actually looking at our results line 16

is the first interesting line which uses

a special Kerris function to load the

emne Stata into four data sets with a

normal data set we'd have to download it

and load it ourselves but M NIST is such

a famous well-known data set that it's

built into the Charis library itself

we're going to use a common notation in

machine learning where X stands for the

inputs and Y stands for the outputs so

here X underscore train is a list of

60,000 28 by 28 images or another way of

looking at it is that X underscore Train

is 60,000 28 by 28 array of integers

from 0 to 255 where 0 is the darkest

black and 255 is the brightest white Y

underscore train is a list of 60,000

labels which in the case of M NIST are

actually digits between 0 & 9 now X

underscore tests is a held-out set of

10,000 images that we're going to use to

test our algorithm on once we've trained

it and Y underscore test is 10,000 more

labels that correspond to the images in

X underscore test just for now we're

only going to classify 5s verse not

fives so you have to transform our

output data into that lines 18 and 19

create two new variables is underscore 5

train and is underscore 5 underscore

test that correspond to exactly whether

or not our image is

five lens 22 and 23 calculate the image

width and the image height using an

incredibly useful function shape X

underscore trained s shape gives us the

dimensions of our X underscore training

variable which as we mentioned earlier

is 60,000 by 28 by 28

so X underscore train bracket 1 and X

Windows screen bracket 2 are actually

both 28 since our images are in fact 28

by 28 pixel squares okay now for the

important part where we build our first

neural network line 26 sets up our

network to be sequential

caris actually has different ways of

defining neural networks and sequential

is the simplest and most important so

we're going to use it it means that our

network is defined as a series of steps

line 27 defines this first step in our

network which simply flattens our data

from a 28 by 28 two-dimensional array to

a single 784 length one-dimensional

array we tell the flatten command that

the input will always be 28 by 28 and

this is a fundamental constraint of

neural networks the input size always

has to be the same if we ever have a

different size image we're gonna have to

crop it or resize it before we feed it

into our network line 28 adds a single

perceptron to our network the layer is

called dense because every input is

connected to every output and this will

make more sense later as our networks

get more complicated our simple network

outputs one single number which is where

that number one in line 28 comes from

now line 29 sets up our network for

training but this really requires some

explanation our network looks like this

we are multiplying our pixels by a set

of weights and adding them up and hoping

that we output a 1 where the input is a

5 and hoping that we output a zero in

all the other cases the only thing we

can change about our algorithm is the

weights so how do we find a good set of

weights that will output a 1 1 or pixels

correspond to an image of a 5 and 0

otherwise we talked earlier about how

Rosen blah the invent

perceptrons was turning knobs himself to

train an algorithm and I think in fact

that's an excellent intuition for how

all these algorithms work however if we

actually had to try turning each knob

and testing how well the network would

work it would take forever to train

larger neural networks on lots of data

so luckily there's a computational

method with the neural networks that

means we don't actually have to turn

every knob and in general the problem

with searching over a large number of

knobs or parameters for an optimal

setting is what's known as a gradient

descent problem and mathematicians have

actually been thinking about this class

of problems for hundreds of years in

this case of neural networks there's a

special optimization called back

propagation which helps us calculate

exactly what's going to happen when we

turn each knob very very efficiently and

this optimization is really important

because we're gonna change these knobs

or these weights quite a bit since every

machine learning library including

Kerris has a large array of excellent

gradient descent algorithms built into

it we're not going to go exactly into

how gradient sent and back propagation

works beyond this intuition but there's

a ton of materials online if you're

interested in that sort of thing I'll

say my favorite resource for this is the

three blue one Brown series on neural

networks which covers a lot more theory

than these lessons and is also a lot of

fun ok let's go back to the code there's

two things we actually do need to define

to make gradient descent work the first

is the loss function which is basically

how much we don't like our output or how

different our output is from the output

that we wanted the simplest loss

function is mean absolute error which is

just how different your output number is

the output number that you wanted a

fancier loss function would be mean

squared error which is how different

your output number is from what you

wanted squared and we'll use that for

now here MSE is short for mean squared

error the second thing we have to

specify is which gradient descent

algorithm to use one really crucial and

sometimes hard to set a parameter in

gradient descent is called the learning

rate and this is basically how fast

you're changing the weights every time

you feed it a new image from your data

set change the weights to slow

it'll take you forever to find good

weights she's always too fast and you

might jump over a good set of weights

the right learning rate can really

depend on the problem and other factors

there are many choices of algorithms in

Kerris but I almost always use the atom

gradient descent function because I

don't have to specify the learning rate

and it can really adapt to a wide range

of cases so that's what optimizer equals

Adam means the final thing I do is I set

the metric to accuracy this doesn't

change the algorithm itself but what it

does is it makes Karis output the

accuracy of our algorithm as the

algorithm learns the last line here in

our code does the actual training we

call fit on the model which makes it

look for the best set of weights given

the input training data X underscore

train and the output that we want which

is is underscore five underscore train

we ask our model to print out the

accuracy on a held out validation set in

addition to the training set that it was

trained on by default

caris will do the training for one epic

meaning that it will look at each input

exactly one time but here we set epochs

equals to config that epochs which we

earlier set to be 10 so it'll actually

go over each training data point ten

times so now this model is super simple

but it will run so go back into your

terminal and type Python perceptron

single dot PI and this will train the


you can follow the link to see a chart

of the accuracy as the model trains over


and Wow the accuracies model looks

terrible I would expect that just

guessing would give an accuracy of 50%

and guessing not 5 only would give an

accuracy of 90% one really frustrating

thing about working with neural networks

is they never give you helpful errors

what they do instead is they just don't

work that well most de toriel's skip

over the debugging part but I think

debugging was where the learning happens

so let's try to fix this model together

if you really want to learn maybe pause

this video for a minute and dive into

the cara stalks and see if you can debug

this yourself

so first let's see if this model gets

better with more training we can set the

number of epochs to a hundred and wait a



okay so we're back and we've trained for

a hundred epics and you can see the lost

number really isn't going down at all

and the accuracy is not going up so this

model is definitely not improving so the

next thing we need to test is if the

model can learn to fit a tiny subset of

the data so for times sake we're gonna

set the epics back to ten but then the

important thing is we're gonna set the

input data to be just twenty images so

we subset X underscore train so just the

first 20 examples and then we subset our

output data to be just the first 20

labels now a reasonable neural network

should be able to get a hundred percent

accuracy on the first 20 images just by


so we've run this neural network and we

pull open the output in weights and

biases or W and B and we see that the

accuracy is still not good even when our

constrains just the first 20 images so

something is really broken so the next

thing to try is to look at what the

model is outputting on just a couple of

test images so X underscore test is a

set of test images so we can call print

on model dot predict X underscore test :

10 and this will show us what the model

is actually outputting the number coming

out of the model on the first 10 images

and this is pretty weird so we were

expecting numbers between 0 & 1 but

we're getting numbers that are wildly

negative and numbers that are above 200

so remember nothing ever said that a

neural network had to output a number

between 0 & 1 so we'll get into this

more deeply in later episodes but

remember how we said you can optionally

add an activation function to the

weighted sum of the perceptron well

here's a really good reason why we want

to force our output to be between 0 & 1

and now a common activation function

maybe the most common activation

function is called a sigmoid also for

you math nerds cool

clearly known as logistic or softmax for

you non math nerds the important thing

to know about a sigmoid is that big

negative numbers are turned into zeros

and big positive numbers are turned into

one and numbers near zero are tending to

something in between so no matter what

your weighted sum is the output after

taking the sigmoid is going to be

between zero and one so we can add a

sigmoid by going back into our code and

adding just a single line to our model

which is activation equals same one

on a small data set the loss is going

down and the accuracy is going up so

this is a really good sign so let's go

back in and let's train the model on the

complete set of training data again and

see how well it does so now our accuracy

is closer to 91% so we're doing better

than random so as an aside whenever

you're building machine learning models

you basically have three ways to make

your models better you can improve your

algorithms like we're doing now which

tends to be really hard as you can see

you can improve your data preparation

which is also pretty hard and we'll talk

about that more later or you can add

more training data

I didn't rating that it might seem hard

labeling 60,000 images probably took a

long time but actually started a company

called figure 8 which will do this for

you and you should check it out if you

need more label data this model is

working reasonably well but actually

every time you run it you're gonna see

different scores so when you train this

model you may get a different accuracy

number than I'm getting we'll get into

how to fix this and how to build much

more complicated models in the next


what are we covered today we built our

first neural network and more

importantly we debug their first neural

network get used to that if you really

want to build neural networks for your

job we also talked about Kerris which is

the most important internal network

framework not just for beginners but

people that really build neural networks

professionally and we talked about loss

functions and we talked about weights

and a little bit about back propagation

in the next section we're gonna take

this in neural network and we're gonna

build what's known as a multi-layer

perceptron which is a more complicated

neural network that can do more things

and then later on we're gonna build a

convolutional neural network which is

maybe the the really fancy kind that you

hear about in a lot of papers and in the

news all the time and we're gonna keep

creating these videos so you should

probably subscribe so that you're the

first to know when a new video comes out