## Bootstrapping and Resampling in Statistics with Example| Statistics Tutorial #12 |MarinStatsLectures

so we're gonna talk a little bit about

what is a bootstrap approach and why

might we want to use that so when doing

statistical inference it generally

relies on the sampling distribution as

well as a standard error we're gonna do

all this in the context of one numeric

variable and estimating a means so first

let's talk about what we've learned so

far this parametric approach or

sometimes called a large sample approach

so we have the entire population in

front of them we draw a sample of n

observations and we use that to

calculate our estimate here we're gonna

talk about the sample mean now in

reality we've only observed this one

sample and got this one estimate but we

have this idea of a sampling

distribution right this theoretical set

of all the possible estimates we could

get okay so this here we kind of imagine

imagine if we're take a sample of size n

again from the population and get

another estimate then take another

sample of size and get another estimate

do this over and over again so but we

imagine doing this and theory a large

sample Theory tells us that the

distribution of all these possible

estimates okay which we call the

sampling distribution it's going to be

approximately normal in other words the

histogram or distribution of all these

imaginary different estimates we could

have get mathematical theory tells us

these will be approximately normally

main one being large sample size and the

standard deviation of all these

estimates right this imaginary all

estimates we could get mathematical

theory tells us that this here happens

to equal to the standard deviation of

the individuals divided by the square

root of the Sam

sighs okay or in the case of a sample of

data the sample standard deviation okay

so this all these results here we get

from mathematical theory right this

imaginary set of all the possible

essence we could get theory tells us

they'll be approximately normally

main one being a large sample size and

the standard deviation of all these

estimates right on average how far are

the estimates moving from the true mean

is the sample standard deviation divided

by the square root of the sample size so

we're going to get to talking about the

bootstrap approach so in a moment we'll

get to talking about exactly what it is

but first let's try to build up why

might we do this okay the two main

reasons for considering a bootstrap

approach rather than this large sample

Theory approach the first is what if we

don't have a large sample so if we don't

have a large sample size and we can't

assume that the sampling distribution is

approximately normal then what do we do

okay and a second reason and maybe this

is the more useful reason is sometimes

getting the standard deviation of the

estimate okay what we're going to call

the standard error might be difficult

okay so in this case we're dealing with

an estimate that's just a mean okay a

mean is a pretty simple estimate and the

theories been built for us to know that

the standard deviation of the mean or

the standard error is approximately the

sample standard deviation divided by the

square root of the sample size okay but

what if the estimate we were looking at

suppose that we were interested in

estimating the range from the 80th to

90th percentile okay so what's the

distance between the 80th and 90th

percentile that's our estimate we can

collect some data we can calculate the

90th and the 80th percentile and then

find their range okay but getting we

know that that's just an estimate

different set of data would get a

slightly different estimate calculating

the standard error for the 90th - 80th

percentile might not be so

straightforward okay or in other cases

the estimate we're looking at might be

some composite measure made up of

multiple items okay so we may take

multiple measurements a new

them to come up with some composite

measure again that's just an estimate

different data we're gonna slightly

different estimate okay working out the

standard error for this composit measure

might be quite difficult or impossible

mathematically so we're there we can try

using a bootstrapping approach so let's

start to build up what that is exactly

okay first thing I'm going to do it in a

picture then I'm going to try to explain

it through words and then we're gonna do

it looking at this example here okay so

we have the same idea here we have the

population okay so here's all the

individuals in the population and we're

gonna take a sample out of there we

reach in this population we pull out n

individuals right and we get a sample

I'll draw that in here there's a sample

and that ends up giving us a sample mean

okay so same up here maybe I should have

drawn that in rather than using theory

to tell us what's the kind of

theoretical distribution of all these

possible estimates what we're going to

do is we're going to create this through

resampling okay so what we're going to

do is if we think that our sample is

representative the population which it

should be okay and assuming the sample

represents the population is true

whether we're looking at a parametric

approach or a bootstrap approach okay we

we need to have that built in then that

our sample we're assuming is

representative of the population what we

can do is we can reach into our sample

and we can take a what we're going to

call a resample and I'll expand on this

idea exactly what we mean by that in a

little bit but we can reach into our

sample and try and generate a new sample

from that we're going to do it with

replacement

in doing that it's going to give us I'm

going to label it X bar 2 with a star

star to indicate it's a bootstrap

estimate ok let's do that again we're

gonna reach into our sample and we're

going to generate a new resample and

again this is with replacement and we're

going to get another estimate and we're

gonna repeat this remember in the

theoretical approach this um plus if we

did an infinite amount of times all

possible estimates here we're gonna

repeat it B times okay and we'll talk a

little bit more about how many times we

should do that so then we can get

resample again with replacement and this

is going to give us our beef estimate

okay so rather than relying on theory to

try and tell us what is the set of all

possible estimates look like we're gonna

try and generate that

through resampling okay so by taking

resamples of our observed data we're try

and mimic this idea of getting new

sample estimates so if we were to look

at the distribution of all these

estimates that gives us what we call the

bootstrap sampling distribution okay

again rather than theory telling us what

these would look like under certain

conditions we're gonna try and generate

all possible samples using this

resampling approach and the distribution

of all of those gives us the bootstrap

sampling distribution the standard

deviation of all these bootstrap

estimates that we've got it's going to

be what we call our bootstrap standard

error of the mean and I should label

that here this is our standard error of

the mean through theory or through a

resampling approach okay so let's talk a

little bit about how this boot strapping

or resampling is done okay so what we're

going to do is we're going to reach into

our sample okay and we use this little

toy example here in a moment to talk our

way through you see I've made it small

just five observations numbers that are

going to be easy to work with okay so we

reach into our sample we're going to

resample with replacement a sample of

the same size okay so in this example

here I have five observations what I'm

going to do is randomly select an

observation

put that back in the pool then randomly

select another put it back in the pool

and do that till I get five observations

okay then for those five I'm going to

calculate the mean and clear calculate

my estimate okay I'm going to repeat

that repeat that approach

time's now the number of times you

repeat this approach is up to you

there's different guidelines that exist

in the past a thousand at least a

thousand used to be kind of the minimum

suggestion it's gone up a bit I would

say at least ten thousand or more really

it doesn't matter you can do this as

many times as you want the only real

limitation is time or computing power an

important note is that increasing be

okay increasing the number of resamples

you take can't increase the amount of

information in your data so taking this

example here we only had five

observations if we were to repeat this

and can do resampling one billion times

that's not going to be more useful than

only doing it 10,000 times okay we can't

have five observations that's the amount

of info we have it just increasing be

getting it larger is going to get you a

slightly better estimate of what's the

sampling distribution hopefully you can

imagine if instead of 10,000 we took a

million we're going to get closer to all

possible estimates we're going to get a

slightly more reliable estimate of the

standard error okay but it doesn't

increase the information that just gives

us a more I guess stable estimate of

these okay so let's work our way through

this toy example and see how

bootstrapping works ok so first I'm

going to take re sample number one I

cannot prove a tit RS number one so what

I'm going to do is reach into here

randomly select an observation suppose

that we had it up with 75 okay then we

put it back in the pool of observations

and then randomly select another suppose

we end up with 90

put it back in the pool select another

end up with 80 put it back in the pool I

hand up with 90 again put it back in the

pool and I end up with 85 ok so our

first bootstrap estimate came up to a

mean of 84 so I guess it's worth

mentioning explicitly here that when we

do this resampling approach or this

bootstrap approach we can get the same

observation multiple times we can also

have certain observations not appear at

all in that resample and hopefully that

should make sense right if we took every

sample that was just our exact data

every time that's not going to be

anything useful ok so let's imagine

doing that again resample number 2 ok

reaching this population end up with 85

sorry that's this population this sample

right our sample of data put it back in

reach in we end up with 60 randomly

select another 75 then 85 and then we

get 60 again ok we find this second

estimate comes out to be 73 repeat this

over and over let's just go up to the

last one resample number B ok our last

one we end up with 90 then 80 85 85

again and 60 and these gave us a sample

mean

of India okay now if we were to go

through and make a histogram of all

these estimates yes that's going to be

our bootstrap sampling distribution if

we were to go and calculate the standard

deviation of all these estimates that's

going to give us our estimate of the

bootstrap standard error you'll notice

if you go through and work through just

these ones here

the standard error of all these because

the bootstrap estimate is gonna come out

to be five point five seven which is

reasonably close to that there and

actually I only did it for these three

observations for these three estimates

here so you can imagine if we had 10,000

of these the standard deviation of all

these estimates going to come out

amazingly close to that there okay so

while it might not be intuitive the

results we get from a bootstrapping

approach are nearly identical to what we

get through the large sample theory okay

but the pro of these is that they always

work when our assumptions aren't met for

large samples here we can't work that

way bootstrapping approach will work so

we're gonna have some separate videos

that look at running through examples of

these some that generate a bootstrap

sampling distribution calculate the

standard error and compare them to the

theoretical results to see how how

identical they come out to be looking at

constructing confidence intervals or

testing hypotheses using a bootstrapping

approach yep one final thing I want to

leave on because a lot of people are

going to have this question I know when

I first encountered this stuff I had

this question so you might be thinking

doesn't this bootstrapping approach

depend too much on the observed data

okay so for example you might be

thinking what if we got a really extreme

value here instead of 90 let's suppose

value isn't that going to affect this

bootstrapping approach well that

observation show up in a lot of

resamples and skew things well that that

is true but if we think of the large

sample approach that we've seen we go

and let's put in the context of a

confidence interval we go from our

estimate plus our minus T standard

errors now I'm looking at this approach

this also depends very much on our

observed data what's going to happen

with that outlier it's going to skew

this means it's going to inflate that

standard deviation okay so while it

might be tempting to think if we get

some extreme outlying value in there

that's going to show up quite often and

resamples and skew things okay but this

bootstrapping approach observed this

bootstrapping approach relies as much on

your observed data as these large sample

approaches do bootstrapping is an

amazingly powerful tool part of the

reason why it's been a bit slower

catching on is that it's a fairly well

in the academic world the fairly recent

development I think sometime in the 80s

it came in and it really depends on the

computing power before having large

computing power we weren't able to take

our sample of data and resample over and

over again it was too time-consuming but

it's an amazingly powerful tool and

worth exploring with you subscribe to

our channel make up your stick around

guys these are lots more

physics is hard to say poop on ooh nice