Comparing the efficiency of SRSWOR and SRSWR with the help of R Programming
~Ronit Bhowmick
Christ University, Bengaluru.
Which
sampling technique is a better population estimator, SRSWOR, or SRSWR?
I
tried logging the answer stepwise, in a way so that even someone who isn’t
really connected to the statistics stream by any means can have a basic but broad
picture of the topic.
Here
is how I am planning on dividing the entire topic: The definition of simple
random sampling, the branches of simple random sampling with a numerical based example,
definitions of sampling without replacement and sampling with replacement, the
difference between both SRSWR and SRSWOR, notations and basic formulas, and finally
I will also be showing the R codes that will conclude the difference.
The
definition:
Simple
random sampling (SRS) is defined as a sampling technique where every item in
the population has an even chance and likelihood of being selected in the
sample. Simple random sampling is a fundamental sampling method and can easily
be a component of a more complex sampling method.
Simple random sampling is a sampling method in which all of the elements in the population and, consequently, all of the units in the sampling frame have the same probability of being selected for the sample. It would be along the lines of having a fair raffle among every individual in the population: we give everyone raffle tickets with unique sequential numbers, put them all in a basket, and draw numbers from the basket at random. The individuals whose numbers are selected become our sample.
From a more mathematical approach, we may define SRS as a special case of random sampling. (If each unit of the population has a known (equal or unequal) probability of selection in the sample, the sample is called a random sample. If each unit of the population has an equal probability of being selected for the sample, the sample obtained is called a simple random sample.) Whenever a unit is selected for the sample, the units of the population are equally likely to be selected. When the first unit is selected, all the units of the population have an equal chance of selection, which is 1/N. When the second unit is selected, all the remaining (N–1) units of the population have 1/(N–1) chance of selection.
The
two branches of Simple Random Sampling:
Depending on whether or not the individuals in the population can be selected for the sample more than once, we distinguish between SRS with replacement and SRS without replacement. If we sample with replacement, the fact that an individual was randomly selected for our sample does not prevent that same individual from being chosen again in the next selection. This would be the equivalent of putting the raffle ticket back in the basket after every draw. If, on the other hand, we choose to sample without replacement, an individual selected for the sample is not eligible for the next drawing in the raffle.
To replace, or not to replace? That is the question. It’s a simple math problem. In terms of both estimation precision and minimum sample size required to obtain a given level of precision, we can firmly conclude that simple random sampling without replacement is more efficient.
There is a question relating to what
specifically happens to a blue ball that we randomly select from the
bag that contains N number of red and blue balls. This question arises
when sampling is that after we select an individual ball, what do we do with
it. There are two options: We can either replace it back into the bag
that we are sampling from or we can choose to not replace the ball.
We
can very easily see that these lead to two different situations. In the first option, replacement leaves open
the possibility that the blue ball is randomly chosen a second time. For the second option, if we are working
without replacement, then it is impossible to pick the same blue ball
twice. We will see that this difference
will affect the calculation of probabilities related to these samples.
Effect
on Probabilities: To see how we handle replacement affects the calculation of
probabilities, consider the following example question. What is the probability
of drawing two aces from a standard deck of cards? This question is
ambiguous. What happens once we draw the
first card? Do we put it back into the
deck, or do we leave it out?
We
start with calculating the probability with replacement. There are four aces and 52 cards total, so
the probability of drawing one ace is 4/52. If we replace this card and draw
again, then the probability is again 4/52. These events are independent, so we
multiply the probabilities (4/52) x (4/52) = 1/169, or approximately 0.592%.
Now
we will compare this to the same situation, with the exception that we do not
replace the cards. The probability of
drawing an ace on the first draw is still 4/52. For the second card, we assume
that an ace has been already drawn. We
must now calculate a conditional probability.
In other words, we need to know the probability of drawing a second
ace, given that the first card is also an ace.
There
are now three aces remaining out of a total of 51 cards. So, the conditional
probability of a second ace after drawing an ace is 3/51. The probability of drawing two aces without
replacement is (4/52) x (3/51) = 1/221, or about 0.425%.
We
see directly from the problem above that what we choose to do with replacement
has bearing on the values of probabilities.
It can significantly change these values.
Population
Sizes: There are some situations where sampling with or without replacement
does not substantially change any probabilities. Suppose that we are randomly choosing two
people from a city with a population of 50,000, of which 30,000 of these people
are female.
If
we sample with replacement, then the probability of choosing a female on the
first selection is given by 30000/50000 = 60%.
The probability of a female on the second selection is still 60%. The probability of both people being female
is 0.6 x 0.6 = 0.36.
If
we sample without replacement then the first probability is unaffected. The second probability is now 29999/49999 =
0.5999919998..., which is extremely close to 60%. The probability that both are female is 0.6 x
0.5999919998 = 0.359995.
The
probabilities are technically different; however, they are close enough to be
nearly indistinguishable. For this reason, many times even though we sample without replacement, we treat the
selection of each individual as if they are independent of the other individuals
in the sample.
Other
Applications:
There
are other instances where we need to consider whether to sample with or without
replacement. An example of this is bootstrapping. This statistical technique
falls under the heading of a resampling technique.
In
bootstrapping we start with a statistical sample of a population. We then use
computer software to compute bootstrap samples. In other words, the computer
resamples with replacement from the initial sample.
Sampling
without replacement (SRSWOR):
Consider
the same population of potato sacks, each of which has either 12, 13, 14, 15,
16, 17, or 18 potatoes, and all the values are equally likely. Suppose that, in
this population, there is exactly one sack with each number. So, the whole
population has seven sacks. If I sample two without replacement, then I first
pick one (say 14). I had a 1/7 probability of choosing that one. Then I pick
another. At this point, there are only six possibilities: 12, 13, 15, 16, 17,
and 18. So there are only 42 different possibilities here (again assuming that
we distinguish between the first and the second.) They are: (12,13), (12,14),
(12,15), (12,16), (12,17), (12,18), (13,12), (13,14), (13,15), etc.
Sampling
with replacement (SRSWR):
Consider
a population of potato sacks, each of which has either 12, 13, 14, 15, 16, 17,
or 18 potatoes, and all the values are equally likely. Suppose that, in this
population, there is exactly one sack with each number. So, the whole
population has seven sacks. If I sample two with replacement, then I first pick
one (say 14). I had a 1/7 probability of choosing that one. Then I replace it.
Then I pick another. Every one of them still has a 1/7 probability of being
chosen. And there are exactly 49 different possibilities here (assuming we
distinguish between the first and second.) They are: (12,12), (12,13), (12,
14), (12,15), (12,16), (12,17), (12,18), (13,12), (13,13), (13,14), etc.
What's the
Difference?
When we sample with
replacement, the two-sample values are independent. Practically, this means
that what we get on the first one doesn't affect what we get on the second.
Mathematically, this means that the covariance between the two is zero.
In sampling without replacement, the two-sample values aren't independent. Practically, this means that what we got on the for the first one affects what we can get for the second one. Mathematically, this means that the covariance between the two isn't zero. That complicates the computations. In particular, if we have an SRS (simple random sample) without replacement, from a population with variance , then the covariance of two of the different sample values is , where N is the population size.
Population size
When we sample
without replacement, and get a non-zero covariance, the covariance depends on
the population size. If the population is very large, this covariance is very
close to zero. In that case, sampling with replacement isn't much different
from sampling without replacement. In some discussions, people describe this
difference as sampling from an infinite population (sampling with replacement)
versus sampling from a finite population (without replacement).
Basic Notations: Assume that we have a population of size N. The values of the
population are numbers. When we take a sample, it is a
simple random sample (SRS) of size n, where .
Population standard deviation:
Unbiased estimator of
the population mean (sample mean):
If the individual
values of the population are "successes" or "failures", we
code those as 1 or 0, respectively. Then the parameter of interest is usually
called the population proportion, even though, strictly speaking, it is also
the population mean.
Population standard
deviation:
Unbiased estimator of
the population proportion (sample proportion):
If we assume the simple random sampling is with replacement,
then the sample values are independent, so the covariance between any two
different sample values are zero. This fact is used to derive these formulas for
the standard deviation of the estimator and the estimated standard deviation of
the estimator. The first two columns are the parameter and the statistic which
is the unbiased estimator of that parameter.
standard deviation of the estimator |
usual estimator of the standard deviation of the
estimator |
||
|
|
|
where |
|
|
|
|
If we assume the simple random sampling is without replacement,
then the sample values are not independent, so the covariance
between any two different sample values is not zero. In fact,
one can show that
Covariance between
two different sample values:
for
This fact is used to
derive these formulas for the standard deviation of the estimator and the
estimated standard deviation of the estimator. The first two columns are the
parameter and the statistic which is the unbiased estimator of that parameter.
standard deviation of the estimator |
estimator of the standard deviation of the
estimator |
||
|
|
|
where |
|
|
|
|
The R Codes for performing SRSWOR
and SRSWR:
The
aim here is to analyse the in-built r dataset, mtcars with respect to one of
its vectors called the mpg which stands for Miles/(US) gallon. The data was
extracted from the 1974 Motor Trend US
magazine, and comprises fuel consumption and 10 aspects of automobile design
and performance for 32 automobiles (1973–74 models).
Objectives:
1. To show the sample mean is an unbiased estimator of the population
mean.
2. To verify that the variance of the estimate of SRSWOR is less than
SRSWR.
Analysis using R Programming:
Interpretation:
We can see that SRSWOR (1.286479) is more efficient than SRSWR
(2.345932) as the variance of SRSWR is higher.
This result can prove that SRSWOR is a sampling technique that is better at estimating
the population statistic and will be giving more precise results than SRSWR.
Here I have taken a standard dataset of mtcars that is very easily available
for everyone specifically if one has R pre-installed. But, this property of
SRSWOR being more efficient/ precise is held true for any kind of population of
N size.
References:
Simple Random
Sampling: Definition and Examples (questionpro.com)
Random
sampling: simple random sampling (netquest.com)
Sampling
With or Without Replacement (thoughtco.com)
Comments
Post a Comment