Comparing the efficiency of SRSWOR and SRSWR with the help of R Programming

~Ronit Bhowmick

Christ University, Bengaluru.

Which sampling technique is a better population estimator, SRSWOR, or SRSWR?

I tried logging the answer stepwise, in a way so that even someone who isn’t really connected to the statistics stream by any means can have a basic but broad picture of the topic.

Here is how I am planning on dividing the entire topic: The definition of simple random sampling, the branches of simple random sampling with a numerical based example, definitions of sampling without replacement and sampling with replacement, the difference between both SRSWR and SRSWOR, notations and basic formulas, and finally I will also be showing the R codes that will conclude the difference.

The definition:

Simple random sampling (SRS) is defined as a sampling technique where every item in the population has an even chance and likelihood of being selected in the sample. Simple random sampling is a fundamental sampling method and can easily be a component of a more complex sampling method.

Simple random sampling is a sampling method in which all of the elements in the population and, consequently, all of the units in the sampling frame have the same probability of being selected for the sample. It would be along the lines of having a fair raffle among every individual in the population: we give everyone raffle tickets with unique sequential numbers, put them all in a basket, and draw numbers from the basket at random. The individuals whose numbers are selected become our sample.

From a more mathematical approach, we may define SRS as a special case of random sampling. (If each unit of the population has a known (equal or unequal) probability of selection in the sample, the sample is called a random sample. If each unit of the population has an equal probability of being selected for the sample, the sample obtained is called a simple random sample.) Whenever a unit is selected for the sample, the units of the population are equally likely to be selected. When the first unit is selected, all the units of the population have an equal chance of selection, which is 1/N. When the second unit is selected, all the remaining (N–1) units of the population have 1/(N–1) chance of selection.

The two branches of Simple Random Sampling:

Depending on whether or not the individuals in the population can be selected for the sample more than once, we distinguish between SRS with replacement and SRS without replacement. If we sample with replacement, the fact that an individual was randomly selected for our sample does not prevent that same individual from being chosen again in the next selection. This would be the equivalent of putting the raffle ticket back in the basket after every draw. If, on the other hand, we choose to sample without replacement, an individual selected for the sample is not eligible for the next drawing in the raffle.

To replace, or not to replace? That is the question. It’s a simple math problem. In terms of both estimation precision and minimum sample size required to obtain a given level of precision, we can firmly conclude that simple random sampling without replacement is more efficient.

There is a question relating to what specifically happens to a blue ball that we randomly select from the bag that contains N number of red and blue balls. This question arises when sampling is that after we select an individual ball, what do we do with it. There are two options: We can either replace it back into the bag that we are sampling from or we can choose to not replace the ball.

We can very easily see that these lead to two different situations. In the first option, replacement leaves open the possibility that the blue ball is randomly chosen a second time. For the second option, if we are working without replacement, then it is impossible to pick the same blue ball twice. We will see that this difference will affect the calculation of probabilities related to these samples.

Effect on Probabilities: To see how we handle replacement affects the calculation of probabilities, consider the following example question. What is the probability of drawing two aces from a standard deck of cards? This question is ambiguous. What happens once we draw the first card? Do we put it back into the deck, or do we leave it out?

We start with calculating the probability with replacement. There are four aces and 52 cards total, so the probability of drawing one ace is 4/52. If we replace this card and draw again, then the probability is again 4/52. These events are independent, so we multiply the probabilities (4/52) x (4/52) = 1/169, or approximately 0.592%.

Now we will compare this to the same situation, with the exception that we do not replace the cards. The probability of drawing an ace on the first draw is still 4/52. For the second card, we assume that an ace has been already drawn. We must now calculate a conditional probability. In other words, we need to know the probability of drawing a second ace, given that the first card is also an ace.

There are now three aces remaining out of a total of 51 cards. So, the conditional probability of a second ace after drawing an ace is 3/51. The probability of drawing two aces without replacement is (4/52) x (3/51) = 1/221, or about 0.425%.

We see directly from the problem above that what we choose to do with replacement has bearing on the values of probabilities. It can significantly change these values.

Population Sizes: There are some situations where sampling with or without replacement does not substantially change any probabilities. Suppose that we are randomly choosing two people from a city with a population of 50,000, of which 30,000 of these people are female.

If we sample with replacement, then the probability of choosing a female on the first selection is given by 30000/50000 = 60%. The probability of a female on the second selection is still 60%. The probability of both people being female is 0.6 x 0.6 = 0.36.

If we sample without replacement then the first probability is unaffected. The second probability is now 29999/49999 = 0.5999919998..., which is extremely close to 60%. The probability that both are female is 0.6 x 0.5999919998 = 0.359995.

The probabilities are technically different; however, they are close enough to be nearly indistinguishable. For this reason, many times even though we sample without replacement, we treat the selection of each individual as if they are independent of the other individuals in the sample.

Other Applications:

There are other instances where we need to consider whether to sample with or without replacement. An example of this is bootstrapping. This statistical technique falls under the heading of a resampling technique.

In bootstrapping we start with a statistical sample of a population. We then use computer software to compute bootstrap samples. In other words, the computer resamples with replacement from the initial sample.

Sampling without replacement (SRSWOR):

Consider the same population of potato sacks, each of which has either 12, 13, 14, 15, 16, 17, or 18 potatoes, and all the values are equally likely. Suppose that, in this population, there is exactly one sack with each number. So, the whole population has seven sacks. If I sample two without replacement, then I first pick one (say 14). I had a 1/7 probability of choosing that one. Then I pick another. At this point, there are only six possibilities: 12, 13, 15, 16, 17, and 18. So there are only 42 different possibilities here (again assuming that we distinguish between the first and the second.) They are: (12,13), (12,14), (12,15), (12,16), (12,17), (12,18), (13,12), (13,14), (13,15), etc.

Sampling with replacement (SRSWR):

Consider a population of potato sacks, each of which has either 12, 13, 14, 15, 16, 17, or 18 potatoes, and all the values are equally likely. Suppose that, in this population, there is exactly one sack with each number. So, the whole population has seven sacks. If I sample two with replacement, then I first pick one (say 14). I had a 1/7 probability of choosing that one. Then I replace it. Then I pick another. Every one of them still has a 1/7 probability of being chosen. And there are exactly 49 different possibilities here (assuming we distinguish between the first and second.) They are: (12,12), (12,13), (12, 14), (12,15), (12,16), (12,17), (12,18), (13,12), (13,13), (13,14), etc.

What's the Difference?

When we sample with replacement, the two-sample values are independent. Practically, this means that what we get on the first one doesn't affect what we get on the second. Mathematically, this means that the covariance between the two is zero.

In sampling without replacement, the two-sample values aren't independent. Practically, this means that what we got on the for the first one affects what we can get for the second one. Mathematically, this means that the covariance between the two isn't zero. That complicates the computations. In particular, if we have an SRS (simple random sample) without replacement, from a population with variance , then the covariance of two of the different sample values is , where N is the population size.

Population size

When we sample without replacement, and get a non-zero covariance, the covariance depends on the population size. If the population is very large, this covariance is very close to zero. In that case, sampling with replacement isn't much different from sampling without replacement. In some discussions, people describe this difference as sampling from an infinite population (sampling with replacement) versus sampling from a finite population (without replacement).

Basic Notations: Assume that we have a population of size N. The values of the population are numbers. When we take a sample, it is a simple random sample (SRS) of size n, where .

Population mean:

Population standard deviation:

Unbiased estimator of the population mean (sample mean):

If the individual values of the population are "successes" or "failures", we code those as 1 or 0, respectively. Then the parameter of interest is usually called the population proportion, even though, strictly speaking, it is also the population mean.

Population proportion:

Population standard deviation:

Unbiased estimator of the population proportion (sample proportion):

If we assume the simple random sampling is with replacement, then the sample values are independent, so the covariance between any two different sample values are zero. This fact is used to derive these formulas for the standard deviation of the estimator and the estimated standard deviation of the estimator. The first two columns are the parameter and the statistic which is the unbiased estimator of that parameter.

standard deviation of the estimator

usual estimator of the standard deviation of the estimator

where

If we assume the simple random sampling is without replacement, then the sample values are not independent, so the covariance between any two different sample values is not zero. In fact, one can show that

Covariance between two different sample values:

for

This fact is used to derive these formulas for the standard deviation of the estimator and the estimated standard deviation of the estimator. The first two columns are the parameter and the statistic which is the unbiased estimator of that parameter.

standard deviation of the estimator

estimator of the standard deviation of the estimator

where

The R Codes for performing SRSWOR and SRSWR:

The aim here is to analyse the in-built r dataset, mtcars with respect to one of its vectors called the mpg which stands for Miles/(US) gallon. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Objectives:

1. To show the sample mean is an unbiased estimator of the population mean.

2. To verify that the variance of the estimate of SRSWOR is less than SRSWR.

Analysis using R Programming:

Interpretation:

We can see that SRSWOR (1.286479) is more efficient than SRSWR (2.345932) as the variance of SRSWR is higher. This result can prove that SRSWOR is a sampling technique that is better at estimating the population statistic and will be giving more precise results than SRSWR. Here I have taken a standard dataset of mtcars that is very easily available for everyone specifically if one has R pre-installed. But, this property of SRSWOR being more efficient/ precise is held true for any kind of population of N size.

References:

40803_5.pdf (sagepub.com)

Simple Random Sampling: Definition and Examples (questionpro.com)

Random sampling: simple random sampling (netquest.com)

Sampling With or Without Replacement (thoughtco.com)