Stratified Sampling - Neyman Allocation
Stratified Sampling - Neyman Allocation
-Indumathi S 2048123
Introduction:
In real life, data are not always homogeneous and so in such cases we have to use stratified sampling. Also there are many sampling allocations in stratified sampling. We will see about stratified sampling and Neyman allocation which is one of the important sampling allocations in stratified sampling with the explanation of a real life application and R analysis.
Stratified Sampling:
The efficiency can be increased greatly
by dividing the heterogeneous population into homogeneous groups (strata) with
respect to the characteristic under study and then method of selecting samples
from each of the groups separately is called Stratified Sampling. Stratified
sampling is commonly used in large-scale surveys. Like, voter surveys, house
price prediction surveys.
The population of N units is
stratified into k strata, the ith strata having Ni units . These strata are non-overlapping so
that they comprise the whole population such that N1 +N2 + … + Nk =N
A sample is drawn from each stratum
independently, the sample size within the ith stratum being ni such that n1+ n2+…+nk = n. The procedure of taking
samples in this way is called a
Stratified Sampling . If the sample is
selected by simple random sampling from each stratum is called Stratified
Random Sampling.
Neyman Allocation (Optimal
Allocation):
For stratified sampling, we should
carefully consider the problem of forming strata, sampling procedures for
different strata, and allocation of sample sizes to the respective stratum.
Sampling allocation is a method to allocate the sample from each stratum.
Neyman allocation is the one of the important sampling allocations. Neyman allocation is a special case of optimal allocation used when the costs in the strata are approximately equal and it is also called as minimum variance allocation. The allocation of samples among different strata is based on a consideration of the stratum size and the stratum variation. In this allocation, it is assumed that the sampling cost per unit among different strata is the same and the size of the sample is fixed. The sample sizes allocated by
Under Neyman, nh is proportional to Nh*Sh. if all variances in strata and costs are equal, proportional allocation is the same as optimal allocation.
A formula for minimum variance with fixed n is
obtained by substituting the value of nh in variance of the estimate for simple
random sampling,
Application in Real life:
Stratified random sampling is commonly
used to estimate the abundance indicies of fish populations in multispecies
survey. The neyman allocation for stratified random sampling is mainly used for
fishery surveys.
Fishery Independent Survey:
Fishery independent surveys are
used to collect the high quality biological and ecological data to support
fisheries management. This type surveys gives the useful measures of rate of
population, relative abundance and sex & size details of wide range species.
Stratified sampling designs are widely
used in fishery surveys. In some cases, however the sample size relatively
small because of the limitation of survey cost or other factors. The allocation
method of sampling efforts among strata in this survey plays a major role. Generally,
four sampling designs are used in this fishery independent survey. One among the
four designs is stratified sampling design using Neyman allocation. Why Neyman allocation is used here is to
obtain the high precision with minimum variance.
Data Description and Explanation:
Now, we are going to explain the sampling
technique with R code. So that we are collected the dataset from kaggle. The data
contains the information of description of the fish found at Lake
Powell, equipment used, date, at what location, total length and weight of the
fishes.
We are interested to obtain the average of total length of the fishes for
each species with minimum variance of the estimate where total length of the
fish means the length of a fish measured from the tip of the snout to the tip
of the longer lobe of the caudal fin, usually measured with the lobes
compressed along the midline.
The
dataset has multi-species data. Therefore, the population is heterogeneous. If
we estimate the average total length of the fishes without using stratified sampling then
definitely the variance will be high so we will get less precision of estimate.
So that, using the concept of stratified random sampling, we have to classify
the population into homogeneous subgroups with respect to species of fishes and
to draw the sample randomly from each stratum using neyman allocation which
gives the minimum variance of estimate.
Analysis
library(rmarkdown, knitr)
## Warning: package 'rmarkdown' was built under R version 3.6.3
library(readxl)
## Warning: package 'readxl' was built under R version 3.6.3
library(samplingbook)
## Warning: package 'samplingbook' was built under R version 3.6.3
## Loading required package: pps
## Loading required package: sampling
## Warning: package 'sampling' was built under R version 3.6.3
## Loading required package: survey
## Warning: package 'survey' was built under R version 3.6.3
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survival'
## The following objects are
masked from 'package:sampling':
##
##
cluster, strata
##
## Attaching package: 'survey'
## The following object is
masked from 'package:graphics':
##
##
dotchart
data <- read_excel("D:/Indu/M.Sc/sample
survey design/fish data1.xlsx")
head(data)
## # A tibble: 6 x 13
##
FISH_ID DATE TREND GEAR Species
TL WT KTL
Wr AGE_STRUCTURE
##
<dbl> <chr> <lgl> <chr> <chr> <dbl> <dbl> <dbl>
<dbl> <lgl>
## 1
1.97e7 26179 TRUE GN LMB
263 250 1.37
0 FALSE
## 2
1.97e7 26179 TRUE GN LMB
348 700 1.66
0 FALSE
## 3
1.97e7 26179 TRUE GN LMB
332 555 1.51
0 FALSE
## 4
1.97e7 26179 TRUE GN LMB
350 720 1.67
0 FALSE
## 5
1.97e7 26179 TRUE GN LMB
300 455 1.68
0 FALSE
## 6
1.97e7 26179 TRUE GN LMB
320 505 1.54
0 FALSE
## # ... with 3 more variables: STOMACH
<lgl>, GONADS <lgl>, SITE <chr>
#Dividing the data into strata
#Stratum-1:
stratum1<- data[data$Species=="GSF",]
#Here, Stratum1 is taken from the population
whose species is GSF
N1<- sum(stratum1$Species=="GSF")
N1 #Size of stratum1
## [1] 15
mean1<- mean(stratum1$WT)
mean1 #Mean of total length of the fishes in the
stratum1
## [1] 129.1333
S1<- sqrt(var(stratum1$WT))
S1 #Standard deviation of the fishes in the
stratum1
## [1] 147.265
#Stratum-2:
stratum2<- data[data$Species=="WAE",]
#Here, Stratum2 is taken from the population
whose species is WAE
N2<- sum(stratum2$Species=="WAE")
N2 #Size of stratum2
## [1] 112
mean2<- mean(stratum2$TL)
mean2
#Mean of total length of the
fishes in the stratum2
## [1] 455.5446
S2<- sqrt(var(stratum2$TL))
S2
#Standard deviation of the
fishes in the stratum2
## [1] 96.49991
#Stratum-3:
stratum3<- data[data$Species=="RBT",]
#Here, Stratum3 is taken from the population
whose species is RBT
N3<- sum(stratum3$Species=="RBT")
N3 #Size
of stratum3
## [1] 29
mean3<- mean(stratum3$TL)
mean3
#Mean of total length of the
fishes in the stratum3
## [1] 460.1379
S3<- sqrt(var(stratum3$TL))
S3
#Standard deviation of the
fishes in the stratum3
## [1] 116.163
#Stratum-4:
stratum4<- data[data$Species=="CRP",]
#Here, Stratum4 is taken from the population
whose species is CRP
N4<- sum(stratum4$Species=="CRP")
N4
#Size of stratum4
## [1] 70
mean4<- mean(stratum4$TL)
mean4
#Mean of total length of the
fishes in the stratum4
## [1] 348.8429
S4<- sqrt(var(stratum4$TL))
S4 #Standard deviation of the fishes in the
stratum4
## [1] 30.49145
#Stratum-5:
stratum5<- data[data$Species=="LMB",]
#Here, Stratum5 is taken from the population
whose species is LMB
N5<- sum(stratum5$Species=="LMB")
N5 #Size of stratum5
## [1] 637
mean5<- mean(stratum5$TL)
mean5
#Mean of total length of the
fishes in the stratum5
## [1] 341.471
S5<- sqrt(var(stratum5$TL))
S5 #Standard deviation of the fishes in the
stratum5
## [1] 54.00883
N= N1+N2+N3+N4+N5
N #Population Size
## [1] 863
#The sample of size n=200 has
to be drawn using optimum allocation 1.e) Neyman allocation
sample_size <- stratasamp(n=200, Nh=c(N1,N2,N3,N4,N5), Sh=c(S1,S2,S3,S4,S5),type = "opt")
sample_size
##
## Stratum 1
2 3 4 5
## Size
8 41 13 8 130
#Sample allocations have been
obtained by Neyman allocation using stratasamp function.
#Sample sizes to each stratum are:
n1=3
n2=42
n3=13
n4=8
n5=134
#Collect a random sample of
size specified above from each stratum:
sample1=stratum1[sample(1:nrow(stratum1),
3,
replace=FALSE), ] #sample has been drawn from stratum1
sample2= stratum2[sample(1:nrow(stratum2),42,replace = FALSE),] #sample has been drawn from stratum2
sample3= stratum3[sample(1:nrow(stratum3),13,replace = FALSE),] #sample has been drawn from stratum3
sample4= stratum4[sample(1:nrow(stratum4),8,replace = FALSE),] #sample has been drawn from stratum4
sample5= stratum5[sample(1:nrow(stratum5),134,replace = FALSE),] #sample has been drawn from stratum5
#Total sample:
total_sampled_data=rbind(sample1,
sample2, sample3, sample4, sample5)
#Total sample is collected by stratified random
sampling design using Neyman allocation
nh=as.vector(table(total_sampled_data$Species))
nh #Sample size
## [1] 8 3 134 13 42
wh=nh/sum(nh)
wh #Weights of the strata
## [1] 0.040 0.015 0.670 0.065 0.210
#Estimation of mean of total
length of the fishes using Neyman allocation:
stratamean(y=total_sampled_data$TL,
h=as.vector(total_sampled_data$Species), wh=wh, eae=TRUE)
## Mean SE
CIu CIo
## CRP
348.5000 7.017834 334.7453
362.2547
## GSF
243.6667 40.522970 164.2431 323.0902
## LMB
346.6940 3.979118 338.8951
354.4930
## RBT
443.9231 28.612231 387.8441 500.0020
## WAE
444.5476 16.087240 413.0172 476.0780
## overall 372.0900 4.735796 362.8080 381.3720
#Variance of the estimate
using stratified random sampling with Neyman allocation:
n=n1+n2+n3+n4+n5 #Total sample size
Nh= c(N1,N2,N3,N4,N5) #Vector
of Strata sizes
Sh= c(S1,S2,S3,S4,S5) #vector
of standard deviation
var_st <-(1/N^2)*(((1/n)*(sum(Sh*Nh))^2)-(sum(Nh*Sh^2)))
var_st #variance of estimate
## [1] 13.85909
Interpretation:
Here, we have obtained the average of total
length of fishes based on species and also confidence interval and variance of
the estimate have been obtained. The average of total length of the fishes of CRP
species is 329.3750 cm, The average of total length of the fishes of GSF
species is 176.6667 cm, The average of total length of the fishes of LMB
species is 337.5672 cm, The average of total length of the fishes of RBT species
is 462.9231 cm and The average of total length of the fishes of WAE species is 448.7381
cm. The average of total length of all the fishes in the population is 366.3200
cm.
From the results, we can see that, the
fishes of GSF species are shorter than other and the fishes of RBT species are
larger than other. Like this we can weight and other biological details of
fishes with respect to species. Therefore, this sampling technique is very
useful for analyse the biological aspects of fishes in fishery surveys and also
Neyman allocation gives the minimum variance of estimate. Therefore, the
precision of the estimation will be higher. So that, we used Neyman allocation.
The minimum variance of estimate of average of total length of fishes is 13.22321
by using Neyman allocation.
Conclusion:
The
significance of Neyman allocation for stratified random sampling and the one of
the real life applications of Neyman allocation have been explained with R analysis.
Comments
Post a Comment