Stratified Sampling - Neyman Allocation

                                Stratified Sampling - Neyman Allocation

                                                                                   -Indumathi S  2048123

                                                                                                              

Introduction:

         In real life, data are not always homogeneous and so in such cases we have to use stratified sampling. Also there are many sampling allocations in stratified sampling. We will see about stratified sampling and Neyman allocation which is one of the important sampling allocations in stratified sampling with the explanation of a real life application and R analysis.

 Stratified Sampling:

          The efficiency can be increased greatly by dividing the heterogeneous population into homogeneous groups (strata) with respect to the characteristic under study and then method of selecting samples from each of the groups separately is called Stratified Sampling. Stratified sampling is commonly used in large-scale surveys. Like, voter surveys, house price prediction surveys.

         The population of N units is stratified into k strata, the ith strata having Ni units . These strata are non-overlapping so that they comprise the whole population such that    N1 +N2 + … + Nk =N

         A sample is drawn from each stratum independently, the sample size within the ith stratum being ni such that   n1+ n2+…+nk = n. The procedure of taking samples in this way is called  a Stratified Sampling .  If the sample is selected by simple random sampling from each stratum is called Stratified Random Sampling.

 Neyman Allocation (Optimal Allocation):

         For stratified sampling, we should carefully consider the problem of forming strata, sampling procedures for different strata, and allocation of sample sizes to the respective stratum. Sampling allocation is a method to allocate the sample from each stratum.

      Neyman allocation is the one of the important sampling allocations. Neyman allocation is a special case of optimal allocation used when the costs in the strata are approximately equal and it is also called as minimum variance allocation. The allocation of samples among different strata is based on a consideration of the stratum size and the stratum variation. In this allocation, it is assumed that the sampling cost per unit among different strata is the same and the size of the sample is fixed. The sample sizes allocated by

 

                                         

            Under Neyman, nh is proportional to Nh*Sh. if all variances in strata and costs are equal, proportional allocation is the same as optimal allocation.

      A formula for minimum variance with fixed n is obtained by substituting the value of nh in variance of the estimate for simple random sampling, 

 

                                          

 

Application in Real life:

            Stratified random sampling is commonly used to estimate the abundance indicies of fish populations in multispecies survey. The neyman allocation for stratified random sampling is mainly used for fishery surveys.

Fishery Independent Survey:

            Fishery independent surveys are used to collect the high quality biological and ecological data to support fisheries management. This type surveys gives the useful measures of rate of population, relative abundance and sex & size details of wide range species.

         Stratified sampling designs are widely used in fishery surveys. In some cases, however the sample size relatively small because of the limitation of survey cost or other factors. The allocation method of sampling efforts among strata in this survey plays a major role. Generally, four sampling designs are used in this fishery independent survey. One among the four designs is stratified sampling design using Neyman allocation.  Why Neyman allocation is used here is to obtain the high precision with minimum variance.

Data Description and Explanation:

        Now, we are going to explain the sampling technique with R code. So that we are collected the dataset from kaggle. The data contains the information of description of the fish found at Lake Powell, equipment used, date, at what location, total length and weight of the fishes.

         We are interested to obtain the average of total length of the fishes for each species with minimum variance of the estimate where total length of the fish means the length of a fish measured from the tip of the snout to the tip of the longer lobe of the caudal fin, usually measured with the lobes compressed along the midline. 

         The dataset has multi-species data. Therefore, the population is heterogeneous. If we estimate the average total length of the fishes  without using stratified sampling then definitely the variance will be high so we will get less precision of estimate. So that, using the concept of stratified random sampling, we have to classify the population into homogeneous subgroups with respect to species of fishes and to draw the sample randomly from each stratum using neyman allocation which gives the minimum variance of estimate.

Analysis           

library(rmarkdown, knitr)

## Warning: package 'rmarkdown' was built under R version 3.6.3

library(readxl)

## Warning: package 'readxl' was built under R version 3.6.3

library(samplingbook)

## Warning: package 'samplingbook' was built under R version 3.6.3

## Loading required package: pps

## Loading required package: sampling

## Warning: package 'sampling' was built under R version 3.6.3

## Loading required package: survey

## Warning: package 'survey' was built under R version 3.6.3

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

##
## Attaching package: 'survival'

## The following objects are masked from 'package:sampling':
##
##     cluster, strata

##
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
##
##     dotchart


                    



data <- read_excel("D:/Indu/M.Sc/sample survey design/fish data1.xlsx")
head(data)

## # A tibble: 6 x 13
##   FISH_ID DATE  TREND GEAR  Species    TL    WT   KTL    Wr AGE_STRUCTURE
##     <dbl> <chr> <lgl> <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <lgl>       
## 1  1.97e7 26179 TRUE  GN    LMB       263   250  1.37     0 FALSE       
## 2  1.97e7 26179 TRUE  GN    LMB       348   700  1.66     0 FALSE       
## 3  1.97e7 26179 TRUE  GN    LMB       332   555  1.51     0 FALSE       
## 4  1.97e7 26179 TRUE  GN    LMB       350   720  1.67     0 FALSE       
## 5  1.97e7 26179 TRUE  GN    LMB       300   455  1.68     0 FALSE       
## 6  1.97e7 26179 TRUE  GN    LMB       320   505  1.54     0 FALSE       
## # ... with 3 more variables: STOMACH <lgl>, GONADS <lgl>, SITE <chr>

#Dividing the data into strata

#Stratum-1:

stratum1<- data[data$Species=="GSF",]
#Here, Stratum1 is taken from the population whose species is GSF

N1<- sum(stratum1$Species=="GSF")
N1     #Size of stratum1

## [1] 15

mean1<- mean(stratum1$WT)
mean1   #Mean of total length of the fishes in the stratum1

## [1] 129.1333

S1<- sqrt(var(stratum1$WT))
S1      #Standard deviation of the fishes in the stratum1

## [1] 147.265

                               

#Stratum-2:

stratum2<- data[data$Species=="WAE",]
#Here, Stratum2 is taken from the population whose species is WAE

N2<- sum(stratum2$Species=="WAE")
N2     #Size of stratum2

## [1] 112

mean2<- mean(stratum2$TL)
mean2          #Mean of total length of the fishes in the stratum2

## [1] 455.5446

S2<- sqrt(var(stratum2$TL))
S2        #Standard deviation of the fishes in the stratum2

## [1] 96.49991


#Stratum-3:

stratum3<- data[data$Species=="RBT",]
#Here, Stratum3 is taken from the population whose species is RBT

N3<- sum(stratum3$Species=="RBT")
N3        #Size of stratum3

## [1] 29

mean3<- mean(stratum3$TL)
mean3      #Mean of total length of the fishes in the stratum3

## [1] 460.1379

S3<- sqrt(var(stratum3$TL))
S3         #Standard deviation of the fishes in the stratum3

## [1] 116.163

              



#Stratum-4:

stratum4<- data[data$Species=="CRP",]
#Here, Stratum4 is taken from the population whose species is CRP

N4<- sum(stratum4$Species=="CRP")
N4         #Size of stratum4

## [1] 70

mean4<- mean(stratum4$TL)
mean4       #Mean of total length of the fishes in the stratum4

## [1] 348.8429

S4<- sqrt(var(stratum4$TL))
S4     #Standard deviation of the fishes in the stratum4

## [1] 30.49145


#Stratum-5:

stratum5<- data[data$Species=="LMB",]
#Here, Stratum5 is taken from the population whose species is LMB

N5<- sum(stratum5$Species=="LMB")
N5       #Size of stratum5

## [1] 637

mean5<- mean(stratum5$TL)
mean5        #Mean of total length of the fishes in the stratum5

## [1] 341.471

S5<- sqrt(var(stratum5$TL))
S5     #Standard deviation of the fishes in the stratum5

## [1] 54.00883

N= N1+N2+N3+N4+N5
N   #Population Size

## [1] 863

                      



#The sample of size n=200 has to be drawn using optimum allocation 1.e) Neyman allocation

sample_size <- stratasamp(n=200, Nh=c(N1,N2,N3,N4,N5), Sh=c(S1,S2,S3,S4,S5),type = "opt")
sample_size

##                     
## Stratum 1  2  3 4   5
## Size    8 41 13 8 130

#Sample allocations have been obtained by Neyman allocation using stratasamp function.

#Sample sizes to each stratum are:
n1=3
n2=42
n3=13
n4=8
n5=134

                  



#Collect a random sample of size specified above from each stratum:

sample1=stratum1[sample(1:nrow(stratum1), 3, replace=FALSE), ]  #sample has been drawn from stratum1

sample2= stratum2[sample(1:nrow(stratum2),42,replace = FALSE),]  #sample has been drawn from stratum2

sample3= stratum3[sample(1:nrow(stratum3),13,replace = FALSE),]  #sample has been drawn from stratum3

sample4= stratum4[sample(1:nrow(stratum4),8,replace = FALSE),]  #sample has been drawn from stratum4

sample5= stratum5[sample(1:nrow(stratum5),134,replace = FALSE),]  #sample has been drawn from stratum5

#Total sample:
total_sampled_data=rbind(sample1, sample2, sample3, sample4, sample5)
   #Total sample is collected by stratified random sampling design using Neyman allocation

nh=as.vector(table(total_sampled_data$Species))
nh  #Sample size

## [1]   8   3 134  13  42

wh=nh/sum(nh)
wh  #Weights of the strata

## [1] 0.040 0.015 0.670 0.065 0.210

#Estimation of mean of total length of the fishes using Neyman allocation:

stratamean(y=total_sampled_data$TL, h=as.vector(total_sampled_data$Species), wh=wh,  eae=TRUE)

##             Mean        SE      CIu      CIo
## CRP     348.5000  7.017834 334.7453 362.2547
## GSF     243.6667 40.522970 164.2431 323.0902
## LMB     346.6940  3.979118 338.8951 354.4930
## RBT     443.9231 28.612231 387.8441 500.0020
## WAE     444.5476 16.087240 413.0172 476.0780
## overall 372.0900  4.735796 362.8080 381.3720

             



#Variance of the estimate using stratified random sampling with Neyman allocation:
n=n1+n2+n3+n4+n5  #Total sample size
Nh= c(N1,N2,N3,N4,N5)  #Vector of Strata sizes
Sh= c(S1,S2,S3,S4,S5)  #vector of standard deviation

var_st <-(1/N^2)*(((1/n)*(sum(Sh*Nh))^2)-(sum(Nh*Sh^2)))
var_st  #variance of estimate

## [1] 13.85909



 

Interpretation:

        Here, we have obtained the average of total length of fishes based on species and also confidence interval and variance of the estimate have been obtained. The average of total length of the fishes of CRP species is 329.3750 cm, The average of total length of the fishes of GSF species is 176.6667 cm, The average of total length of the fishes of LMB species is 337.5672 cm, The average of total length of the fishes of RBT species is 462.9231 cm and The average of total length of the fishes of WAE species is 448.7381 cm. The average of total length of all the fishes in the population is 366.3200 cm.

         From the results, we can see that, the fishes of GSF species are shorter than other and the fishes of RBT species are larger than other. Like this we can weight and other biological details of fishes with respect to species. Therefore, this sampling technique is very useful for analyse the biological aspects of fishes in fishery surveys and also Neyman allocation gives the minimum variance of estimate. Therefore, the precision of the estimation will be higher. So that, we used Neyman allocation. The minimum variance of estimate of average of total length of fishes is 13.22321 by using Neyman allocation.

 Conclusion:

      The significance of Neyman allocation for stratified random sampling and the one of the real life applications of Neyman allocation have been explained with R analysis.






Comments

Popular posts from this blog

Comparing the efficiency of SRSWOR and SRSWR with the help of R Programming

Selection of samples:SRSWR vs SRSWOR(2048114)

pps (probability proportional to size) Systematic Sampling