Optimal Stratification of Univariate
Populations
Name: Nayana B Menon
Reg.no: 2048128
Stratification reduces the
variance of sample estimates for population parameters by creating homogeneous
strata. Often, surveyors stratify the population using the most convenient
variables such as age, sex, region, etc. Such convenient methods often do not
produce internally homogeneous strata, hence, the precision of the estimates of
the variables of interest could be further improved. Stratification in
univariate populations has been explored by numerous researchers, many of whom
have proposed competing algorithms that help surveyors determine efficient
stratum boundaries.
Optimum Stratum boundaries: The method of choosing the
best boundaries that make strata internally homogenous as far as possible is
known as optimum stratification. To achieve this, the strata should be
constructed in such a way that the strata variances for the characteristic
under study be as small as possible. If the frequency distribution of the study
variable x is known, the optimum strata boundaries (OSB) could be obtained by cutting
the range of the distribution at suitable points. If the frequency distribution
of x is unknown, it may be approximated from the past experience or some prior
knowledge obtained at a recent study. Many skewed populations have log-normal
frequency distribution or may be assumed to follow approximately log-normal
frequency distribution.
Optimum Sample size: When you perform a survey, the
intention is to get a representative image about a number of variables or
statements within a certain target group or population. Due
to practical reasons (too large, too expensive or too time-consuming etc) it is
often difficult to interrogate the total population. In that case a sample is
used. This is a selection of respondents chosen in such a way that they represent
the total population as well as possible. It is very important to use a
correct sample size. When your sample is too big, this will lead to unnecessary
waste of money and time. On the other hand, when it’s too small, your results
will not be statistically significant and you will not come to reliable
conclusions.
Determination of optimum stratum boundaries
and optimum sample sizes to be selected from each stratum are two
inherent optimization problems in optimal stratification. Once the optimum stratum
boundaries have been determined, optimum sample size can easily be
computed using a particular sample allocation method. When stratification is
based on a single study variable (y), its distribution can be utilized as the
best characteristic to determine the optimum stratum boundaries, i.e., by
cutting the range of the distribution at suitable points. The basic
consideration involved in determining optimum stratum boundaries is that
the strata should be as internally homogenous as possible. Thus, in order to
achieve maximum precision, the stratum variances should be as small as possible
Cochran (1977).
Formulation of the Univariate Stratification
Problem
Let the target population of the
variable under study be stratified into L strata where the estimation of the
mean of this study variable (y) is of interest. If a simple random sample of
size nh is to be drawn from h th stratum with sample mean ¯yh, then the
stratified sample mean, ¯yst, is given by:
where Wh (stratum
weight) is the proportion of the population contained in the hth stratum. When
the finite population correction factors are ignored, under the Neyman (1934)
allocation, the variance of
y¯st is given
by:
where Sh2 is
the stratum variance for the study variable in the hth
where h=1, 2... Lh=1,2..., L) stratum and n is the preassigned
total sample size. For a fixed sample size n, minimizing the expression of the
right hand side of equation is equivalent to minimizing
Thus, the objective function could
be expressed as a function of boundary points yh and yh−1
only. Further defining lh = yh − yh−1; h = 1,
2, ..., L where lh ≥ 0 denotes the range or width of the hth
stratum and the range of the distribution, d = b − a, is expressed as a
function of stratum width as:
The hth stratification point
yh; h = 1, 2, ..., L is then expressed as yh = yh−1 + lh
and the problem can be treated as an equivalent problem of determining optimum
strata widths (OSW), l1, l2, ..., lL. Due to the special nature of functions,
the problem may be treated as a function of lh alone and can be
expressed as:
Optimal Stratification of Univariate Populations –
R Programming:
With the simulated data sets, the number of strata (h),
fixed sample size (n) and population size (N) were used as the input arguments
to the strata.dp() function in the package. When executed, the package outputs
the OSB and OSS, amongst other quantities such as stratum weight (Wh),
stratum variance (Sh2), etc.
The stratifyR Package Under the proposed method, in order
to construct optimum stratum boundaries and optimum sample sizes for a given
population, its best-fit frequency distribution needs to be estimated. The
problem of OSB is then formulated as a mathematical programming problem, where
the objective function is minimised on the range of the data set subject to the
constraints Both the estimation of the distribution and the MPP formulation
(for the indicated distributions) are implemented in the proposed stratify package.
We use an example to illustrate this in
r programming:
SYNTAX:
strata.data(data, h, n, cost=FALSE,
ch=NULL)
data - A vector: data containing every unit of the survey population
h - A numeric: number of strata to be sampled. The default is 2
n - A numeric: fixed total sample size
cost - A logical: stratum cost. Default cost=FALSE.
ch - A numeric: denotes a vector of stratum costs. Default ch=NULL.
To show the application of the
strata.data() function, an example of the command used and its output from the
package is given below. The problem uses the ‘mag’ variable from the ‘quakes’
data (with a population of N=1000) available from the datasets package in R. To
construct a 2-strata solution with a fixed sample size of n=300, we use the
following codes:
data(quakes)
head(quakes)
##
lat long depth mag stations
## 1 -20.42 181.62
562 4.8 41
## 2 -20.62 181.03
650 4.2 15
## 3 -26.00 184.10
42 5.4 43
## 4 -17.97 181.66
626 4.1 19
## 5 -20.42 181.96
649 4.0 11
## 6 -19.68 184.31
195 4.0 12
mag <- quakes$mag
length(mag)
## [1] 1000
hist(mag)
library(stratifyR)
## Warning: package 'stratifyR' was built under R
version 3.6.3
## Loading required package: fitdistrplus
## Warning: package 'fitdistrplus' was built under R
version 3.6.3
## Loading required package: MASS
## Loading required package: survival
## Loading required package: zipfR
## Warning: package 'zipfR' was built under R version
3.6.3
## Loading required package: actuar
## Warning: package 'actuar' was built under R
version 3.6.3
##
## Attaching package: 'actuar'
## The following object is masked from
'package:grDevices':
##
## cm
## Loading required package: triangle
## Warning: package 'triangle' was built under R
version 3.6.3
## Loading required package: mc2d
## Warning: package 'mc2d' was built under R version
3.6.3
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R
version 3.6.3
##
## Attaching package: 'mc2d'
## The following objects are masked from
'package:base':
##
## pmax,
pmin
res <- strata.data(mag, h = 2, n=300)
## The program is running, it'll take some time!
summary(res)
## _____________________________________________
## Optimum Strata Boundaries for h = 2
## Data Range: [4, 6.4] with d = 2.4
## Best-fit Frequency Distribution: lnorm
## Parameter estimate(s):
##
meanlog sdlog
## 1.52681032 0.08503554
##
____________________________________________________
## Strata OSB
Wh Vh WhSh
nh Nh fh
## 1 4.68
0.58 0.03 0.109 140 585 0.24
## 2 6.4 0.42 0.09 0.124 160 415 0.38
## Total 1.00 0.12 0.233 300 1000 0.30
##
____________________________________________________
Comments
Post a Comment