Alternative Estimation Method for a Three-Stage Cluster Sampling in Finite Population

This research investigates the use of a three-stage cluster sampling design in estimating population total. We focus on a special design where certain number of visits is being considered for estimating the population size and a weighted factor of / is introduced. In particular, attempt was made at deriving new method for a three-stage sampling design. In this study, we compared the newly proposed estimator with some of the existing estimators in a three-stage sampling design. Eight (8) data sets were used to justify this paper. The first four (4) data sets were obtained from[1],[2],[3] and[4] respectively while the second four (4) data sets represent the number of d iabetic patients in Niger state, Nigeria for the years 2005, 2006, 2007 and 2008 respectively. The computation was done with software developed in Microsoft Visual C++ programming language. All the estimates obtained show that our newly proposed three-stage cluster sampling design estimator performs better.


Introduction
In a census, each unit (such as person, household or local government area) is enumerated, whereas in a sample survey, only a sample of units is enumerated and information provided by the sample is used to make estimates relating to all units [5] and [6]. In designing a study, it can be advantageous to sample units in more than one-stage. The criteria for selecting a unit at a given stage typically depend on attributes observed in the previous stages [7]. Mu ltistage sampling is where the researcher d ivides the population into clusters, samples the clusters, and then resample, repeating the process until the ultimate samp ling units are selected at the last of the hierarchical levels [8]. If, after selecting a sample of primary units, a samp le of secondary units is selected fro m each of the selected primary units, the design is referred to as two-stage sampling. If in turn a sample of tertiary units is selected fro m each selected secondary unit, the design is three-stage sampling [9].
The aim of this paper is to model a new estimator for three-stage cluster samp ling scheme which is to be co mpared with the other existing seven conventional estimators.

Methodology
Sub sampling has a great variety of applications [3] and the reason for mu ltistage samp ling is ad ministrative convenienc e [10]. The process of sub sampling can be carried to a third stage by sampling the subunits instead of enumerat ing them completely [11]. Co mparing mu lt istage cluster sampling with simp le rando m samp ling, it was observes that multistage cluster sampling is better in terms of efficiency [12]. Multistage sampling makes fieldwork and supervision relatively easy [4]. Multistage sampling is more efficient than single stage cluster sampling [13] and references had been made to the use of three or more stages sampling [9].
Let N denote the number of p rimary un its in the population and n the number of primary units in the samp le.
Let ij y denote the value of the variable of interest of the j th secondary unit in the i th primary unit. The total o f the y-values in the ith primary unit is Accordingly, the population total for over-all sample in a two-stage is given as The expression (5) may be written into three co mponents as: where ">1" is the symbol to represent all stages of sampling after the first [3].

Proposed Three-Stage Cluster Sampling Design
To estimate the population size at different hospitals using three-stage sampling, the unbiased estimator of population total can be derived as follows. In a three-stage sampling without replacement design supported by [3], [4] and [14]; a sample of primary units is selected, then a sample of secondary units is chosen fro m each of the selected primary units and finally, a sample of tertiary units is chosen from each selected secondary unit. For instance, the state consists of number of local government areas out of which a simp le random samp ling o f n number of local government areas is selected. Each local government area consists of number of cities out of which a simp le random sampling without replacement of number of cit ies is selected. Finally, fro m the selected sample of city containing number o f hospitals, number of hospitals is selected at random without replacement and the number of d iabetic patients in this hospital is collected. Then Again, let be the number of primary units (local government areas) samp led without replacement, be the number o f secondary units (cities) selected without replacement fro m the ℎ sampled primary unit (local government area) and be the number of tertiary units (hospitals) selected fro m the ℎ secondary unit (city) in the ℎ primary unit (local govern ment area). An unbiased estimator of the population total at ℎ secondary unit in the ℎ primary unit in the sample is: where = is the known sampling fract ion for tertiary units in the ℎ secondary unit of the ℎ primary unit. Also, let denote the number of indiv iduals (tertiary units) in the sample fro m the ℎ secondary unit of the ℎ primary unit who engage in the treatment of diabetes. An unbiased estimator of the population total in the ℎ primary unit in the sample is: Finally, an unbiased estimator of the population total of the diabetic patients undergoing treatment in all the hospitals at the ℎ secondary unit (city) in the ℎ primary unit (local government area) is:

Proof:
We know that expectation of � given by equation (8) conditional on samples 1 and 2 of primary units and secondary units respectively equals of engaging in the variable of interest in each primary unit and each secondary unit [15]. That is; � � � 1 , 2 � = (11) Also, the expectation of � given by equation (9) conditional on sample 1 of primary units equals of engaging in the variable of interest in each primary unit.
That is; ( � | 1 ) = (12) To obtain the expected value of � 3 given by equation (10) over all possible samples of primary units.
Then, the expectation of � 3 is : where 1 and 2 denote the samples of primary units and secondary units respectively.
Hence, the variance of the newly proposed estimator � 3 of the population total is derived as follo ws: In line with [3] and [14], we use Because of the simp le rando m samp ling of primary units and secondary units without replacement at the first stage and second stage respectively, the first term to the right of the equality in equation (15) is: The second term to the right of the equality in equation (15) is: Equations (16) and (17) g ive; We note that the first term to the right of the equality in equation (18) is the variance that would be obtained if every tertiary unit in a selected secondary unit and every secondary unit in a selected primary unit was observed, that is, if 's were known for = 1,2, ⋯ , . The second term contains variance that would be obtained if every tertiary unit in a selected secondary unit was observed, that is, if 's were known for = 1,2, ⋯ , and = 1,2, ⋯ , . The third term contains variance due to estimating the 's fro m a subsample of tertiary units within the selected secondary units. An unbiased estimator of the variance of � 3 given in equation (18) is obtained by replacing the population variances with the sample variances as follows:
This estimator, � 3 , is then compared with these seven conventional three stage cluster sampling design estimators:

Data Used for this Study
There are eight (8) categories of data sets used in this paper. The first four (4) data sets were obtained fro m [1], [2], [3] and [4] respectively. The second four (4)

Results
The estimates obtained with the aid of software developed using Visual Basic C++ Programming Language [18] are given in tables 1 -12 for the illustrated and the real-life data respectively.

Discussion of Results
The estimation methods given in equation (10) was applied to four d ifferent illustrated data (Cases I -IV) and four real life data (Populations 1 -4). The population totals obtained for illustrated data are given in table 1 wh ile the  population totals obtained for real life data are given in table  2. Table 3 give the biases of the estimated population totals for illustrated data for our own estimator as 11, 219, 1, and 112 fo r cases I -IV respectively while table 4 g ives that of the four life data sets as 112, 104, 103, and 107 respectively. This implies that our own estimator has the least biases using both data sets. The confidence intervals of the estimated populations in table 1 are g iven in table 9 for α=, 5%. The confidence intervals of the estimated populations in table 2 are given in  table 10 for α =5 % which shows that all the estimated population totals fall within the co mputed intervals as expected. For our o wn estimator, table 11 gives the coefficients of variations for the estimated population totals using illustrated data as 3.98%, 0.31%, 2.51% and 0.27% for cases I -IV respectively while table 12 gives that of life data sets as 0.34%, 0.33%, 0.35% and 0.37% respectively which means that our newly p roposed three stage cluster estimator has the least coefficient of variation, hence it is preferred.

Conclusions
The alternative estimation method of population allows the use of certain number of visits to the venues (hospitals) within the clusters (cities) and a more precise (minimu m mean square error) estimate was obtained and the estimates presented indicate that substantial reduction in the variances was obtained through the use of newly proposed estimator. We also observed that irrespective of the data considered, the variance of newly proposed estimator is always less than those of already existing estimators in three-stage cluster sampling designs. The newly proposed estimator ( � 3 ) is preferred to the already existing estimators considered in this study and is therefore recommended.