Comparison of the Efficiency of the Various Algorithms in Stratified Sampling when the Initial Solutions are Determined with Geometric Method

The main aim of this paper is to examine the efficiency of Genetic Algorithm (GA) of Keskintürk and Er (2007)[1], Kozak’s (2004) Random Search[2] and Lavallée and Hidiroglou’s (1988) Iterative Algorithm method[3] on determination of the stratum boundaries that minimize the variance of the estimate. Initial starting boundaries of the mentioned algorithms are obtained randomly. Here, it is aimed to reach better results in a shorter period of time by utilizing the initial boundaries obtained from Gunning and Horgan’s (2004) geometric method[4] compared to the random initial boundaries. Three algorithms are applied on various populations with both random and geometric initial boundaries and their performances are compared. With the stratification of 11 heterogenous populations that have different properties, higher variance of the estimates or infeasible solutions can be observed once the initial boundaries are obtained with geometric method.


Introduction
In stratified sampling,in order to gain more precision than other methods of sampling, a heterogeneous population is divided into subpopulations, each of which is internally homogeneous. As a result the main problem arising in stratified sampling is to obtain the optimum boundaries. Several numerical and computational methods have been developed for this purpose. Some apply to highly skewed populations and some apply to any kind of populations. An early and very simple method is the cumulative square root of the frequency method (cum√f) of Dalenius & Hodges in 1959 [5]. More recently Lavallée & Hidiroglou algorithm [3] and Gunning & Horgan's (2004) geometric method [4] have been proposed for highly skewed populations whereas Kozak's (2004) random search method [2] and Keskinturk & Er's (2007) genetic algorithm (GA) method [1] have been proposed for even non-skewed populations. Very recently, Brito et.all [6] proposed an exact algorithm for the stratification problem with only proportional allocation based on the concept of minimum path in graphs and they called their method StratPath. Moreover, developed an iterated local search method to solve the stratification problem of variables with any distribution with Neyman allocation [7].All these methods aim to achieve the optimum boundaries that maximise the level of precision or equivalently minimise the variance of the estimate or the sample size required to reach a level of precision and some of them are available in the stratification package stratification for use with the statistical programming environment R [8]; freely available on the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=stratification.
The main aim of this research is to compare the efficiency ratios of the Lavallée ve Hidiroglou iterative method, Kozak's random search method and Keskinturk and Er's genetic algorithm approach when the initial boundaries are obtained either randomly or from the geometric method of Gunning and Horgan, and to examine the performances of the three methods. The predetermined total sample size (n) is allocated using Neyman [9] optimum allocation method. The paper is structured as follows: In the second section the exact solution of Dalenius [10] and the methods that are developed in order to approximately solve the Dalenius equations are briefly explained. In the third section, the results obtained with Lavallée and Hidiroglou's iterative method, Kozak's random search method and Keskintürk and Er's genetic algorithm are given when the initial boundaries are obtained randomly or from the geometric method of Gunning and Horgan and the performance of the algorithms are compared. the Initial Solutions are Determined with Geometric Method Dalenius (1950) [10] considers a density The range (X max -X min ) of the stratification variable x is divided into L parts at points b 1 <b 2 <...<b L-1 , each part corresponding to a stratum. When a sample of h n n is estimated by Cochran as [11] 1 where for the h th stratum h W , h µ , st x and are calculated as follows [11]: The estimate of the mean st x has a variance of where the true variance is ( ) If the sampling fractions h h n N are negligible then the variance could be written in short, It is well-known that this variance of the estimate is minimum when total sample size n is allocated using Neyman's optimum allocation method [9]: Therefore the variance of the estimate is a function of the boundaries h b . As a result, it is very difficult to find the boundaries that minimise the variance of the estimate. Dalenius (1950) [10] has shown that the variance of the estimate obtained with Neyman's optimum allocation method is optimum or in other words minimum, when the stratum boundaries satisfy the following equations: It is very difficult to find the stratum boundaries h b that satisfy these equations remembered as Dalenius equations since these equations include 2 h σ and h µ that both vary with h b stratum boundaries. As a result, there have been many approximations and algorithms proposed for solving Dalenius equations. The widely known simple method among the proposals is the cumulative square root frequency method of Dalenius and Hodges (1959) ( cum f ) [5]. Then, in 1988 Lavallée and Hidiroglou's iterative approach [3], in 2004 Gunning and Horgan's geometric method [4] and Kozak's random search method [2], in 2007 Keskintürk and Er's genetic algorithm method [1] are developed in order to find the stratum boundaries. Among these methods, geometric method is the simplest method that does not include any complex algorithms. Therefore, the main aim of this research paper is to set the initial boundaries of the proposed algorithms with geometric method and compare the efficiencies of the algorithms when the boundaries are obtained with or without geometric method since it is believed that these algorithms would reach the solution in a shorter period once they start searching the entire space at a reasonable point. The details of the approaches and algorithms of these methods could be obtained from the original papers of Dalenius and Hodges' (1959) [5], Gunning and Horgan (2004) [4], Kozak (2004) [2] and Keskintürk and Er's (2007) [1]. All of these methods could be applied in R statistical environment using stratification [12] and GA4stratification [13] packages but the GA results given in this studyare obtained in Matlab 7.0 since in the package there is no option for setting the initial boundaries with non-random results.

Populations for Stratification
In this paper, many populations are used for stratification with different skewness, kurtosis, mean, standard deviation and size properties.Those populations that are available in the R stratification [12] and GA4Stratification [13] packages are used for stratification. Each of the populations are divided into 3, 4, 5 and 6 strata and the boundaries are obtained using Lavallée and Hidiroglou, Kozak and GA methods with random and geometric initial boundaries. The boxplots of the populations are displayed between Figures 1 and 3, and the summary statistics of the populations are given in Table 2.
Referring the descriptive statistics in Table 2 and boxplots in Figures 1-3, we see that the populations to be stratified are highly heterogenous which makes stratified sampling efficient to use. For comparison, the initial boundaries are obtained with both random initial boundaries and with geometric method. The populations are divided into 3, 4, 5 and 6 strata and the total sample size is determined as 100 for Pop1-Pop11. For genetic algorithm, the number of iterations is set to 10000, the GA population size to 35, the crossover rate to 0.99 and the mutation rate to 0.15. For efficiency (efficiency -eff) comparisons of the ratio of variance of the estimates or the ratios of squares of coefficient of variations (CV) are calculated and given in Appendix 1. Since Lavallée and Hidiroglou's (LH) method is based on sampling all of the elements in the last stratum (take-all top stratum), the following efficiency ratios are calculated if GA and Kozak's methods provide a take-all top stratum solution: For those situtations where some of the last stratum is sampled, only the efficiency ratio between GA and Kozak's method ( / GA Kozak eff ) is calculated. From the efficiency and the coefficient of variation ratios given in Table 3in Appendix 1 and from the strata and sample sizes given in Table 5 in Appendix 2, it can be seen that the algorithms compared in this paper provide very close results and that the stratum boundaries are very close to each other when the initial boundaries are set randomly.When we look at the summary of the results given in Table 1, we see that the number of cases where GA or Kozak is better than the other one does not differ much and the gains in efficiencies are close to each other. On the other hand, the results are different with higher coefficient of variations when the initial boundaries are obtained with geometric method (Table 4).Moreover, when the initial boundaries are set to be found with geometric method, many infeasible or nonconverged results are obtained. For example, when we look at Table 4 where the initial boundaries are obtained with geometric method, we see that the coefficient of variations for GA increases in 32 cases among 44 cases. Yet some of these increases in the CVs result from a nonconverged or an infeasible solution. Only in 4 cases there is a gain in efficiency ranging in between ‰0.01 (CV falling from 0.01437 to 0.01436 for H=5 for Pop3-UScolleges) and %0.186 (falling from 0.02485 to 0.02299 for H=5 for Pop8-MRTS), which could be counted as a very minor gain. The results for L&H and Kozak's are more or less the same with the results obtained for GA. When the initial boundaries are obtained with geometric method, with each of Kozak's and L&H's methods there is an efficiency gain in only 5 cases, which are again minor. For these reasons, Lavallée and Hidiroglou's iterative method, Kozak's random search method and Keskintürk and Er's genetic algorithms give more efficient results when the initial boundaries are set randomly due to their nature. As a result, it can be concluded that starting with geometric initial boundaries does not have much contribution on the efficiency ratios or on the stratum boundaries for the computational methods. As proposed by Horgan (2011) [14], in order to obtain feasible solutions in some data sets,some modifications should be applied before utilising the geometric method. Horgan (2011) [14] suggests that the data should be analysed before applying the stratified sampling scheme if there are extreme outliers. In this paper the revitised version of the geometric method is not applied since the algorithms examined here already give good results with random initials. Furthermore, if any researcher wants to use the geometric initial boundaries for data sets with extreme outliers, modified version of the geometric method should be used.

Conclusions
Stratified sampling is a sampling methodology used for heterogeneous populations in order to gain more precision than other methods of sampling. This paper examines the improvement in the efficiency ratios and stratum boundaries obtained with Lavallée and Hidiroglou [3], Kozak [2] and Keskintürk and Er's (2007) [1] methods once the initial boundaries are obtained with geometric method. With the stratification of 16 heterogenous populations that have different properties, higher variance of the estimates or infeasible solutions can be observed. As a result, researchers should be much more rigorous when using geometric method for the initial boundaries in algorithmic methods or else use the modified version of geometric method once the data has very extreme values.