2.6 Statistical selection procedures2.6.1 IntroductionSuppose that a breeder is comparing a number k (k ³ 2) of potential oil palm progenies. A progeny is characterized by the expected yield m per plot of constant size. The goal of the breeder is to select one or more good progenies, or, formulated in a more accurate way, he wants to select ultimately the best progeny, where the best progeny is defined as the progeny with the largest expected yield per plot. The statistical approach of searching for the best progeny is termed Statistical Selection. There are two basic approaches developed for Statistical Selection in the literature. One approach has been developed by Bechhofer (1954). The second approach has been thoroughly investigated by Gupta (1956, 1965). For a review see Van der Laan and Verdooren (1989, 1990). In sections 2.6.2 and 2.6.3 the theoretical background of the two approaches will be outlined for the interested reader. In section 2.6.5 these approaches are illustrated by a practical example with oil palm. 2.6.2 Indifference Zone approach of selectionIn this section we shall describe Bechhofer's Indifference Zone approach. Assume k (fixed and k ³ 2) varieties denoted by V1, V2, ..., Vk are given. The experimental design can either be a completely randomized design with n plots for every variety or a randomized complete block design with n blocks each of blocksize k and the plots in a block randomly assigned to the k varieties. From the observations Xij of the k varieties, we calculate the k sample means
and these sample means are based on an equal number n of independent and Normally distributed random observations with expectation m i and common variance s 2. For simplicity we first assume that the common variance s 2 is known. The parameters mi are ranked and indicated by m [1] m [2] £ ... m [k]. In the same way the ranked sample means are denoted by x [1] £ x [2] £ ... £ x [k].The variety associated with m [i] will be denoted by V(i). Then we define the variety V(k) (with associated response x (k)), corresponding to m [k], as the best variety. If there is more than one contender because there are ties, it is assumed that one of these is appropriately tagged. The goal is to select the variety associated with m [k], thus the best variety. We define d k,k-1 = m [k] m [k-1]
is as follows. Select Vi if and only if
In this context a correct selection (CS) means that the best variety is selected. The following probability condition for CS given the selection procedure R must be fulfilled: P(CS | R) ³ P* if d k, k-1 ³ d * with k-1 <P* <1. Thus the probability of selecting the best variety is at least P*, provided the best variety is at least d * away from the second best variety. This minimal probability P* can only be guaranteed if the required common sample size n is large enough. The minimum of the P (CS | R) is attained for the so-called Least Favourable Configuration (LFC) given by m [1] = m [2] = ...= m [k-1] =m [k] -d *. One can prove that the probability of correct selection for the LFC is equal to
where F ,(· ) is the standard Normal cumulative distribution function and Tables for t and thus for n have been constructed.
rounded to the nearest (larger) integer, where the quantity t can be found for various values of P* and k, in for instance Gibbons et al. (1977) tables A1 and A2, Bechhofer (1954), Gupta (1963), Gupta et al. (1973) and Butler & Butler (1987). The conclusion deduced from the statistical selection procedure R can also be formulated as follows. With the chosen minimal n it can be guaranteed with minimal probability P* that the selected variety is less than d * away from the best variety. The Indifference Zone approach is important in designing an experiment. It provides a value for the common sample size n needed to meet certain probability requirements. In practice k, the number of varieties, should not be large, otherwise the total number of plots (kn) will be too large. Using the Indifference Zone approach in the situation of an unknown common variance s 2 a two-stage procedure is necessary, which is less attractive in practice. For the Indifference Zone procedure the first stage is necessary in order to get an estimation of s 2. In the second stage the estimate s2 is used to obtain the required size of the second sample. 2.6.3 Subset Selection approach of selectionThe Subset Selection approach of Gupta aims to select a subset of the k varieties considered in section 2.6.2, in order to include the best variety with a certain confidence. The size of the subset is not fixed beforehand (this means the number of selected varieties) and depends among other things on the sample means and the variance s 2 and on the common sample size n. Obviously, we wish a selection rule which makes the expected number of varieties in the subset as small as possible. It is not necessary to determine the common sample size n at the start of the experiment. The experimental design can be a completely randomized design with n plots for each variety or a randomized complete block design with n blocks each of blocksize k and the plots in a block randomly assigned to the k varieties.
where is based on an equal number n of independent and Normally distributed random observations with expectation m i and common known variance s 2. The rule R can be described as follows. Select Vi in the subset if and only if where t >0 must be determined such that the probability requirement of a Correct Selection (CS) with this selection rule R P(CS) | R) ³ P* is met for all possible values of the parameters i. In this context a correct selection CS means that the best variety belongs to the selected subset. It can be proved that the Least Favourable Configuration (LFC) is the limit situation, where m [1], ...,m [k1] are all equal to m [k]. It can be proved that Values of t can be found in the tables mentioned in 2.6.2. The size of the subset reflects the confidence in choosing the best variety. A large subset would mean that either the varieties are close together or the sample sizes are small, or both. In the Normal means m i situation with the common variance s 2 unknown, a single-stage procedure can be used for the Subset Selection: Select variety Vi if and only if where s2 is the unbiased estimator of s 2 based on v degrees of freedom and t = t (k,v,P*). Values of the constant t or h with t = hÖ 2 can be found in the references mentioned in 2.6.2, with the exception of Butler & Butler (1987) which gives only values of t (k,¥ , P*). The constant h satisfies where G(w) is the distribution function of w = s/s and P* is the desired confidence. Values of h are given in e.g. table A4 of Gibbons et al. (1977) and in Bechhofer and Dunnett (1988). 2.6.4 Comparison of the two approachesThe Subset Selection approach has certain advantages in practice. We mention the possibility to use the Subset Selection method as screening procedure. Even when the ultimate goal of the breeder is to choose the best, the Subset Selection approach can be applied to eliminate inferior varieties. This is in practice an interesting feature, especially when the number of potential varieties is large, as is usually the case in oil palm parental testing programs. The Indifference Zone approach aims to indicate the best variety, whereas the Subset Selection approach selects in general more than one variety, so providing less precise information. However, one has to pay for more precise information in the form of structuring the problem in more detail. Using the Indifference Zone approach we must define d k,k-1 which is a measure for the distance between the best variety and the second best variety and we must give d *, which is in practice sometimes embarrassing. The Indifference Zone approach is very useful at the experimental design stage in order to determine the required common sample size n. The designing aspect is an integral accessory of the Indifference Zone methodology. Determination of the required sample size is the central point, rather than the analysis of obtained samples. The Subset Selection approach can be used without planning the sample size in advance. This enables the breeder to analyze the data when the experiment has already been done and the sample size is not adequate for the Indifference Zone approach. In this sense one can say that the Subset Selection approach is more flexible. Regardless the value of the common sample size n, the Subset Selection approach can be applied. However, the size of the subset increases as n decreases. That is the toll one has to pay for small sample sizes. This Subset Selection approach is recommended as a method for the oil palm progeny
tests.
In Papua New Guinea, oil palm cultivation started on a commercial scale in 1968. In 1976, about 12,000 ha were planted. To guide the oil palm cultivation the Dami Oil Palm Research Station was founded at Kimbe, West New Britain, Papua New Guinea. At this station a dura x pisifera progeny trial was established in 1968. In this experiment nine ex-AVROS pisifera with four selected Deli dura palms were crossed to get 15 families. These fifteen families were arranged in five randomized complete blocks with sixteen (4 x 4) palms per plot with a 9 m triangular spacing. For this example we have taken only ten families which remain in four complete blocks, the other families were discarded in several blocks due to diseases. The average fresh fruit bunch yield y (in kg/palm) over the years 1972-1977 of the four inner palms per plot was analyzed. Further, samples of leaf 17 were taken from all inner palms in 1973, bulked per plot, and analyzed at Banting Oil Palm Research Station (O.P.R.S.) in Malaysia. The Magnesium content x (in %) in leaf 17 was determined. Breure (1987), arrived at the conclusion that this % Mg has a good correlation (r= 0.70) with the yield of oil for the first five years of production (1972-1976); hence the % Mg can be used to indicate good families for oil yield. Because the % Mg determination has been done with the same procedure in Banting O.P.R.S. for a long period, the standard deviation % of the % Mg determination can be stated as known and to be 0.0186. The average % Mg for the 10 families over the four blocks was as follows.
Following Bechhofer's procedure the Least Favourable Configuration (LFC) is given by m [1]=m [2]= ... = m [9] = m [10] - d *. When the minimum probability of correct selection P* and the common sample size n are given, d * can be determined. From table A1 of Gibbons et al. (1977), we find the following values for t with k = 10 populations for various values of P*:
From the formula we find in this case the values of
for the values of P* as: P*: 0.75 0.90 0.95 0.99 d *: 0.021 0.028 0.032 0.039 Otherwise we can also determine the number n of complete blocks to determine a d * = 0.02 or 0.01 for the different values of P* from rounded to the nearest (larger) integer. In our case we must calculate and the results for n are:
When we apply Gupta's Subset Selection procedure to find the subset which contains the best family with a minimum probability of correct selection P*, we must take those families Vi for which
The results are:
Conclusion for % Mg If we want to indicate the families which give the best Magnesium content in leaf 17 across the first five years of production (and hence a good oil yield), we can take the families 10, 8, 3 and 2 with a minimal probability of correct selection P* of 0.90. When we increase P* to 0.99 the subset must be extended with the families 7, 1 and 5. Now we want to use Gupta's Subset Selection procedure to find the minimum subset of families which contains the best family with a minimum probability of correct selection P* for the average fresh fruit bunch yield y. The Analysis of Variance procedure gives as an estimate of the variance s 2 a Mean Square Error of 6308.488 based on v = 27 degrees of freedom. For the yield yij of family i in block j we use the following model: Yij = m + a i + b j + eij (i = 1, 2, ..., 10 and j = 1, 2, 3, 4). The least squares mean m + a i + b for the family Vi is estimated by the family Vi mean yi. The results were as follows:
From table A4 of Gibbons et al. (1977), which gives h-values, we derive the following values of t = Ö 2 for k = 10, v = 27, and for P* = 0.95 and P* = 0.99 (we must interpolate with 1/v for v = 25 and v = 30 to find the value of h): P* = 0.95, t = 2.55Ö 2 = 3.606 P* = 0.99, t = 3.27Ö 2 = 4.624 Hence
Conclusion for fresh fruit bunch yield We then conclude that the most promising families for the fresh fruit bunch yield are the families 1, 3, 8, 9, 2, 4 and 7. The probability that this selection procedure selects in the subset the best family of the 10 tested families is at least P* = 0.95. When we increase P* to 0.99 then family 10 must also be included in this subset. The subsets can be reduced by using more replications in forthcoming experiments. 2.6.6 Selection trials with incomplete block designsIn case progenies are compared in an incomplete block design, one can use for the Subset Selection Procedure of Gupta the following approximation. Instead of the selection rule in section 2.6.3 : "Select variety Vi if and only if " we use now the selection rule : "Select variety Vi if and only if LSM(Vi)³ LSM(V[k]) - S*t " where S = 1/Ö2 times the average standard error of the differences between pairs of progenies, LSM(Vi) is the Least Squares Mean of variety Vi and LSM(V[k]) is the Least Squares Mean of the variety with the largest LSM (see section 2.2 or section 2.5 for the definition of Least Squares Mean (LSM) ). A more accurate (but more elaborate) method can be found in Dourleijn (1993, 1995).
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||