The Graduate School at our university requested a new system for judging and scoring posters in their annual poster competition. Every year over 200 students enter their posters into the contest. Faculty members, post-docs, and graduate students can sign up as judges for the contest. There are usually around 100 judges. Each judge is assigned five posters to review and score. Each poster must have at least one faculty judge. A logistical complication of organizing the event arises because some judges who originally signed up do not show up, and some judges just show up without signing up in advance. So the organizers only learn the exact number of judges on the day of the event. Moreover, since a different set of judges review each of the posters, the organizers realize that simply taking the average of the scores each poster gets does not provide a fair evaluation. Therefore, they were seeking a way to make the judge assignments in advance and create a method to “fairly” evaluate and score the posters.
2.1 The challenge: no BIBD exists for this situation
We propose framing the judge assignment as a design problem, where the posters are the treatments and the judges are the blocks. This creates an incomplete block design (IBD) since every judge will only review and score a subset of five of the posters. The optimal design of this type would be a Balanced Incomplete Block Design (BIBD) (see, e.g., Montgomery (2009)), in which every poster is judged an equal number of times, and every pair of posters needs to be judged by the same judge an equal number of times. The following two relationships summarize the above restrictions:
where is the number of blocks (judges), is the number of replicates (reviews per poster), is the block size (number of reviews per judge), is the number of treatment levels (posters), and is the number of blocks in which each pair of treatments appears together (number of times each pair of posters is reviewed by the same judge). Suppose we have 201 posters and each judge reviews five posters, . From Equation 2, even with the minimum , a BIBD would require each poster to be reviewed by 50 judges (). From Equation 1, we would need judges, which is not feasible. Moreover, the number of judges, i.e. blocks, available is unknown in advance.
There are alternatives with good properties when balance is not achievable, such as a partially balanced incomplete block design (Bose and Shimamoto (1952)) or the alpha-type of resolvable incomplete block designs (Patterson and Williams (1976)). In our application, since all the posters should communicate research to a general audience, they cannot be partitioned into pre-determined categories (classes). Therefore a partially balanced incomplete block design, with pre-determined classes, is not appropriate. Due to the unknown number of judges (blocks), we can not partition the blocks into sets of replicates; thus the design is not resolvable (Yates (1936)). Therefore the alpha-type of resolvable design is not suitable either. An extensive search of the literature did not reveal an exact match of an existing design. Motivated by lack of guidelines on how judges should be selected or how the final scores should be calculated for competitions, in the following sections we propose two designs that can be applied to a general judging system, the corresponding algorithm for implementation, and statistical methods to calculate the final scores.
2.2 Goals of the Designs
The goals of the judge assignment scheme can be summarized as follows:
Each judge reviews a fixed number () of distinct posters (five in this case).
There are two types of judges, faculty and non-faculty. Each poster needs to be reviewed by at least one faculty judge.
Regardless of how many judges () eventually show up, the assignment of judges must be connected. A connected design means that every poster can be connected to every other poster by a chain of pairs within blocks (Bose (1949)). The design and any initial subset, defined as , for all , of assignments needs to be connected. A connected design provides more precise estimates of the posters’ true quality scores.
The number of times each poster is judged () should not differ by more than one so that we get nearly equal replicates.
(optional) Each Pair of posters is reviewed by the same judge at most once.
We can achieve connectivity (#3) if for each new judge assignment, one of the posters assigned to the judge comes from those previously reviewed. Given the total number of posters () and number of posters each judge reviews ( in this case), we can calculate the minimum number of judges required to guarantee connectivity, , to be the smallest integer .
We can give the first judge assignments to faculty judges to satisfy the requirement #2 above. To satisfy #4, the maximum number of times each poster can be reviewed by any judge is the smallest integer . Denote this maximum review number by .
2.3 The Algorithm
To generate any desired number of judge assignments with the above goals in mind, we propose two designs. The first design satisfies goals #1 to #5, and is referred to as the type 1 near-balanced design (NB1). The following algorithm implements the NB1 design:
Randomly select 5 posters from the pool for the first judge. Keep track of the number of times each poster () gets reviewed ().
From the second judge assignment to the
th assignments, the first review of each judge is always selected from the posters that have been previously reviewed, with the selection probability proportional to.
The other four reviews are randomly selected from the set of least reviewed posters. To be more specific, first randomly select from the posters that have not been reviewed (the ones that have ). Stop if we get four. If not, move on to the ones that have been reviewed once (the ones that have ). Randomly select the remaining needed number of posters from this subset. Continue in the same fashion if necessary to those with two reviews, etc.
After each new random assignment is generated, check how many times each pair of posters is reviewed by the same judge. If any pair appears more than once, replace the last assignment with a new one until for all pairs.
From the ()th assignment, generate all five reviews according to Step 3 and 4 until the desired number of judge assignments have been made.
To keep the algorithm efficient, if any assignment takes more than a certain number of attempts (e.g. 500) due to the check in step 4, which implies there might not be a solution given the existing assignment, the algorithm will stop and start over from the beginning.
If we drop Step 4 in the above algorithm, we will satisfy goals #1 through #4. We call this design the type 2 near-balanced design (NB2). NB2 balances the number of reviews of each poster, while NB1 further balances number of reviews of each pair of posters.
After knowing the number of posters, we can sequentially generate any number of judge assignments using the above scheme, preferably more than needed in case more judges show up without signing up. Since we generate the judge assignment sequentially, any initial subset of the poster assignments to judges will automatically satisfy the requirements. Therefore, no matter how many judges actually show up, as long as the assignments are handed to judges according to the order generated, the design is always efficient. However we need to have a minimum number of faculty judges (). If this condition is not met, we would have to drop requirement #2 above and give the faculty assignments to general judges. If more faculty show up, then they will be treated as general judges.
The above algorithm runs very fast on a personal computer. With the average poster number of around 200, and about 100 judges, the computational time ranges from 0.2 (NB2) to 5 (NB1) seconds. In other settings, NB1 potentially can take much longer due to the check step.
2.4 The Statistical Model
For an IBD, in order to estimate and compare all treatments, the simple average of the review scores is not valid due to the judge differences. Instead, we should estimate the population marginal means (Searle et al. (1980)) which take into consideration and adjust for the judge effects. The statistical model to estimate the population marginal means is as follows:
where is the score that poster gets from judge , is an unknown parameter, is the th poster effect and is the th judge effect. The judge factor can be modeled as a fixed or random factor. See Montgomery (2009) for a detailed description of the intrablock analysis (fixed effects) and interblock analysis (random effects). The error term comes from two sources: the random scoring error and the poster judge interaction effects, i.e. how the criterion of a particular judge interacts with the quality characteristics of each poster. As discussed in Nelder (1977) and Robinson (1991)
, the choice of model is determined by the nature of the questions and the properties of the blocking factors. In our example, we treat the judges as a random blocking factor for several reasons. First there are many judges (over 100), and the population marginal mean estimates are more accurate with a smaller standard error unless the block variance is much larger than the error variance (Robinson (1991)). Secondly we are not interested in estimating the judge effects but simply accounting for them. Thirdly the judges are only a sample of faculty and post-docs on campus. Finally, even though we propose a design that guarantees connectivity, in reality, there is the chance that the assignments are disconnected due to some administrative mistakes. In this case, the fixed effects model will fail, but the random effects model will provide estimates of the poster marginal means.
3 Simulation Study
To evaluate the performance of the proposed designs, we conduct simulation studies in a setting that mimics the real competition. We set the number of posters to be 200, the number of judges to be 100, and the number of awards to be 30. From past data, we find that the poster mean score is around 80, the poster standard deviation is about 7, the judge standard deviation is about 6, and the standard deviation of the random errors is about 7. We consider three designs, NB1, NB2, and a random design. In the random design, each of the first several judges randomly selects 5 posters from the pool of unreviewed posters until all of the posters are reviewed at least once. Then each of the later judges randomly selects 5 from all of the posters. This particular random design only guarantees each poster being reviewed at least once without the other properties of NB1 and NB2.
In each iteration of the simulation, we generate a matrix of scores according to Equation 3 with the parameters being , , , and . The score matrix is fixed at each iteration. We define the “true score” of poster as , since we are interested in how an infinite number of judges from a hypothetical population would score the posters rather than how the actual judges in the sample would judge the posters (the latter one will add a term of average judge effect to the poster score). The true poster scores are held constant for the three designs at each iteration. The three designs will select three different subsets of poster-judge combinations as their corresponding design matrices. We fit the random effects model to each of the three subsets and estimate the population marginal means of the posters. We run 1000 such iterations. Out of the 1000 iterations, the random design generates 27 assignments that are disconnected, thus a fixed effects model would fail to produce poster score estimates in those runs. This supports both the use of random effects model and the near-balanced designs that can guarantee connectivity.
Since the ultimate goal of the poster competition is to select the best posters to give awards, we set the probability of the “truly best posters” being selected as our primary evaluation statistic. The “truly best posters” are defined as the top 30 posters ranked by the true poster scores. Figure 1 examines, for each design, the proportion of the 30 truly best posters judged to be in the top 30. Each histogram in the top row displays the empirical distribution of the proportion of the 30 truly best posters ranked among the top 30 across the 1000 simulation runs for a given design. The median probability is 0.6 for all three designs, which implies that on average, 18 of the 30 truly best posters would win an award when awards are given to the 30 posters judged to be best. The 97.5% quantile is 0.733 for all three designs, and the 2.5% quantile is 0.5, 0.467, and 0.433 for NB1, NB2, and random respectively. Thus, we estimate that for 95% of poster competitions matching our assumptions, 15-22 top posters are awarded under NB1, 14-22 under NB2, and 13-22 under random design. Note that the maximum probability is 0.833 for NB1 and NB2 and 0.8 for random, which implies that even under the best case scenario 5 of the best posters are not awarded. On average each poster gets only 2.5 reviews due to the limited number of reviews per judge. Therefore the estimated scores and ranks are highly variable. We think the fact that on average we only award 60% of the truly best posters and never get all of the 30 completely correct is a common limitation in most of the judging systems.
We further examine the difference of the three designs. In each iteration, we calculate the difference in the winning probabilities of all three pairs. The second row of Figure 1
displays the distribution of the differences. The dashed vertical lines indicate the mean difference and the 95% confidence interval of the mean, with a solid vertical line at 0 as the reference. We can see that the mean difference between NB1 and random, and NB2 and random are both about 0.025 and are significantly greater than 0. The median difference is about 0.033 so the NB design gets 1 more poster correctly awarded on average. The relatively small difference is mainly due to the specific combination of poster number, judge number, and reviews per judge. In NB designs, half of the posters get 3 reviews and the other half get 2 reviews, while in the random design most of the posters still get 2 or more reviews and only a small number of posters get the extreme of a single review. In addition, the judges are sampled from a common distribution so they are relatively similar. Nonetheless, the difference is statistically significant so is not due to Monte Carlo error. Taken into consideration that the average winning probability is not high, we believe 1 poster difference is a scientifically meaningful difference. NB1 and NB2 seem to be equivalent in selecting the best posters. Due to the small difference in the number of reviews of each poster, the additional pair balance does not seem to increase the winning chance of the truly best posters.
In addition, we look at three other measures of the accuracy and efficiency of the designs: (1) the median of the absolute difference in the estimated and true ranks of the truly best posters (top 30) (2) the mean absolute difference between estimated and true scores of the truly best posters, and (3) the mean standard error of the population marginal mean estimator. Similar to the winning probability measure, we take the difference between NB1 and random, and NB1 and NB2, for each of the three measures in each iteration and summarize the distribution of the differences in Figure 2. All three measures favor smaller values. We can see that the differences between NB1 and random are all significantly smaller than 0, which indicates NB1 is a better design than random assignment. The most obvious effect of having a near-balanced design is in reducing the standard error of the estimates. In turn, it will improve the estimation and ranking accuracy. Therefore, we see that the most significant difference between NB1 and random is in the standard error difference (3rd plot in the top row). Like winning probability, NB1 and NB2 are not significantly different in the three measures.
Motivated by a real example, we propose an algorithm to create a near-balanced incomplete block design that satisfies several conditions and to use a mixed effects model to estimate the poster scores. We propose two versions of the NB design, one balancing the number of reviews per poster (NB2) and the other one also balancing the reviews per pair of posters (NB1). We evaluate the accuracy and efficiency of the proposed design and demonstrate the benefits of balancing the number of reviews per poster over a random assignment. The simulation setup mimics the historical data in our example. Even though theoretically NB1 should outperform NB2 due to the additional balance, under this specific setting, the two versions of the NB design showed no significant performance differences for the four performance measures we considered, which implies that there is little difference in balancing the reviews per pair of posters. In some other settings, for example, when the poster, judge, and random errors are all smaller, it is possible that NB1 performs slightly better than NB2 (see Figures 3 and 4 in Appendix). In general, NB1 takes longer than NB2 to run and NB1 might not be feasible in some judge, poster, and review number combinations. Therefore, for simplicity and consistency, we would recommend NB2 to administrative practitioners.
We also discuss a general limitation of such judging systems by examining the proportion of the truly best posters receiving an award. We see that on average, only 60% of the truly best posters will win an award. This is largely due to the limited reviews per judge and the assumption that the truly best 30 posters are not dramatically different from many of the other posters in our simulation study. Hopefully, this result will also offer some solace to those who don’t win anything. Our recommendation has been accepted by the Graduate School for future poster competitions.
The design proposed in this article can be generalized to many other settings such as consumer scoring surveys, website and customer service feedback surveys, and food tasting trials with a large number of samples. The code to generate the proposed design is available at: https://psu.box.com/s/av1kekt1butboo3a2jrk79tmqqcr6myh.
- Bose (1949) Bose, R. (1949). Least square aspects of analysis of variance. In Institute of Statistics Mimeo Series 9. University of North Carolina, Chapel Hill.
- Bose and Shimamoto (1952) Bose, R. C. and T. Shimamoto (1952). Classification and analysis of partially balanced incomplete block designs with two associate classes. Journal of the American Statistical Association 47(258), 151–184.
- Montgomery (2009) Montgomery, D. C. (2009). Design and Analysis of Experiments (7 ed.). John Wiley & Sons, Inc.
- Nelder (1977) Nelder, J. A. (1977). A reformulation of linear models (with discussion). Journal of the Royal Statistical Society. Series A 140(1), 48–76.
- Patterson and Williams (1976) Patterson, H. D. and E. R. Williams (1976). A new class of resolvable incomplete block designs. Biometrika 63(1), 83–96.
- Robinson (1991) Robinson, G. K. (1991). That blup is a good thing: The estimation of random effects. Statistical Science 6(1), 15–51.
- Searle et al. (1980) Searle, S., F. Speed, and G. Milliken (1980). Population marginal means in the linear model: An alternative to least squares means. The American Statistician 34(4), 216–221.
- Yates (1936) Yates, F. (1936). A new method or arranging variety trials involving a large number of varieties. J. Agricultural Sci 26, 424–455.