Crowdsourcing as a Novel Method to Evaluate Aesthetic Outcomes of Treatment for Unilateral Cleft Lip

    loading  Checking for direct PDF access through Ovid

Excerpt

We applaud the authors for their article.1 They successfully used a crowd to rate large numbers of images for aesthetics quickly yet also similarly to professionals. However, we urge caution, as there is no clear evidence that reliability has improved or that this technology helps establish a valid assessment.
Crowdsourcing 250 raters for each survey seems impressive, but the character of these Web site–recruited, anonymized lay raters could improve or reduce the validity of any scoring system, and we know little about the intrarater reliability of any of the raters. Nose ranking was from full anteroposterior and profile views. Deall et al.2 demonstrated that lip scores have better reliability than isolated nose scores, and dominate whole-image scores. Original Asher-McDade3 scoring does not show the lower lip. Failure to crop out the lower lip means the effect of class III occlusion could be confounding (Fig. 1). Some raters seeing the lip when evaluating the nasal features may have a perception bias, underestimating the influence of the whole image on individual component rating.
The authors used Pearson correlation as their main statistical tool. This is designed for interclass (not intraclass) reliability. It is unclear whether other statistics have been used, specifically, for intraclass reliability. To analyze ordinal five-category Asher-McDade scores, the aggregated total or mean was used. This dilutes assessment toward the middle ranks. Although the whole scales are not fully used, the agreement will appear inflated. To study reliability and validation simultaneously while considering the correlated data, alternative statistical methods are proposed in Deall et al.2 and Bella et al.4
This article describes higher correlations than in the literature for Asher-McDade outcomes (interrater and intrarater reliability of ±0.60 and ±0.70,3–6 respectively); this needs more explanation. The rater variability should not be overlooked, as two of six surgeons reported reliability lower (0.64 to 0.66) than the majority (0.80 to 0.90 in Table 4). Some raters never “see” a good result, whereas others never see a poor result.4 Simply increasing the number of raters does not improve reliability unless “bad” scorers have been excluded.
Details of image selection were lacking, with the number of images not guaranteeing a full spectrum of outcomes. It would have been instructive to summarize 50 images against the Asher-McDade scales. As a small proportion (13 images) of the original 50 images were selected for surgeon review, this may predispose to a large selection bias.
The ultimate goal of facial aesthetic measures is to produce an absolute rating scale. Although Elo ranking elicited relative ranking, it does not allow comparison of one center or technique with another, unless they are in the same sample and scored together. The inclusion of four noncleft images was reassuring, as raters perceived these to be aesthetically pleasing.
In conclusion, we believe that crowdsourcing has the potential to deliver rapid results and provide insights toward developing a subjective rating scheme. However, the current evaluation again recognizes the absolute difference between images but does not make it any more quantifiable.
    loading  Loading Related Articles