The reliability of measurement of how young children spend their time has traditionally been computed in terms of interobserver exact agreement This study sought to apply generalizability theory to the measurement of engagement Forty-seven young children, 15 of whom had disabilities, were observed four times in their child care setting. Types and levels of engagement were coded by three raters. Using ANOVA procedures for determining the relative contribution of different sources of error, a fully crossed (with subjects) two-facet (sessions, raters) generalizability (G) study was employed. The nine outcome measures were four types and five levels of engagement Results showed that raters accounted for less than 2% of the variance in the error of the scores, while sessions accounted for most of the variance other than between subject variance. The outcome measures proved to vary in their reliability robustness. The G study was followed by a decision (D) study to determine the levels of the facets that would be required to achieve a generalizability coefficient of .80. The number of sessions could be realistically increased to achieve this aim, but the increase in “dependability” achieved with the addition of each rater was minimal. Conclusions are drawn about the importance of assessing more than one source of error, raters, in observational research Overall, molecular methods were determined to be relatively unstable for measuring the molar construct of engagement, but certain engagement outcomes were stable across sessions and raters.