On the instability of teacher effectiveness measures, by Morgan Polikoff

image by Flickr user Colin Harris
image by Flickr user Colin Harris

 

One of the most important policy innovations of the last few years has been the adoption and implementation of new multiple-measure teacher evaluation systems. These systems, encouraged by the Obama administration’s Race to the Top and No Child Left Behind Waiver programs, use measures of student learning alongside other measures of teachers’ classroom performance to make formative and summative judgments about individual teachers.

By far the most controversial portion of these systems has been the student achievement portion. Value-added models (VAMs), which use students’ prior achievement history and, sometimes, demographic characteristics to estimate teachers’ impact on student achievement, have been criticized along a number of dimensions. The most fundamental objection has been that VAMs are unstable from year to year; this instability, it is argued, all but invalidates their potential use for high stakes evaluation.

While the (in)stability of VAMs is well known by now, due to the decades of research on VAMs, there is much less known about the technical properties of the classroom observation and student survey portions of new evaluations. Presumably, should the stability results look roughly similar (year-to-year correlations in the .2 to .5 range), the same validity concerns should apply. My paper in the American Journal of Education, The Stability of Observational and Student Survey Measures of Teaching Effectiveness, uses data from the Bill and Melinda Gates Foundation’s Measures of Effective Teaching study to investigate this issue, looking at the year-to-year stability of several well known and widely-used observational and student survey measures (the Framework for Teaching, the Classroom Assessment Scoring System, the Protocol for Language Arts Teaching Observations, the Mathematical Quality of Instruction instrument, and the Tripod student survey).

The results show that the year-to-year stability at the total score level for these observational and student survey measures is only slightly better than that of VAMs – typically in the .45 to .55 range. When subscales are examined—important because subscale scores are likely more useful to teachers from a formative standpoint—the results are weaker still.

Next, I sought to investigate the extent to which instability varied based on the characteristics of teachers or classrooms. I did find limited evidence that instability on certain instruments might be more of an issue for teachers in elementary schools than in middle schools. However, I found no evidence that instability was explained by year-to-year variations in the characteristics of students.

Finally, to help make sense of these findings, I presented the year-to-year reclassification rates for each of the studied instruments. Reclassification was studied using two approaches—a norm-referenced approach, where teachers were sorted into quintiles and followed into the next year, and a criterion-referenced approach, where teachers were classified as above or below some performance cut in year one and followed into year two. This reclassification analysis was illuminating, revealing that reclassification was more of a problem when using the criterion-referenced approach (as is common in new state systems), but that high- and low-performing teachers in one year using a norm-referenced approach were relatively unlikely to be rated the opposite in another year.

Overall, whether these are results are concerning or heartening probably depends on where you stand with regard to the evidence we already have. For those who view the instability of VAMs as a fatal flaw limiting their utility for high- or low-stakes decisions about teachers, these results suggest that the same concerns may apply to observations and student survey measures. That is, unless one imagines that the cutoff for “too unstable to be useful” lies between .5 and .55, these results suggest that the instability concerns that some have with regard to VAMs likely also apply to these observational and student survey measures.

Alternatively, for those who argue that VAMs provide useful evidence about teacher effectiveness and that the year-to-year stability concerns are overblown, these results suggest that observational and student survey components also appear to be capturing some stable element of teacher effectiveness. This agrees with findings presented in the main Measures of Effective Teaching reports, which showed that the stability of a composite of measures was greater than the stability of the individual component measures.

While these results provide some of the first large-scale evidence on the stability of the non-VAM components of new teacher evaluation systems, they are limited in several ways, and their limitations demand further investigation. The most important limitation is that the data used here come from a research study, not from an actual system implemented in a district or state. As data are collected from new evaluation systems, it is imperative that districts and states engage in this kind of analysis, in order to the understand the properties of new systems. This is true not only for stability, but also for issues of bias (another claim commonly leveled against VAMs that may well apply to observational and student survey measures). Simply put, more evidence is needed from real-life implementation of these systems.

The research also points for the need to come to established consensus about whether instability of measures of teaching performance is a problem or not, and what level of stability is needed to make either high- or low-stakes decisions about teachers. Perhaps it is the case that none of the measures will have the technical properties desired by opponents of new evaluation systems. Or perhaps it is the case that all of the measures provide useful information and they can be used thoughtfully to help improve teaching and learning in U.S. schools.

 

Morgan S. Polikoff is an assistant professor of education at the University of Southern California Rossier School of Education. He researches the design, implementation, and effects of standards, assessment, and accountability policies.

2 Comments