Teachers have an undeniable impact on both the short-term and long-term successes of the students they instruct. Unsurprisingly, several recent federal and state educational initiatives have identified increasing teacher effectiveness as a key lever for improving students’ outcomes. These initiatives have resulted in revisions to teacher evaluation systems across the country. Most of the debate around how to measure effectiveness for use in these systems has centered on teacher impacts on students on standardized assessments. However, with continued concerns over the unintended (and sometimes detrimental) consequences of expanding standardized testing, policies have reemphasized the importance of more traditional, widely applicable approaches to measuring teacher quality—specifically, the quality of teachers’ instruction, assessed through classroom observation. To date, however, research has largely yielded mixed evidence on the strength of the relationship between classroom observation and student test performance, without significant investigation as to why such variability exists. Our study, “Relationships between Observations of Elementary Mathematics Instruction and Student Achievement: Exploring Variability across Districts” explores a potential explanation as to why differences in the relationships between classroom observation scores and student test scores exist across different contexts.
Generating and testing hypotheses as to why classroom observation and student test performance may (or may not) identify the same teachers as effective is important for several reasons. First, evidence of a strong relationship supports the theoretical and logical expectation that better teaching leads to improved student learning. In practice, weak observed relationships might impede the efficacy of teacher policies. For example, personnel decisions, such as determining which teachers to target for merit pay bonuses or additional professional development, are made more complicated if classroom observation and test scores do not identify the same teachers as effective. With weak alignment, teachers might also receive conflicting feedback on how to best improve their practice—especially if the types of teaching and learning valued by observation and tests diverge. It is this possibility that our study investigates. Specifically, we explore whether the relationship between teachers’ classroom observation scores and their students’ mathematics achievement outcomes might be attributed to (mis)alignment between the types of instructional practices valued in classroom observations and the skills expected of students on standardized tests.
To do so, we leverage data from almost 300 teachers and 7000 students across five school districts in four different states. These data include teachers’ performance on an observation instrument designed to measure mathematical inquiry-oriented instruction and activities in the classroom. This instrument, the Mathematical Quality of Instruction (MQI), assesses, for example, teachers’ ability to link and connect mathematical representations, effectively remediate students’ mathematical mistakes, and incorporate cognitively demanding mathematical tasks into instruction. Our data also contains students’ performance on two types of mathematics assessments: their state standardized tests and a researcher-developed test. Using subsets of items, we performed a formal coding analysis to determine the overall test demand for each test based on item formats (i.e., multiple choice, short answer, and open ended) and alignment of items to the MQI instrument (based on the conceptual demand of test items). Our expectation was that, in contexts with more demanding standardized tests, teachers who were identified as more effective in terms of their students’ achievement outcomes would also be identified as more effective through their performance on the MQI, which similarly emphasizes instruction that develops students’ conceptual understanding.
“[T]he strength of the relationship between teachers’ classroom observation scores and their students’ test performance may in part be attributable to the sensitivity of the skills assessed during observation to those assessed on the test.”
In only two of the five school districts did we find significant positive relationships between teachers’ MQI scores and their students’ performance on the state standardized mathematics exams. Notably, these two districts were in the same state so that students across contexts had taken the same test, which, in our formal coding analysis, we had rated as high in overall demand. The tests students took in the other three districts contained items that were less aligned to the MQI, and often contained less demanding item formats, as well (i.e., multiple choice). However, when we considered students’ performance on the researcher-developed mathematics test, which was administered across all five districts, there were no significant differences in the relationship strengths between teachers’ MQI scores and their students’ achievement outcomes across contexts.
Our results thus provided suggestive evidence that the strength of the relationship between teachers’ classroom observation scores and their students’ test performance may in part be attributable to the sensitivity of the skills assessed during observation to those assessed on the test. With stronger alignment, stronger relationships may be observed. This finding has implications for the development of teacher policies. Many states and districts are considering a move towards changing curriculum standards, which will likely affect the standardized tests that students take. Increasing the rigor of standards and, subsequently, the demand of tests, will likely increase the proportion of cases where classroom observation scores and student achievement results align to identify similar teachers as effective. On the other hand, in states and districts where the demand of tests remains low, relationships between teacher performance on observations that demand conceptually-oriented instruction and student achievement may remain weak. The current results suggest that districts may need to examine their classroom observation instruments for alignment with their student achievement tests, to ensure that teachers receive consistent feedback and support for instructional improvement.
KATHLEEN LYNCH is a doctoral student at the Harvard Graduate School of Education. Her research interests include education policy and strategies to reduce educational inequality, particularly in mathematics.
MARK CHIN is a doctoral student in education policy and program evaluation at the Harvard
Graduate School of Education. His research interests center on how race, racism, and assimilative pressures affect the experiences and outcomes of students of color and early generation immigrant students in US K–12 contexts.
DAVID BLAZAR is an assistant professor at the University of Maryland College Park. His
research examines factors that affect teacher and teaching quality, with a focus on professional learning, the organizational context of schools and districts, and accountability policy.