Over the last few days, thousands of Georgia teachers have read and shared the AJC Get Schooled blog and Facebook posts about the state’s new evaluation system. Teachers are upset the system could base half of their performance ratings on student test scores. (Look for an AJC news story about the revolt brewing among teachers over this issue.)
The federal Department of Education endorsed using scores to evaluate teachers, making it a condition of a Race to the Top grant. As of 2014, 40 states were using or piloting programs to evaluate teachers in part based on growth in student learning as reflected by test scores.
The consequences of unsatisfactory evaluations could include frozen salaries, remediation or dismissal, while good evaluations could bring a bonus, a salary jump, or tenure. Georgia is unusual in counting student growth for 50 percent of a teacher’s performance rating; most states count it for 20 or 30 percent.
Along with unhappy teachers, I am hearing from puzzled readers asking why it’s unfair to judge teachers on how much growth their students show on tests. A new federal study suggests why.
The Nevada Department of Education asked the U.S. DOE’s Institute of Education Sciences to investigate the stability of the teacher-level growth score. The Regional Educational Laboratory analyzed three years of math and reading score data for about 370 elementary and middle school teachers from Nevada’s second largest school district.
Here is what researchers said:
This study examines one overarching research question: How stable over years are annual teacher-level growth scores, derived by applying the student growth percentile model to student scores from Nevada’s Criterion-Referenced Tests in math and reading? In other words, how likely is it that the same score would be obtained in different years?
In math, half the variance in teacher scores in any given year was attributable to differences among teachers, and half was random or unstable. In reading, the proportion of the variance attributable to differences among teachers was .41, and .59 was random or unstable.
More stable measures of effectiveness can be constructed by averaging multiple years of growth scores for a teacher. For example, when effectiveness is computed as an average of annual scores for three years, the proportion of the variance in teacher scores attributable to differences among teachers is .75 in math and .68 in reading.
These estimates do not meet the .85 level of reliability traditionally desired in scores used for high-stakes decisions about individuals (Haertel, 2013; Wasserman & Bracken, 2003). States that are considering the student growth percentile model for teacher accountability may want to be cautious about using the scores for high-stakes decisions.
The study concludes:
This study finds half or more of the variance in teacher scores from the model is due to random or otherwise unstable sources rather than to reliable information that could predict future performance. Even when derived by averaging several years of teacher scores, effectiveness estimates are unlikely to provide a level of reliability desired in scores used for high-stakes decisions, such as tenure or dismissal. Thus, states may want to be cautious in using student growth percentile scores for teacher evaluation.
The conclusion that growth scores alone may not be sufficiently stable to support high stakes decisions suggests the need to examine measures of teacher effectiveness and their interpretation in evaluation systems. The growth score may not be a sound measure of a teacher’s effectiveness, or the magnitude of a teacher’s effect on student learning may not be as predictable a trait of the teacher as many evaluation systems assume it is. Rather, a teacher’s effectiveness may depend in part on features of the teacher’s students—that is, the collection of students in any given year, which change from one year to the next. Growth measures may need to be thought of differently—considered a measure that is associated with a particular combination of teacher and students rather than one that is attributable to the teacher alone.
Thus, as states examine properties of their estimates of teacher effectiveness and decision makers weigh how to incorporate teacher-level growth scores in teacher accountability policy, they may want to exercise caution and further investigate whether teacher-level growth scores are sufficiently stable for use in high-stakes decisions. Many educator evaluation models include multiple measures such as teacher observations, surveys, or additional student outcomes. So policymakers may want to consider the stability of those other measures and examine the reliability of different combinations of measures and the weight assigned to different measures.