Posted earlier today at Substance News:
Evidence Presented in the Case Against Growth Models for High Stakes Purposes
Evidence Presented in the Case Against Growth Models for High Stakes Purposes
Denise Wilburn and Jim Horn
The following article quotes liberally
from The Mismeasure of Education, and it represents an overview of the
research-based critiques of value-added modeling, or growth models. We offer it
here to Chicago educators [and educators everywhere] with the hope that it may serve to inspire and inform
the restoration of fairness, reliability, and validity to the teacher
evaluation process and the assessment of children and schools.
In the fall of 2009 before the final
guidance was issued to all the cash-strapped states lined up for some of the
$3.4 billion in Race to the Top grants, the Board of Testing and Assessment
issued a 17-page letter to Arne Duncan, citing the National Research Council’s
response to the RTTT draft plan. BOTA
cited reasons to applaud the DOEd’s efforts, but the main purpose of the letter
was to voice, in unequivocal language, the NRC’s concern regarding the use of
value-added measures, or growth models, for high stakes purposes specifically
related to the evaluation of teachers:
BOTA has
significant concerns that the Department’s proposal places too much emphasis on
measures of growth in student achievement (1) that have not yet been adequately
studied for the purposes of evaluating teachers and principals and (2) that
face substantial practical barriers to being successfully deployed in an
operational personnel system that is fair, reliable, and valid (p. 8).
In 1992 when Dr. William Sanders
sold value-added measurement to Tennessee politicians as the most reliable,
valid, and fair way to measure student academic growth and the impact that
teachers, schools, and districts have on student achievement, the idea seemed
reasonable, more fair, and even scientific to some. But since 1992, leading statisticians and
testing experts who have scrutinized value-added models have concluded that
these assessment systems for measuring test score growth do not meet the
reliability, validity, and fairness standards established by respected national
organizations, such as the American Educational Research
Association, the American
Psychological Association, and the National Council on Measurement in Education
(Amrein-Beardsley, 2008). Nonetheless,
value-added modeling for high-stakes decision making now consumes significant
portions of state education budgets, even though there is no oversight agency to
make sure that students and teachers are protected:
Who protects [students and teachers in America’s schools]
from assessment models that could do as much
harm as good? Who protects their
well-being and ensures that assessment models are safe, wholesome, and effective? Who guarantees that assessment models honestly and accurately inform the
public about student progress
and teacher effectiveness? Who regulates the assessment industry? (Amrein-Beardsley, 2008, p. 72)
If value-added measures do not meet the
highest standards established for reliable, valid and fair measurement, then
the resulting high stakes decisions made based on these value-added measures
are also unreliable, invalid and unfair.
Therefore, legislators, policymakers, and administrators who require
high-stakes decisions based on value-added measures are equally culpable and
liable for the wrongful termination of educators mismeasured with metrics
derived from tests that were never intended to do anything but give a ballpark
idea of how students are progressing on academic benchmarks for subject matter
concepts at each grade level.
Is
value-added measurement reliable? Is it consistent in its measurement and free
of measurement errors from year to year?
As early as 1995, the Tennessee Office of Education Accountability (OEA)
reported “unexplained variability” in Tennessee’s value-added scores and called
for an outside evaluation of all components of the Tennessee Value-Added
Assessment System (TVAAS), including the achievement tests used in calculating
the value-added scores. The outside evaluators, Bock, Wolfe &
Fisher (1996), questioned the reliability of the system for high-stakes
decisions based on how the achievement tests were constructed. These experts recognized that test makers are
engaged in a very difficult and imprecise science when they attempt to rank
order learning concepts by difficulty.
This difficulty is compounded when they
attempt to link those concepts across a grade-level continuum so that students
successfully build subject matter knowledge and skill from grade to grade. An extra layer of content design imprecision
is added when test makers create multiple forms of the test at each grade level
to represent those rank-ordered test items.
Bock, Wolfe, & Fisher found that variation in test construction was,
in part, responsible for the “unexplained variability” in Tennessee’s state
test results.
Other highly respected researchers
(Ballou, 2002; Lockwood, 2006; McCaffrey & Lockwood, 2008; Briggs, Weeks
& Wiley, 2008) have weighed in on the issue of reliability of value-added
measures based on questionable achievement test construction. As an invited speaker to the National
Research Council workshop on value-added methodology and accountability in
2010, Ballou pointedly went to the heart of the test quality matter when he
acknowledged the “most neglected” question among economists concerned with
accountability measures:
The
question of what achievement tests measure and how they measure it is probably
the [issue] most neglected by economists…. If tests do not cover enough of what
teachers actually teach (a common complaint), the most sophisticated
statistical analysis in the world still will not yield good estimates of
value-added unless it is appropriate to attach zero weight to learning that is
not covered by the test. (National Research Council and National Academy of
Education, 2010, p. 27).
In addition to
these test issues, the
reliability of the teacher effect estimates in high-stakes applications are compromised
by a number of other recurring problems:
1) the timing of the test administration and
summer learning loss (Papay, 2011);
2) missing student data (Bock & Wolfe,
1996; Fisher, 1996; McCaffrey et al, 2003; Braun, 2005; National Research
Council, 2010);
3) student data poorly linked to teachers
(Dunn, Kadane & Garrow, 2003; Baker et al, 2010);
4) inadequate sample size of students due to
classroom arrangements or other school logistical and demographic issues
(Ballou, 2005; McCaffrey, Sass, Lockwood, & Mihaly, 2009).
Growth models such as the Sanders VAM use
multiple years of data in order to reduce the degree of potential error in
gauging teacher effect. Sanders
justifies this practice by claiming that a teacher’s effect on her students
learning will persist into the future and, therefore, can be measured with
consistency.
However, research conducted by McCaffrey,
Lockwood, Koretz, Louis, & Hamilton (2004) and subsequently by Jacob, Lefgrens and Sims (2008) shatters
this bedrock assumption. These
researchers found that “only about one-fifth of the test score gain from a high
value-added teacher remains after a single year…. After two years, about
one-eighth of the original gain persists” (p. 33).
Too many uncontrolled factors impact the
stability and sensitivity of value-added measurement for making high-stakes
personnel decisions for teachers. In
fact, Schochet and Chiang (2010) found that the error rates for distinguishing
teachers from the average teaching performance using three years of data was
about 26 percent. They concluded
more
than 1 in 4 teachers who are truly average in performance will be erroneously
identified for special treatment, and more than 1 in 4 teachers who differ from
average performance by 3 months of student learning in math or 4 months in
reading will be overlooked (p. 35).
Schochet and
Chiang also found that to reduce error in the variance in teachers’ effect scores
to 12 percent, rather than 26 percent, ten years of data would be required for
each teacher (p. 35). When we consider
the stakes for altering school communities and students’ and teachers’ lives,
the utter impracticality of reducing error to what may be argued as an
acceptable level makes value-added modeling for high-stakes decisions simply
unacceptable.
Besides this
brief summary above of reliability problems, growth models also have serious
validity issues caused by the effect of nonrandom placement of
students from classroom to classroom, from school to school within districts,
or from system to system. Non-random placement of students further erodes
Sanders’ causal claims for teacher effects on achievement, as well as his
claims that the impact of student characteristics on student achievement is
irrelevant. For a teacher’s value-added scores to be valid, she must have “an equal chance
of being assigned any of the students in the district of the appropriate grade
and subject” and “a teacher might be disadvantaged [her scores might be biased]
by placement in a school serving a particular population” year after year
(Ballou, 2005, p. 5). In Tennessee, no educational policy or
administrative rule exists that requires schools to randomly assign teachers or
students to classrooms, thereby denying an equal chance of randomly placed
teachers or students within a school, within a district, or across the state.
To
underscore the effect of non-random placement of disadvantaged students on
teacher effect estimates, Kupermintz (2003) found in reexamining Tennessee
value-added data that “schools with more than 90% minority
enrollment tend to exhibit lower cumulative average gains” and school systems’
data showed “even stronger relations between average gains and the percentage
of students eligible for free or reduced-price lunch” (p. 295).
Value-added
models like TVAAS that assume random assignment of students to teachers’
classrooms “yield misleading [teacher effect] estimates, and policies that use
these estimates in hiring, firing, and compensation decisions may reward and
punish teachers for the students they are assigned as much as for their actual
effectiveness in the classroom” (Rothstein, 2010, p. 177).
In
his value-added primer from 2005, Henry Braun (2005) stated clearly and boldly
that the “fundamental concern is that, if making
causal attributions (of teacher effect on student achievement performance) is
the goal, then no statistical model, however complex, and no method of
analysis, however sophisticated, can fully compensate for the lack of
randomization” (p. 8).
Any system of assessment that claims to
measure teacher and school effectiveness must be fair in its application to all
teachers and to all schools. Because
teaching is a contextually-embedded, nonlinear activity that cannot be
accurately assessed by using a linear, context-independent value-added model,
it is unfair to use such a model at this time. Consensus among VAM researchers
recommends against the use of growth models for high stakes purposes. Any assessment system that can misidentify 26
percent or more of the teachers as above or below average when they are neither
is unfair when used for decisions of dismissal, merit pay, granting or revoking
tenure, closing a school, retaining students, or withholding resources for poor
performance.
When almost two-thirds of teachers who do not teach
subjects where standardized tests are administered will be rated based on the
test score gains of other teachers in their schools, then the assessment system
has led to unfair and unequal treatment (Gonzalez, 2012).
When the
assessment system intensifies teaching to the test, narrowing of curriculum,
avoidance of the neediest students, reduction of teacher collaboration, or the
widespread demoralization of teachers (Baker, E. et al, 2010), then it has
unfair and regressive effects.
Any assessment
system whose proprietary status limits access by the scholarly community to
validate its findings and interpretations is antithetical to the review process
upon which knowledge claims are based.
An unfair assessment system is unacceptable for high stakes
decision-making.
In August, 2013, the Tennessee State Board of Education adopted
a new teacher licensing policy that ties teacher license renewal to value-added
scores.
However, the implementation of this policy was delayed by a very
important presentation made public by the Tennessee Education Association.
Presented by TEA attorney, Rick Colbert, and based on individual teachers sharing
their value-added data for additional analysis, the TEA demonstrated that 43
percent of the teachers who would have lost their licenses due to declining
value-added scores in one year had higher scores the following year, with
twenty percent of those teachers scoring high enough in the following year to
retain their licenses. The presentation may be viewed at YouTube:
http://www.youtube.com/watch?v=l1BWGiqhHac
After 20 years of using value-added assessment in Tennessee,
educational
achievement does not reflect an added value in the information gained
from an expensive investment in
the value-added assessment system. With
$326,000,000 spent
for assessment, the TVAAS, and other costs related to accountability since 1992, the
State’s student achievement levels remain in
the bottom quarter nationally (Score Report, 2010, p. 7). Tennessee received a D on K–12 achievement when compared to
other
states based on NAEP achievement levels and gains, poverty gaps, graduation rates,
and Advanced Placement test scores
(Quality Counts 2011, p. 46). The Public Education Finances
Reports (U.S
Census Bureau) ranks Tennessee’s per pupil
spending as 47th for
both 1992 and 2009. When state
legislators and policymakers were led to believe
in 1992 that the teacher is the single most important factor in improving student academic performance, they found reason to justify lowering education spending
as a priority
and increasing accountability.
Finally, the evidence from twenty years of review and analysis by leading national experts in educational measurement, accountability, lead to the same conclusion when trying to answer Dr. Sanders’
original question: Can student test data be used to determine teacher effectiveness? The answer:
No, not with enough certainty
to make high-stakes personnel decisions. In turn, when we ask the larger
social science question (Flyvbjerg, 2001): Is the
use of value-added modeling and high-stakes testing a
desirable social policy for improving learning conditions
and learning for all students? The
answer must be an unequivocal “no,” and it
must remain so until
assessments measure various levels of learning at
the highest levels of reliability and validity, and
with the conscious purpose of equality
in educational opportunity for all students.
We have wasted much time,
money, and effort to find out what we already
knew: effective teachers and schools make a difference in
student learning and students’ lives. What the TVAAS and the EVAAS do not tell us, and
what supporters of growth
models seem oddly uncurious to
know is what, how or why teachers make a difference? While test data and value-added analysis may highlight strengths and/or areas of needed intervention in school programs or subgroups of
the student population, we can only
know the “what,” “how”
and “why” of effective teaching through
careful observation by knowledgeable observers in classrooms where effective teachers engage
students in varied
levels of learning across multiple contexts. And while this kind
of knowing may be too much to ask of any set of algorithms developed so far for deployment in
schools, it is not at all alien to great educators who
have been asking these
questions and doing this kind of knowledge sharing
since Socrates, at least.
References
Amrein-Beardsley, A. (2008). Methodological concerns
about the Education Value-Added Assessment System. Educational
Researcher, 37(2), 65-75. doi: 10.3102/0013189X08316420
Baker, A., Xu, D., and Detch, E. (1995).
The measure of education: A review of the Tennessee value added
assessment system. Nashville,
TN: Comptroller of the Treasury, Office
of Education Accountability Report.
Baker, E. L.,
Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L.,
Ravitch, D., Rothstein, R., Shavelson, R. J., & Shephard, L. A. (2010, August
29). Problems with the use of student
test scores to evaluate teachers (Briefing Paper #278). Washington,
DC: Economic Policy Institute.
Ballou, D. (2002). Sizing up test scores. Education Next. Retrieved from www.educationnext.org
Ballou, D. (2005).
Value-added assessment: Lessons
from Tennessee. Retrieved from http://dpi.state.nc.us/docs/superintendents/quarterly/2010-11/20100928/ballou-lessons.pdf
Bock, R. and
Wolfe, R. (1996, Jan. 23). Audit
and review of the Tennessee value-added assessment system (TVAAS): Preliminary report. Nashville, TN: Comptroller of the Treasury, Office of
Education Accountability Report.
Braun, H. I.
(2005). Using student progress to
evaluate teachers (Policy Information Perspective). Retrieved from
Educational Testing Service, Policy Information Center website: http://www.ets.org/Media/Research/pdf/PICVAM.pdf
Briggs, D. C.,
Weeks, J. P. & Wiley, E. (2008,
April). The sensitivity of value-added modeling to the creation of a vertical
scale score. Paper presented at the
National Conference on Value-Added Modeling, Madison, WI. Retrieved from http://academiclanguag.wceruw.org/news/events/VAM%20Conference%20Final%20Papers/SensitivityOfVAM_BriggsWeeksWiley.pdf
Dunn, M.,
Kadane, J., & Garrow, J. (2003). Comparing harm done by mobility and class
absence: Missing students and missing data. Journal of Educational and
Behavioral Statistics, 28, 269–288.
Fisher, T. (1996, January). A review and analysis of the Tennessee
value-added assessment system.
Nashville, TN: Tennessee
Comptroller of the Treasury, Office of Education
Accountability Report.
Flyvbjerg, B. (2001).
Making social science matter: Why
social inquiry fails and how to make it succeed again. Cambridge: Cambridge University Press.
Gonzalez, T. (2012, July 17). TN
education reform hits bump in teacher evaluation. The
Tennessean. Retrieved from
Jacob, B. A., Lefgrens, L. & Sims, D.
P. (2008, June). The persistence of
teacher-induced learning gains (Working Paper 14065). Retrieved from the National Bureau of
Economic Research website: http://www.nber.org/papers/w14065
Kupermintz, H.
(2003). Teacher effects and teacher effectiveness: A validity investigation of
the Tennessee Value Added Assessment System. Educational Evaluation and Policy Analysis, 25(3), 287-298.
Lockwood, J. R.,
McCaffrey, D. F., Hamilton, L. S., Stecher, B. Le, V. & Martinez, F. (2006). The sensitivity of value-added teacher effect estimates to different
mathematics achievement measures. Retrieved from The Rand Corporation
website: http://www.rand.org/content/dam/rand/pubs/reports/2009/RAND_RP1269.pdf
McCaffrey,
D. F. & Lockwood, J. R. (2008, November). Value-added models: Analytic
Issues. Paper presented at the
National Research Council and the National Academy of Education, Board of
Testing and Accountability Workshop on Value-Added Modeling, Washington DC.
McCaffrey, D. F.,
Lockwood, J. R., Koretz, D. M. & Hamilton, L. S. (2003). Evaluating value-added models for teacher
accountability. Retrieved from The Rand
Corporation website: http://www.rand.org/pubs/monographs/MG158.html
McCaffrey, D. F.,
Lockwood, J. R., Koretz, D., Louis, T. A., &Hamilton, L. (2004). Models for Value-Added Modeling of Teacher Effects. Journal of Educational and Behavioral
Statistics, 29(1), 67-101.
McCaffrey, D. F.,
Sass, T. R., Lockwood, J. R. & Mihaly, K. (2009). The intertemporal
variability of teacher effect estimates. Education
Finance and Policy, 4(4), 572-606.
National
Academy of Sciences. (2009). Letter
report to the U. S. Department of Education on the Race to the Top fund. Washington, DC: National Academies of
Sciences. Retrieved from http://www.nap.edu/catalog.php?record_id=12780
National Research
Council and National Academy of Education. (2010). Getting Value Out of
Value-Added: Report of a Workshop. Committee on Value-Added Methodology for
Instructional Improvement, Program Evaluation, and Educational Accountability,
Henry Braun, Naomi Chudowsky, and Judith Koenig, Editors. Center for Education,
Division of Behavioral and Social Sciences and Education. Washington, DC: The
National Academies Press.
Papay, J. (2011). Different
tests, different answers: The stability
of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1),163-193.
Quality counts, 2011:
Uncertain forecast. (2011, January 13). Education Week. Retrieved from
http://www.edweek.org/ew/toc/2011/01/13/index.html
Rothstein, J. (2010). Teacher quality in educational
production: Tracking, decay, and student achievement. The
Quarterly Journal of Economics, 125(1),
175-214.
Schochet, P. Z. & Chiang, H. S. (2010). Error
Rates in Measuring Teacher and School Performance Based on Student Test Score
Gains (NCEE 2010-4004). Washington, DC: National Center for Education
Evaluation and Regional Assistance, Institute of Education Sciences, U.S.
Department of Education.
State Collaborative on Reforming
Education. (2010). The state of education
in Tennessee (Annual Report). Retrieved from http://www.tnscore.org/wp-content/uploads/2010/06/Score-2010-Annual-Report-Full.pdf
U. S.
Census Bureau. (2011). Public Education
Finances: 2009 (G09-ASPEF). Washington, DC:
U.S. Government Printing Office.
What's the research on rating teachers by Student Learning Objectives?
ReplyDeleteThank-you, thank-you, thank-you.
ReplyDelete