With numerous court cases filed to challenge the fairness, reliability, and/or validity of the various mis-applications of value-added modeling (VAM) to reward and punish teachers, it is appropriate to consider the huge body of evidence that may be shaped to drive a stake through the heart of the monstrous chimera that the tobacco-chewing Bill Sanders conjured just over 30 years ago from his ag desk at the University of Tennessee. The following excerpt from The Mismeasure of Education considers the reliability issue, alone. The other elements of the VAM critique comprise the remainder of Part 3 of the book.
. . . . In 1983, William
Sanders was a station statistician at the UT Agricultural Campus and an adjunct
professor in UT’s College of Business.
Based on his own experimentation with modeling the growth of farm animals
and crops, Sanders proposed a hypothetical and reductive question: Can student achievement growth data be used
to determine teacher effectiveness? He
then built a statistical model, ran the data of 6,890 Knox County students
through his model and answered his own question with an unequivocal
affirmative. Proceeding from this single study, Sanders’ claims went beyond the
customary correlational relationship between or among variables that
statisticians find as patterns or trends in data. He pronounced that teachers not only
contributed to the rate of student growth, but that teacher effectiveness was
in fact the most important variable in the rate of student growth:
If the purpose of educational
evaluation is to improve the educational process, and if such improvement is
characterized by improved academic growth of students, then the inclusion of
measures of the effectiveness of schools, schools systems, and teachers in
facilitating such growth is essential if the purpose is to be realized. Of
these three, determining the effectiveness of individual teachers holds the
most promise because, again and again, findings from TVAAS research show
teacher effectiveness to be the most important factor in the academic growth of
students. (Sanders & Horn, 1998, p. 3).
This provocative pronouncement sent
other statisticians, psychometricians, mathematicians, economists, and
educational researchers into a bustle of activity to examine the Sanders’
statistical model and his various claims based on that model, which became the Tennessee
Value-Added Assessment System (TVAAS), and which is now marketed by SAS as
Education Value Added Assessment System SAS®
EVAAS®. In sharing the findings of that body of
research, it becomes clear that using TVAAS to advance education policy is 1)
undesirable when examined for high standards of reliability, validity and
fairness, and 2) counterproductive in reaching high levels of academic
achievement.
At first glance,
Sanders’ question seems reasonable and his answer logical. Why entrust students to teachers for seven or
more a day and expect children to grow academically, if teachers do not
contribute significantly to student learning?
In the Aristotelian tradition of logical conclusions “if a=b and b=c,
then c=a,” Sanders’ argument goes something like this:
a) if student test
scores are indicative of academic growth, and
b) academic growth
is indicative of teacher effectiveness, then
c) test scores are
indicative of teacher effectiveness; and therefore, they should be used in
teacher evaluation.
If test scores are improving, then,
academic growth has increased and teacher effectiveness is greater. On the contrary, if test scores are not
improving, academic growth has decreased and teacher effectiveness is diminished.
When
Sanders made his causal pronouncements, he made assumptions that, in turn, made
the teaching and learning context largely irrelevant. Using complex statistical methods previously
applied to business and agriculture to study inputs and outputs of systems,
Sanders developed formulae that eliminated student background, educational
resources, district curricula and adopted instructional practices, and the
learning environment of the classroom as variables that impact student
learning. Sanders’ expressed his
rationale for attempting to isolate teacher effect from the myriad of effects
on student learning at an online listserv discussion with Gene Glass and others
in 1994:
The advantage of following growth over time is
that the child serves as his or her own “control.” Ability, race, and many other factors that
have been impossible to partition from educational effects in the past are
stable throughout the life of the child.
[http://gvglass.info/TVAAS/]
What we know, of course, is that a child’s economic, social, and familial conditions can
and do change depending on the larger contexts of national recessions,
mobility, divorce, crime, and changing education policies. It stands to reason, then, that the high
stakes claims made while using statistical models such as TVAAS must be held to
higher standards of proof when making high-stakes declarations of causation
that simultaneously render contexts irrelevant.
These standards are shaped by the following questions: Are the value-added assessment findings
reliable? Are the findings valid? Are
the findings fair? Based on critical
reviews of leading statisticians, mathematicians, psychometricians, economists,
and education researchers, the Sanders Model does not meet these standards of
proof when used for high-stakes purposes, with the most egregious shortcomings
apparent when used for the dual purposes of diagnostics and evaluation. Briefly, there are three reasons the Sanders
Model falls short: Sanders assumes 1) that tests and test scores are a reliable
measure of student learning; 2) that characteristics of students, classrooms,
schools, school systems, and neighborhoods can be made irrelevant by comparing
a student’s test scores from year to year; 3) that value-added modeling can
capture the expertise of teachers fairly—just as fairly as Harrington Emerson
thought he could apply Frederick Taylor’s (1911) Principles of Scientific Management to the inefficiency of American
high schools in 1912 (Callahan, 1962).
As explained in Part II, the TVAAS is simply the latest iteration of
business efficiency formulas misapplied to educational settings in hopes of
producing a standard product in a cost effective fashion. Sanders made this clear in 1994 while
distinguishing TVAAS from other teacher evaluation formats,
TVAAS is product oriented. We look at whether the child learns—not at
everything s/he learns, but at a portion that is assessed along the articulated
curriculum, a portion each parent is entitled to expect an adequately
instructed child will learn in the course of a year ([http://gvglass.info/TVAAS/]).
This quote is even more telling for what
remains implicit, rather than expressed: “the portion that is assessed” is the
portion that can be reduced to the standardized test format, which is required
for Dr. Sanders to be able to perform his statistical alchemy to begin
with. Not only does Dr. Sanders claim to
speak here for the millions of parents who may have greater expectations for
their children’s learning than those reflected in Dr. Sanders’ minimalist
expectations, but he also tips his hand as to a deeper accountability and
efficiency motive that is exposed by his concern for the “adequately instructed
child.” Once the other contextual
factors (resources, poverty levels, parenting, social and cultural capital,
leadership, etc.) have been excised from Dr. Sanders’ formulae, the only
remaining contextual factor (the teacher) must absorb the full weight of any
causative change. Such contextual
cleansing may make for beautiful statistical results, but it performs a
devastating reduction to what is considered learning in schools, all the while
acknowledging to not care a whit for either what is taught or how it is taught.
All these basic shortcomings are reinforced by assuming that learning is
linear, a demonstrably false assumption that will be discussed later in this
section.
While there were distinct statistical
and psychometric challenges to the efficacy of TVAAS in the educational
measurement and evaluation research literature from 1994 to 2012, one
overarching theme ran through them all:
value-added modeling at the teacher-effect level is not stable enough to
determine individual teacher contributions to student academic performance,
especially as it is related to personnel decisions, i. e., evaluation,
performance pay, tenure, hiring, or dismissal decisions. As early as 1995, scholars (Baker, Xu, &
Detch, 1995) offered a strong warning that the use of TVAAS for high-stakes
might create unintended consequences for both teachers and students, such as
teaching the test (thus narrowing the curriculum), teaching test skills instead
of academic skills, over-enrolling students in special education since special
education scores were not counted in TVAAS calculations, cheating to raise test
scores, and using poor test performance to hurt teachers professionally. By 2011 researchers (Corcoran, Jennings, &
Beveridge, 2011) offered empirical evidence that all teachers do not teach
to the test, but when they do, student learning depreciates more quickly than
when teachers teach to general knowledge domains and expect students to master
concepts and apply skills
In the following
examination of these three standards of proof (reliability, validity, and
fairness), we summarize the findings of national experts in statistical
modeling, value-added assessment, education policy, and accountability
practices. Taken together, they provide
irrefutable evidence that Sanders fails to meet these standards by using TVAAS
for high-stakes decision-making such as reducing resources to schools, closing
low-scoring schools, or sanctioning and/or rewarding teachers. Most importantly, however, is the
Sanders-sanctioned myth that if students are making some yearly growth on tests
that were constructed for diagnostic rather than evaluative purposes, then
those students will have received an education sufficient for a successful
life, economically, socially, personally.
The
Tennessee Value-Added Assessment Model—Reliability Issues
The Tennessee Comprehensive
Assessment Program (TCAP) achievement test is a standardized, multiple-choice
test composed of criterion-referenced items administered to 3-8th
grades. It is purported to measure
student mastery of the general academic concepts and skills as well as specific
Tennessee learning standard objectives. Controversy surrounding the use of achievement
tests stems from the degree of test reliability needed for the high stakes
purposes for which they are used. To
achieve reliability, achievement test scores must be consistent over repeated
test measurements and free of errors of measurement (RAND, 2010). The degree of reliability is biased by test
construction such as vertical scales or test equating and test use such as
diagnostic versus evaluative.
As early as 1995, the
Tennessee Office of Education Accountability (OEA) reported “unexplained
variability” in the value-added scores and called for an outside evaluation of
all components of the TVAAS that included the tests used in calculating the
value-added scores (p. iv). A three-person outside evaluation team included R.
Darrell Bock, a distinguished professor in design and analysis of educational
assessment and professor at the University of Chicago; Richard Wolfe, head of
computing for the Ontario Institute for Studies in Education; and Thomas H. Fisher,
Director of Student Assessment Services for the Florida Department of
Education. The outside evaluation team
investigated the 1995 OEA concern over the achievement tests used by TVAAS. Of
particular interest to TVAAS evaluators were the test constructions of equal
interval and vertical scales and the process of test equating [needs layman’s
terms or explanation].
Bock and Wolfe found the
scaling properties acceptable for the purpose of determining student academic
gain scores from year to year, but unacceptable for determining district,
school, and teacher effect scores (p. 32). All three evaluators had concerns
about test equating. Fisher’s (1996) concerns focused on the testing
contractor, CTB/McGraw-Hill, having the sole responsibility for developing
multiple test forms of equal difficulty at each grade level, stating that
“[t]est equating is a procedure in which there are many decisions not only
about initial test content but also about the statistical procedure used. If care is not exercised, the content design
will change over time and the equating linkages will drift” (p. 23). And indeed, Bock and Wolfe found in their
examination of Tennessee’s equated tests forms that test form difficulty (due
to item selection) created unexpected variation in gain scores at some grade
levels. Bock and Wolfe also emphasized
the importance of how the scale scores, used in calculating value-added gain
scores, are derived (pp. 12-13). Why are the scaling properties of tests
important?
Achievement tests are measurement
tools designed to determine where on a continuum of learning a student’s
performance falls. The equal interval
scale of a test is the continuum of knowledge and skills divided into equal
units of “learning” value. If one thinks
of measuring learning along a number line ranging from 1 to 100, the assumption
of equal intervals of learning would be that the same “amount” of learning
occurs whether the student’s scores move from 1 to 2 or from 50 to 51 or from
98 to 99 on the number line or measurement scale. The leap of faith here is that the student
who scores 1 to 10 at the less difficult end of the testing continuum has
learned a greater amount than the student who increases his or her score by
fewer intervals, say 95 to 99, at the most difficult end of the continuum. Dale Ballou (2002), an economics professor at
Vanderbilt University who collaborated with Sanders during the 1990s, has
maintained that the equal intervals used to measure student ability are really
measuring the ordered difficulty of test items, with the ordering of difficulty
determined by the test constructor (p. 15).
If, for example, a statistics and probably question requiring students
make a prediction based on various representations is placed on a third grade
test, it is considered more difficult than an item requiring the student to
simply add or substract. It is difficult
to say who has learned more, the third grade gifted student who answers the
statistical question correctly but makes less progress than the student who answers
all the calculation questions correctly and appears from the test score to make
more progress. Therefore, student
ability is inferred from test scores and not truly observed, thus making
value-added teacher effect estimates better or worse depending on how these
equal interval scales are designed for consistently measuring units of
“learning” at every grade level over time.
Differences in units of measurement from scale to scale yield
differences in teacher effect scores, even though the selection and use of the
scales are beyond the control of any teacher.
Ballou has concluded that an built-in imprecision in scales leads to
quite arbitrary results, and that “our
efforts to determine which students gain more than others—and thus which
teachers and schools are more effective—turn out to depend on conventions
(arbitrary choices) that make some educators look better than others” (p. 15).
The vertical scaling of an achievement test is based on the measurement of
increasingly difficult test items from year to year on the same academic
content and skills. Vertical scaling is important to Sanders’ TVAAS model
because he uses student test scores over multiple years in estimating teacher
effectiveness. Therefore, the content and skills at one grade level must be
linked to the content and skills at the next grade level in order to measure
changes in student performance on increasingly difficult or more complex
concepts and skills from third grade through eighth grade. Problems arise, however, when there is a shift
in the learning progression of content and skills. For example, third grade reading may focus on
types and characteristics of words and the retelling of narratives, fifth grade
on types and characteristics of literary genres and interpretation of non-fiction
texts, and eighth grade on evaluation of texts for symbolic meaning, bias, and
connections to other academic subjects such as history or science. While the content and skills are related from
grade to grade, there may be not be sufficient linkage between content and
skills or consistency in the degree of difficulty across grades and subjects to
render accurate performance portraits of student and the resulting teacher
effect estimates, even if one has faith that test results can mirror teacher efforts
in the best of all possible worlds: “Shifts in the mix of
constructs across grades can distort test score gains, invalidate assumptions
of perfect persistence of teacher effects and the use of gain scores to measure
growth, and bias VAM [value-added model] estimates” (McCaffrey & Lockwood,
2008, p. 9).
The same kinds of inconsistencies can occur when using the same tests for
other high stakes measure such as school effectiveness. Using the Sanders Model and eight different
vertical scales for the same CTB/McGraw Hill tests at consecutive grade levels,
Briggs, Weeks, and Wiley (2008) found that
the numbers of schools that
could be reliably classified as effective, average or ineffective was somewhat
sensitive to the choice of the underlying vertical scale. When VAMs are being
used for the purposes of high-stakes accountability decisions, this sensitivity
is most likely to be problematic (p. 26).
Lockwood (2006) and his colleagues at RAND found that variation within
teachers’ effect scores persisted, even when the internal consistency
reliability between the procedures subtest and the problem-solving subtest from
the same mathematics test was high. In
fact, there was greater variation from one subtest to the next than there was
in the overall variation among teachers (p. 14). The authors cited the source of this
variation as “the content mix of the test” (p. 17), which simply means that
test construction and scaling are imperfect enough to warrant great care and
prudence when applying even the most perfect statistical treatments under the
most controlled conditions.
For students to show progress
in a specific academic subject (low-stakes) and for Sanders to isolate the
teacher effect based on student progress (high-stakes), tests require higher
degrees of reliability in equal interval, vertical scaling and test
equating. Tests are designed and
constructed to do a number of things, from linking concepts and skills for
annual diagnostic purposes to determining student mastery of assigned standards
of learning. They are not, however,
designed or constructed to reliably fulfill the value-added modeling demands
placed on them. Though teachers cannot
control the reliability of test scaling or the test item selection that
represents what they teach, they can control their teaching to the learning
objectives of the standards most likely to be on tests at their grade
levels—those learning objectives that lend themselves easily to multiple-choice
tests. For example, an eighth teacher
might have students identify bias in different reading selections, easily
tested in a multiple choice format, instead of studying the effect of reporting
bias in the news and research literature on current political issues and policy
decisions in the students’ community.
Or if she does teach the later lesson, she and her students get no
credit on a multiple-choice test for their true level of expertise in teaching
and understanding the concept of bias.
In fact, if she spends the time to examine reporting bias as part of the
student’s social and political environment at the expense of another test
objective, student scores and her resulting value-added effect designation may
suffer. This is an unintended
consequence of high-stakes testing and a survival strategy for teachers whose
position and salary are bound to policies and practices that focus on high test
scores. As an invited speaker to the National Research
Council workshop on value-added methodology and accountability, Ballou
pointedly went to the heart of the matter when he acknowledged the “most
neglected” question among economists concerned with accountability measures:
The question of what
achievement tests measure and how they measure it is probably the [issue] most
neglected by economists…. If tests do not cover enough of what teachers
actually teach (a common complaint), the most sophisticated statistical
analysis in the world still will not yield good estimates of value-added unless
it is appropriate to attach zero weight to learning that is not covered by the
test. (Braun, Chudowsky, & Koneig, 2010, p. 27).
In addition to
these scaling issues, the reliability of the teacher effect estimates is a
problem in high-stakes applications when compromised by the timing of the test
administration, summer learning loss, missing student data, and inadequate
sample size of students due to classroom arrangements or other school
logistical and demographic issues.
Achievement
tests used for value-added modeling are generally administered once a
year. Scores from these tests are then
compared in one of three ways: (1) from
spring to spring, (2) from spring to fall, or (3) from fall to fall. Spring to spring and fall to fall schedules
introduce what has become known as summer learning loss—what students forget
during summer vacation. This loss is
different for different students depending on what learning opportunities they
have or do not have during the summer, e. g., summer tutoring programs, camps,
family vacations, access to books and computers. What John Papay (2011) found in comparing
different test administration schedules was that “summer learning loss (or
gain) may produce important differences in teacher effect” and that even “using the same test but
varying the timing of the baseline and outcome measure introduces a great deal
of instability to teacher rankings” (p. 187). Papay’s warned policymakers and practitioners
wishing to use value-added estimates for high-stakes decision making that
they “must think carefully about the consequences of these differences,
recognizing that even decisions seemingly as arbitrary as when to schedule the
test within the school year will likely produce variation in teacher
effectiveness estimates” (p. 188).
In addition to
the test schedule problem for pre/post test administration, achievement tests
are usually administered before an entire school year is completed, meaning the
students’ achievement test scores impact two teachers’ effect scores each year
instead of just one. By using multiple years of student data to estimate teacher effect scores, Sanders has
remained unconcerned with this issue by assuming the persistence of teacher
effect on student performance is an assumption of his model. Ballou (2005) described Sanders’ assumption
in the following way “. . . teacher effects are layered over time (the effect of the fourth
grade teacher persists into fifth grade, the effects of the fourth and fifth
grade teachers persist into sixth grade, etc.)” (p. 6). However, the possible “contamination” of
other teachers’ influence on an individual teacher’s effect estimate was noted
in the first outside evaluation of TVAAS by Bock and Wolfe
(1996), who questioned the three years of data that Sanders used in his
model. Bock and Wolfe agreed that three
years of data would help stabilize the estimated gain scores, but they were
concerned, nonetheless, that “the sensitivity of the estimate as an indicator
of a specific teacher’s performance would be blunted” (p. 21). Fourteen years after Bock and Wolfe’s neglected
warning, the empirical research presented in a study completed for the U.S.
Department of Education’s Institute of Education Sciences (Schochet &
Chiang, 2010) found that the sensitivity of the estimate of a specific
teacher’s effect was, indeed, blunted.
They found, in fact, that the error rates for distinguishing teachers
from the average teaching performance using three years of data was about 26
percent. They concluded
more than 1 in 4
teachers who are truly average in performance will be erroneously identified
for special treatment, and more than 1 in 4 teachers who differ from average
performance by 3 months of student learning in math or 4 months in reading will
be overlooked (p. 35).
Schochet and Chiang (2010) also found that to reduce the effect of test
measurement errors to 12 percent of the variance in teachers’ effect scores
would take 10 years of data for each teacher (p. 35), an utter impracticality
when using value-added modeling for high-stakes decisions that alter school
communities and students’ and teachers’ lives.
McCaffrey, Lockwood, Koretz, Louis, & Hamilton (2004) challenged
Sanders’ assumption of the persistence of a teacher’s effect on future student
performance. In noting the “decaying
effects” that are common in social science research, they concluded the Sanders
claim of teacher effect immutability over time “is not empirically or
theoretically justified and seems on its face not to be entirely plausible (p.
94). In fact, in earlier research, McCaffrey and
his colleagues at RAND (2003) developed a value-added model that allowed for
the “estimation of the strength of the persistence of teacher effects in later
years” (p. 59) and found that “teacher effects dampen [decay] very quickly” (p.
81). As a result, they called for more
research concerning the assumption of persistence. Mariano, McCaffrey, and Lockwood’s (2010)
research concerning the persistence of teacher effect showed that “complete
persistence of teacher effects across future years is not supported by data” (in
Lipscomb et al, 2010, p. A14). Using
statistical methods to measure teacher persistence effect on student
performance across multiple years in math and reading, Jacob, Lefgrens and Sim
(2010) determined that “only about one-fifth of the test score gain from a high
value-added teacher remains after a single year…. After two years, about
one-eighth of the original gain persists” (p. 33). They went on to say that “if value-added test
score gains do not persist over time, adding up consecutive gains [over
multiple years] does not correctly account for the benefits of higher
value-added teachers” (p. 33). In light
of these more recent research studies, Sanders’ unwavering claims have proven
more persistent than the teacher effect persistence that he claims. In light of the mounting body of research
that, at a minimum, acknowledges deep uncertainty regarding the persistence of
teacher effect, the claim by Sanders and Rivers (1996) that the “residual effects of both very
effective and ineffective teachers were measurable two years later, regardless
of the effectiveness of teachers in later grades,” (p. 6) clearly needs to be
reexamined and explicated further.
By claiming the persistence of a teacher’s influence on student
performance, Sanders is able to assume that access to three years of data lessens
the statistical noise created by missing student test scores, socioeconomic
status, or other factors that affect teacher effect scores (Sanders, Wright,
Rivers, & Leandro, 2009). Missing
data was a primary issue in the 1996 evaluation of TVAAS by Bock, Wolfe and
Fisher. Their examination of the data quality showed that missing data could
cause distortion to the TVAAS results and the “linkage from students to
teachers is never higher than about 85 percent, and worse in grades 7-8,
especially in reading” (p.18). It is important, of course, that test scores of
every student in every classroom in every school are accounted for and attached
or linked to the correct teacher when computing teacher effect scores. Poor linking commonly occurs in schools, however,
due to student absences, students being pulled out of class for special
education, student mobility, and team teaching arrangements, just to name a few
(Baker et al, 2010). Missing or faulty
data contributes to teachers having incomplete sets of data points (student
test scores) and “can have a negative impact on the precision and stability of
value-added estimates and can contribute to bias” (Braun, Chudowsky &
Koneig, 2010, p. 46). A small set of
student test scores, for example, can be impacted by an overrepresentation of a
subgroup of students (i.e., socioeconomic status or disabilities), or that same
set of scores may be significantly impacted by a single student with very
different scores, high or low, from the other students in a class.
In addition to assuming multiple years of data will increase the number of
data points per teacher sufficiently, Sanders assumes that missing data are
random, but this is doubtful as students whose test data are missing are most
often low-scoring students (McCaffrey et al, 2003, p. 83) who missed school or moved, entered personal data incorrectly, or took the
test under irregular circumstances (i.e. special education modifications,
make-up exams) and may have been improperly matched to teachers (Dunn, Kadane,
& Garrow, 2003). Sanders attempts to capture all
available data for teacher effect estimates by recalculating those estimates to include missing data from previous years that is
eventually matched properly to the correct teacher (Eckert & Dabrowski,
2010).[i] This causes variability in past effect scores
for the same teacher and increases skepticism about TVAAS accuracy, especially
when such retrospective conversions come too late to alter teacher evaluation
decisions based on the earlier version of scores. In his primer on value-added modeling, Braun
(2005) pointed out that the Sanders claim that multiple years of data can
resolve the impact of missing data, “required empirical validation” (p. 13). No such validation has been forthcoming from
the Sanders Team.
Ballou (2005) explained that imprecision arises when teacher effect scores
are based on too few data points (student test scores linked to a particular
teacher). The number of data points can
be too few based on the number of years the data are collected, whether one,
two, or three years. The number of data
points can be reduced, too, by small class size, missing student data, classes
with a large percentage of special education students whose scores do not count
in teacher effect data, or students who have not attended a teacher’s class for
at least 150 days of instruction, and by shifting teaching assignments for
either grade level and subject area.
Data points may be reduced, too, by team teaching situations whereby
only one teacher on the team is linked to student data. Ballou’s (2005)
research indicated further that the “imprecision in estimated effectiveness due
to a changing mix of students would still produce considerable instability in
the rank-ordering of teachers [from least effective to most effective]” (p.
18). Ballou recommended adjusting TVAAS
to account for all of a teacher’s grade level and subject area data, as too few
teachers teach the same grade level and same subject over a three-year period
(p. 23). Ballou concluded, too, that
only one year of student test data makes teacher effect data too imprecise to
be meaningful or fair for its use in teacher evaluation.
In 2009, McCaffrey, Sass, Lockwood and Mihaly published their research
concerning the year-to-year variability in value-added measures applied to
teachers assigned
small numbers of students such as special education teachers since special
education students are often exempted from taking the tests. The small number of student scores are
impacted by extremely high or extremely low scores resulting is extremes in
teacher value-added scores, “so rewarding or penalizing the top or bottom
performers would emphasize these teachers and limit the efficacy of polices
designed to identify teachers whose performance is truly exceptional” (p.
601). Even though using multiple years
of data helps reduce the variability in teacher effect scores, “one must
recognize that even when multiyear estimates of teacher effectiveness are
derived from samples of teachers with large numbers of students per year, there
will still be considerable variability over time” (p. 601).
With these unresolved issues and deep skepticism related to test
reliability, Sanders’ logic in justifying the use of value-added modeling for
teacher accountability weakens, as in our slightly modified syllogism:
a) if student test scores are unreliable measures
of student growth and
b) unreliable
measures of student growth are the basis for calculating teacher effectiveness,
then
c)
test scores are unreliable measures for calculating teacher
effectiveness, at least for high-stakes decisions concerning teachers’
livelihoods and schools’ existence.
[i] “[The
current] year’s estimates of previous years’ gains may have changed as a result
of incorporating the most recent student data. Re-estimating all years in the
current year with the newest data available provides the most precise and
reliable information for any year and subject/grade combination. Find district
and school information at the following: TVAAS Public
https://tvaas.sas.com/evaas/public_welcome.jsp, TVAAS Restricted: https://tvaas.sas.com/evaas/login.jsp”
(Eckert & Dabrowski, 2010, p. 90).
The false assumption the 'reformers' make is that correlation is not causation. VAM scores can vary wildly from one year to the next because the scores do not measure the value of the teachers. Instead, the scores more likely reflect the socio-economic levels of the students which may vary from year to year. VAM is junk science based on wrong assumptions, and it has been debunked by the American Statistical Society.
ReplyDelete