Sunday, March 27, 2011

The Crucial Missing Value in Using Value Added Modeling (VAM) for Teacher Evaluation

I was at a social justice conference on Saturday and heard a great deal of concern related the ethical breach created when children and teachers are evaluated using a standardized test.  What was missing in the discussion, however, was an understanding of how the "new and improved" version of pre and post testing represents just another in a long line of pseudo-scientific impositions that eventually ended up in the dustbin of technocratic ed interventions. 

The excerpted piece below that appears in Change helps to add some much-needed technical clarity to a policy that is dangerous for many other reasons than the ones laid out here based on technical non-feasibility and the inherent messiness of application venue for a method that would be much easier to apply to, let's say, an agricultural experiment (perhaps that is where the mad Dr. Sanders should return). 

For even though Dr. Wainer below lays out 3 big reasons that this current use of pseudoscience is not ready for prime time, he does not even touch upon the lack of validity and reliability of the junk tests that the testing corporations have sold to state and locals school systems, and he does not delve into the ethical and philosophical barrel of snakes to which the student-teacher relationship has been cast when student performance on a test becomes the principal criterion used to determine if a teacher can keep her job. 
This essay was extracted from Howard Wainer’s book Uneducated Guesses: A Leisurely Tour of Some Educational Policies That Only Make Sense If You Say Them Fast, to be published this year by Princeton University Press.
Race to the Top, the Obama administration’s program to help reform American education, has much to recommend it—not the least of which is the infusion of much needed money. So it came as no surprise to anyone that resource-starved states rushed headlong to submit modified education programs that would qualify them for some of the windfall. A required aspect of all such reforms is the use of student performance data to judge the quality of districts and teachers. This is a fine idea, not original to Race to the Top.

The late Marvin Bressler (1923–2010), Princeton University’s renowned educational sociologist, said the following in a 1991 interview:
Some professors are justly renowned for their bravura performances as Grand Expositor on the podium, Agent Provocateur in the preceptorial, or Kindly Old Mentor in the corridors. These familiar roles in the standard faculty repertoire, however, should not be mistaken for teaching, except as they are validated by the transformation of the minds and persons of the intended audience.
But how are we to measure the extent to which “the minds and persons” of students are transformed? And how much of any transformation observed do we assign to the teacher as the causal agent? These are thorny issues indeed. The beginning of a solution has been proposed in the guise of what are generally called “value-added models,” or VAMs for short. These models try to partition the change in student test scores among the student, the school, and the teacher. Although these models are still very much in the experimental stage, they have been seized upon as “the solution” by many states and thence included as a key element in their reform proposals. Their use in the evaluation of teachers is especially problematic.

Let me describe what I see as three especially difficult problems in the hopes that I might instigate some of my readers to try to make some progress toward their solution.

Problem 1—Causal Inference

One principal goal of VAMs is to estimate each teacher’s effect on his/her students. This probably would not be too difficult to do if our goal was just descriptive (e.g., Freddy’s math score went up 10 points while Ms. Jones was his teacher.). But, description is only a very small step if this is to be used to evaluate teachers. We must have a causal connection. Surely no one would credit Ms. Jones if the claim was “Freddy grew four inches while Ms. Jones was his teacher,” although it, too, might be descriptively correct. How are we to go from description to causation? A good beginning would be to know how much of a gain Freddy would have made with some other teacher. But alas, Freddy didn’t have any other teacher. He had Ms. Jones. The problem of the counterfactual plagues all of causal inference.

We would have a stronger claim for causation if we could randomly assign students to teachers and thence compare the average gain of one teacher with that of another. But, students are not assigned randomly. And even if they were, it would be difficult to contain post-assignment shifting. Also, randomization doesn’t cure all ills in very finite samples. The VAM parameter that is called “teacher effect” is actually misnamed; it should more properly be called “classroom effect.” This change in nomenclature makes explicit that certain aspects of what goes on in the classroom affects student learning, but is not completely under the teacher’s control. For example, suppose there is one 4th grader whose lack of bladder control regularly disrupts the class. Even if his class assignment is random, it still does not allow fair causal comparisons.

And so if VAMs are to be usable, we must utilize all the tools of observational studies to make the assumptions required for causal inference less than heroic.

Problem 2—The Load VAM Places on Tests

VAMs have been mandated for use in teacher evaluation from kindergarten through 12th grade. This leads through dangerous waters. It may be possible for test scores on the same subject, within the same grade, to be scaled so that a 10-point gain from a score of, say, 40 to 50 has the same meaning as a similar gain from 80 to 90. It will take some doing, but I believe that current psychometric technology may be able to handle it. I am less sanguine about being able to do this across years. Thus, while we may be able to make comparisons between two 4th-grade teachers with respect to the gains their students have made in math, I am not sure how well we could do if we were comparing a 2nd-grade teacher and a 6th-grade one. Surely a 10-point gain on the tests that were properly aimed for these two distinct student populations would have little in common. In fact, even the topics covered on the two math tests are certain to be wildly different.

If these difficulties emerge on the same subject in elementary school, the problems of comparing teachers in high school seem insurmountable. Is a 10-point gain on a French test equal to a 10-point gain in physics? Even cutting-edge psychometrics has no answers for this. Are you better at French than I am in physics? Was Mozart a better composer than Babe Ruth was a hitter? Such questions are not impossible to think about—Mozart was a better composer than I am a hitter—but only for very great differences. What can we do to make some gains on this topic? Judging differences among teachers are usually much more subtle.

Problem 3—Missing Data

Always a huge problem in all practical situations, it is made even more critical because of problems with the stability of VAM parameter estimates. The sample size available for the estimation of a teacher effect is typically about 30. This has not yielded stable estimates. One VAM study showed that only about 20% of teachers in the top quintile one year were in the top quintile the next. This result can be interpreted in either of two ways:
  • 1. The teacher effect estimates aren’t much better than random numbers.
  • 2. Teacher quality is ephemeral and so a very good teacher one year can be awful the next.
If we opt for (1), we must discard VAM as too inaccurate for serious use. If we opt for (2), the underlying idea behind VAM (that being a good teacher is a relatively stable characteristic we wish to reward) is not true. In either case, VAM is in trouble.

Current belief is that the problem is (1) and we must try to stabilize the estimates by increasing the sample size. This can be done in lots of ways. Four that come to mind are:
  • (a) Increasing class size to 300 or so
  • (b) Collapsing across time
  • (c) Collapsing across teachers
  • (d) Using some sort of empirical Bayes trick and gather stability by borrowing strength from both other teachers and other time periods
Option (a), despite its appeal to lunatic cost-cutters, violates all we know about the importance of small class sizes, especially in elementary education. Option (c) seems at odds with the notion of trying to estimate a teacher effect, and it would be tough to explain to a teacher that her rating was lowered this year because some of the other teachers in her school had not performed up to par. Option (d) is a technical solution that has much appeal to me, but I don’t know how much work has been done to measure its efficacy. Option (b) is the one that has been chosen in Tennessee, the state that has pioneered VAM, and has thence been adopted more-or-less pro forma by the other states in which VAMs have been mandated. But, requiring longitudinal data increases data-gathering costs and the amount of missing data.

What data are missing? Typically, test scores, but also sometimes things like the connection between student and teacher. But, let’s just focus on missing test scores. The essence of VAM is the adjusted difference between pretest and posttest scores (often given at the ends of the school year, but sometimes there is just one test given in a year and the pretest score is the previous year’s post score). The pre-score can be missing, the post-score can be missing, or both can be missing. High student mobility increases the likelihood of missing data. Inner-city schools have higher mobility than suburban schools. Because it is unlikely that missingness is unrelated to student performance, it is unrealistic to assume that we can ignore missing data and just average around them. Yet, often this is just what is done.
If a student’s pre-test score is missing, we cannot obtain what the change is unless we do something. What is often done is the mean pre-score for the students that have them (in that school and that grade) is imputed for the missing score. This has the advantage of allowing us to compute a change score, and the mean scores for the actual data and the augmented data (augmented with the imputed scores) will be the same. This sounds like a plausible strategy, but only if you say it fast. It assumes that the people who are missing scores are just like the ones that have complete data. This is unlikely to be true.

It isn’t hard to imagine how a principal under pressure to show big gains could easily game the system. For example, the principal could arrange a field trip for the best students on the day that the pre-test is to be given. Those students would have the average of all those left behind imputed as their score. Then, at the end of the year when the post-test is to be given, there is a parallel field trip for the worst students. Their missing data will be imputed from the average of those who remain. The gain scores could thus be manipulated, and the size of the manipulation is directly related to the academic diversity of the student population—the more diverse, the greater the possible gain. Obviously, a better method for dealing with missing data must be found.

Concluding Remarks

There is substantial evidence that the quality of teachers is of great importance in children’s education. We must remember, however, the lessons brought home to us in the Coleman report (and replicated many times in the almost half-century since) that the effects of home life dwarf teacher effects, whatever they are. If a classroom is made up of students whose home life is filled with the richness of learning, even an ordinary teacher can have remarkable results. But, conversely, if the children’s homes reflect chronic lack, and the life of the mind is largely absent, the teacher’s task is made insuperably more difficult.

Value-added models represent the beginning of an attempt to help us find, and thence reward, the most gifted teachers. But, despite substantial efforts, these models are still not ready for full-scale implementation. I have tried to describe what I believe are the biggest challenges facing the developers of this methodology. I do this in the hope that once the problems are made explicit, others will add the beauty of their minds to the labor of mine and we may make some progress. But we must be quick, because the pressures of contemporary politics allow little time for extended reflection.

Marcel Proust likened aging to being “perched upon living stilts that keep on growing.” We can see farther, but passage is increasingly wobbly. This essay exemplifies Proust’s metaphor.

Some Hows and Whys of VAM

The principal claim made by the developers of VAM—William L. Sanders, Arnold M. Saxton, and Sandra P. Horn—is that through the analysis of changes in student test scores from one year to the next, they can objectively isolate the contributions of teachers and schools to student learning. If this claim proves to be true, VAM could become a powerful tool for both teachers’ professional development and teachers’ evaluation.
This approach represents an important divergence from the path specified by the “adequate yearly progress” provisions of the No Child Left Behind Act, for it focuses on the gain each student makes, rather than the proportion of students who attain some particular standard. VAM’s attention to individual student’s longitudinal data to measure their progress seems filled with commonsense and fairness.

There are many models that fall under the general heading of VAM. One of the most widely used was developed and programmed by William Sanders and his colleagues. It was developed for use in Tennessee and has been in place there for more than a decade under the name Tennessee Value-Added Assessment System. It also has been called the “layered model” because of the way each of its annual component pieces is layered on top of another.

The model begins by representing a student’s test score in the first year, y1, as the sum of the district’s average for that grade, subject, and year, say μ1; the incremental contribution of the teacher, say θ1; and systematic and unsystematic errors, say ε1. When these pieces are put together, we obtain a simple equation for the first year:
y1 = μ1+ θ1+ ε1 (1)
or

Student’s score (1) = district average (1) + teacher effect (1) + error (1)
There are similar equations for the second, third, fourth, and fifth years, and it is instructive to look at the second year’s equation, which looks like the first except it contains a term for the teacher’s effect from the previous year:
y2 = μ2+ θ1+ θ2+ ε2 . (2)
or
Student’s score (2) = district average (2) + teacher effect (1) + teacher (2) + error (2)

To assess the value added (y2 – y1), we merely subtract equation (1) from equation (2) and note that the effect of the teacher from the first year has conveniently dropped out. While this is statistically convenient, because it leaves us with fewer parameters to estimate, does it make sense? Some have argued that although a teacher’s effect lingers beyond the year the student had her/him, that effect is likely to shrink with time.

Although such a model is less convenient to estimate, it more realistically mirrors reality. But, not surprisingly, the estimate of the size of a teacher’s effect varies depending on the choice of model. How large this choice-of-model effect is, relative to the size of the “teacher effect” is yet to be determined. Obviously, if it is large, it diminishes the practicality of the methodology.

Recent research from the Rand Corporation shows a shift from the layered model to one that estimates the size of the change of a teacher’s effect from one year to the next suggests that almost half of the teacher effect is accounted for by the choice of model.

One cannot partition student effect from teacher effect without information about how the same students perform with other teachers. In practice, using longitudinal data and obtaining measures of student performance in other years can resolve this issue. The decade of Tennessee’s experience with VAM led to a requirement of at least three years’ data. This requirement raises the concerns when (i) data are missing and (ii) the meaning of what is being tested changes with time. 

Further Reading

Ballou, D., W. Sanders, and P. Wright. 2004. Controlling for student background in value-added assessment of teachers. Journal of Educational and Behavioral Statistics 29(1):37–65.

Braun, H. I., and H. Wainer. 2007. Value-added assessment. In Handbook of statistics (volume 27) psychometrics, ed. C. R. Rao and S. Sinharay, 867–892. Amsterdam: Elsevier Science.

Bressler, M. 1992. A teacher reflects. Princeton Alumni Weekly 93(5):11–14. 

Coleman, J. S., et al. 1966. Equality of educational opportunity. Washington, DC: U.S. Office of Education. 

Mariano, L. T., D. F. McCafferty, and J. R. Lockwood. 2010. A model for teacher effects from longitudinal data without assuming vertical scaling. Journal of Educational and Behavioral Statistics 35:253–279.

National Research Council. 2010. Getting value out of value-added. H. Braun, N. Chudowsky, and J. Koenig (eds.). Washington, DC: National Academy Press.

Rubin, D. B., E. A. Stuart, and E. L. Zanutto. 2004. A potential outcomes view of value-added assessment in education. Journal of Educational and Behavioral Statistics 29(1):103–116.

Sanders, W. L., A. M. Saxton, and S. P. Horn. 1997. The Tennessee value-added educational assessment system (TVAAS): A quantitative, outcomes-based approach to educational assessment. In Grading teachers, grading schools: Is student achievement a valid evaluation measure?, ed. J. Millman, 137–162. Thousand Oaks, California: Corwin Press, Inc.

No comments:

Post a Comment