Wednesday, April 9, 2008

Use of test data

Today, the Brian Lehrer show attempted to address the question of the proper use of student achievement data in making tenure decisions for teachers. This was prompted by a NYT story from today's paper. 

Two important quotes from the story came from state legislation.
  1. Original, but scrapped language: "That section said teachers would be evaluated for tenure based on, among other things, an 'evaluation of the extent to which the teacher successfully utilized analysis of available student performance data.'"
  2. Final (?) language: "'The teacher shall not be granted or denied tenure based on student performance data.'”
I understand that on the surface this change seems a travesty, but it is not. Let me explain. 

To help you to understand the issue, I am going to give some other measurement examples that are likely more familiar to you.
  • Kitchen measuring cups come in two basic varieties: liquid (the transparent ones you read from the outside) and solid. You can use solid measuring cups for liquids, but it'll be a little in accurate. At the same time, however, you really cannot use the liquid measuring cups for solids with any real accuracy. Note that even though the context (i.e. the kitchen) and the purpose (measuring and cooking food), different tools are needed to measure the different substances well. Especially note the asymmetry in substitution, in that one can come close to replacing the other, but not the other way around. Small changes in what you are trying to measure can necessitate a whole different tool to measure it well.
  • Most of us have a simple device to measure our own size in our homes: the bathroom scale. However, it only measures one aspect of size well; while it is a good tool to measure weight, it has a horrible tool to measure height. It's really not a good tool for width or depth, either. Or waist size. Or any number of other aspects of size. Of course, you could use it to measure height, a little bit. If you know that the person weights 32 lbs, you've got a clue as to how tall they are. If you know that they weight 180 lbs, you again have a clue. But in either case, you could be far off, depending on whether the person is think or fat. And, of course, that assumes that you are measuring a person, instead of dog or a box of books. Similar constructs, or different aspects of the same construct (e.g. size) are not necessarily measured the same way.  
  • That bathroom scale has some other limitations. It assumes that it is being used in a particular context. If you take it to the moon to measure my identical twin, you will get quite a different answer. It is calibrated to earth gravity. Of course, that's usually not a problem. But if you work for NASA, it's something to keep in mind. Furthermore, while you can use your doctor's fancy scale on the moon, even it won't work in space. Context matters.
  • Bringing it back to the kitchen, some of us have food scales in our kitchens. We can't use them to measure people, but you can't use a bathroom scale to measure food for cooking. Though they each measure the same thing (i.e. weight) they are useful for different ranges and precision. My kitchen scale is 160 times as precise as my bathroom scale (1/20 ounce vs. 1/2 lb), but only goes up to 5 lbs. Precision matters, as the range at which a tool can be accurate 
  • Another kitchen example is measuring flour. The best cooks measure it by mass rather than by volume, because flour volume is notoriously unreliable. (Think about sifted flour vs. unsifted, for example.) But flour is sensitive to moisture in the air. In a more humid environment, you use a little more flour, as whatever you measure will include some moisture that the flour absorbed from the air, and in a less humid environment, you need to use a little less four. Environment matters. 
  • One last kitchen example. We presume that if water is boiling that that means that it is 212 degrees (100 Celsius, of course). But at higher elevations this is no longer true, because of differences in air pressure. So, in Denver, the boiling  point is 201F/94C. If you don't take this into account when calibrating your kitchen thermometer, your meat will be underdone and your cheesecake will be runny. Calibration is critically important.
All of these issues are apparent in achievement tests for student, the ones we use in our schools for NCLB and many other purposes, too. 

There are many problems with our tests, and none of them are new. E.F. Lindquist wrote about them in 1951 in a book called Educational Measurement. Now let's be clear here, he was no opponent of standardized testing. I mean, this is the guy who invented the scantron machine! But he was concerned that people would misuse test results, or misunderstand their meaning. 

One of his biggest concerns, and I think that this one underlies virtually all of the big problems with testing today, is that people can get confused between the items being tested and the "construct" that the items are supposed to stand for. 

For example, I might give an essay test on particular novel. As the teacher, I act under the assumption that the the scores students get on the test represent students' writing ability, understanding of the novel or even ability to analyze literature in general. Maybe that's a fair assumption. Or, maybe if I used different questions, some student would do better on the test, and some would do worse. Perhaps the boys would do better if I asked about some aspects of the novel, and the girls better if I asked about other aspect. Perhaps some questions might be easier for students who had read a particular other novel in another class the previous year, and those who were not in the same class last year -- and did not read that novel -- would do worse.

Technically, this is called "person x item interaction," meaning some students do better on some items (i.e. questions), and some do better on others. This is not to say simply that some items are harder than others, rather that different items are harder for different students. And when we confuse their performance on those item with their underlying ability, we are making a big mistake. 

I won't get into the ways that our current testing system makes that problem more likely, at least not today. But clearly taking a test designed for one purpose and using it for another makes that far more likely. 


Now, to get back to the particular issues of the day. First, the change in the legislative language and the question of whether "performance data" (i.e. test scores) should be used to make tenure decisions. 

Our current tests are usually, at best, designed to measure current performance levels. But eve Brian acknowledged that going by performance level would not be fair, because teachers working in low performing schools should not be punished for their commitment to work with the most needy student.

So, measure improve in scores, year to year, right? Well, the tests are usually not designed to do that, and rarely do it well. Moreover, even if they were, recent studies have shown that lower income students lose more ground over the summer than higher income students. It's easy to imagine why, as higher income students are more likely to attend richer summer programs that build on their learning in schools. This is entirely beyond the control of schools and teachers, but this would imply a weaker teacher performance in low income schools than those in higher income schools, if we measured year over year growth. 

There are issues with using the same tests for low performing schools and Stuyvessant High School.

Should we expect students' math and reading scores to go up by the same amount in 7th grade as in 3rd grade, or are some years more critical than others?

What about the subjects for which we don't have high stakes tests? Do students perform differently on high stakes tests than low- or no-stakes tests?

These problems go on and on. Sure, there are answers to most of them, but they usually require more costly tests and analysis procedure. And some problems do not have answers.

But the real point, the biggest problem, and what I wanted to mention on Brian's show today, is that these tests are not known to be "instructionally sensitive." That is, none of them have been designed to differentiate good instruction from bad instruction. None of them have been validated for that. None of them have even been checked for that. We have no reason to believe that these tests -- or the individual items on the tests -- are capable of providing information that would allow us to identify good teaching or good teachers. Heck, we don't even yet know how to design items that are instructionally sensitive. This is what James Popham was talking about two weeks ago at the annual meeting of the American Education Research Association. (He is a former president of the association.)

Sure, if all you have is a hammer, than everything looks like a nail. But that'll break the lightbulb and it's not going to do anything useful when you are trying to change a flat tire. And that is what we are talking about here. We do not have the tool to accomplish that, and until we do we need to back off.

If the state of New York wants to invest a couple of million dollars in such a research effort, out of the $20 billion  it spends on education each year, perhaps that would be a good idea. The Department of Defense spends billions of dollars each year on the next generation of weapons and equipment, paying defense contractor to invent/develop them before using them in the field. 


None of the previous section addresses the original language in the legislation and why it might be bad. But there are problems there, too, even though it doesn't call for teachers to be evaluated on student performance.

I am all for teachers using performance data to help guide their teaching. I think that teacher education programs should teach pre-service teachers how to make sense of such data. But I'm not convinced that it should yet be a factor in tenure decisions.

First, we better make damn sure that the tests are good and actually provide meaningful measures of student performance before we demand teachers make use of them day to day. The original language calls for teaching to the test. That is what it means. It means that teachers should take test data to guide their instruction so that students do better on the next test. 

At its worst, it means teaching how to take these tests, or even how to answer the kinds of problems that appear on the test, rather than focusing on the core lessons of the topic. It means narrowing application of material to how they appear on tests, rather than real world use that might not be able to appear on the test.

It means that the tests -- regardless of their quality -- will drive instruction, rather than the tests providing information about student performance, or even about instruction. It confuses the cart and the horse.

And then there's the bottom line. It is principals who are responsible for evaluating teachers. This language is about what principals should consider. Unfortunately, principals do not know how to do this stuff, either. Moreover, no one teaches principals the dangers and concerns in depending on potentially problematic tests, that might be used in appropriately to draw conclusions about things that the tests are not capable of supporting. I do not blame principals for this, rather I look to their preparation programs and their districts, both of whom have failed to teach them about this valuable material. But if they do not really understand it, why would anyone put them in a position to evaluate it?


But I am not just a hater. I am happy to recommend resources that might help principals and other the learn more about this. 

DataWise is a book about using assessment data to guide instruction. It addresses the immediate problem at hand, how principals and teachers can use test results.

More importantly, Measuring Up: What Educations Testing Really Tells Us is a brand new book about tests, testing and educational measurement in general. It is written for a lay audience, without all of the complex mathematics and statistics that underlie testing. 

The credibility and expertise of the authors of both books are beyond reproach. These are not one-sided political screeds by any means. 

No comments: