We're Getting Odd Reading Results from Our Progress Monitoring Tests

  • Assessment
  • 24 September, 2017
  • 6 Comments

Teacher question:

  We are having an interesting conversation in our district. We currently give AIMSweb as a screening probe three times a year. One of the school psychologists pointed out that for the last several years the first graders seem to do better in the fall than in the spring on nonsense word fluency. When we look at measures of comprehension and fluency using other measures we do not see a decline. Is there any research out there that might help us understand what we are seeing and whether or not this is a serious issue?

 

Shanahan responds:

  What you describe is a common experience with AIMSweb and other progress monitoring tests. And, the more often you re-test, the more often you’ll see the problem. (Thank goodness you are only trying to test the kids three times a year.)

  I could find no studies on the nonsense word portion of AIMSweb. But every test has a standard error of measurement (SEM).

  The standard error gives an estimate of how much test scores will vary if the test is given repeatedly. Tests aren’t perfect, so if someone were to take the same test two days in a row, the score would not be likely to be the same.

  But how much could someone learn (or forget) in one day? Which is the point.

  SEM tells you how much change the test score is likely to undergo even if there were no significant opportunity for learning or forgetting. It is not a real change in reading ability, but variance due to the imprecision of the measurement.

  Schools tend to pay a lot of attention to the standard error with their state test scores (the so-called “wings” around your school or district average scores). If your school gets 500 in reading on the state test, but the standard error is + or – 5… then we can’t be sure that you did any better than the schools that got 495s, 496s, 497s, 498s, and 499s. Your score was higher, but we can’t tell from this whether your kids actually outperformed those schools within the standard error.

  When you calculate the SEM for a school or district score, it will tend to be small because of the large numbers of students whose scores are being averaged. However, when you are looking at an individual’s score, such as when you are trying to find out how much improvement there has been since the last time you tested, SEMs can get a lot bigger.   

  Unfortunately, schools pay less attention to SEMs with screening or progress monitoring tests than they do with accountability tests.

  Nevertheless, AIMSweb has a standard error of measurement. So do all the other screeners out there.

  That means when you give such tests repeatedly over short periods of time (say less than every 15 weeks), you’ll end up with unreliability affecting some percentage of the students’ scores.

  I’d love to blame AIMSweb for being particularly bad as a predictor test. That would sure make it easy to address your problem: “Lady, you bought the wrong test. Buy the XYZ Reading Screener and everything is going to be fine. You’ll see.”

  In fact, studies suggest—at least with oral reading fluency—that if anything AIMSweb has particularly small standard errors of measurement (Ardoin & Christ, 2009).

  But even with that, you’ll still find changes in scores that make no sense. John got a 49 when you tested him early in the school year. I couldn’t find an SEM for the AIMSweb nonsense word test, but let’s say to be 95% certain that one score is higher than another is + or – 10 points. Thus, if on retesting you find that his score is 45 it looks like a decline—but what it really means is that John’s score isn’t any different than before.

  Teachers usually like knowing that; what looked to be a decline is just test noise.

  They usually aren’t quite as happy with the idea that if John goes from 49 to 58 on that test that the change is too small to conclude that any real progress was made. Changes that are within the standard error of measurement are not actually changes at all. 

  Since I can’t recommend shifting to some other comparable measure (e.g., DIBELS, PALS, CBM) that would necessarily be any more precise, I think what you are doing—comparing the results with those derived from other measures—is the best antidote.

  If you see a decline in AIMSweb scores, but no comparable decline in other tests that you are giving…. I’d conclude that there was probably not a real decline. I would then monitor that student more closely during instruction just to be sure.

  On the other hand, if the score decline is confirmed by your other tests, then I would try to address the problem through instruction—giving the youngster greater help with the skill in question.

  Contact your test publisher and ask for the test’s standard errors of measurement. Those statistics will help you to better interpret these test scores. In fact, without that kind of information I'm not sure how you are making sense of these data. 

  The problem here: You are expecting too high a degree of accuracy from your testing regime. Give the tests. Use the tests. But don’t trust these changes, up or down, to always be accurate—at least no more accurate than the standard errors suggest that they should be.

Reference

Ardoin, S.P., & Christ, T.J. (2009). Curriculum-based measurement of oral reading: Standard errors associated with progress monitoring outcomes from DIBELS, AIMSweb, and an experimental              passages set. School Psychology Review, 38, 266-283.

 

Comments

See what others have to say about this topic.

Karen
Sep 26, 2017 03:08 PM

Our district currently uses both Aimsweb and STAR (Renaissance Learning) at the grade school and just STAR at the middle school (where I currently teach). One thing we've noticed (especially in grades 5 and up) is that our scores are pretty consistent for our highest and lowest readers, but many of our middle readers have score graphs that look like roller coasters. In October, their scores identify them as well above grade level, but in December, they're identified as in need of intervention (but we don't see drops in academic performance). I'm sure some of this is motivation, but it also makes us question the validity of the test. STAR is a computer adaptive test, so I wonder if kids miss questions early on in the test if it makes it harder for them to score well? What are your thoughts on this? I've wondered about asking if we could switch from STAR to Aimsweb at the middle school level because STAR just seems so inconsistent.

Tim Shanahan
Sep 27, 2017 02:29 AM

Karen-
Comprehension tests definitely depend on the kids trying. If the scores are frequently inconsistent as you describe they cannot possibly be useful. Either find do,e ways to get the kids to try (sometimes this takes no more than a pep talk) or do as you say switch to another measure.

Anonymous
Sep 29, 2017 09:43 AM

Thanks for your insights on this topic.

Our school is progress monitoring kindergarten students showing risk on Aimsweb weekly.
Are you suggesting that this practice is unreliable and should be discontinued?



Tim Shanahan
Sep 29, 2017 11:13 AM

Anonymous--
Indeed, I am. That amount of testing is not justified. The standard error of the test is large enough that children cannot be expected to make that amount of gain in a week. You can't tell if the changes that you see are due to learning or unreliability of measurement. No research supports the practice of such frequent testing. Use this time tobtsCh kids to read.

Erin Gaston
Oct 03, 2017 12:57 PM

I was looking at these roller coaster trends as well as "drops" in scores from fall to spring on our CBM measures, especially in our 5th grade class district-wide. About 90% of the students did worse if spring than fall. I figured 90% of our 5th graders were unlikely to perform worse on a measure than the last time after a year of instruction, so I took the test myself and asked 3 other teachers to take it as well. We found that the test was incredibly poorly written. Some questions had multiple correct answers, with direct textual evidence, yet only one was deemed correct. One CBM's correct answer for one question was completely wrong. Finally, the test passage itself was incredibly prejudicial in its content. We were appalled at the "moral" of the story, which was that you should be grateful for your food and obey your parents or you might end up like the homeless people in the shelter where the kid was forced to volunteer.

My concern with some of these adaptive and computer based measures is that we're not allowed to see them in full. We've been using this particular measure for years, yet no teacher had every analyzed it and it's readily available to us in full. Yet we send kids off to take tests we can't see or analyze and assume that the results are valid or worthy. It's made very clear to us during state testing (which I know is different than the topic here) that we can lose our licenses for reading questions on the kids' screens, making note of the type of questions being asked, or discussing the material on the measures. It defies logic, in my opinion.

Mark R Shinn, Ph.D.
Nov 06, 2017 08:23 PM

Note that I am a consultant for Pearson's aimsweb product. I'd like to think it's because I know something about assessment, including basic skill progress monitoring and screening. Note that this question is about NWF. I have had long standing concerns about it largely because like other tests in Campbell's law, it is subject to "corruption pressure." That is, what gets measured and what takes on importance has a tendency to get corrupted." With respect to NWF, too often I've seen flashcard drill with nonsense words--in the absence of phonic instruction--and that bothers me. They get treated as another sort of sight word. Additionally, I've found that Letter Sounds is correlated with other reading measures as well or better than NWF--and in the absence of phonics instruction teachers teach Letter Sounds, I'm not as Uncomfortable with that corruption.

My hypothesis about declines in NWF are due to the fact that there is little utility in repeatedly administering NWF to Grade 1 students. It is plausible that NWF ceilings out--and students should obtain fairly high NWF scores in the FALL of Grade 1. Me, at the least, I'd like them to be reading authentic highly decodable text pretty well if they've had any reading instruction in K. But IF I use NWF, it is as a Fall Grade 1 screener. That's it. I wouldn't repeat it G1 w and certainly not G1 Spring. Shouldn't students be showing growth--and dramatic growth in reading "real words" in connected text. Another plausible hypothesis is motivation. I would hope that gets might be more interested in reading pseudo words when they can't read real words and when they can read real words, what's the point of reading nonsense words.

One of my pet peeves--and especially with young students--is that too many educators think a screener needs to be repeated as a progress monitoring test. We know the most about a simple, short, scientifically sound measure of oral reading. THAT should be assessed regularly--but not frequently--for typically developing kids, and more frequently for students who are receiving intensive intervention. But like Letter Names, PSF--which I find more useful as a longer diagnostic test than as a screener or PM test--I don't recommend using NWF as a progress monitoring test. Falling in love with 3 times per year without thinking only diverts time away from instruction and inadvertently turns some educators off on data.

What Are your thoughts?

Leave me a comment and I would like to have a discussion with you!

Comment *
Name*
Email*
Website
Comments

We're Getting Odd Reading Results from Our Progress Monitoring Tests

6 comments

One of the world’s premier literacy educators.

He studies reading and writing across all ages and abilities. Feel free to contact him.