We're Getting Odd Reading Results from Our Progress Monitoring Tests

Teacher question:

We are having an interesting conversation in our district. We currently give AIMSweb as a screening probe three times a year. One of the school psychologists pointed out that for the last several years the first graders seem to do better in the fall than in the spring on nonsense word fluency. When we look at measures of comprehension and fluency using other measures, we do not see a decline. Is there any research out there that might help us understand what we are seeing and whether or not this is a serious issue?

Shanahan responds:

What you describe is a common experience with AIMSweb and other progress monitoring tests. And, the more often you re-test, the more often you’ll see the problem. (Thank goodness you are only trying to test the kids three times a year.)

I could find no studies on the nonsense word portion of AIMSweb. But every test has a standard error of measurement (SEM).

The standard error gives an estimate of how much test scores will vary if the test is given repeatedly. Tests aren’t perfect, so if someone were to take the same test two days in a row, the score would not be likely to be the same.

But how much could someone learn (or forget) in one day? Which is the point.

SEM tells you how much change the test score is likely to undergo even if there were no significant opportunity for learning or forgetting. It is not a real change in reading ability, but variance due to the imprecision of the measurement.

Schools tend to pay a lot of attention to the standard error with their state test scores (the so-called “wings” around your school or district average scores). If your school gets 500 in reading on the state test, but the standard error is + or – 5… then we can’t be sure that you did any better than the schools that got 495s, 496s, 497s, 498s, and 499s. Your score was higher, but we can’t tell from this whether your kids actually outperformed those schools within the standard error.

When you calculate the SEM for a school or district score, it will tend to be small because of the large numbers of students whose scores are being averaged. However, when you are looking at an individual’s score, such as when you are trying to find out how much improvement there has been since the last time you tested, SEMs can get a lot bigger.

Unfortunately, schools pay less attention to SEMs with screening or progress monitoring tests than they do with accountability tests.

Nevertheless, AIMSweb has a standard error of measurement. So do all the other screeners out there.

That means when you give such tests repeatedly over short periods of time (say less than every 15 weeks), you’ll end up with unreliability affecting some percentage of the students’ scores.

I’d love to blame AIMSweb for being particularly bad as a predictor test. That would sure make it easy to address your problem: “Lady, you bought the wrong test. Buy the XYZ Reading Screener and everything is going to be fine. You’ll see.”

In fact, studies suggest—at least with oral reading fluency—that if anything AIMSweb has particularly small standard errors of measurement (Ardoin & Christ, 2009).

But even with that, you’ll still find changes in scores that make no sense. John got a 49 when you tested him early in the school year. I couldn’t find an SEM for the AIMSweb nonsense word test, but let’s say to be 95% certain that one score is higher than another is + or – 10 points. Thus, if on retesting you find that his score is 45 it looks like a decline—but what it really means is that John’s score isn’t any different than before.

Teachers usually like knowing that; what looked to be a decline is just test noise.

They usually aren’t quite as happy with the idea that if John goes from 49 to 58 on that test that the change is too small to conclude that any real progress was made. Changes that are within the standard error of measurement are not actually changes at all.

Since I can’t recommend shifting to some other comparable measure (e.g., DIBELS, PALS, CBM) that would necessarily be any more precise, I think what you are doing—comparing the results with those derived from other measures—is the best antidote.

If you see a decline in AIMSweb scores, but no comparable decline in other tests that you are giving…. I’d conclude that there was probably not a real decline. I would then monitor that student more closely during instruction just to be sure.

On the other hand, if the score decline is confirmed by your other tests, then I would try to address the problem through instruction—giving the youngster greater help with the skill in question.

Contact your test publisher and ask for the test’s standard errors of measurement. Those statistics will help you to better interpret these test scores. In fact, without that kind of information I'm not sure how you are making sense of these data.

The problem here: You are expecting too high a degree of accuracy from your testing regime. Give the tests. Use the tests. But don’t trust these changes, up or down, to always be accurate—at least no more accurate than the standard errors suggest that they should be.

Reference

Ardoin, S.P., & Christ, T.J. (2009). Curriculum-based measurement of oral reading: Standard errors associated with progress monitoring outcomes from DIBELS, AIMSweb, and an experimental passages set. School Psychology Review, 38, 266-283.

Comments

See what others have to say about this topic.

What Are your thoughts?

Leave me a comment and I would like to have a discussion with you!

Karen Sep 26, 2017 03:08 PM

Our district currently uses both Aimsweb and STAR (Renaissance Learning) at the grade school and just STAR at the middle school (where I currently teach). One thing we've noticed (especially in grades 5 and up) is that our scores are pretty consistent for our highest and lowest readers, but many of our middle readers have score graphs that look like roller coasters. In October, their scores identify them as well above grade level, but in December, they're identified as in need of intervention (but we don't see drops in academic performance). I'm sure some of this is motivation, but it also makes us question the validity of the test. STAR is a computer adaptive test, so I wonder if kids miss questions early on in the test if it makes it harder for them to score well? What are your thoughts on this? I've wondered about asking if we could switch from STAR to Aimsweb at the middle school level because STAR just seems so inconsistent.

Tim Shanahan Sep 27, 2017 02:29 AM

Karen-
Comprehension tests definitely depend on the kids trying. If the scores are frequently inconsistent as you describe they cannot possibly be useful. Either find do,e ways to get the kids to try (sometimes this takes no more than a pep talk) or do as you say switch to another measure.

Anonymous Sep 29, 2017 09:43 AM

Thanks for your insights on this topic.

Our school is progress monitoring kindergarten students showing risk on Aimsweb weekly.
Are you suggesting that this practice is unreliable and should be discontinued?

Tim Shanahan Sep 29, 2017 11:13 AM

Anonymous--
Indeed, I am. That amount of testing is not justified. The standard error of the test is large enough that children cannot be expected to make that amount of gain in a week. You can't tell if the changes that you see are due to learning or unreliability of measurement. No research supports the practice of such frequent testing. Use this time tobtsCh kids to read.

Mark R Shinn, Ph.D. Nov 06, 2017 08:23 PM

Note that I am a consultant for Pearson's aimsweb product. I'd like to think it's because I know something about assessment, including basic skill progress monitoring and screening. Note that this question is about NWF. I have had long standing concerns about it largely because like other tests in Campbell's law, it is subject to "corruption pressure." That is, what gets measured and what takes on importance has a tendency to get corrupted." With respect to NWF, too often I've seen flashcard drill with nonsense words--in the absence of phonic instruction--and that bothers me. They get treated as another sort of sight word. Additionally, I've found that Letter Sounds is correlated with other reading measures as well or better than NWF--and in the absence of phonics instruction teachers teach Letter Sounds, I'm not as Uncomfortable with that corruption.

My hypothesis about declines in NWF are due to the fact that there is little utility in repeatedly administering NWF to Grade 1 students. It is plausible that NWF ceilings out--and students should obtain fairly high NWF scores in the FALL of Grade 1. Me, at the least, I'd like them to be reading authentic highly decodable text pretty well if they've had any reading instruction in K. But IF I use NWF, it is as a Fall Grade 1 screener. That's it. I wouldn't repeat it G1 w and certainly not G1 Spring. Shouldn't students be showing growth--and dramatic growth in reading "real words" in connected text. Another plausible hypothesis is motivation. I would hope that gets might be more interested in reading pseudo words when they can't read real words and when they can read real words, what's the point of reading nonsense words.

One of my pet peeves--and especially with young students--is that too many educators think a screener needs to be repeated as a progress monitoring test. We know the most about a simple, short, scientifically sound measure of oral reading. THAT should be assessed regularly--but not frequently--for typically developing kids, and more frequently for students who are receiving intensive intervention. But like Letter Names, PSF--which I find more useful as a longer diagnostic test than as a screener or PM test--I don't recommend using NWF as a progress monitoring test. Falling in love with 3 times per year without thinking only diverts time away from instruction and inadvertently turns some educators off on data.

Robert Hamill May 25, 2018 06:18 PM

Nonsense word fluency assesses a student's use of letter-sound sequencing when decoding words. The most obvious reason for a drop in scores on AIMSweb's Nonsense Word Fluency (NWF) is that students are likely processing printed text differently at the end of the year than they did at the beginning of the year. On AIMSweb's NWF, students can earn one point for each correct letter-sound they recall in a CVC nonsense word (e.g., three points if all correct) or can earn all three points for correctly pronouncing the whole nonsense word. Struggling readers start out the year by pronouncing each letter-sound. As the year progresses, they sound out the sequence of letters, and later (often with a bit of a pause), slide the sounds together to correctly pronounce the word. Later, when presented with a NWF word, they pause briefly as they process the information and then correctly pronounce the word. This difference in processing can result in declining, flat, or bumpy data slopes, such as described in the teacher's question to this site. The NWF graphs of more successful readers often show a more linear progression, but will still often include a dip at the point in time when the student starts to process the letter-sequences as whole words. When administered on a weekly basis, this assessment can provide a teacher with information regarding this qualitative change in the way a student processes print. Frequent progress monitoring provides teachers with information needed to design timely and effective interventions for a struggling reader. Still, the interpretation and critique of AIMSweb/NWF data requires an understanding of the measure and of how the processing of print information changes in an early reader. Do not expect to see a linear progression of scores for all students on this particular AIMSweb measure. However, the resulting data graph is a rich source of information regarding qualitative changes in student's development of decoding skills.
DIBELS also has a NWF measure, but has separate scoring for sounding-out letters (Correct Letter Sound/CLS) and for decoding the entire word (Whole Word Read/WWR). However, students do not receive a point on WWR if they had initially sounded out aloud the letters of the word (e.g., "N-I-F, NIF"). This has always seemed odd to me in that students are often taught to sound-out unfamiliar words before attempting to pronounce them. Separating WWR from CLS scores in DIBELS/NWF avoids confounding data regarding these two different ways of processing print.

load all comments

Comments

We're Getting Odd Reading Results from Our Progress Monitoring Tests

8 comments

We're Getting Odd Reading Results from Our Progress Monitoring Tests

Should Reading Be Taught Whole Class or Small Group?

Encouraging Summer Reading

Autism and Reading Part 2: Lessons to be Learned from Special Kids

Comments

What Are your thoughts?

We're Getting Odd Reading Results from Our Progress Monitoring Tests

Subscribe to newsletter

Latest posts

Should Reading Be Taught Whole Class or Small Group?

Encouraging Summer Reading

Autism and Reading Part 2: Lessons to be Learned from Special Kids

Comments

What Are your thoughts?

One of the world’s premier literacy educators.