Chris Domaleski had a problem and its name was Andrew Lloyd Webber.
Question 42 on Georgia’s sixth-grade social studies test had asked whether Webber was a playwright, painter, sculptor or athlete.
The famous composer of Broadway musicals, however, was none of those things. But what should Domaleski, the state’s testing director, do?
Testing was over. Scrapping the question would delay test results at least 10 days, inviting complaints about one of the state’s most politically-sensitive undertakings. Rushing a re-scoring would also heighten the chance of error. Yet counting the question would mean penalizing tens of thousands of students for someone else’s mistake.
Domaleski’s predicament illustrates the cascade of problems flawed questions cause when they slip past layers of review and appear on standardized exams.
Quality-control breakdowns have become near commonplace on the state tests taken in public schools across the country, The Atlanta Journal-Constitution found. Faulty tests undermine reforms seeking to rescue American schools and risk harming them instead. The exams have grown so critical that test companies’ errors can — and have — cost children something as valuable as a diploma.
In a year-long national investigation, the newspaper examined thousands of pages of test-related documents from government agencies – including statistical analyses of questions, correspondence with contractors, internal reports and audits.
The examination scrutinized more than 100 testing failures and reviewed statistics on each of nearly 93,000 test questions given to students nationwide.
The reporting revealed vulnerabilities at every step of the testing process. It exposed significant cracks in a cornerstone of one of the most sweeping pieces of federal legislation to target American schools: The No Child Left Behind Act of 2001.
Test-based accountability began its march across the nation’s classrooms more than a decade ago. Yet no one in that time, the newspaper found, has held the tests themselves accountable.
While lawmakers pumped up the repercussions of lagging scores, schools opened exam booklets to find whole pages missing. Answer-sheet scanners malfunctioned. Kids puzzled over nonsensical questions. Results were miscalculated, again and again.
Most tests are fine, test company executives and state officials point out. They also say the field is working to improve itself.
Yet mishaps continue to disrupt tests and distort scores, the newspaper found, and damage-control judgment calls like Domaleski’s lurk at every turn. The vast majority of states have experienced testing problems – some repeatedly.
“If someone hasn’t had an error,” said Robert Lee, chief analyst of Massachusetts’s testing program, “either they’re extremely lucky or they haven’t reported it.”
Even as testing companies received public floggings for errors, lawmakers and education officials failed to address why the tests were derailing or how government contributed to breakdowns.
Some industry executives acknowledge their immense challenges, which include an unprecedented volume of test-takers and demanding federal and state timelines for reporting scores.
Those deadlines have sometimes left testing contractors without enough time to figure out why something didn’t look right, said Stuart Kahl, co-founder of testing company Measured Progress.
“So we had to go for it,” he said. “That’s not a situation you want to be in.”
At the same time, cash-strapped states have struggled to hire and retain staff to provide oversight. Too often, they leave contractors to police themselves, some experts say.
“To be asking testing contractors to ensure the quality of their own work is to put the testing companies in an unfair position,” said Thomas Toch, of the Carnegie Foundation for the Advancement of Teaching, “and to not provide taxpayers and the public with the level of scrutiny they deserve.”
The consequences for kids — nearly all the 22 million in grades 3 through 8 alone take the tests yearly — have never been greater.
Tests drive decisions about who wins a scholarship, enters a coveted gifted program, attends a magnet school, or moves to the next grade. Teachers and principals lose their jobs because of bad scores. A school tarred with them can attract a state takeover.
Jake Crosby worried his future hung in the balance when he received a lower score than he had expected on Connecticut’s high school reading test in 2005.
“I literally spent the entire year after this thinking, ‘Should I retake this portion of this test?’” Crosby recalled. “You’re thinking ‘What are colleges going to think about this? How will they perceive this?’”
He was stunned to learn as a junior the following spring that contractor CTB/McGraw-Hill had misreported his score – and that of 354 other students. He had done well on the test. His confidence in the exams was shaken.
“I am sitting there thinking,” he said, “…it could happen once, it could happen again.”
Advocates of test-based reform say the exams offer scientific precision. But records show thousands of students have faced questions that failed to meet basic industry standards.
Some questions had no right answer option, or more than one right answer. Wording was unclear, or covered material never taught. A few questions have bordered on bizarre – such as the now-infamous passage on a New York test about a race between a hare and a pineapple.
Each testing season, new complaints crop up. Writing good questions, it turns out, is deceptively difficult.
“It can take an expert all day to write a few items correctly,” Lee said. “Unfortunately, a lot of people try to write a lot of items incorrectly.”
The question-creation process begins with states’ blueprints for what teachers should teach. Testing contractors turn the guides into a battery of questions that aim to measure what students know.
Contractors either write questions themselves or hire freelancers who are paid by the hour or item. Advertisements on job boards such as Craigslist seek “test development” writers to meet what has become an almost insatiable appetite for new items.
States and test companies say they typically submit potential test questions to multiple reviews by committees of staff and educators. They look for problems such as bias, a lack of clarity or inaccuracy. Editing is extensive.
But records reviewed by the newspaper show quality checks meant to keep bad questions off students’ desks continue to falter.
In 2008 alone, records show, Mississippi dropped five questions and Georgia dropped three because of flaws such as no right answer that were discovered after testing ended.
The following year, quick work by Mississippi staff narrowly averted the need to remove another nine items before testing began. The contractor hadn’t even fixed a typo the state had flagged the year before.
Overall, the newspaper found potential problems with blocks of questions on nearly one in 10 tests given across the country in recent years.
In reality, no rules govern how extensively states must review questions. Some states, facing time and money constraints, have scaled back, records show. Gathering teacher feedback can be costly: Texas, for instance, shaved $2.74 million off its testing contract by curtailing educator reviews of questions after pilot, or “field” tests, were conducted.
Even with multiple reviews, errors can be introduced during edits of words or graphics. Sometimes, the supposed content experts who review questions don’t have the right expertise - or are ignored. More fundamentally, questions may not reflect what’s taught in schools.
Multiple failures occurred before the first Georgia student pondered the Andrew Lloyd Webber question, e-mails show.
For one, the answer to the question didn’t focus on what teachers following the state’s standards had taught – that Webber’s contribution to the arts was in the area of music.
And a teacher review of the question didn’t help: In fact, it appeared to make matters worse.
“This is an item that was rewritten by teachers at the Jan. 07 item review,” a CTB employee wrote in a May 2008 e-mail to Georgia officials defending the question. “Perhaps teachers thought composer was too difficult a word for their students … and they thought what their students would know is that Webber created musical theater/plays, which is why they chose the word playwright.”
After learning of the problem, Domaleski, the former testing director, ordered the question dropped from scoring in a terse e-mail. But he relented after further discussion with CTB. It would count, giving credit to those who chose “playwright.” The nearly two of every three test-takers who didn’t would lose points for answering a bad question wrong.
“Although I do not feel it is an example of the high quality of items we expect from CTB McGraw Hill, I am influenced by information that suggests there is a continuum of defensible positions about this item…” Domaleksi wrote in an e-mail to the contractor. “That said, I don’t want that item to show up again.”
Domaleski now says he made the best call he could amidst competing pressures. But, he says, he doesn’t defend the question or the decision.
CTB declined to discuss the matter.
The debate over the question proved a harbinger of trouble. A week later, score projections signaled massive failures on that test and a second. Barely a quarter of students passed social studies in either sixth or seventh grade.
The tests were new and students and teachers were apparently surprised by what was on them. The state had ignored abysmal statistics from field tests the year before and pushed ahead. Many educators felt betrayed.
“It appears that no one is listening at all!” former Dougherty County Superintendent Sally Whatley wrote in an e-mail after what she called an “EXTREMELY frustrating” conference call with the state.
“To consider that the Social Studies tests are valid,” she wrote, “is absurd!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!”
Two days later, state officials announced they would throw out the results of both tests, saying the state had failed to prepare teachers adequately. The time and public money spent on the exams was all for naught.
Poorly worded questions have continued to bedevil states since they ramped up testing programs in 2006 to satisfy No Child Left Behind mandates.
Just last year, national headlines carried news of the controversy in New York over the pineapple reading passage. Parents and teachers called it absurd. Contractors had made other errors, too, affecting the scoring of more than a dozen additional questions.
But while state testing officials often scurry to minimize the harm flawed questions do, in some cases complaints about them meet resistance.
Jonathan Halabi, a math teacher at a specialized public high school in the Bronx, said teachers have regularly complained to the state about the accuracy and clarity of the state’s math “Regents” exams. Students must pass one test in that subject to graduate.
“We open up the exams and start calling,” said Halabi, who estimated that teachers find poor questions on at least one in three exams. State officials, however, he said, “don’t want to back down.”
In an e-mailed response, state education spokesman Jonathan Burman said mistakes happen less frequently than Halabi alleged. “All calls are reviewed with care, diligence, and urgency,” he wrote.
A scientist who found numerous errors in Florida’s guide for science test-question writers said he found state officials there initially recalcitrant. He prepared for battle.
Robert Krampf, who runs the website “The Happy Scientist,” said he was astounded last year when he read incorrect sample questions and a “Science Item Writer Glossary” rife with errors.
The glossary defined “germination” as “the process by which plants begin to grow from seed to spore or from seed to bud.” Krampf noted in an e-mail to state officials: “There are NO plants that grow from seed to spore.”
The glossary called a “predator” “an organism that obtains nutrients from other organisms.” Krampf wrote: “According to that definition, cows, grass and mushrooms are all predators.”
A math and science test coordinator wrote back the state would “take a look” at Krampf’s concerns. But the official also defended the document. The glossary had been “checked against several sources,” he wrote, and reviewed by science teachers.
Krampf was unsatisfied. He kept pushing until state officials agreed to correct the guidelines.
But he said he couldn’t get a complete answer when he asked about a problematic question he found on an actual science test from 2007 – one of the few old tests Florida has released.
Florida officials said in a written statement that they stand behind their processes for developing tests but are open to refinements as needed. “We really did bend over backwards to accommodate Mr. Krampf,” said Cheryl Etters, a department spokeswoman.
Most questions may be adequate, as industry representatives claim. But for hundreds, if not thousands, of students across the country each year, just one or two questions will make the difference between passing and failing a critical exam.
Parents and educators are hard-pressed to check the quality of what’s on the tests, since secrecy enshrouds them.
Yet contractors typically calculate statistics to track patterns in student responses. Such patterns can reveal flaws potentially as serious as a question with two right answers.
Statistical analyses of more than 1,700 tests given over two years in 42 states and Washington, D.C., suggest a patchwork of problem questions spans coasts.
For almost 9 percent of tests, one in every 10 questions or more showed signs of potential flaws, technical reports examined by the AJC revealed. Most states gave at least one test in recent years with blocks of suspect questions – threatening the tests’ overall quality and raising questions about fairness.
In a few cases, the review found, states gave exams rife with questionable questions year after year.
Some states are vigilant and drop flawed questions discovered after testing – though doing so risks delaying scoring. Other states appear to ignore the problems.
“I think that’s just the bottom line,” said Matthew Johnson, a professor at Teachers College, Columbia University, in New York City who advised the AJC on this project. In some states, he said, “there is no quality control, or very little.”
About 45 percent of the exams given over two years in West Virginia were thick with potentially poor questions, the statistics showed.
The lapses were revealed through an industry metric commonly called “discrimination.” A question’s discrimination considers whether students who did well on the rest of the test are clearly more likely to get a question right. If not, and if low-performers aren’t clearly less likely to answer correctly, that could signal a problem.
Such questions should always be reviewed and, in many cases, revised or thrown out, said testing experts such as Johnson.
In West Virginia’s case, the most alarming sort of discrimination statistic – one that is negative – showed up repeatedly, state records reveal. The statistic showed that the lowest performing students were more likely to get questions right than top performers.
Eleven questions with negative discriminations appeared on tests in the 2011 and 2012 school years. Such a statistic is “nonsensical,” said Gary Cook, a former testing company executive and state testing director now at the University of Wisconsin.
In a written response, West Virginia officials defended their tests, which they said have grown more difficult in recent years. They said they take precautions to make sure answers are coded properly and questions match curriculum. They continually review items.
But they did not explain why their state gave so many more tests full of poor-discrimination questions than other states.
Even some states with aggressive test-driven accountability systems had problem tests, the analyses showed. In Louisiana, one in five of the state’s tests in 2011 and 2012 had high levels of questions with poor statistics.
Most concerning were the results of the state’s high-stakes graduation tests. More than one of every 10 questions was likely problematic on high school English tests in 2011 and 2009, and science tests in 2009 and 2010.
Parents C.C. Campbell-Rock and her husband, Raymond Rock, said they weren’t surprised.
The New Orleans couple was among a group that filed an unsuccessful lawsuit more than a decade ago to stop the state’s high-stakes testing program. They accused the state of failing to teach students the material it then punished them for not knowing.
“The questions were weird,” Campbell-Rock said. “We found the curriculum our children were being taught did not match the curriculum of the new high-stakes test.”
In written responses to questions, Louisiana officials said they carefully vet test questions. They said they follow industry guidelines to screen out bad questions and conduct multiple reviews.
Officials acknowledged the state in the past offered school districts guidance on curriculum that did not always align with the tests. But they said those problems have been fixed.
In Kansas, officials told the newspaper they had not run discriminations in years. Their reason: They wanted to keep the test the same to make it easier to compare results from year to year.
Johnson scoffs at that explanation.
“To say ‘I don’t have to look at them at all is silly,” he said. “Because items that were great five years ago or eight years ago may not be so great anymore.”
Ultimately, states and testing companies have little excuse for failing to weed out flawed questions, Johnson said. Analysis that can quickly highlight problems is cheap and straightforward.
“This is so easy to do,” he said.
Connecticut’s former Education Commissioner, Betty Sternberg, said catching errors may be hard for states facing staffing shortages and other obstacles. But it’s indispensable.
“If you’re going to do something like this that affects kids and districts,” she said, “you’d darn well better invest in the quality control first.”
This newspaper had reported extensively on cheating, but hadn’t dug deeply into a more basic question: Just how good are the exams used to make critical decisions in schools?
No one, it turned out, had documented testing errors’ scope, causes or consequences since the 2001 No Child Left Behind Act.
The newspaper requested documents on testing from all 50 states and the District of Columbia, conducted scores of interviews and reviewed news stories and federal reports.
The newspaper also reviewed the statistics for more than 90,000 test questions given on roughly 1,700 tests in 42 states and Washington, D.C. Reporter Heather Vogell worked with Teachers College testing expert Matthew Johnson, who is also the editor of the Journal of Educational and Behavioral Statistics.
The newspaper figured the percentage of multiple-choice questions on each test with statistics below industry standards, using the metric “discrimination” (typically the “point biserial” or “item-total correlation”). The calculation gauges how well a question distinguishes between students who know the subject matter tested and those who don’t.
Low-discrimination questions should often be revised or thrown out, experts say. At best, they provide educators scant information because they fail to separate better performers from worse ones. They can also signal a problem such as two right answer options.
A few low-discrimination questions are sometimes necessary to cover key topics, but blocks aren’t good practice. Most states had at least one exam with high numbers of suspect questions. Some had more.
Vogell received the prestigious Spencer Education Fellowship at the Columbia University Graduate School of Journalism in New York City, where she did most of the reporting. Also contributing to the data review was Kate Fink, a Ph.D. candidate at the school.