Of That

Brandt Redd on Education, Technology, Energy, and Trust

14 September 2018

Quality Assessment Part 5: Blueprints and Computerized-Adaptive Testing

This is part 5 of a 9-part series on building high-quality assessments.

Arrows in a tree formation.

Molly is a 6th grade student who is already behind in math. Near the end of the school year she takes her state's annual achievement tests in mathematics and English Language Arts. Already anxious when she sits down to the test, her fears are confirmed by the first question where she is asked to divide 3/5 by 7/8. Though they spent several days on this during the year, she doesn't recall how to divide one fraction by another. As she progresses through the test, she is able to answer a few questions but resorts to guessing on all too many. After twenty minutes of this she gives up and just guesses on the rest of the answers. When her test results are returned a month later she gets the same rating as three previous years, "Needs Improvement." Perpetually behind, she decides that she is, "Just not good at math."

Molly is fictional but she represents thousands of students across the U.S. and around the world.

Let's try another scenario. In this case, Molly is given a Computerized-Adaptive Test (CAT). When she gets the first question wrong, the testing engine picks an easier question which she knows how to answer. Gaining confidence she applies herself to the next question which she also knows how to answer. The system presents easier and harder questions as it works to pinpoint her skill level within a spectrum extending back to 4th grade and ahead to 8th grade. When her score report comes she has a scale score of 2505 which is below the 6th grade standard of 2552. The report shows her previous year's score of 2423 which was well below standard for Grade 5. The summary says that, while Mollie is still behind, she has achieved significantly more than a year's progress in the past year of school; much like this example of a California report.

Computerized-Adaptive Testing

A fixed-form Item Response Theory test presents a set of questions at a variety of skill levels centered on the standard for proficiency for the grade or course. Such tests result in a scale score, which indicates the student's proficiency level, and a standard error which indicates a confidence level of the scale score. A simplified explanation is that the student's actual skill level should be within the range of the scale score plus or minus the standard error. Because a fixed-form test is optimized for the mean, the standard error is greater the further the student is from the target proficiency for that test.

Computerized Adaptive Tests (CAT) start with a large pool of assessment items. Smarter Balanced uses a pool of 1,200-1,800 items for a 40 item test. Each question is calibrated according to its difficulty within the range of the test. The test administration starts with a question near the middle of the range. From then on, the adaptive algorithm tracks the student's performance on prior items and and then selects questions most likely to discover and increase confidence in the student's skill level.

A stage-adaptive or multistage test is similar except that groups of questions are selected together.

CAT tests have three important advantages over fixed-form:

  • The test can measure student skill across a wider range while maintaining a small standard error.
  • Fewer questions are required to assess the student's skill level.
  • Students may have a more rewarding experience as the testing engine offers more questions near their skill level.

When you combine more accurate results with a broader measured range and then use the same test family over time, you can reliably measure student growth over a period of time.

Test Blueprints

As I described in Part 2 and Part 3 of this series, each assessment item is designed to measure one or two specific skills. A test blueprint indicates what skills are to be measured in a particular test and how many items of which types should be used to measure each skill.

As an example, here's the blueprint for the Smarter Balanced Interim Assessment Block (IAB) for "Grade 3 Brief Writes":

Block 3: Brief Writes
ClaimTargetItemsTotal Items
Writing1a. Write Brief Texts (Narrative)46
3a. Write Brief Texts (Informational)1
6a. Write Brief Texts (Opinion)1

This blueprint, for a relatively short fixed-form test, indicates a total of six items spread across one claim and three targets. For more examples, you can check out the Smarter Balanced Test Blueprints. The Summative Tests, which are used to measure achievement at the end of each year, have the most items and represent the broadest range of skills to be measured.

When developing a fixed-form test, the test producer will select a set of items that meets the requirements of the blueprint and represents an appropriate mix of difficulty levels.

For CAT tests it's more complicated. The test producer must select a much larger pool of items than will be presented to the student. A minimum is five to ten items in the pool for each item in to be presented to the student. For summative tests, Smarter Balanced uses a ratio averaging around 25 to 1. These items should represent the skills to be measured in approximately the same ratios as they are represented in the blueprint. And they should represent difficulty levels across the range of skill to be measured. (Difficulty level is represented by the IRT b parameter of each item.)

As the student progresses through the test, the CAT Algorithm selects the next item to be presented. In doing so, it takes into account three factors: 1. Information it has determined about the student's skill level so far, 2. How much of the blueprint has been covered so far and what it has yet to cover, and 3. The pool of items it has to select from. From those criteria it selects an item that will advance coverage of the blueprint and will improve measurement of the student's skill level.

Data Form

To present a CAT assessment the test engine needs three sets of data:

  • The Test Blueprint
  • A Catalog of all items in the pool. The entry for each item must specify it's alignment to the test blueprint (which is equivalent to its alignment to standards), and its IRT Parameters.
  • The Test Items themselves.

Part 3 of this series describes formats for the items. The item metadata should include the alignment and IRT information. The manifest portion of IMS Content Packaging is one format for storing and transmitting item metadata.

To date, there is no standard or commonly-used data format for test blueprints. Smarter Balanced has published open specifications for its Assessment Packages. Of those, the Test Administration Package format includes the test blueprint and the item catalog. IMS CASE is designed for representing achievement standards but it may also be applicable to test blueprints.

IMS Global has formed an "IMS CAT Task Force" which is working on interoperable standards for Computerized Adaptive Testing. They anticipate releasing specifications later in 2018.

Quality Factors

A CAT Simulation is used to measure the quality of a Computerized Adaptive Test. These simulations use a set of a few thousand simulated students each assigned a particular skill level. The system then simulates each student taking the test. For each item, the item characteristic function is used to determine whether a student at that skill level is likely to answer correctly. The adaptive algorithm uses those results to determine which item to present next.

The results of the simulation are used to see how well the CAT measured the skill levels of the simulated students by comparing the test scores against the skill levels of the simulated students. Results of a CAT simulation are used to ensure that the item pool has sufficient coverage, that the CAT algorithm satisfies the blueprint, and to find out which items get the most exposure. This feedback is used to tune the item pool and the configuration of the CAT algorithm to achieve optimal results across the simulated population of students.

To build a high-quality CAT assessment:

  • Build a large item pool with items of difficulty levels spanning the range to be measured.
  • Design a test blueprint that focuses on the skills to be measured and correlates with the overall score and the subscores to be reported.
  • Ensure that the adaptive algorithm effectively covers the the blueprint and also focuses in on each student's skill level.
  • Perform CAT simulations to tune the effectiveness of the item pool, blueprint, and CAT algorithm.


Computerized adaptive testing offers significant benefits to students by delivering more accurate measures with a shorter, more satisfying test. CAT is best suited to larger tests with 35 or more questions spread across a broad blueprint. Shorter tests, focused on mastery of one or two specific skills, may be better served by conventional fixed-form tests.

No comments:

Post a Comment