*This is part 4 of a 10-part series on building high-quality assessments.*

*Part 1: Quality Factors**Part 2: Standards and Content Specifications**Part 3: Items and Item Specifications***Part 4: Item Response Theory, Field Testing, and Metadata***Part 5: Blueprints and Computerized-Adaptive Testing**Part 6: Achievement Levels and Standard Setting**Part 7: Securing the Test**Part 8: Test Reports**Part 9: Frontiers**Part 10: Scoring Tests*

Consider a math quiz with the following two items:

x = 5 - 2

What is the value of x?

x2 - 6x + 9 = 0

What is the value of x?

George gets item A correct but gets the wrong answer for item B. Sally has the wrong answer for A but answers B correctly. Using traditional scoring, George and Sally each get 50%.

A more sophisticated quiz might assign 2 points to item A and 6 points to item B (recognizing that B is harder than A). Under such a scoring system, George would get 25% and Sally would get 75%.

But the score is still short on meaning. George scored 25% of what? Sally scored 75% of what?

An even more sophisticated model should acknowledge that knowing how to solve quadratics (item B) is evidence that the student can also perform subtraction (item A). Such a model would position George somewhere between first grade (single-digit subtraction) and High School (solving quadratics). That same model would indicate that Sally either guessed correctly on item B or made a mistake on item A that's not representative of her skill. Due to the conflicting evidence, we are less sure about Sally's skill level than George's. For both students, more items would be required to gain greater confidence in their skill levels.

## Item Response Theory

Item Response Theory or IRT is a statistical method for describing how student performance on assessment items relates to their skill in the area the item was designed to measure.

The "three parameter logistic model" (3PL) for IRT describes the probability that a student of a certain skill level will answer the item correctly. Student proficiency is represented by θ (theta) and the three *item* parameters are a, b, and c. They represent the following factors:

*a*= Discrimination. This value indicates how well the item discriminates between proficient students and those who have not yet learned this skill.*b*= Difficulty. This value indicates how difficult an item is for the student to answer correctly.*c*= Guessing. The probability that a student might guess the correct response. For a four-item multiple-choice question, this would be 0.25 because the student has a one-in-four chance of guessing the right answer.

From these parameters we can create an *item characteristic curve*. The formula is as follows:

This is much easier to understand in graph form. So I loaded it into the Desmos graphing calculator.

The vertical (y) axis indicates the probability that a student will answer the item correctly. The horizontal (y) axis is student proficiency (represented by θ in the equation). You can move the sliders to change the *a*, *b*, and *c* parameters and see how different items would be represented in an item characteristic curve.

In addition to this "three-parameter" model, there are other IRT models but they all follow this same basic premise: The function represents the probability that a student of given skill (represented by θ, theta) will answer the question correctly. At least one parameter of the function represents the difficulty of the question. For items scored on multi-point scale, there are difficulty parameters (typically d1, d2, etc.) representing the difficulty thresholds for each point value.

## Scale Scores

The difficulty parameter *b*, and the student skill value θ, are on the same *logistic* scale and center on the skill level being measured. For example, if an item is written for grade 5 math, a *b* parameter of 0 means that the average 5th grade student should be able to answer the question correctly 50% of the time.

Most assessments convert from this *theta score* into a *scale score* which is a consistent score reported to educators, students, and parents. For Smarter Balanced, the scale score ranges from 2000 to 3000 and represents skill levels from Kindergarten to High School Graduation. *Theta scores* are converted to *scale scores* using a polynomial function.

## Field Testing

So how do we come up with the *a*, *b*, and *c* parameters for a particular item? Based on the item type and potential responses we can predict *c* (guessing) fairly well but our experience at Smarter Balanced has shown that authors are not very good at predicting *b* (difficulty) or *c* (discrimination). To get an objective measure of these values we use a *field test*.

In Spring 2014 Smarter Balanced held a field test in which 4.2 million students completed a test - typically in either English Language Arts or Mathematics. Some students took both. For the participating schools and students, this was a practice test - gaining experience in administering and taking tests. Since the items were not yet calibrated, we could not reliably score the tests. For Smarter Balanced it offered critical data on more than 19,000 test items. For each item we gained more than 10,000 scored responses from students representing the target grades across all demographics.

Psychometricians used these data, from students taking the test, to calculate the parameters (*a*, *b*, and *c*) for each item in the field test. The process of calculating IRT parameters from field test data is called *calibration*. Once items were calibrated we examined the parameters and the data to determine which items are approved for use in tests. For example, if *a* is too low then the question likely has a flaw. It may not measure the right skill or the answer key may be incorrect. Likewise, if the *b* parameter is different across demographic groups than the item may be sensitive to gender, cultural, or ethnic bias. Items from the field test that met statistical standards were approved and became the initial bank of items from which Smarter Balanced produces tests.

Each year Smarter Balanced does an *embedded field test*. Each test that a student takes has a few new "field test" items included. These items do not contribute to the student's test score. Rather, the students' scored responses are used to calibrate the items. This way the test item bank is being constantly renewed. Other organizations like ACT and SAT follow the same practice of embedding field test questions in regular tests.

To understand more about IRT, I recommend A Simple Guide to IRT and Rasch Modeling by Ho Yu.

## Item Metadata

The IRT parameters, alignment to standards, and other critical information are collected as metadata about each item. In most cases, metadata is represented as a set of name-value pairs. There are many formats for representing metadata and also many dictionaries of field definitions. Smarter Balanced uses the metadata structure from IMS Content Packaging and draws field definitions from The Learning Resource Metadata Initiative (LRMI), from Schema.org, and from Common Education Data Standards (CEDS).

Here are some of the most critical metadata elements for assessment items with links to their definitions in those standards:

- Identifier: An number that uniquely identifies this item.
- PrimaryStandard: An identifier of the principal skill the item is intended to measure. The skill would be described in an Achievement Standard or Content Specification.
- SecondaryStandard: Optional identifiers of additional Achievement Standards or Content Specifications that the item measures.
- InteractionType: The type of interaction (multiple choice, matching, short answer, essay, etc.).
- IRT Parameters: The
*a*,*b*, and*c*parameters or another parameter set for the Item Response Theory function. - History: A record of when and how the item has been used to estimate how much it has been exposed.

## Quality Factors

States, schools, assessment consortia, and assessment companies all maintain banks of assessment items from which they construct their assessments. There are a number of efforts underway to pool resources from multiple entities into large, joint item banks. The value of items in any such bank is **multiplied tenfold** if the items have consistent and reliable metadata regarding **alignment to standards** and **IRT parameters**.

Here are factors to consider related to IRT Calibration and Metadata:

- Are all items field-tested and calibrated before they are used in an operational test?
- Is alignment to standards and content specifications an integral part of item writing?
- Are the identifiers used to record alignment consistent across the entire item bank?
- Is field testing an integral part of the assessment design?
- Are IRT parameters consistent and comparable across the entire bank?
- When sharing items or an item bank across multiple organizations, do all participants agree to contribute data (field testing and operational use) back to the bank?

## Wrapup

Field testing can be expensive, inconvenient, or both. But without actual data from student performance we have no objective evidence that a particular assessment item measures what it's intended to measure at the expected level of difficulty.

The challenges around field testing combined with the lack of training in IRT and related psychometrics have been kept these measures from being used in anything other than large-scale, high stakes tests. Nevertheless, it's concerning to me that final exams and midterms of great consequence are rarely, if ever, calibrated and validated. Greater collaboration among institutions, among curriculum developers, or both could achieve sufficient scale for calibrated tests to become more common.

## No comments:

## Post a Comment