Of That

Brandt Redd on Education, Technology, Energy, and Trust

23 August 2018

Quality Assessment Part 3: Items and Item Specifications

This is part 3 of a multi-part series on building high-quality assessments.
Transparent cylindrical vessel with wires leading to an electric spark inside.

Some years ago I remember reading my middle school science textbook. The book was attempting to describe the difference between a mixture and a compound. It explained that water is a compound of two parts hydrogen and one part oxygen. However, if you mix two parts hydrogen and one part oxygen in a container, you will simply have a container with a mixture of the two gasses, they will not spontaneously combine to form water.

So far, so good. Next, the book said that if you introduced an electric spark in the mixed gasses you would, "start to see drops of water appear on the inside surface of the container as the gasses react to form water." This was accompanied by an image of a container with wires and an electric spark.

I suppose the book was technically correct; that is what would happen if the container was strong enough to contain the violent explosion. But, even as a middle school student, I wondered how the dangerously misleading passage got written and how it survived the review process.

The writing and review of assessments requires the same or better rigor than writing textbooks. An error on an assessment item affects the evaluation of all students who take the test.

Items

In the parlance of the assessment industry, test questions are called items. The latter term is intended include more complex interactions than just answering questions.

Stimuli and Performance Tasks

Oftentimes, an item is based on a stimulus or passage that sets up the question. It may be an article, short story, or description of a math or science problem. The stimulus is usually associated with three to five items. When presented by computer, the stimulus and the associated items are usually presented on one split screen so that the student can refer to the stimulus while responding to the items.

Sometimes, item authors will write the stimulus; this is frequently the case for mathematics stimuli as they set up a story problem. But the best items draw on professionally-written passages. To facilitate this, the Copyright Clearance Center has set up the Student Assessment License as a means to license copyrighted materials for use in student assessment.

A performance task is a larger-scale activity intended to allow the student to demonstrate a set of related skills. Typically, it begins with a stimulus followed by a set of ordered items. The items build on each other usually finishing with an essay that asks the student to draw conclusions from the available information. For Smarter Balanced this pattern (stimulus, multiple items, essay) is consistent across English Language Arts and Mathematics.

Prompt or Stem

The prompt, sometimes called a stem, is the request for the student to do something. A prompt might be as simple as, "What is the sum of 24 and 62." Or it might be as complex as, "Write an essay comparing the views of the philosophers Voltaire and Kant regarding enlightenment. Include quotes from each that relate to your argument." Regardless, the prompt must provide required information, clearly describe what the student is to do, and how they are to express their response.

Interaction or Response Types

The response is a student's answer to the prompt. Two general categories of items are selected response and constructed response. Selected response items require the student to select one or more alternatives from a set of pre-composed responses. Multiple choice is the most common selected response type but others include multi-select (in which more than one response may be correct), matching, true/false, and others.

Multiple choice items are particularly popular due to the ease of recording and scoring student responses. For multiple choice items, alternatives are the responses that a student may select from, distractors are the incorrect responses, and the answer is the correct response.

The most common constructed response item types are short answer and essay. In each case, the student is expected to write their answer. The difference is the length of the answer; short answer is usually a word or phrase while essay is a composition of multiple sentences or paragraphs. A variation of short answer may have a student enter a mathematical formula. Constructed responses may also have students plot information on a graph or arrange objects into a particular configuration.

Technology-Enhanced items are another commonly-used category. These items are delivered by computer and include simulations, composition tools, and other creative interactions. However, all technology-enhanced items can still be categorized as either selected response or constructed response.

Scoring Methods

There are two general ways of scoring items, deterministic scoring and probabilistic scoring.

Deterministic scoring is indicated when a student's response may be unequivocally determined to be correct or incorrect. When a response is scored on multiple factors there may be partial credit for the factors the student addressed correctly. Deterministic scoring is most often associated with selected response items but many constructed response items may also be deterministically scored when the factors of correctness are sufficiently precise; such as a numeric answer or a single word for a fill-in-the-blank question. When answers are collected by computer or are easily entered into a computer, deterministic scoring is almost always done by computer.

Probabilistic scoring is indicated when the quality of a student's answer must be judged on a scale. This is most often associated with essay type questions but may also apply to other constructed response forms. When handled well, a probabilistic score may include a confidence level — how confident is the scoring person or system that the score is correct.

Probabilistic scoring may be done by humans (e.g. judging the quality of an essay) or by computer. When done by computer, Artificial Intelligence techniques are frequently used with different degrees of reliability depending on the question type and the quality of the AI.

Answer Keys and Rubrics

The answer key is the information needed to score a selected-response item. For multiple choice questions, it's simply the letter of the correct answer. A machine scoring key or machine rubric is an answer key coded in such a way that a computer can perform the scoring.

The rubric is a scoring guide used to evaluate the quality of student responses. For constructed response items the rubric will indicate which factors should be evaluated in the response and what scores should be assigned to each factor. Selected response items may also have a rubric which, in addition to indicating which response is correct, would also give an explanation about why that response is correct and why each distractor is incorrect.

Item Specifications

An item specification describes the skills to be measured and the interaction type to be used. It serves as both a template and a guide for item authors.

The skills should be expressed as references to the Content Specification and associated Competency Standards (see Part 2 of this series). A consistent identifier scheme for the Content Specification and Standards greatly facilitates this. However, to assist item authors, the specification often quotes relevant parts of the specification and standards verbatim.

If the item requires a stimulus, the specification should describe the nature of the stimulus. For ELA, that would include the type of passage (article, short-story, essay, etc.), the length, and the reading difficulty or text complexity level. In mathematics, the stimulus might include a diagram for Geometry, a graph for data analysis, or a story problem.

The task model describes the structure of the prompt and the interaction type the student will use to compose their response. For a multiple choice item, the task model would indicate the type of question to be posed, sometimes with sample text. That would be followed by the number of multiple choice options to be presented, the structure for the correct answer, and guidelines for composing appropriate distractors. Task models for constructed response would include the types of information to be provided and how the student should express their response.

The item specification concludes with guidelines about how the item will be scored including how to compose the rubric and scoring key. The rubric and scoring key focus on what evidence is required to demonstrate the student's skill and how that evidence is detected.

Smarter Balanced content specifications include references to the Depth of Knowledge that should be measured by the item, and guidelines on how to make the items accessible to students with disabilities. Smarter Balanced also publishes specifications for full performance tasks.

Data Form for Item Specifications

Like Content Content Specifications, Item Specifications have traditionally been published in document form. When offered online they are typically in PDF format. Like Content Specifications, there are great benefits to be achieved by publishing content specs in a structured data form. Doing so can integrate the content specification into the item authoring system — presenting a template for the item with pre-filled content-specification alignment metadata, pre-selected interaction time, and guidelines about stimulus and prompt alongside the places where the author is to fill in the information.

Smarter Balanced has selected the IMS CASE format for publishing item specifications in structured form. This is the same data format we used for the content specifications.

Data Form for Items

The only standardized format for assessment items in general use is IMS Question and Test Interoperability (QTI). It's a large standard with many features. Some organizations have chosen to implement a custom subset of QTI features known as a "profile." The soon-to-be-released QTI 3.0 aims to reduce divergence among profiles.

A few organizations, including Smarter Balanced and CoreSpring have been collaborating on the Portable Interactions and Elements (PIE) concept. This is a framework for packaging custom interaction types using Web Components. If successful, this will simplify the player software and support publishing of custom interaction types.

Quality Factors

A good item specification will likely be much longer than the items it describes. As a result, producing an item specification also consumes a lot more work than writing any single item. But, since each item specification will result in dozens or hundreds of items, the effort of writing good item specifications pays huge dividends in terms of the quality of the resulting assessment.

  • Start with a good quality standards and content specifications
  • Create task models that are authentic to the skills being measured. The task that the student is asked to perform should be as similar as possible to how they would manifest the measured skill in the real world.
  • Choose or write high-quality stimuli. For language arts items, the stimulus should demand the skills being measured. For non-language-arts items, the stimulus should be clear and concise so as to reduce sensitivity to student reading skill level.
  • Choose or create interaction types that are inherently accessible to students with disabilities.
  • Ensure that the correct answer is clear and unambiguous to a person who possesses the skills being measured.
  • Train item authors in the process of item writing. Sensitize them to common pitfalls such as using terms that may not be familiar to students of diverse ethnic backgrounds.
  • Use copy editors to ensure that language use is consistent, parallel in structure, and that expectations are clear.
  • Develop a review, feedback, and revision process for items before they are accepted.
  • Write specific quality criteria for reviewing items. Set up a review process in which reviewers apply the quality criteria and evaluate the match to the item specification.

Wrapup

Most tests and quizzes we take, whether in K-12 or college, are composed one question at a time based on the skills taught in the previous unit or course. Item specifications are rarely developed or consulted in these conditions and even the learning objectives may be somewhat vague. Furthermore, there is little third-party review of such assessments. Considering the effort students go through to prepare for and take an exam, not to mention the consequences associated with their performance on those exams, it seems like institutions should do a better job.

Starting from an item specification is both easier and produces better results than writing an item from scratch. The challenge is producing the item specifications themselves, which is quite demanding. Just as achievement standards are developed at state or multi-state scale, so also could item specifications be jointly developed and shared broadly. As shown in the links above, Smarter Balanced has published its item specifications and many other organizations do the same. Developing and sharing item specifications will result in better quality assessments at all levels from daily quizzes to annual achievement tests.

11 August 2018

Quality Assessment Part 2: Standards and Content Specifications

Mountain with Flag on Summit

In Part 1 of this series I introduced the factors that distinguish a high quality assessment from other assessments. The balance of this series will discuss the process of constructing an assessment and the factors that make them high quality. Today I'm writing about Achievement Standards and the Content Specification.

Some years ago my sister was in middle school and I had just finished my freshman year at college. My sister's English teacher kept assigning word search puzzles and she hated them. The family had just purchased an Apple II clone and so I wrote a program to solve word searches for my sister. I'm not sure what skills her teacher was trying to develop with the puzzles; but I developed programming skills and my sister learned to operate a computer. Both skill sets have served us later in life.

Alignment to Standards

The first step in building any assessment, from a quiz to a major exam, should be to determine what you are trying to measure. In the case of academic assessments, we measure skills, also known as competencies. State standards are descriptions of specific competencies that a student should have achieved by the end of the year. They are organized by subject and grade. State summative tests indicate student achievement by measuring how close each student is to the state standards — typically at the close of the school year. Interim tests can be used during the school year to measure progress and to offer more detailed focus on specific skill areas.

At the higher education level, Colleges and Universities set Learning Objectives for each course. A common practice is to use the term, "competencies" as a generic reference to state standards and college learning objectives and I'll follow that pattern here.

The Smarter Balanced Assessment Consortium, where I have been working, measures student performance relative to the Common Core State Standards. Choosing standards that have been adopted by multiple states enables us to write one assessment that meets the needs all of our member states and territories.

The Content Specification

The content specification is a restatement of competencies organized in a way that facilitates assessment. Related skills are clustered together so that performance measures on related tasks may be aggregated. For example, Smarter Balanced collects skill measures associated with "Reading Literary Texts" and "Reading Informational Texts" together into a general evaluation of "Reading". In contrast, a curriculum might cluster "Reading Literary Texts" with "Creative Writing" because synergies occur when you teach those skills together.

The Smarter Balanced content specification follows a hierarchy of Subject, Grade, Claim, and Target. In Mathematics, the four claims are:

  1. Concepts and Procedures
  2. Problem Solving
  3. Communicating Reasoning
  4. Modeling and Data Analysis

In English Language Arts, the four claims are:

  1. Reading
  2. Writing
  3. Speaking & Listening
  4. Research & Inquiry

These same four claims are repeated in each grade but the expected skill level increases. That increase in skill is represented by the targets assigned to the claims at each grade level. In English Reading (Claim 1), the complexity of the text presented to the student increases and the information the student is expected to draw from the text is increasingly demanding. Likewise, in Math Claim 1 (Concepts and Procedures) the targets progress from simple arithmetic in lower grades to Geometry and Trigonometry in High School.

Data Form

Typical practice is for states to publish their standards as documents. When published online they have been published as PDF files. Such documents are human readable but they lack the structure needed for data systems to facilitate access. In many cases they also lack identifiers that are required when referencing standards or content specifications.

Most departments within colleges and universities will develop a set of learning objectives for each course. Often times a state college system will develop statewide objectives. While these objectives are used internally for course design, there's little consistency in publishing the objectives. Some institutions publish all of their objectives while others keep them as internal documents. The Temple University College of Liberal Arts offers an example of publicly published learning objectives in HTML form.

In August 2017, IMS Global published the Competencies & Academic Standards Exchange (CASE) data standard. It is a vendor-independent format for publishing achievement standards suitable for course learning objectives, state standards, content specifications, and many other competency frameworks.

Public Consulting Group, in partnership with a several organizations built OpenSALT, an open source "Standards Alignment Tool" as a reference implementation of CASE.

Here's an example. Smarter Balanced originally published its content specifications in PDF form. The latest versions, from July of 2017, are available on the Development and Design page of their website. These documents have complete information but they do not offer any computer-readable structure.

"Boring" PDF form of Smarter Balanced Content Specifications:

In Spring 2018, Smarter Balanced published the same specifications, in CASE format, using the OpenSALT tool. The structure of the format lets you navigate the hierarchy of the specifications. The CASE format also supports cross-references between publications. In this case, Smarter Balanced also published a rendering of the Common Core State Standards in CASE format to facilitate references from the content specifications to the corresponding Common Core standards.

"Cool" CASE form of Smarter Balanced Content Specifications and CCSS:

I hope you agree that the Standards and Content Specifications are significantly easier to navigate in their structured form. Smarter Balanced is presently working on a "Content Specification Explorer" which will offer a friendlier user interface on the structured CASE data.

Identifiers

Regardless of how they are published, use of standards is greatly facilitated if an identifier is assigned to each competency. There are two general categories of identifiers: Opaque identifiers carry no meaning - they are just a number. Often they are "Univerally Unique IDs" (UUIDs) which are generated using an algorithm to assure that identifier is not used anywhere else in the world. Any meaning of the identifier is by virtue of the record to which it is assigned. "Human Readable" identifiers are constructed to have a meaningful structure to a human reader. There are good justifications each approach.

The Common Core State Standards assigned both types of identifier to each standard. Smarter Balanced has followed a similar practice in the identifiers for our Content Specification.

Common Core State Standards Example:

  • Opaque Identifier: DB7A9168437744809096645140085C00
  • Human Readable Identifier: CCSS.Math.Content.5.OA.A.1
  • URL: http://corestandards.org/Math/Content/5/OA/A/1/
  • Statement: Use parentheses, brackets, or braces in numerical expressions, and evaluate expressions with these symbols.

Smarter Balanced Content Specification Target Example:

You'll notice that the Smarter Balanced Content Specification target is a copy of the corresponding Common Core State Standard. The CASE representation includes an "Exact Match Of" cross-reference from the content specification to the corresponding standard to show that's the case.

Smarter Balanced has published a specification for its human-readable Content Specification Identifiers. Here's the interpretation of "M.G5.C1OA.TA.5.OA.A.1":

  • M Math
  • G5 Grade 5
  • C1 Claim 1
  • OA Domain OA (Operations & Algebraic Thinking)
  • TA Target A
  • 5.OA.A.1 CCSS Standard 5.OA.A.1

Quality Factors

The design of any educational activity should begin with a set of learning objectives. State Standards offer a template for curricula, lesson plans, assessments, supplemental materials, games and more. At the higher education level, Colleges and Universities set learning objectives for each course that serve a similar purpose. The quality of the achievement standards will have a fundamental impact on the quality of the related learning activities.

Factors to consider when selecting or building standards or learning objectives include the following:

  • Are the competencies relevant to the discipline being taught?
  • Are the competencies parallel in construction, describing skills at a similar grain size?
  • Are the skills ordered in a natural learning progression?
  • Are related skills, such as reading and writing, taught together in a coordinated fashion?
  • Is the amount of material covered by the competencies appropriate for the amount of time that will be allocated for learning?

The Development Process and the Standards-Setting Criteria used by the authors of the Common Core State Standards offer some insight into how they sought to develop high quality standards.

Factors to consider when developing an assessment content specification include the following:

  • Does the specification reference an existing standard or competency set?
  • Are the competencies described in such a way that they can be measured?
  • Is the grain size (the amount of knowledge involved) for each competency optimal for construction of test questions?
  • Are the competencies organized so that related skills are clustered together?
  • Does the content standard factor in dependencies between competencies? For example, performing long division is evidence that an individual is also competent at multiplication.
  • Is the organization of the competencies, typically into a hierarchy, consistent and easy to navigate?
  • Does the competency set lend itself to reporting skills at multiple levels? For example, Smarter Balanced reports an overall ELA score and then subscores for each claim: Reading, Writing, Speaking & Listening, and Research & Inquiry.

Wrapup

Compared with curricula, standards and content specifications are relatively short documents. The Common Core State Standards total 160 pages, much less than the textbook for a single grade. But standards have a disproportionate impact on all learning activities within the state, college, or class where they are used. Careful attention to the selection or construction of standards is a high-impact effort.

02 August 2018

Quality Assessment Part 1: Quality Factors

Flask

As I wrap up my service at the Smarter Balanced Assessment Consortium I am reflecting on what we've accomplished over the last 5+ years. We've assembled a full suite of assessments; we built an open source platform for assessment delivery; and multiple organizations have endorsed SmarterBalanced as more rigorous and better aligned to state standards than prior state assessments.

So, what are the characteristics of a high-quality assessment? How do you go about constructing such an assessment? And what distinguishes an assessment like Smarter Balanced from a typical quiz or exam that you might have in class?

That will be the subject of this series of posts. Starting from the achievement standards that guide construction of both curriculum and assessment I will walk through the process Smarter Balanced and other organizations use to create standardized assessments and then indicate the extra effort required to make them both standardized and high quality.

But, to start with, we must define what quality means — at least in the context of an assessment.

Goal of a Quality Assessment

Nearly a year ago the Smarter Balanced member states released test scores for 2017. In most states the results were flat — with little or no improvement from 2016. It was a bit disappointing but what surprised me at the time was the criticism directed at the test. "The test must be flawed," certain critics said, "because it didn't show improvement."

This seemed like a strange criticism to direct at the measurement instrument. If you stick your hand in an oven and it doesn't feel warm do you wonder why your hand is numb or do you check the oven to see if it is working? Both are possibilities but I expect you would check the oven first.

The more I thought about it, however, the more I realized that the critics have a point. Our purpose in deploying assessments is to improve student learning, not just to passively measure learning. The assessment is a critical part of the eduational feedback loop.

Smarter Balanced commissioned an independent study and confirmed that the testing instrument is working properly. Nevertheless, there are more things that the assessment system can do support better learning.

Features of a Quality Assessment

So, we define a quality assessment as one that consistently contributes to better student learning. What are the features of an assessment that does this?

  • Valid: The test must measure the skills it is intended to measure. That requires us to start with a taxonomy of skills — typically called achievement standards or state standards. The quality of the standards also matter, of course, but that's the subject of a different blog post.. A valid test should be relatively insensitive to skills or characteristics it is not intended to measure. For example, it should be free of ethnic or cultural bias.
  • Reliable: The test should consistently return the same results for students of the same skill level. Since repeated tests may not be composed of the same questions, the measures must be calibrated to ensure they return consistent results. And the test must accurately measure growth of a student when multiple tests are given over an interval of time.
  • Timely: Assessment results must be provided in time to guide future learning activities. Summative assessments, big tests near the end of the school year, are useful but they must be augmented with interim assessments and formative activities that happen at strategic times during the school year.
  • Informative: If an assessment is to support improved learning, the information it offers must be useful for guiding the next steps in a student's learning journey.
  • Rewarding: Test anxiety has been the downfall of many well-intentioned assessment programs. Not only does anxiety interfere with the reliability of results but inappropriate consequences to teachers can encourage poor instructional practice. By its nature, the testing process is demanding of students. Upon completion, their effort should be rewarded with a feeling that they've achieved something important.

Watch This Space

In the coming weeks, I will describe the processes that go into constructing quality assessments. Because I'm a technology person, I'll include discussions of how data and technology standards support the work.