Of That

Brandt Redd on Education, Technology, Energy, and Trust

06 November 2018

Quality Assessment Part 8: Test Reports

This is part 8 of a 10-part series on building high-quality assessments.

Bicycle

Since pretty much the first Tour de France cyclists have assumed that narrow tires and higher pressures would make for a faster bike. As tire technology improved to be able to handle higher pressures in tighter spaces the consensus standard became 23mm width and 115 psi. And that standard held for decades. This was despite the science that says otherwise.

Doing the math indicates that a wider tire will have a shorter footprint, and a shorter footprint loses less energy to bumps in the road. The math was confirmed in laboratory tests and the automotive industry has applied this information for a long time. But tradition held in the Tour de France and other bicycle races until a couple of teams began experimenting with wider tires. In 2012, Velonews published a laboratory comparison of tire widths and by 2018 the average moved up to 25 mm with some riders going as wide as 30mm.

While laboratory tests still confirm that higher pressure results in lower rolling resistance, high pressure also results in a rougher ride and greater fatigue for the rider. So teams are also experimenting with lower pressures adapted to the terrain being ridden and they find that the optimum pressure isn't necessarily the highest that the tire material can withstand.

You can build the best and most accurate student assessment ever. You can administer it properly with the right conditions. But if no one pays attention to the results, or if the reports don't influence educational decisions, then all of that effort will be for naught. Even worse, correct data may be interpreted in misleading ways. Like the tire width data, the information may be there but it still must be applied.

Reporting Test Results

Assuming you have reliable test results (the subjects of the preceding parts in this series), there are four key elements that must be applied before student learning will improve:

  • Delivery: Students, Parents, and Educators must be able to access the test data.
  • Explanation: They must be able to interpret the data — understand what it means.
  • Application: The student, and those advising the student, must be able to make informed decisions about learning activities based on assessment results.
  • Integration: Educators should correlate the test results with other information they have about the student.

Delivery

Most online assessment systems are paired with online reporting systems. Administrators are able to see reports for districts, schools, and grades sifting and sorting the data according to demographic groups. This may be used to hold institutions accountable and to direct Title 1 funds. Parents and other interested parties can access public reports like this one for California containing similar information.

Proper interpretation of individual student reports has greater potential to improve learning than the school, district, and state-level reports. Teachers have access to reports for students in their classes and parents receive reports for their children at least once a year. But teachers may not be trained to apply the data, or parents may not know how to interpret the test results.

Part of delivery is designing reports so that the information is clear and the correct interpretation is the most natural. To experts in the field, well-versed in statistical methods, the obvious design may not be the best one.

The best reports are designed using a lot of consumer feedback. The designers use focus groups and usability tests to find out what works best. In a typical trial, a parent or educator would be given a sample report and asked to interpret it. The degree to which they match the desired interpretation is an evaluation of the quality of the report.

Explanation

Even the best-designed reports will likely benefit from an interpretation guide. A good example is the Online Reporting Guide deployed by four western states. The individual student reports in these states are delivered to parents on paper. But the online guide provides interpretation and guidance to parents that would be hard to achieve in paper form.

Online reports should be rich with explanations, links, tooltips, and other tools to help users understand what each element means and how it should be interpreted. Graphs and charts should be well-labeled and designed as a natural representation of the underlying data.

An important advantage of online reporting is that it can facilitate exploration of the data. For example, a teacher might be viewing an online report of an interim test. They that a cluster of students all got a lower score. Clicking on the scores reveals a more detailed chart that shows how the students performed on each question. They might see that the students in the cluster all missed the same question. From there, they could examine the student's responses to that question to gain insight into their misunderstanding. When done properly, such an analysis would only take a few minutes and could inform a future review period.

Application

Ultimately, all of this effort should result in good decisions being made by the student and made by others in their behalf. Closing the feedback loop in this way consistently results in improved student learning.

In part 2 of this series, I wrote that assessment design starts with a set of defined skills, also known as competencies or learning objectives. This alignment can facilitate guided application of test results. When test questions are aligned to the same skills as the curriculum, then students and educators can easily locate the learning resources that are best suited to student needs.

Integration

The best schools and teachers use multiple measures of student performance to inform their educational decisions. In an ideal scenario, all measures, test results, homework, attendance, projects, etc., would be integrated into a single dashboard. Organizations like The Ed-Fi Alliance are pursuing this but it's proving to be quite a challenge.

An intermediate goal is for the measures to be reported in consistent ways. For example, measures related to student skill should be correlated to the state standards. This will help teachers find correlations (or lack thereof) between the different measures.

Quality Factors

  • Make the reports, or the reporting system, available and convenient for students, parents, and educators to use.
  • Ensure that reports are easy to understand and that they naturally lead to the right interpretations. Use focus groups and usability testing to refine the reports.
  • Actively connect between test results and learning resources.
  • Support integration of multiple measures.

Wrapup

Every educational program, activity, or material should be considered in terms of its impact on student learning. Effective reporting, that informs educational decisions, makes the considerable investment in developing and administering a test worthwhile.

16 October 2018

Quality Assessment Part 7: Securing the Test

This is part 7 of a 10-part series on building high-quality assessments.

A Shield

Each spring, millions of students in the United States take their annual achievement tests. Despite proctoring, some fraction of those students carry in a phone or some other sort of camera, take pictures of test questions, and post them on social media. Concurrently, testing companies hire a few hundred people to scan social media sites for inappropriately shared test content and send takedown notices to site operators.

Proctoring, secure browsers, and scanning social media sites are parts of a multifaceted effort to secure tests from inappropriate access. If students have prior access to test content, the theory goes, then they will memorize answers to questions rather than study the principles of the subject. The high-stakes nature of the tests creates incentive for cheating.

Secure Browsers

Most computer-administered tests today are given over the world-wide web. But if students were given unfettered access to the web, or even to their local computer, they could look up answers online, share screen-captures of test questions, access an unauthorized calculator, share answers using chats, or even videoconference with someone who can help with the test. To prevent this, test delivery providers use a secure browser, also known as a lockdown browser. Such a browser is configured so it will only access the designated testing website and it takes over the computer - preventing access to other applications for the duration of the test. It also checks to ensure that no unauthorized applications are already running, such as screen grabbers or conferencing software.

Secure browsers are inherently difficult to build and maintain. That's because operating systems are designed to support multiple concurrent applications and to support convenient switching among applications. In one case, the operating system vendor added a dictionary feature — users could tap any word on the screen and get a dictionary definition of that word. This, of course, interfered with vocabulary-related questions on the test. In this, and many other cases, testing companies have had to work directly with operating system manufacturers to develop special features required to enable secure browsing.

Secure browsers must communicate with testing servers. The server must detect that a secure browser is in use before delivering a test and it also supplies the secure browser with lists of authorized applications that can be run concurrently (such as assistive technology). To date, most testing services develop their own secure browsers. So, if a school or district uses tests from multiple vendors, they must install multiple secure browsers.

To encourage a more universal solution. [Smarter Balanced] commissioned a Universal Secure Browser Protocol that would allow browsers and servers from different companies to work effectively together. They also commissioned and host a Browser Implementation Readiness Test (BIRT) that can be used to verify a browser - that it implements the required protocols and also the basic HTML 5 requirements. So far, Microsoft has implemented their Take a Test feature in Windows 10 that satisfies secure browser requirements and Smarter Balanced has released into open source a set of secure browsers for Windows, MacOS, iOS (iPad), Chrome OS (ChromeBook), Android, and Linux. Nevertheless, most testing companies continue to develop their own solutions.

Large Item Pools - An Alternative Approach

Could there be an alternative to all of this security effort? Deploying secure browsers on thousands of computers is expensive and inconvenient. Proctoring and social media policing cost a lot of time and money. And conspiracy theorists ask if the testing companies have something to hide in their tests.

Computerized-adaptive testing opens one possibility. If the pool of questions is big enough, the probability that a student encounters a question they have previously studied will be small enough that it won't significantly impact the test result. With a large enough pool, you could publish all questions for public review and still maintain a valid and rigorous test. I once asked a psychometrician how large the pool would have to be for this. He estimated about 200 questions in the pool for each one that appears on the test. Smarter Balanced presently uses a 20 to one ratio. Anther benefit of such a large item pool is that students can retake the test and still get a valid result.

Even with a large item pool, you would still need to use a secure browser and proctoring to prevent students from getting help from social media. That is, unless we can change incentives to the point that students are more interested in an accurate evaluation than they are in getting getting a top score.

Quality Factors

The goal of test security is to maintain the validity of test results; ensuring that students do not have access to questions in advance of the test and that they cannot obtain unauthorized assistance during the test. The following practices contribute to a valid and reliable test:

  • For computerized-adaptive tests, have a large item pool thereby reducing the impact of any item exposure and potentially allowing for retakes.
  • For fixed-form tests, develop multiple forms. As with a large item pool, multiple forms let you switch forms in the event that an item is exposed and also allows for retakes.
  • For online tests, use secure browser technology to prevent unauthorized use of the computer during the test.
  • Monitor social media for people posting test content.
  • Have trained proctors monitor testing conditions.
  • Consider social changes, related to how test results are used, that would better align student motivation toward valid test results.

Wrapup

The purpose of Test Security is to ensure that test results are a valid measure of student skill and that they are comparable to other students' results on the same test. Current best practices include securing the browser, effective proctoring, and monitoring social media. Potential alternatives include larger test item banks and better alignment of student and institutional motivations.

05 October 2018

Quality Assessment Part 6: Achievement Levels and Standard Setting

This is part 6 of a 10-part series on building high-quality assessments.

Two mountains, one with a flag on top.

If you have a child in U.S. public school, chances are that they took a state achievement test this past spring and sometime this summer you received a report on how they performed on that test. That report probably looks something like this sample of a California Student Score Report. It shows that "Matthew" achieved a score of 2503 in English Language Arts/Literacy and 2530 in Mathematics. Both scores are described as "Standard Met (Level 3)". Notably, in prior years Matthew was in the "Standard Nearly Met" category so his performance has improved.

The California School Dashboard offers reports of school performance according to multiple factors. For example, the Detailed Report for Castle View Elementary includes a graph of "Assessment Performance Results: Distance from Level 3".

Line graph showing performance of Lake Matthews Elementary on the English and Math tests for 2015, 2016, and 2017. In all three years, they score between 14 and 21 points above proficiency in math and between 22 and 40 points above proficiency in English.

To prepare this graph, they take the average difference between students' scale scores and the Level 3 standard for proficiency in the grade in which they were tested. For each grade and subject, California and Smarter Balanced use four achievement levels, each assigned to a range of scores. Here are the achievement levels for 5th grade Math (see this page for all ranges).

LevelRangeDescriptor
Level 1Less than 2455Standard Not Met
Level 22455 to 2527Standard Nearly Met
Level 32528 to 2578Standard Met
Level 4Greater than 2578Standard Exceeded

So, for Matthew and his fellow 5th graders, the Math standard for proficiency, or "Level 3" score, is 2528. Students at Lake Matthews Elementary, on average, exceeded the Math standard by 14.4 points on the 2017 tests.

Clearly, there are serious consequences associated with the assignment of scores to achievement levels. A difference of 10-20 points can make the difference between a school, or student, meeting or failing to meet the standard. Changes in proficiency rates can affect allocation of federal Title 1 funds, the careers of school staff, and even the value of homes in local neighborhoods.

More importantly to me, achievement levels must be carefully set if they are to provide reliable guidance to students, parents, and educators.

Standard Setting

Standard Setting is the process of assigning test score ranges to achievement levels. A score value that separates one achievement level from another is called a cut score. The most important cut score is the one that distinguishes between proficient (meeting the standard) and not proficient (not meeting the standard). For the California Math test, and for Smarter Balanced, that's the "Level 3" score but different tests may have different achievement levels.

When Smarter Balanced performed its standard setting exercise in October of 2014, it used the Bookmark Method. Smarter Balanced had conducted a field test that previous spring (described in Part 4 of this series). From those field test results, they calculated a difficulty level for each test item and converted that into a scale score. For each grade, a selection of approximately 70 items were sorted from easiest to most difficult. This sorted list of items is called an Ordered Item Booklet (OIB) though, in the Smarter Balanced case, the items were presented online. A panel of experts, composed mostly of teachers, went through the OIB starting at the beginning (easiest item), and set a bookmark at the item they believed represented proficiency for that grade. A proficient student should be able to answer all preceding items correctly but might have trouble with the items that follow the bookmark.

There were multiple iterations of this process on each grade, and then the correlation from grade-to-grade was also reviewed. Panelists were given statistics on how many students in the field tests would be considered proficient at each proposed skill level. Following multiple review passes the group settled on the recommended cut scores for each grade. The Smarter Balanced Standard Setting Report describes the process in great detail.

Data Form

For each subject and grade, the standard setting process results in cut scores representing the division between achievement levels. The cut scores for Grade 5 math, from table above, are 2455, 2528, and 2579. Psychometricians also calculate the Highest Obtainable Scale Score (HOSS) and Lowest Obtainable Scale Score (LOSS) for the test.

I am not aware of any existing data format standard for achievement levels. Smarter Balanced publishes its achievement levels and cut scores on its web site. The Smarter Balanced test administration package format includes cut scores, and HOSS and LOSS; but not achievement level descriptors.

A data dictionary for publishing achievement levels would include the following elements:

ElementDefinition
Cut ScoreThe lowest *scale score* included in a particular achievement level.
LOSSThe lowest obtainable *scale score* that a student can achieve on the test.
HOSSThe highest obtainable *scale score* that a student can achieve on the test.
Achievement Level DescriptorA description of what an achievement level means. For example, "Met Standard" or "Exceeded Standard".

Quality Factors

The stakes are high for standard setting. Reliable cut scores for achievement levels ensure that students, parents, teachers, administrators, and policy makers receive appropriate guidance for high-stakes decisions. If the cut scores are wrong - many decisions may be ill informed. Quality is achieved by following a good process:

  • Begin with a foundation of high quality achievement standards, test items that accurately measure the standards, and a reliable field test.
  • Form a standard-setting panel composed of experts and grade-level teachers.
  • Ensure that the panelists are familiar with the achievement standards that the assessment targets.
  • Inform the panel with statistics regarding actual student performance on the test items.
  • Follow a proven standard-setting process.
  • Publish the achievement levels and cut scores in convenient human-readable and machine-readable forms.

Wrapup

Student achievement rates affect policies at state and national levels, direct budgets, impact staffing decisions, influence real estate values, and much more. Setting achievement level cut scores too high may set unreasonable expectations for students. Setting them too low may offer an inappropriate sense of complacency. Regardless, achievement levels are set on a scale calibrated to achievement standards. If the standards for the skills to be learned are not well-designed, or if the tests don't really measure the standards, then no amount of work on the achievement level cut scores can compensate.

14 September 2018

Quality Assessment Part 5: Blueprints and Computerized-Adaptive Testing

This is part 5 of a 10-part series on building high-quality assessments.

Arrows in a tree formation.

Molly is a 6th grade student who is already behind in math. Near the end of the school year she takes her state's annual achievement tests in mathematics and English Language Arts. Already anxious when she sits down to the test, her fears are confirmed by the first question where she is asked to divide 3/5 by 7/8. Though they spent several days on this during the year, she doesn't recall how to divide one fraction by another. As she progresses through the test, she is able to answer a few questions but resorts to guessing on all too many. After twenty minutes of this she gives up and just guesses on the rest of the answers. When her test results are returned a month later she gets the same rating as three previous years, "Needs Improvement." Perpetually behind, she decides that she is, "Just not good at math."

Molly is fictional but she represents thousands of students across the U.S. and around the world.

Let's try another scenario. In this case, Molly is given a Computerized-Adaptive Test (CAT). When she gets the first question wrong, the testing engine picks an easier question which she knows how to answer. Gaining confidence she applies herself to the next question which she also knows how to answer. The system presents easier and harder questions as it works to pinpoint her skill level within a spectrum extending back to 4th grade and ahead to 8th grade. When her score report comes she has a scale score of 2505 which is below the 6th grade standard of 2552. The report shows her previous year's score of 2423 which was well below standard for Grade 5. The summary says that, while Mollie is still behind, she has achieved significantly more than a year's progress in the past year of school; much like this example of a California report.

Computerized-Adaptive Testing

A fixed-form Item Response Theory test presents a set of questions at a variety of skill levels centered on the standard for proficiency for the grade or course. Such tests result in a scale score, which indicates the student's proficiency level, and a standard error which indicates a confidence level of the scale score. A simplified explanation is that the student's actual skill level should be within the range of the scale score plus or minus the standard error. Because a fixed-form test is optimized for the mean, the standard error is greater the further the student is from the target proficiency for that test.

Computerized Adaptive Tests (CAT) start with a large pool of assessment items. Smarter Balanced uses a pool of 1,200-1,800 items for a 40 item test. Each question is calibrated according to its difficulty within the range of the test. The test administration starts with a question near the middle of the range. From then on, the adaptive algorithm tracks the student's performance on prior items and then selects questions most likely to discover and increase confidence in the student's skill level.

A stage-adaptive or multistage test is similar except that groups of questions are selected together.

CAT tests have three important advantages over fixed-form:

  • The test can measure student skill across a wider range while maintaining a small standard error.
  • Fewer questions are required to assess the student's skill level.
  • Students may have a more rewarding experience as the testing engine offers more questions near their skill level.

When you combine more accurate results with a broader measured range and then use the same test family over time, you can reliably measure student growth over a period of time.

Test Blueprints

As I described in Part 2 and Part 3 of this series, each assessment item is designed to measure one or two specific skills. A test blueprint indicates what skills are to be measured in a particular test and how many items of which types should be used to measure each skill.

As an example, here's the blueprint for the Smarter Balanced Interim Assessment Block (IAB) for "Grade 3 Brief Writes":

Block 3: Brief Writes
ClaimTargetItemsTotal Items
Writing1a. Write Brief Texts (Narrative)46
3a. Write Brief Texts (Informational)1
6a. Write Brief Texts (Opinion)1

This blueprint, for a relatively short fixed-form test, indicates a total of six items spread across one claim and three targets. For more examples, you can check out the Smarter Balanced Test Blueprints. The Summative Tests, which are used to measure achievement at the end of each year, have the most items and represent the broadest range of skills to be measured.

When developing a fixed-form test, the test producer will select a set of items that meets the requirements of the blueprint and represents an appropriate mix of difficulty levels.

For CAT tests it's more complicated. The test producer must select a much larger pool of items than will be presented to the student. A minimum is five to ten items in the pool for each item in to be presented to the student. For summative tests, Smarter Balanced uses a ratio averaging around 25 to 1. These items should represent the skills to be measured in approximately the same ratios as they are represented in the blueprint. And they should represent difficulty levels across the range of skill to be measured. (Difficulty level is represented by the IRT b parameter of each item.)

As the student progresses through the test, the CAT Algorithm selects the next item to be presented. In doing so, it takes into account three factors: 1. Information it has determined about the student's skill level so far, 2. How much of the blueprint has been covered so far and what it has yet to cover, and 3. The pool of items it has to select from. From those criteria it selects an item that will advance coverage of the blueprint and will improve measurement of the student's skill level.

Data Form

To present a CAT assessment the test engine needs three sets of data:

  • The Test Blueprint
  • A Catalog of all items in the pool. The entry for each item must specify its alignment to the test blueprint (which is equivalent to its alignment to standards), and its IRT Parameters.
  • The Test Items themselves.

Part 3 of this series describes formats for the items. The item metadata should include the alignment and IRT information. The manifest portion of IMS Content Packaging is one format for storing and transmitting item metadata.

To date, there is no standard or commonly-used data format for test blueprints. Smarter Balanced has published open specifications for its Assessment Packages. Of those, the Test Administration Package format includes the test blueprint and the item catalog. IMS CASE is designed for representing achievement standards, but it may also be applicable to test blueprints.

IMS Global has formed an "IMS CAT Task Force" which is working on interoperable standards for Computerized Adaptive Testing. They anticipate releasing specifications later in 2018.

Quality Factors

A CAT Simulation is used to measure the quality of a Computerized Adaptive Test. These simulations use a set of a few thousand simulated students each assigned a particular skill level. The system then simulates each student taking the test. For each item, the item characteristic function is used to determine whether a student at that skill level is likely to answer correctly. The adaptive algorithm uses those results to determine which item to present next.

The results of the simulation are used to see how well the CAT measures the skill levels of the simulated students by comparing the test scores against the skill levels of the simulated students. Results of a CAT simulation are used to ensure that the item pool has sufficient coverage, that the CAT algorithm satisfies the blueprint, and to find out which items get the most exposure. This feedback is used to tune the item pool and the configuration of the CAT algorithm to achieve optimal results across the simulated population of students.

To build a high-quality CAT assessment:

  • Build a large item pool with items of difficulty levels spanning the range to be measured.
  • Design a test blueprint that focuses on the skills to be measured and correlates with the overall score and the subscores to be reported.
  • Ensure that the adaptive algorithm effectively covers the blueprint and also focuses in on each student's skill level.
  • Perform CAT simulations to tune the effectiveness of the item pool, blueprint, and CAT algorithm.

Wrapup

Computerized adaptive testing offers significant benefits to students by delivering more accurate measures with a shorter, more satisfying test. CAT is best suited to larger tests with 35 or more questions spread across a broad blueprint. Shorter tests, focused on mastery of one or two specific skills, may be better served by conventional fixed-form tests.

01 September 2018

Quality Assessment Part 4: Item Response Theory, Field Testing, and Metadata

This is part 4 of a 10-part series on building high-quality assessments.

Drafting tools - triangle, compass, ruler.

Consider a math quiz with the following two items:

Item A:

x = 5 - 2
What is the value of x?

Item B:

x2 - 6x + 9 = 0
What is the value of x?

George gets item A correct but gets the wrong answer for item B. Sally has the wrong answer for A but answers B correctly. Using traditional scoring, George and Sally each get 50%.

A more sophisticated quiz might assign 2 points to item A and 6 points to item B (recognizing that B is harder than A). Under such a scoring system, George would get 25% and Sally would get 75%.

But the score is still short on meaning. George scored 25% of what? Sally scored 75% of what?

An even more sophisticated model should acknowledge that knowing how to solve quadratics (item B) is evidence that the student can also perform subtraction (item A). Such a model would position George somewhere between first grade (single-digit subtraction) and High School (solving quadratics). That same model would indicate that Sally either guessed correctly on item B or made a mistake on item A that's not representative of her skill. Due to the conflicting evidence, we are less sure about Sally's skill level than George's. For both students, more items would be required to gain greater confidence in their skill levels.

Item Response Theory

Item Response Theory or IRT is a statistical method for describing how student performance on assessment items relates to their skill in the area the item was designed to measure.

The "three parameter logistic model" (3PL) for IRT describes the probability that a student of a certain skill level will answer the item correctly. Student proficiency is represented by θ (theta) and the three item parameters are a, b, and c. They represent the following factors:

  • a = Discrimination. This value indicates how well the item discriminates between proficient students and those who have not yet learned this skill.
  • b = Difficulty. This value indicates how difficult an item is for the student to answer correctly.
  • c = Guessing. The probability that a student might guess the correct response. For a four-item multiple-choice question, this would be 0.25 because the student has a one-in-four chance of guessing the right answer.

From these parameters we can create an item characteristic curve. The formula is as follows:

formula: p=c+(1-c)/(1+e^(-a(θ-b))

This is much easier to understand in graph form. So I loaded it into the Desmos graphing calculator.

The vertical (y) axis indicates the probability that a student will answer the item correctly. The horizontal (y) axis is student proficiency (represented by θ in the equation). You can move the sliders to change the a, b, and c parameters and see how different items would be represented in an item characteristic curve.

In addition to this "three-parameter" model, there are other IRT models but they all follow this same basic premise: The function represents the probability that a student of given skill (represented by θ, theta) will answer the question correctly. At least one parameter of the function represents the difficulty of the question. For items scored on multi-point scale, there are difficulty parameters (typically d1, d2, etc.) representing the difficulty thresholds for each point value.

Scale Scores

The difficulty parameter b, and the student skill value θ, are on the same logistic scale and center on the skill level being measured. For example, if an item is written for grade 5 math, a b parameter of 0 means that the average 5th grade student should be able to answer the question correctly 50% of the time.

Most assessments convert from this theta score into a scale score which is a consistent score reported to educators, students, and parents. For Smarter Balanced, the scale score ranges from 2000 to 3000 and represents skill levels from Kindergarten to High School Graduation. Theta scores are converted to scale scores using a polynomial function.

Field Testing

So how do we come up with the a, b, and c parameters for a particular item? Based on the item type and potential responses we can predict c (guessing) fairly well but our experience at Smarter Balanced has shown that authors are not very good at predicting b (difficulty) or c (discrimination). To get an objective measure of these values we use a field test.

In Spring 2014 Smarter Balanced held a field test in which 4.2 million students completed a test - typically in either English Language Arts or Mathematics. Some students took both. For the participating schools and students, this was a practice test - gaining experience in administering and taking tests. Since the items were not yet calibrated, we could not reliably score the tests. For Smarter Balanced it offered critical data on more than 19,000 test items. For each item we gained more than 10,000 scored responses from students representing the target grades across all demographics.

Psychometricians used these data, from students taking the test, to calculate the parameters (a, b, and c) for each item in the field test. The process of calculating IRT parameters from field test data is called calibration. Once items were calibrated we examined the parameters and the data to determine which items are approved for use in tests. For example, if a is too low then the question likely has a flaw. It may not measure the right skill or the answer key may be incorrect. Likewise, if the b parameter is different across demographic groups than the item may be sensitive to gender, cultural, or ethnic bias. Items from the field test that met statistical standards were approved and became the initial bank of items from which Smarter Balanced produces tests.

Each year Smarter Balanced does an embedded field test. Each test that a student takes has a few new "field test" items included. These items do not contribute to the student's test score. Rather, the students' scored responses are used to calibrate the items. This way the test item bank is being constantly renewed. Other organizations like ACT and SAT follow the same practice of embedding field test questions in regular tests.

To understand more about IRT, I recommend A Simple Guide to IRT and Rasch Modeling by Ho Yu.

Item Metadata

The IRT parameters, alignment to standards, and other critical information are collected as metadata about each item. In most cases, metadata is represented as a set of name-value pairs. There are many formats for representing metadata and also many dictionaries of field definitions. Smarter Balanced uses the metadata structure from IMS Content Packaging and draws field definitions from The Learning Resource Metadata Initiative (LRMI), from Schema.org, and from Common Education Data Standards (CEDS).

Here are some of the most critical metadata elements for assessment items with links to their definitions in those standards:

  • Identifier: An number that uniquely identifies this item.
  • PrimaryStandard: An identifier of the principal skill the item is intended to measure. The skill would be described in an Achievement Standard or Content Specification.
  • SecondaryStandard: Optional identifiers of additional Achievement Standards or Content Specifications that the item measures.
  • InteractionType: The type of interaction (multiple choice, matching, short answer, essay, etc.).
  • IRT Parameters: The a, b, and c parameters or another parameter set for the Item Response Theory function.
  • History: A record of when and how the item has been used to estimate how much it has been exposed.

Quality Factors

States, schools, assessment consortia, and assessment companies all maintain banks of assessment items from which they construct their assessments. There are a number of efforts underway to pool resources from multiple entities into large, joint item banks. The value of items in any such bank is multiplied tenfold if the items have consistent and reliable metadata regarding alignment to standards and IRT parameters.

Here are factors to consider related to IRT Calibration and Metadata:

  • Are all items field-tested and calibrated before they are used in an operational test?
  • Is alignment to standards and content specifications an integral part of item writing?
  • Are the identifiers used to record alignment consistent across the entire item bank?
  • Is field testing an integral part of the assessment design?
  • Are IRT parameters consistent and comparable across the entire bank?
  • When sharing items or an item bank across multiple organizations, do all participants agree to contribute data (field testing and operational use) back to the bank?

Wrapup

Field testing can be expensive, inconvenient, or both. But without actual data from student performance we have no objective evidence that a particular assessment item measures what it's intended to measure at the expected level of difficulty.

The challenges around field testing combined with the lack of training in IRT and related psychometrics have been kept these measures from being used in anything other than large-scale, high stakes tests. Nevertheless, it's concerning to me that final exams and midterms of great consequence are rarely, if ever, calibrated and validated. Greater collaboration among institutions, among curriculum developers, or both could achieve sufficient scale for calibrated tests to become more common.

23 August 2018

Quality Assessment Part 3: Items and Item Specifications

This is part 3 of a 10-part series on building high-quality assessments.

Transparent cylindrical vessel with wires leading to an electric spark inside.

Some years ago I remember reading my middle school science textbook. The book was attempting to describe the difference between a mixture and a compound. It explained that water is a compound of two parts hydrogen and one part oxygen. However, if you mix two parts hydrogen and one part oxygen in a container, you will simply have a container with a mixture of the two gasses, they will not spontaneously combine to form water.

So far, so good. Next, the book said that if you introduced an electric spark in the mixed gasses you would, "start to see drops of water appear on the inside surface of the container as the gasses react to form water." This was accompanied by an image of a container with wires and an electric spark.

I suppose the book was technically correct; that is what would happen if the container was strong enough to contain the violent explosion. But, even as a middle school student, I wondered how the dangerously misleading passage got written and how it survived the review process.

The writing and review of assessments requires the same or better rigor than writing textbooks. An error on an assessment item affects the evaluation of all students who take the test.

Items

In the parlance of the assessment industry, test questions are called items. The latter term is intended include more complex interactions than just answering questions.

Stimuli and Performance Tasks

Oftentimes, an item is based on a stimulus or passage that sets up the question. It may be an article, short story, or description of a math or science problem. The stimulus is usually associated with three to five items. When presented by computer, the stimulus and the associated items are usually presented on one split screen so that the student can refer to the stimulus while responding to the items.

Sometimes, item authors will write the stimulus; this is frequently the case for mathematics stimuli as they set up a story problem. But the best items draw on professionally-written passages. To facilitate this, the Copyright Clearance Center has set up the Student Assessment License as a means to license copyrighted materials for use in student assessment.

A performance task is a larger-scale activity intended to allow the student to demonstrate a set of related skills. Typically, it begins with a stimulus followed by a set of ordered items. The items build on each other usually finishing with an essay that asks the student to draw conclusions from the available information. For Smarter Balanced this pattern (stimulus, multiple items, essay) is consistent across English Language Arts and Mathematics.

Prompt or Stem

The prompt, sometimes called a stem, is the request for the student to do something. A prompt might be as simple as, "What is the sum of 24 and 62." Or it might be as complex as, "Write an essay comparing the views of the philosophers Voltaire and Kant regarding enlightenment. Include quotes from each that relate to your argument." Regardless, the prompt must provide required information, clearly describe what the student is to do, and how they are to express their response.

Interaction or Response Types

The response is a student's answer to the prompt. Two general categories of items are selected response and constructed response. Selected response items require the student to select one or more alternatives from a set of pre-composed responses. Multiple choice is the most common selected response type, but others include multi-select (in which more than one response may be correct), matching, true/false, and others.

Multiple choice items are particularly popular due to the ease of recording and scoring student responses. For multiple choice items, alternatives are the responses that a student may select from, distractors are the incorrect responses, and the answer is the correct response.

The most common constructed response item types are short answer and essay. In each case, the student is expected to write their answer. The difference is the length of the answer; short answer is usually a word or phrase while essay is a composition of multiple sentences or paragraphs. A variation of short answer may have a student enter a mathematical formula. Constructed responses may also have students plot information on a graph or arrange objects into a particular configuration.

Technology-Enhanced items are another commonly used category. These items are delivered by computer and include simulations, composition tools, and other creative interactions. However, all technology-enhanced items can still be categorized as either selected response or constructed response.

Scoring Methods

There are two general ways of scoring items, deterministic scoring and probabilistic scoring.

Deterministic scoring is indicated when a student's response may be unequivocally determined to be correct or incorrect. When a response is scored on multiple factors there may be partial credit for the factors the student addressed correctly. Deterministic scoring is most often associated with selected response items, but many constructed response items may also be deterministically scored when the factors of correctness are sufficiently precise, such as a numeric answer or a single word for a fill-in-the-blank question. When answers are collected by computer or are easily entered into a computer, deterministic scoring is almost always done by computer.

Probabilistic scoring is indicated when the quality of a student's answer must be judged on a scale. This is most often associated with essay type questions but may also apply to other constructed response forms. When handled well, a probabilistic score may include a confidence level — how confident is the scoring person or system that the score is correct.

Probabilistic scoring may be done by humans (e.g. judging the quality of an essay) or by computer. When done by computer, Artificial Intelligence techniques are frequently used with different degrees of reliability depending on the question type and the quality of the AI.

Answer Keys and Rubrics

The answer key is the information needed to score a selected-response item. For multiple choice questions, it's simply the letter of the correct answer. A machine scoring key or machine rubric is an answer key coded in such a way that a computer can perform the scoring.

The rubric is a scoring guide used to evaluate the quality of student responses. For constructed response items the rubric will indicate which factors should be evaluated in the response and what scores should be assigned to each factor. Selected response items may also have a rubric which, in addition to indicating which response is correct, would also give an explanation about why that response is correct and why each distractor is incorrect.

Item Specifications

An item specification describes the skills to be measured and the interaction type to be used. It serves as both a template and a guide for item authors.

The skills should be expressed as references to the Content Specification and associated Competency Standards (see Part 2 of this series). A consistent identifier scheme for the Content Specification and Standards greatly facilitates this. However, to assist item authors, the specification often quotes relevant parts of the specification and standards verbatim.

If the item requires a stimulus, the specification should describe the nature of the stimulus. For ELA, that would include the type of passage (article, short-story, essay, etc.), the length, and the reading difficulty or text complexity level. In mathematics, the stimulus might include a diagram for Geometry, a graph for data analysis, or a story problem.

The task model describes the structure of the prompt and the interaction type the student will use to compose their response. For a multiple choice, item, the task model would indicate the type of question to be posed, sometimes with sample text. That would be followed by the number of multiple choice options to be presented, the structure for the correct answer, and guidelines for composing appropriate distractors. Task models for constructed response would include the types of information to be provided and how the student should express their response.

The item specification concludes with guidelines about how the item will be scored including how to compose the rubric and scoring key. The rubric and scoring key focus on what evidence is required to demonstrate the student's skill and how that evidence is detected.

Smarter Balanced content specifications include references to the Depth of Knowledge that should be measured by the item, and guidelines on how to make the items accessible to students with disabilities. Smarter Balanced also publishes specifications for full performance tasks.

Data Form for Item Specifications

Like Content Specifications, Item Specifications have traditionally been published in document form. When offered online they are typically in PDF format. Like Content Specifications, there are great benefits to be achieved by publishing content specs in a structured data form. Doing so can integrate the content specification into the item authoring system — presenting a template for the item with pre-filled content-specification alignment metadata, pre-selected interaction time, and guidelines about stimulus and prompt alongside the places where the author is to fill in the information.

Smarter Balanced has selected the IMS CASE format for publishing item specifications in structured form. This is the same data format we used for the content specifications.

Data Form for Items

The only standardized format for assessment items in general use is IMS Question and Test Interoperability (QTI). It's a large standard with many features. Some organizations have chosen to implement a custom subset of QTI features known as a "profile." The soon-to-be-released QTI 3.0 aims to reduce divergence among profiles.

A few organizations, including Smarter Balanced and CoreSpring have been collaborating on the Portable Interactions and Elements (PIE) concept. This is a framework for packaging custom interaction types using Web Components. If successful, this will simplify the player software and support publishing of custom interaction types.

Quality Factors

A good item specification will likely be much longer than the items it describes. As a result, producing an item specification also consumes a lot more work than writing any single item. But, since each item specification will result in dozens or hundreds of items, the effort of writing good item specifications pays huge dividends in terms of the quality of the resulting assessment.

  • Start with a good quality standards and content specifications
  • Create task models that are authentic to the skills being measured. The task that the student is asked to perform should be as similar as possible to how they would manifest the measured skill in the real world.
  • Choose or write high-quality stimuli. For language arts items, the stimulus should demand the skills being measured. For non-language-arts items, the stimulus should be clear and concise so as to reduce sensitivity to student reading skill level.
  • Choose or create interaction types that are inherently accessible to students with disabilities.
  • Ensure that the correct answer is clear and unambiguous to a person who possesses the skills being measured.
  • Train item authors in the process of item writing. Sensitize them to common pitfalls such as using terms that may not be familiar to students of diverse ethnic backgrounds.
  • Use copy editors to ensure that language use is consistent, parallel in structure, and that expectations are clear.
  • Develop a review, feedback, and revision process for items before they are accepted.
  • Write specific quality criteria for reviewing items. Set up a review process in which reviewers apply the quality criteria and evaluate the match to the item specification.

Wrapup

Most tests and quizzes we take, whether in K-12 or college, are composed one question at a time based on the skills taught in the previous unit or course. Item specifications are rarely developed or consulted in these conditions and even the learning objectives may be somewhat vague. Furthermore, there is little third-party review of such assessments. Considering the effort students go through to prepare for and take an exam, not to mention the consequences associated with their performance on those exams, it seems like institutions should do a better job.

Starting from an item specification is both easier and produces better results than writing an item from scratch. The challenge is producing the item specifications themselves, which is quite demanding. Just as achievement standards are developed at state or multi-state scale, so also could item specifications be jointly developed and shared broadly. As shown in the links above, Smarter Balanced has published its item specifications and many other organizations do the same. Developing and sharing item specifications will result in better quality assessments at all levels from daily quizzes to annual achievement tests.

11 August 2018

Quality Assessment Part 2: Standards and Content Specifications

This is part 2 of a 10-part series on building high-quality assessments.

Mountain with Flag on Summit

Some years ago my sister was in middle school and I had just finished my freshman year at college. My sister's English teacher kept assigning word search puzzles and she hated them. The family had just purchased an Apple II clone and so I wrote a program to solve word searches for my sister. I'm not sure what skills her teacher was trying to develop with the puzzles; but I developed programming skills and my sister learned to operate a computer. Both skill sets have served us later in life.

Alignment to Standards

The first step in building any assessment, from a quiz to a major exam, should be to determine what you are trying to measure. In the case of academic assessments, we measure skills, also known as competencies. State standards are descriptions of specific competencies that a student should have achieved by the end of the year. They are organized by subject and grade. State summative tests indicate student achievement by measuring how close each student is to the state standards — typically at the close of the school year. Interim tests can be used during the school year to measure progress and to offer more detailed focus on specific skill areas.

At the higher education level, Colleges and Universities set Learning Objectives for each course. A common practice is to use the term, "competencies" as a generic reference to both state standards and college learning objectives and I'll follow that pattern here.

The Smarter Balanced Assessment Consortium, where I have been working, measures student performance relative to the Common Core State Standards. Choosing standards that have been adopted by multiple states enables us to write one assessment that meets the needs of all our member states and territories.

The Content Specification

The content specification is a restatement of competencies organized in a way that facilitates assessment. Related skills are clustered together so that performance measures on related tasks may be aggregated. For example, Smarter Balanced collects skill measures associated with "Reading Literary Texts" and "Reading Informational Texts" together into a general evaluation of "Reading". In contrast, a curriculum might cluster "Reading Literary Texts" with "Creative Writing" because synergies occur when you teach those skills together.

The Smarter Balanced content specification follows a hierarchy of Subject, Grade, Claim, and Target. In Mathematics, the four claims are:

  1. Concepts and Procedures
  2. Problem Solving
  3. Communicating Reasoning
  4. Modeling and Data Analysis

In English Language Arts, the four claims are:

  1. Reading
  2. Writing
  3. Speaking & Listening
  4. Research & Inquiry

These same four claims are repeated in each grade but the expected skill level increases. That increase in skill is represented by the targets assigned to the claims at each grade level. In English Reading (Claim 1), the complexity of the text presented to the student increases and the information the student is expected to draw from the text is increasingly demanding. Likewise, in Math Claim 1 (Concepts and Procedures) the targets progress from simple arithmetic in lower grades to Geometry and Trigonometry in High School.

Data Form

Typical practice is for states to publish their standards as documents. When published online they have been published as PDF files. Such documents are human readable but they lack the structure needed for data systems to facilitate access. In many cases they also lack identifiers that are required when referencing standards or content specifications.

Most departments within colleges and universities will develop a set of learning objectives for each course. Often times a state college system will develop statewide objectives. While these objectives are used internally for course design, there's little consistency in publishing the objectives. Some institutions publish all of their objectives while others keep them as internal documents. The Temple University College of Liberal Arts offers an example of publicly published learning objectives in HTML form.

In August 2017, IMS Global published the Competencies & Academic Standards Exchange (CASE) data standard. It is a vendor-independent format for publishing achievement standards suitable for course learning objectives, state standards, content specifications, and many other competency frameworks.

Public Consulting Group, in partnership with a several organizations built OpenSALT, an open source "Standards Alignment Tool" as a reference implementation of CASE.

Here's an example. Smarter Balanced originally published its content specifications in PDF form. The latest versions, from July of 2017, are available on the Development and Design page of their website. These documents have complete information but they do not offer any computer-readable structure.

"Boring" PDF form of Smarter Balanced Content Specifications:

In Spring 2018, Smarter Balanced published the same specifications, in CASE format, using the OpenSALT tool. The structure of the format lets you navigate the hierarchy of the specifications. The CASE format also supports cross-references between publications. In this case, Smarter Balanced also published a rendering of the Common Core State Standards in CASE format to facilitate references from the content specifications to the corresponding Common Core standards.

"Cool" CASE form of Smarter Balanced Content Specifications and CCSS:

I hope you agree that the Standards and Content Specifications are significantly easier to navigate in their structured form. Smarter Balanced is presently working on a "Content Specification Explorer" which will offer a friendlier user interface on the structured CASE data.

Identifiers

Regardless of how they are published, use of standards is greatly facilitated if an identifier is assigned to each competency. There are two general categories of identifiers: Opaque identifiers carry no meaning - they are just a number. Often they are "Univerally Unique IDs" (UUIDs) which are generated using an algorithm to assure that identifier is not used anywhere else in the world. Any meaning of the identifier is by virtue of the record to which it is assigned. "Human Readable" identifiers are constructed to have a meaningful structure to a human reader. There are good justifications each approach.

The Common Core State Standards assigned both types of identifier to each standard. Smarter Balanced has followed a similar practice in the identifiers for our Content Specification.

Common Core State Standards Example:

  • Opaque Identifier: DB7A9168437744809096645140085C00
  • Human Readable Identifier: CCSS.Math.Content.5.OA.A.1
  • URL: http://corestandards.org/Math/Content/5/OA/A/1/
  • Statement: Use parentheses, brackets, or braces in numerical expressions, and evaluate expressions with these symbols.

Smarter Balanced Content Specification Target Example:

You'll notice that the Smarter Balanced Content Specification target is a copy of the corresponding Common Core State Standard. The CASE representation includes an "Exact Match Of" cross-reference from the content specification to the corresponding standard to show that's the case.

Smarter Balanced has published a specification for its human-readable Content Specification Identifiers. Here's the interpretation of "M.G5.C1OA.TA.5.OA.A.1":

  • M Math
  • G5 Grade 5
  • C1 Claim 1
  • OA Domain OA (Operations & Algebraic Thinking)
  • TA Target A
  • 5.OA.A.1 CCSS Standard 5.OA.A.1

Quality Factors

The design of any educational activity should begin with a set of learning objectives. State Standards offer a template for curricula, lesson plans, assessments, supplemental materials, games and more. At the higher education level, Colleges and Universities set learning objectives for each course that serve a similar purpose. The quality of the achievement standards will have a fundamental impact on the quality of the related learning activities.

Factors to consider when selecting or building standards or learning objectives include the following:

  • Are the competencies relevant to the discipline being taught?
  • Are the competencies parallel in construction, describing skills at a similar grain size?
  • Are the skills ordered in a natural learning progression?
  • Are related skills, such as reading and writing, taught together in a coordinated fashion?
  • Is the amount of material covered by the competencies appropriate for the amount of time that will be allocated for learning?

The Development Process and the Standards-Setting Criteria used by the authors of the Common Core State Standards offer some insight into how they sought to develop high quality standards.

Factors to consider when developing an assessment content specification include the following:

  • Does the specification reference an existing standard or competency set?
  • Are the competencies described in such a way that they can be measured?
  • Is the grain size (the amount of knowledge involved) for each competency optimal for construction of test questions?
  • Are the competencies organized so that related skills are clustered together?
  • Does the content standard factor in dependencies between competencies? For example, performing long division is evidence that an individual is also competent at multiplication.
  • Is the organization of the competencies, typically into a hierarchy, consistent and easy to navigate?
  • Does the competency set lend itself to reporting skills at multiple levels? For example, Smarter Balanced reports an overall ELA score and then subscores for each claim: Reading, Writing, Speaking & Listening, and Research & Inquiry.

Wrapup

Compared with curricula, standards and content specifications are relatively short documents. The Common Core State Standards total 160 pages, much less than the textbook for a single grade. But standards have a disproportionate impact on all learning activities within the state, college, or class where they are used. Careful attention to the selection or construction of standards is a high-impact effort.

02 August 2018

Quality Assessment Part 1: Quality Factors

This is part 1 of a 10-part series on building high-quality assessments.

Flask

As I wrap up my service at the Smarter Balanced Assessment Consortium I am reflecting on what we've accomplished over the last 5+ years. We've assembled a full suite of assessments; we built an open source platform for assessment delivery; and multiple organizations have endorsed SmarterBalanced as more rigorous and better aligned to state standards than prior state assessments.

So, what are the characteristics of a high-quality assessment? How do you go about constructing such an assessment? And what distinguishes an assessment like Smarter Balanced from a typical quiz or exam that you might have in class?

That will be the subject of this series of posts. Starting from the achievement standards that guide construction of both curriculum and assessment I will walk through the process Smarter Balanced and other organizations use to create standardized assessments and then indicate the extra effort required to make them both standardized and high quality.

But, to start with, we must define what quality means — at least in the context of an assessment.

Goal of a Quality Assessment

Nearly a year ago the Smarter Balanced member states released test scores for 2017. In most states the results were flat — with little or no improvement from 2016. It was a bit disappointing but what surprised me at the time was the criticism directed at the test. "The test must be flawed," certain critics said, "because it didn't show improvement."

This seemed like a strange criticism to direct at the measurement instrument. If you stick your hand in an oven and it doesn't feel warm do you wonder why your hand is numb or do you check the oven to see if it is working? Both are possibilities but I expect you would check the oven first.

The more I thought about it, however, the more I realized that the critics have a point. Our purpose in deploying assessments is to improve student learning, not just to passively measure learning. The assessment is a critical part of the eduational feedback loop.

Smarter Balanced commissioned an independent study and confirmed that the testing instrument is working properly. Nevertheless, there are more things that the assessment system can do support better learning.

Features of a Quality Assessment

So, we define a quality assessment as one that consistently contributes to better student learning. What are the features of an assessment that does this?

  • Valid: The test must measure the skills it is intended to measure. That requires us to start with a taxonomy of skills — typically called achievement standards or state standards and also known as competencies. The quality of the standards also matter, of course, but that's the subject of a different blog post. A valid test should be relatively insensitive to skills or characteristics it is not intended to measure. For example, it should be free of ethnic or cultural bias.
  • Reliable: The test should consistently return the same results for students of the same skill level. Since repeated tests may not be composed of the same questions, the measures must be calibrated to ensure they return consistent results. And the test must accurately measure growth of a student when multiple tests are given over an interval of time.
  • Timely: Assessment results must be provided in time to guide future learning activities. Summative assessments, big tests near the end of the school year, are useful but they must be augmented with interim assessments and formative activities that happen at strategic times during the school year.
  • Informative: If an assessment is to support improved learning, the information it offers must be useful for guiding the next steps in a student's learning journey.
  • Rewarding: Test anxiety has been the downfall of many well-intentioned assessment programs. Not only does anxiety interfere with the reliability of results but inappropriate consequences to teachers can encourage poor instructional practice. By its nature, the testing process is demanding of students. Upon completion, their effort should be rewarded with a feeling that they've achieved something important.

Watch This Space

In the coming weeks, I will describe the processes that go into constructing quality assessments. Because I'm a technology person, I'll include discussions of how data and technology standards support the work.

09 June 2018

A Brief History of Copyright

In the early 2000s I began writing a book titled Frictionless Media. The subject was business models for digital and online media. My thesis was that digital media is naturally frictionless — naturally easy to copy and transmit. Prior media formats had natural friction, they required specialized equipment and significant expense to copy. Traditional media business models are based on that natural friction. In order to preserve business models, publishers have attempted to introduce artificial friction through mechanisms like Digital Rights Management. They would be better off adapting their business models to leverage that frictionlessness to their advantage. My ideas were inspired by experience at Folio Corporation where we had invented a sophisticated Digital Rights Management system for textual publications. We found that the fewer restrictions publishers imposed on their publications the more successful they were.

I didn’t finish the manuscript before the industry caught up with me. Before long, most of my arguments were being made by dozens of pundits. Nevertheless, the second chapter, "A Brief History of Copyright," remains as relevant as ever. In 2018 I updated it to include recent developments such as Creative Commons.