Of That

Brandt Redd on Education, Technology, Energy, and Trust

11 November 2020

"Quality Assessment Part 10: Scoring Tests"

A paper with a pencil and some writing.

Lillian takes an exam and gets a score of 83%. Roberta takes an exam and achieves 95%. Are these comparable scores? A fraction of what was achieved in each case?

Suppose Lillian's exam was the final in College Thermodynamics. A score of 83% means that she earned 83% of the maximum possible score on the test. In traditional scoring, teachers assign a certain number of points to each question. A correct answer earns the student all points on that question. Incorrect earns zero points. And a partly correct answer may earn partial points.

Roberta's exam was in middle school Algebra. Like Lillian, her score is represented as a fraction of what can be achieved on that particular exam. Since these are different exams in different subject areas and different difficulty levels, comparing the scores isn't very meaningful. All you can say is that each student did well in her respective subject.

Standardized Scoring for Quality Assessment

This post is a long overdue coda to my series on quality assessment. In part 4 of the series I introduced item response theory (IRT) and how items can be calibrated on a continuous difficulty scale. In part 6 I described how standardized test scores are mapped to a continuous scale and interpreted accordingly.

This post is a bridge between the two. How do you compute a test score from a calibrated set of IRT questions and put it on that continuous scale?

Calibrated Questions

As described in part 4, a calibrated question has three parameters. Parameter a represents discrimination, parameter b represents difficulty, and parameter c represents the probability of getting the question right by guessing.

Consider a hypothetical test composed of five questions, lettered V, W, X, Y, and Z. Here are the IRT parameters:

ItemVWXYZ
a1.70.91.82.01.2
b-1.9-0.60.00.91.8
c0.00.20.10.20.0

Here are the five IRT graphs plotted together:

Five S curves plotted on a graph with vertical range from 0 to 1.

The horizontal dimension is the skill level on what psychometricians call the "Theta" scale. The usable portion is from approximately -3 to +3. The vertical dimension represents the probability that a student of that skill level will answer the question correctly.

All of the graphs in this post, and the calculations that produce them are in an Excel spreadsheet. You can download it from here and experiment with your own scenarios.

Adding Item Scores

When the test is administered, the items would probably be delivered in an arbitrary order. In this narrative, however, I have them sorted them from least to most difficult.

For our first illustration the student answers V, W, and X correctly, and they answer Y and Z incorrectly. For answers they get correct, we plot the curve directly. For answers they get incorrect, we plot (1 - y) which inverts the curve. Here's the result for this example.

Three S curves and two inverted S curves plotted on a graph with vertical range from 0 to 1.

Now, we multiply the values of the curves together to produce the test score curve.

A bell-shaped curve with the peak at 0.4

The peak of the curve is at 0.4 (as calculated by the spreadsheet). This is is the student's Theta Score for the whole test.

Scale Scores

As mentioned before, the Theta Score is on an arbitrary scale of roughly -3 to +3. But that scale can be confusing or even troubling to those not versed in psychometrics. A negative score is could represent excellent performance but, to a lay person, it doesn't look good.

To make scores more relatable, we convert the Theta Score to a Scale Score. Scale scores are also on an arbitrary scale. For example, the ACT test is mapped to a scale from 1 to 36. SAT is on a scale from 400 to 1600 and the Smarter Balanced tests are on a scale from 2000 to 3000.

For this hypothetical test we decide the scale score should range from 1200 to 1800. We do this by multiplying the Theta Score by 100 and adding 1500. We describe this as having a scaling function with slope of 100 and intercept of 1500.

In this illustration, the Theta Score of 0.4 results in a scale score if 1540.

Another Example

We have another student that's less consistent with their answers. They get questions X and Z correct and all others incorrect.

A bell-shaped curve with the peak at -0.6 and wider than the previous curve.

In this case, the peak is at -0.6 which results in a scale score of 1440.

Confidence Interval - Standard Error

In addition to a Scale Score, most tests report a confidence interval. No test is a perfect measure of a student's true skill level. The confidence interval is the range of scores in which the student's true skill level should lie. For this hypothetical test we will use a 90% interval which means there is a 90% probability that the student's true skill level lies within the interval.

To calculate the confidence interval, we take the area under the curve. Then, starting at the maximum point, which is the score, we move outward until we get an area that is 90% of the full area under the curve. The interval is always centered on the peak.

To convert the low and high points of the confidence interval from a Theta scale to a Scale Score scale, you use the same slope and intercept as used before.

We will illustrate this with the results of the first test. Recall that it has a score of 0.4. The confidence interval is from -1.08 to 1.48. The Scale Score is 1540 with a confidence interval from 1392 to 1688.

The difference between the top or bottom of the interval and the score is the standard error. In this case, the standard error of the Theta core is 1.48 and the standard error of the scale score is 148. Notice that scale score standard error is Theta standard error times the slope from the conversion function; 100 in this case.

A bell-shaped curve with the peak at 0.4 with a shaded confidence interval.

The second test result has a Theta Score of -0.6 and a confidence interval from -2.87 to 1.67 (Standard Error ±2.27). The Scale Score is 1440 with a confidence interval from 1213 to 1667 (Standard Error ±227).

A bell-shaped curve with the peak at -0.6 with a shaded confidence interval.

The second student answered harder questions correctly while missing easier questions. Intuitively this should lead to less confidence in the resulting score and that is confirmed by the broader confidence interval (larger standard error).

Disadvantages and Advantages to IRT Test Scoring

IRT Test Scoring is complicated to perform, to explain, and to understand. The math requires numerical methods that are impractical without computers.

Perhaps the biggest weakness is that this form of scoring works best when questions span a broad range of skill levels. That makes it less suitable for mastery assessment which benefits more from a threshold test.

For achievement testing, there are important advantages. Multiple tests can be calibrated to the same scale, supporting the measurement of growth year over year. Calibration of assessment items ensures that scores of different student cohorts are comparable even as the test itself is updated with new questions. And the scoring system is compatible with computer adaptive testing.

Wrapup

Next time you encounter a standardized test score you'll have a better idea of how it was calculated. And, hopefully that will give you a better understanding of how to interpret the score and what to do with it.

There are many types of assessment and corresponding ways to score them. Overall, it's important ensure that the form of testing and scoring is well-suited to the way the results will be interpreted and applied.

References

08 May 2020

Did Blended Learning Save My Class from Coronavirus?

A Blender with a Coronavirus Inside

Last week I turned in final grades from teaching my first-ever college course. In December I agreed to teach one section of Information Systems 201 at Brigham Young University. I've been talking and writing about learning technology for a long time; I was overdue for some first-hand experience. Little did I know what an experience we were in for.

The outbreak COVID-19 forced me, along with hundreds of thousands of other professors, to shift teaching to 100% online. The amazing part is how well it worked out.

About the Class

IS201 is required of all undergraduate business school majors at BYU. Nearly 800 took it Winter Semester. Mine was an evening section composed of 75 students. A handful dropped before the deadline. The other 68 stuck with me to the end. Majors in my class included Accounting, Marketing, Entrepreneurship, Management, Finance, Information Systems and a bunch of others. So, this was a very technical class taught to mostly less-technical majors.

The class starts with an lightweight introduction to Information Systems before diving into four technical subjects. First up was databases. We learned how to design a database using ER Diagrams and how to perform queries in SQL. For that segment, Data.world was our platform. Next up was procedural programming. We used Visual Basic for Applications to manipulate Microsoft Excel spreadsheets. We progressed to data analytics and visualization for which we used Tableau and Microsoft Excel. And the final segment was web programming in HTML and CSS.

Blended Learning

My class was blended from the beginning, which turned out to be quite valuable once Coronavirus hit and Physical Distancing began. As I write this I ponder how freaky it would have been to read those words only three months ago.

The course is standardized across all sections. We have a digital textbook hosted on MyEducator that was developed by a couple of the BYU Information Systems professors. The online materials are rich with both text and video explanations and hands-on exercises for the students. All homework and exams are submitted online and either scored automatically or by a set of TAs. Because the course is challenging, and taken by a lot of students, there is a dedicated lab where students can get TA help pretty much any time during school hours.

The online materials are sufficient that dedicated students can succeed on their own. In fact, an online section relies exclusively on the text and videos (that are available to all) and offers online TA support. Students of that section generally do well. However, they also self-select for the online experience.

So, students of IS 201 have available to them the following blending of learning experiences:

  • Live lectures.
  • Video tutorials. These are focused on a single topic and range from 5 to 25 minutes in length.
  • Written text with diagrams and images.
  • Live TA access.
  • Virtual TA access. Via email and text-chat from the beginning and online web conferencing later.
  • Office hours with the professor.

The assignments on which students are graded are consistent regardless of which learning method they choose. There are a few quizzes but most assignments are project-oriented.

The net result of this is that students are given a set of consistent assignments, mostly project-based, with their choice of a variety of learning opportunities for mastering the material.

Enter Coronavirus

On Thursday, March 12, 2020 BYU faculty and students got word that all classes would be cancelled Friday, Monday, and Tuesday. By Wednesday we should resume with everything entirely online. Students were encouraged to return to their homes in order to increase social distancing. Probably 2/3 of my class did so; one returning home to Korea. Yet, they all persisted. I didn't have any student drop out of the class after the switch to online.

Compared to many of my peers, the conversion was relatively easy. All of the learning materials were already online and students were already expected to submit their assignments online. Despite that, it took me about 10 hours to prepare. I adjusted the due dates for two assignments, scheduled the zoom lessons and posted links in the LMS, sent out communications to the students, and responded to their questions. On Monday I hosted an optional practice class so that I and the handful of students that joined could get practice with the online format.

The department leadership gave me the option of teaching live classes using Zoom or recording my classes for viewing whenever students chose. I elected to teach live but to post recordings of the live lectures.

Thursday, March 19 I posted this on LinkedIn:

Wednesday was the first day of online instruction for Brigham Young University. That day we achieved more than 1,200 online classes, with more than 26,000 participants coming in from 60-plus countries. Not bad, considering we only had five days to prepare, many of the students returned to their homes during the suspension of classes, and we had an #earthquake that morning.

As with the in-person classes, attendance was optional. Attendance dropped from a typical 60 in-person to about 20-25 online. One particularly sunny afternoon I only had seven show up. On average, the recordings had about 30 views each but the last couple, with focus on the final project, had nearly 90 which means some of the students were watching them more than once.

Having worked from home for the last seven years, I have a lot of experience with online videoconferencing. Despite that, I felt a huge loss moving to online classes. I never before realized how much feedback I got from eye contact and facial expressions. In-person, students were more ready to raise their hands or interrupt with questions. Online, I often felt like I was talking to empty space. I had to be very deliberate in seeking feedback. Maintaining long pauses when prompting for questions, encouraging them to post in the chat box, and suggesting questions of my own.

About two weeks into the online mode, I read an article that said there should be at least three live interactions per online class. They can be simple polls, a question to the students for which they are to call out or write a response, or a simple thumbs up or thumbs down on how well they are understanding the material. Zoom, like most other systems, has tools that make this pretty easy. And I found that the advice was good. Engagement really improved when I added even one or two interactions.

The biggest change was with the TA labs. The two TAs that served my class had to move their sessions online, again using Zoom and screen-sharing to support the students. They did an excellent job and I'm enormously grateful. My office hours were also virtually hosted. But, to my surprise, I only had three students make use of that in the online portion of the semester.

A Teaching Opportunity

COVID-19 became a threat to the U.S. just as my class was getting into the unit on Data Analytics. I wrote a little program in C# to download the virus data from Johns Hopkins University and reformat it into a table more suitable for analysis with Microsoft Excel or Tableau. On this webpage I posted links to the data, to the downloader program, and getting started videos for doing the analysis.

Graph of US Coronairus Cases

Wherever possible, throughout that unit, I used the COVID-19 data for my in-class examples. It turned out to be an excellent opportunity to show the strength of proper visualization with real-world data. I also showed examples of how rendering correct data in the wrong way can be misleading. Feedback from the students was very positive though it was sobering when we analyzed death rates.

Saved by Blended Learning

There are many models for blended learning. My class started out with a selection of learning modes with students given the freedom to choose among them. The LMS we used gives statistics on the students' modes of learning. Across the class, students only watched 15% of videos to completion. Meanwhile, they read 71% of the reading materials and completed 95% of the assessments. My rough estimate is that about 65% attended or viewed the lectures. I don't have statistics on their use of virtual TA help but I'm sure it was considerable.

This correlates with what I have seen in studies. Video is exciting but most students prefer from reading with still images. That's because they control the pace. Live interactions remain important because a teacher can respond immediately to feedback from the class. Online-live is more challenging because most visual cues are eliminated but there are ways to compensate. Most of them involve deliberate effort on the part of the instructor such as prompting for questions, instant quizzes, votes, and so forth.

Despite the challenges, my class came out with a 3.4 average, considerably better than the expected 3.2. I would love to take credit for that. But I think it has more to do with a subject and format that are well-suited to a blended model, high-quality online materials (prepared by my predecessors), and resilient students who simply hung in there until the end.

14 January 2020

What’s Up in Learning Technology?

A Lighthouse. Image by PIRO4D from Pixabay.

With the turn of the decade I have read a lot of pessimistic articles about education and learning technology. Most start with the lamentation that there has been little overall progress in student achievement over the last couple of decades – which is true, unfortunately. But what they fail to note are the many small and medium scale successes.

Take, for example, Seaford School District in Delaware. The community has been economically challenged since DuPont closed its Nylon plant there. Three of its four elementary schools were among the lowest performing in the state just a few years ago. Starting with a focus on reading, they ensured a growth mindset among the educators, gave them training and support, and deployed data programs to track progress and inform interventions. They drew in experts in learning science to inform their programs and curriculum. The result: the district now matches the state averages in overall performance and the three challenged elementary schools are strongly outperforming the state in reading and mathematics.

My friend, Eileen Lento, calls this a lighthouse because it marks the way toward the learning successes we’re seeking. For sailing ships, you don’t need just one lighthouse. You need a series of them along the coast. And each lighthouse sends out a distinct flash pattern so that navigators can tell which one they are looking at. By watching educational lighthouses, we gain evidence of the learning reforms that will make a real and substantial difference in students’ lives.

What does the evidence say?

Perhaps the most dramatic evidence-based pivot in the last decade has been the Aurora Institute, formerly iNACOL. In 2010 their emphasis was on online learning and virtual schools. But the evidence pointed them toward competency-based learning and so they launched CompetencyWorks; they renamed the symposium; and, ultimately, renamed the whole organization.

Much criticism has been leveled at No Child Left Behind, and its successor, the Every Student Succeeds Act. The beneficial results of these federal interventions are the state standards, which form the foundation of competency-based learning; and consistent annual reports that indicate how well K-12 schools are performing. On the downside, we’ve learned that measuring and reporting performance, by themselves, are not enough to drive improvement.

Learning science has made great gains in general awareness over the last decade. We’ve learned that a growth mindset makes a critical difference in how students respond to feedback and that the form of praise given by teachers and mentors can develop that mindset. We have evidence backing the notion that deliberate practice and feedback are required to develop a new skill. And we’ve gained nuance about Bloom’s Two Sigma Problem – that tutoring must be backed by a mastery-based curriculum and that measures of mastery must be rigorous in order to achieve the two standard deviation gains that Benjamin Bloom observed.

Finally, we’ve learned that the type of instructional materials doesn’t matter nearly as much as how they are used. Video and animation are not significantly better at teaching than still pictures and text. That is, until interactivity and experimentation are added. To those, we must also add individual attention from a teacher, opportunities to practice, and feedback.

Learning Technology Responding to the Challenge

A common realization in this past decade is that technology does not drive learning improvement. Successful initiatives are based on a sound understanding of how students learn best. Then, technology may be deployed that supports the initiative.

A natural indicator of what technology developers are doing is the cross-vendor standards effort. In the last couple of years there has emerged an unprecedented level of cooperation not just between vendors but also between the technology standards groups.

Here’s what’s up:

Learning Engineering

A properly engineered learning experience requires a coalescence of Instructional Design, Learning Science, Data Science, Competency-Based Learning and more. The IEEE Learning Technology Standards Committee (LTSC) has sponsored the Industry Consortium on Learning Engineering (ICICLE) and I’m pleased to be a member. We held our conference on Learning Engineering in May 2019, proceedings are due out in Q1 of 2020, the eight Special Interest Groups (SIGs) meet regularly and we have a monthly community meeting.

Interoperable Learner Records (ILR)

The concept is that every learner (and that’s hopefully everyone) should have a portable record that tracks every skill they have mastered. Such a record would support learning plans and guide career opportunities.

  • The T3 Innovation Network, sponsored by the US Chamber of Commerce Foundation, includes “Open Data Standards” and “Map and Harmonize Data Standards” among their pilot projects. These projects are intended to support use of existing standards rather than develop new ones.
  • Common Education Data Standards (CEDS) define the data elements associated with learner records of all sorts and the various standards initiatives continue to align their data models to CEDS.
  • IMS Global has published the Comprehensive Learner Record (CLR) standard.
  • The PESC Standards define how to transfer student records to, from, and between colleges and universities.
  • The Competency Model for Learning Technology Standards (CM4LTS) study group has been authorized by the IEEE LTSC to document a common conceptual model that will harmonize current and future IEEE LTSC standards. The model is anticipated to be based on CEDS.
  • The Advanced Digital Learning Initiative (ADL) has launched the Total Learning Architecture (TLA) working group seeking to develop “plug and play” interoperability between adaptive instructional systems, intelligent digital tutors, real-time data analytics, and interactive e-books. Essential to the TLA will be a portable learner record that functions across products.
  • The HR Open Standards Consortium defines standards to support human resource management. The standards include competency-oriented job descriptions and experience records.

While these may seem like competing efforts, there is a tremendous amount of cooperation and shared membership across the different groups. In fact, A4L, PESC, and HR Open Standards have established an open sharing and cooperation agreement. Our goal is a complementary and harmonious set of standards.

Competency Frameworks

A Competency Framework is a set of competencies (skills, knowledge, abilities, attitudes, or learning outcomes) organized into a taxonomy. Examples include the Common Core State Standards, Next Generation Science Standards, the Physician Competency Reference Set, the Cisco Networking Academy Curriculum, and the O*Net Spectrum of Occupations. There are hundreds of others. Interoperable Learner Records must reference competency frameworks to represent the competencies in the record.

Learning Resource Metadata

Metadata can indicate that a piece of content (text, audio, video, interactive activity, etc.) is intended to teach or assess a particular competency or set of competencies. So, when a person completes an activity, their interoperable learner record can be updated with evidence that they have learned or are learning those competencies.

Standards Advocacy

All of this interoperability effort will be of little use if the developers of learning activities and tools don’t make use of them.

  • Project Unicorn advocates for U.S. school districts to require interoperability standards from the vendors that supply their educational tools.
  • EdMatrix is my own Directory of Learning Standards. It is intended to help developers of learning tools know what standards are applicable, to help learning institutions know what to seek or require, and to help standards developers know what related efforts are underway and to support cooperation among them.

Looking to a New Decade

It can be discouraging to look back on the last decade or two and compare the tremendous investment society has put into education with the lack of measurable progress in outcomes.

I prefer to look forward and right now I’m optimistic. Here’s why:

Our understanding of learning science has grown. In daily conversation we use terms like “Growth Mindset,” “Competency-Based Learning,” “Practice and Feedback,” and “Motivation”.

Online and Blended Learning, and their cousin, Adaptive Learning Platforms, have progressed from the “Peak of Inflated Expectations” through the “Trough of Disillusionment” (using Gartner Hype Cycle terms) and are on their way to the “Plateau of Productivity.” Along the way we’ve learned that technology must serve a theory of learning not the other way around.

Technology and standards efforts are now spanning from primary, and secondary education, through higher education and into workforce training and lifelong learning. This reflects a rapidly changing demand for skills in the 21st century and a realization that most people will have to retrain 3-4 times during their lifetime. I expect that lasting improvement in postsecondary education and training will be driven by workplace demands and that corresponding updates to primary and secondary education will be driven by the downstream demand of postsecondary.

So, despite a lack of measurable impact in standardized tests, previous efforts have established a foundation of competency standards and measures of success. We have hundreds of “lighthouses” - successful initiatives worthy of imitation. On the foundation of competencies and standards, following the lighthouse guides we will build successful, student-centric learning systems.

What do you think? Are the investments of the last couple of decades finally about to pay off? Let me know in the comments.

01 November 2019

Themes manifest as iNACOL Becomes Aurora

Arrows representing systems integration.

In 2010 I took on the responsibility of forming an Education Technology Strategy for the Bill & Melinda Gates Foundation. That same year, I also attended the iNACOL Virtual Schools Symposium (VSS). A year later, I presented at the symposium and I've been pleased to present or contribute in some way most years since.

As my colleagues and I at the Gates Foundation worked on a theory of technology and education education, something quickly became clear. Technology doesn't drive educational improvement; it's simply an enabler. In the early part of this decade there were numerous 1:1 student:computer initiatives. Most failed to show measurable improvement and many turned into fiascos as teachers were tasked with finding something useful to do with their new computers or tablets.

At the foundation we turned to personalized learning, a theory that was based on promising evidence and one that has gained more support since then. With that as basis we looked to where technology could help. The result was support for key projects including Common Education Data Standards, the Learning Resource Metadata Initiative, and Profiles of Next-Generation Learning.

The great folks at iNACOL observed the same patterns and so they pivoted. VSS became, simply, the iNACOL Symposium and their emphasis shifted to personalized and competency-based education with online and blended learning as enablers. This year, they completed the transition, renaming the whole organization to The Aurora Institute. In their words:

[Our] organization has evolved significantly to become a leading nonprofit organization with a deep reach into practitioners creating next-generation learning models. Our focus has grown to examine systems change and education innovation, facilitating the future of learning through personalized learning and student-centered approaches to next-generation learning.

Serving Educators and Students

A theme that spontaneously emerged at the symposium this year is that we must do for the educators what we want for the students. It was first expressed by Dr. Brooke Stafford-Brizard in her opening keynote. As she advocated that we care for the mental health of the children she said, "Across all of our partners who have successfully integrated whole child practice, there isn’t one who didn’t start with their adults." She proceeded to show examples where the a school mental health programs were designed to support both staff and students.

With that as precedent, the principle kept reappearing throughout the symposium.

  • If we expect personalized instruction for the students we must offer personalized professional development for their teachers.
  • Establish the competency set we expect of educators and provide opportunities to master those competencies.
  • Actionable feedback to educators is critical to the success of any learning innovation just as actionable feedback to students is critical to their learning.
  • Create an environment of trust and safety among the staff of your institution - then project that to the students.
  • Growth mindset is as important to educators as it is for the students they teach.

Continuous Improvement

Both themes — technology as enabler, and caring for the educators — are simply signposts on a path of continuous improvement. We must follow the evidence and go where it leads us.

07 March 2019

A Support System for High-Performing Schools

Arrows representing systems integration.

Charter schools operated by Charter Management Organizations (CMOs) tend to outperform other charter schools and public schools. The National Study of Charter Management Organization Effectiveness from 2011 was the first rigorous study of CMO effectiveness and it showed that CMO-operated schools were better than other options. A 2017 study by Stanford University's Center for Research on Education Outcomes found that students enrolled in CMO-operated schools in New York City substantially outperformed their peers in conventional public schools and independent charter schools.

This improvement is to be expected. A basic premise of CMO operations is to study what works, and carry successful practices to other schools in the network.

Some conventional public schools are following a similar pattern. Their solution providers don't necessarily manage the school, like a CMO would. Instead, providers offer an integrated set of services backed by an evidence-driven theory of effective teaching. Here is the ecosystem I expect to emerge in the next few years:

  • Component and Curriculum Suppliers
  • Educational Solution Providers
  • Schools (and other learning institutions)

This same basic model applies to primary, secondary, and higher education though large universities and big districts have the capacity to be their own solution providers. Let's look at the components:

Schools, Districts, and other Learning Institutions

The school is where the teaching and learning occurs. It's where the supply chain of standards, curriculum, educational training, assessments, learning science, and everything else finally meets the student.

Many schools are implementing the same kinds of programs as charters: online curricula, blended learning, teacher dashboards, etc. But the complexity of integration grows exponentially with the number of components to combine. Building an integrated whole is beyond the capacity of most schools and all but the largest districts. The same pattern exists in higher education. Large universities can deliver an integrated solution but community colleges have a harder time.

Component and Curriculum Suppliers

On the supply side, there's a rich, complex, and rapidly growing market of component and curriculum suppliers. They include conventional textbook publishers, online curriculum developers, assessment providers, Learning Management Systems (LMS), Student Information Systems (SIS), and more.

Beyond these well-defined categories there's a host of other components, each designed to address a particular need in the educational economy. For example, Learnosity builds tools for creating and embedding high-quality assessments. Gooru offers a learning map, helping students know where they are in their learning progression. EdConnective offers live, virtual coaching for teachers. In 2018, education technology investment grew to a record $5.23 billion in the U.S. and a breathtaking $16.34 billion worldwide. We can expect many more components and materials to be produced from that level of investment.

Many of these components are raw - requiring significant integration effort before they can become part of an integrated learning solution. Despite this, developers of these components attempt to sell them directly to schools, districts, and states.

Educational Solution Providers

Summit Public Schools is a CMO that consistently achieves high rankings. Summit Learning also offers their online curriculum to public schools. But, separating the curriculum from the balance of the solution hasn't been so successful. In November 2018, Brooklyn students held a walkout and parents created a website to protest "Mass Customized Learning." It's not that the materials were bad; they were well-proven in other contexts. But, separated from the balance of the Summit program the student experience suffered.

An important new category in the education supply chain are Educational Solution Providers. CMOs belong to this category but solution providers to conventional schools don't take over management like a CMO would. Rather, they provide an integrated set of services that includes training and coaching for staff and leadership.

The best solution providers start with an evidence-based learning theory. They then assemble a comprehensive solution based on the theory and selected from the rich menu provided by the component market. A complete solution includes:

  • Training and Coaching Services
  • Professional Development
  • Curriculum (conventional or online)
  • Assessment (ideally curriculum-embedded)
  • Secure Student Data Systems with Educator Dashboards
  • Effectiveness Measures
  • Continuous Improvement

An important job for solution providers is to integrate the components so that they work seamlessly together in support of their learning theory. Training and professional development should embody the same theory that is being expressed to the students. LMS, SIS, dashboards, and all other online systems should function together as one solution even if the provider is sourcing the components from an array of suppliers. In order to do this, the solution provider must have their own curriculum experts for the content side and a talented technology staff focused on systems integration.

Players in this nascent category include The Achievement Network, CLI Solutions Group, and The National Institute for Excellence in Teaching. I think we can expect new entrants in the next few years. Successful CMOs may also cross over to providing services to conventional public schools.

Wrapup

The educational component and curriculum market is rich and rapidly growing with record levels of investment. But, schools don't have the capacity to integrate these components effectively and they need a guiding theory to underpin the selection of components and how they are to be integrated. The emerging category of Educational Solution Provider fills an important role in the ecosystem.

Are you aware of other existing or emerging solution providers? Please let me know in the comments!

11 February 2019

Public-Private Partnership for Public Works

SR 99 tunnel cross section visualization

On February 28, 2001 I was at Microsoft Headquarters in Redmond Washington when the Nisqually Earthquake hit. I was using Microsoft's scalability lab to perform tests on Agilix software. I remember standing in the doorway and asking the someone down the hall, "Is this really an earthquake?" It obviously was, but not having experienced one before my mind was still disbelieving.

Nine years later we moved to Seattle where I developed an education technology strategy at the Bill & Melinda Gates Foundation. At the time, politicians were still trying to figure out what should replace the Alaskan Way Viaduct which had been damaged in the earthquake, and which engineers predicted could collapse should another earthquake occur.

Last week, the Washington SR 99 tunnel replaced the viaduct; 18 years after the earthquake threatened its predecessor. Ironically, the tunnel opening was accompanied by a snowstorm that paralyzed the Northwest making the tunnel one of the few clear roads in the area.

Funding of Public Works

Grand Central Terminal

A few years back I visited New York's Grand Central Terminal and wondered at the great investments made in public works in the early 20th century. The terminal building is beautiful, functional, and built to last. It's been going for more than a century and will probably continue for a century or two more. I wondered why it is so hard to find contemporary investments in public works of such grandeur. However, upon doing some research I found that Grand Central was funded entirely by private investors. Even today, the building is privately owned though the railroad it serves has now been merged into the MTA, a public benefit corporation.

When we visited Seoul, Korea in 2015 we spent five days getting around on the excellent Seoul Metropolitan Subway. It is fast, efficient, clean, and among the largest subway systems in the world with more than 200 miles of track. It features wireless internet throughout, most platforms are protected by automated doors greatly improving safety. Yet, the whole network has been built since 1971. The subway is built and operated by Seoul Metro, Korail, and Metro 9. Seoul Metro and Korail are Korean public corporations; these are corporations where the government owns a controlling interest. Metro 9 is a private venture.

This past December we visited Brisbane, Australia. Brisbane traffic has been mediated through the construction of several bypass tunnels including the Airport Link. The tunnels have been built in relatively short time through public-private partnerships.

As I researched these projects I saw a consistent pattern. The most successful public works projects seem to involve some form of cooperation between government and private enterprise. Funding is more easily obtained and project management is better when a private organization participates and stands to benefit from the long-term success of the project. But government support is also needed to represent the public interest, to streamline access to land and permits, and to ensure that profit-taking isn't excessive. Consider the U.S. Transcontinental Railroad. It was built in six years by three companies with a combination of government land grants, private funding, and some government subsidy bonds.

Less-Successful Examples

Less-successful operations seem to be entirely publicly sponsored and managed. Private companies contract to do the work but they aren't invested beyond project completion. For example, the Boston Big Dig was the "most expensive highway project in the US, and was plagued by cost overruns, delays, leaks, design flaws, charges of poor execution and use of substandard materials, criminal arrests, and one death." While the project was built by private contractors, public agencies were exclusively responsible for sponsorship, oversight, funding, and success.

Similarly, the Florida High Speed Corridor was commissioned by a state constitutional amendment, theoretically obligating the state to build the rail system. While still in the planning stages, the project got bogged down in cost overruns, environmental studies, lawsuits, and declining public support. Ultimately, the project was canceled in 2011. In 2018, however, Brightline, launched service between Miami, Fort Lauderdale, and West Palm Beach with an extension to Orlando being planned. Brightline is privately funded and operated.

Education

The same principles seem to apply in education. In the U.S. the biggest challenge to traditional public education are charter schools. Studies, including this one from the Center on Reinventing Public Education show that charter schools managed by Charter Management Organizations (CMOs) perform better than conventional public schools or independently-managed charter schools. Most CMOs are not-for-profit but they still represent a private, non-government entity. Based on the success of CMOs, some school districts are also considering outside management or support firms. In higher education there is a long tradition of government funding for a mix of public and private universities. Like the successful public works, the greatest success seem to occur when public and private interests are combined and aligned toward a common goal. In these successes, government represents the public interest. The worst outcomes seem to occur when government fails to represent public interests and is either corrupted to serve private needs or excessively focused on politics and party issues.

Organizing for Success

I haven't done a comprehensive search of public works projects. My selection of examples is simply based on projects I happen to be aware of. Nevertheless, it seems that the greatest potential for success is achieved when public and private interests are aligned in a partnership that leverages the strengths of both models and ensures that both groups benefit. public-private partnerships, state-owned enterprises, and public benefit corporations are different ways of achieving these ends.

The SR 99 tunnel in Seattle was bored by Bertha which, at the time, was the largest-ever tunnel boring machine. Early in the process, the machine broke down and it took two years to dig a recovery pit and make repairs. At the time, two state senators sponsored a bill to cancel the project. Despite this setback, and significant cost overruns, the project was ultimately a success. So, we can add persistence to see things through as another key to success.

Though the contract with Seattle Tunnel Partners will conclude when the tunnel project is complete, the organization has achieved a high degree of cooperation with the Washington department of transportation. Public-private cooperation and alignment of interests are behind many of the most successful public projects. And the private interest is often the source of the persistence needed to see things through.

10 January 2019

Quality Assessment Part 9: Frontiers

This is the final segment of a 9-part series on building high-quality assessments.

Mountains

A 2015 survey of US adults indicated that 34% of those surveyed felt that standardized tests were merely fair at measuring students' achievement; 46% think that the way schools use standardized tests has gotten worse; and only 20% are confident that tests have done more good than harm. The same year, the National Education Association surveyed 1500 members (teachers) and found that 70% do not feel that their state test is "developmentally appropriate.".

In the preceding eight parts of this series I described all of the effort that goes into building and deploying a high-quality assessment. Most of these principles are implemented to some degree in the states represented by these surveys. What these opinion polls tell us is that regardless of their quality, these assessments aren't giving valuable insight to two important constituencies: parents and teachers.

The NEA article describes a hypothetical "Most Useful Standardized Test" which, among other things, would "provide feedback to students that helps them learn, and assist educators in setting learning goals. This brings up a central issue in contemporary testing. The annual testing mandated by the Every Student Succeeds Act (ESSA), is focused on school accountability. This was also true of its predecessor, No Child Left Behind (NCLB). Both acts are based on the theory of measuring school performance, reporting that performance, and incentivising better school performance. States and testing consortia also strive to facilitate better performance by reporting individual results to teachers and parents. But facilitation remains a secondary goal of large-scale standardized testing.

The frontiers in assessment I discuss here shift the focus to directly supporting student learning with accountability being a secondary goal.

  • Curriculum-Embedded Assessment
  • Dynamically-Generated Assessments
  • Abundant Assessment

Curriculum-Embedded Assessment

The first model involves embedding assessment directly in the curriculum. Of course, nearly all curricula have embedded assessments of some sort. Math textbooks have daily exercises to apply the principles just taught. English and social studies texts include chapter-end quizzes and study questions. Online curricula intersperse the expository materials with questions, exercises, and quizzes. Some curricula even include pre-built exams. But these existing assessments lack the quality assurance and calibration of a high-quality assessment.

In a true Curriculum-Embedded Assessment, some of the items that appear in the exercises and quizzes would be developed with the same rigor as items on a high-stakes exam. They would be aligned to standards, field tested, and calibrated before appearing in the curriculum. In addition to contributing to the score on the exercise or quiz, the scores of these calibrated items would be aggregated into an overall record of the student's mastery of each skill in the standard.

Since the exercises and quizzes would not be administered in as controlled an environment as a high-stakes exam, the scores would not individually be as reliable as in a high-stakes environment. But by accumulating many more data points, and doing so continuously through the student's learning experience, it's possible to assemble an evaluation that is as reliable or more reliable than a year-end assessment.

Curriculum-Embedded Assessment has several advantages over either a conventional achievement test or the existing exercises and quizzes:

  • Student achievement relative to competency is continuously updated. This can offer much better guidance to students, educators, and parents than existing programs.
  • Student progress and growth can be continuously measured across weeks and months, not just years.
  • Performance relative to each competency can be reliably reported. This information can be used to support personalized learning.
  • Data from calibrated items can be correlated to data from the rest of the items on the exercise or quiz. Over time, these data can be used to calibrate and align the other items, thereby growing the pool of reliable and calibrated assessment items.
  • As Curriculum-Embedded Assessment is proven to offer data as reliable as year-end standardized tests, the standardized tests can be eliminated or reduced in frequency.

Dynamically-Generated Assessments

As described in my post on test blueprints, high-quality assessments begin with a bank of reviewed, field-tested, and calibrated items. Then, a test producer selects from that bank a set of items that match the blueprint of skills to be measured. For Computer-Adaptive Tests, the test is presented to a simulated set of students to determine how well it can measure student skill in the expected range.

In order to provide more frequent and fine-grained measures of student skills, educators prefer shorter interim tests to be used more frequently during the school year. Due to demand from districts and states, the Smarter Balanced Assessment Consortium will more than double the number of interim tests it offers over the next two years. Most of the new tests will be focused on just one or two targets (competencies) and have four to six questions. They will be short enough to be given in a few minutes at the beginning or end of a class period.

But what if you could generate custom tests on-demand to meet specific needs of a student or set of students? An teacher would design a simple blueprint — the skills to be measured and the degree of confidence required on each. Then the system could automatically generate the assessment, the scoring key, and the achievement levels based on the items in the bank and their associated calibration data.

Dynamically-generated assessments like these could target needs specific to a student, cluster of students, or class. With a sufficiently rich item bank, multiple assessments could be generated on the same blueprint thereby allowing multiple tries. And it should reduce the cost of producing all of those short, fine-grained assessments.

Abundant Assessment

Ideally, school should be a place where students are safe to make mistakes. We generally learn more from mistakes than from successes because failure affords us the opportunity to correct misconceptions and gain knowledge whereas success merely confirms existing understanding.

Unfortunately, school isn't like that. Whether primary, secondary, or college; school tends to punish failures. At the college level, a failed assignment is generally is unchangeable and a failed class, or low grade goes on the permanent record. Consider a student that studies hard all semester, gets reasonable grades on homework, but then blows the final exam. Perhaps they were sick on exam day, or perhaps the questions were confusing and different from what they expected, or perhaps the pressure of the exam just messed them up. Their only option is to repeat the whole class — and even then their permanent record will show the class repetition.

Why is this? Why do schools amplify the consequences to such small events? It's because assessments are expensive. They cost a lot to develop, to administer, and to score. In economic terms, assessments are scarce. For schools to offer easy recovery from failure they would have to develop multiple forms for every quiz and exam. They would have to incur the cost of scoring and reporting multiple times. And they would have to select the latest score and ignore all others. To date, such options have been cost-prohibitive.

"Abundant Assessment" is the prospect making assessment inexpensive — "abundant" in economic terms. In such a framework, students would be afforded many tries until they succeed or are satisfied with their performance. Negative consequences to failure would be eliminated and the opportunity to learn from failure would be amplified.

This could be achieved by a combination of collaboration and technology. Presently, most quizzes and exams are written by teachers or professors for their class only. If their efforts were pooled into a common item bank, then you could rapidly achieve a collection large enough to generate multiple exams on each topic area. Technological solutions would provide dynamically-generated assessments (as described in the previous section), online test administration, and automated scoring. All of this would dramatically reduce the labor involved in producing, administering, scoring, and reporting exams and quizzes.

Abundant assessment dramatically changes the cost structure of a school, college, or university. When it is no longer costly to administer assessments then you can encourage students to try early and repeat if they don't achieve the desired score. Each assessment, whether an exercise, quiz, or exam can be a learning experience with students encouraged to learn quickly from errors.

Wrapup

These three frontiers are synergistic. I can imagine a student, let's call her Jane, studying in a blended learning environment. Encountering a topic with which she is already familiar, Jane jumps ahead to the topic quiz. But the questions involve concepts she hasn't yet mastered and she fails. Nevertheless, this is a learning experience. Indeed, it could be reframed as a formative assessment as she now goes back and studies the material knowing what will be demanded of her in the assessment. After studying, and working a number of the exercises, Jane returns to the topic assessment and is presented with a new quiz, equally rigorous, on the same subject. This time she passes.

Outside the frame of Jane's daily work, the data from her assessments and those of her classmates are being accumulated. When the time comes, at the end of the year, to report on school performance, the staff are able to produce reliable evidence of student and school performance without the need for day-long standardized testing.

Most importantly, throughout this experience Jane feels confident and safe. At no point is she nervous that a mistake will have any long-term consequence. Rather, she knows that she can simply persist until she understands the subject matter.

06 November 2018

Quality Assessment Part 8: Test Reports

This is part 8 of a 9-part series on building high-quality assessments.

Bicycle

Since pretty much the first Tour de France cyclists have assumed that narrow tires and higher pressures would make for a faster bike. As tire technology improved to be able to handle higher pressures in tighter spaces the consensus standard became 23mm width and 115 psi. And that standard held for decades. This was despite the science that says otherwise.

Doing the math indicates that a wider tire will have a shorter footprint, and a shorter footprint loses less energy to bumps in the road. The math was confirmed in laboratory tests and the automotive industry has applied this information for a long time. But tradition held in the Tour de France and other bicycle races until a couple of teams began experimenting with wider tires. In 2012, Velonews published a laboratory comparison of tire widths and by 2018 the average moved up to 25 mm with some riders going as wide as 30mm.

While laboratory tests still confirm that higher pressure results in lower rolling resistance, high pressure also results in a rougher ride and greater fatigue for the rider. So teams are also experimenting with lower pressures adapted to the terrain being ridden and they find that the optimum pressure isn't necessarily the highest that the tire material can withstand.

You can build the best and most accurate student assessment ever. You can administer it properly with the right conditions. But if no one pays attention to the results, or if the reports don't influence educational decisions, then all of that effort will be for naught. Even worse, correct data may be interpreted in misleading ways. Like the tire width data, the information may be there but it still must be applied.

Reporting Test Results

Assuming you have reliable test results (the subjects of the preceding parts in this series), there are four key elements that must be applied before student learning will improve:

  • Delivery: Students, Parents, and Educators must be able to access the test data.
  • Explanation: They must be able to interpret the data — understand what it means.
  • Application: The student, and those advising the student, must be able to make informed decisions about learning activities based on assessment results.
  • Integration: Educators should correlate the test results with other information they have about the student.

Delivery

Most online assessment systems are paired with online reporting systems. Administrators are able to see reports for districts, schools, and grades sifting and sorting the data according to demographic groups. This may be used to hold institutions accountable and to direct Title 1 funds. Parents and other interested parties can access public reports like this one for California containing similar information.

Proper interpretation of individual student reports has greater potential to improve learning than the school, district, and state-level reports. Teachers have access to reports for students in their classes and parents receive reports for their children at least once a year. But teachers may not be trained to apply the data, or parents may not know how to interpret the test results.

Part of delivery is designing reports so that the information is clear and the correct interpretation is the most natural. To experts in the field, well-versed in statistical methods, the obvious design may not be the best one.

The best reports are designed using a lot of consumer feedback. The designers use focus groups and usability tests to find out what works best. In a typical trial, a parent or educator would be given a sample report and asked to interpret it. The degree to which they match the desired interpretation is an evaluation of the quality of the report.

Explanation

Even the best-designed reports will likely benefit from an interpretation guide. A good example is the Online Reporting Guide deployed by four western states. The individual student reports in these states are delivered to parents on paper. But the online guide provides interpretation and guidance to parents that would be hard to achieve in paper form.

Online reports should be rich with explanations, links, tooltips, and other tools to help users understand what each element means and how it should be interpreted. Graphs and charts should be well-labeled and designed as a natural representation of the underlying data.

An important advantage of online reporting is that it can facilitate exploration of the data. For example, a teacher might be viewing an online report of an interim test. She sees that a cluster of students all got a lower score. Clicking on the scores reveals a more detailed chart that shows how the students performed on each question. She might see that the students in the cluster all missed the same question. From there, she cold examine the student's responses to that question to gain insight into their misunderstanding. When done properly, such an analysis would only take a few minutes and could inform a future review period.

Application

Ultimately, all of this effort should result in good decisions being made by the student and made by others in their behalf. Closing the feedback loop in this way consistently results in improved student learning.

In part 2 of this series I wrote that assessment design starts with a set of defined skills, also known as competencies or learning objectives. This alignment to can facilitate guided application of test results. When test questions are aligned to the same skills as the curriculum, then students and educators can easily locate the learning resources that are best suited to student needs.

Integration

The best schools and teachers use multiple measures of student performance to inform their educational decisions. In an ideal scenario, all measures, test results, homework, attendance, projects, etc., would be integrated into a single dashboard. Organizations like The Ed-Fi Alliance are pursuing this but it's proving to be quite a challenge.

An intermediate goal is for the measures to be reported in consistent ways. For example, measures related to student skill should be correlated to the state standards. This will help teachers find correlations (or lack thereof) between the different measures.

Quality Factors

  • Make the reports, or the reporting system, available and convenient for students, parents, and educators to use.
  • Ensure that reports are easy to understand and that they naturally lead to the right interpretations. Use focus groups and usability testing to refine the reports.
  • Actively connect between test results and learning resources.
  • Support integration of multiple measures.

Wrapup

Every educational program, activity, or material should be considered in terms of it's impact on student learning. Effective reporting, that informs educational decisions, makes the considerable investment in developing and administering a test worthwhile.

16 October 2018

Quality Assessment Part 7: Securing the Test

This is part 7 of a 9-part series on building high-quality assessments.

A Shield

Each spring, millions of students in the United States take their annual achievement tests. Despite proctoring, some fraction of those students carry in a phone or some other sort of camera, take pictures of test questions, and post them on social media. Concurrently, testing companies hire a few hundred people to scan social media sites for inappropriately shared test content and send takedown notices to site operators.

Proctoring, secure browsers, and scanning social media sites are parts of a multifaceted effort to secure tests from inappropriate access. If students have prior access to test content, the theory goes, then they will memorize answers to questions rather than study the principles of the subject. The high-stakes nature of the tests creates incentive for cheating.

Secure Browsers

Most computer-administered tests today are given over the world-wide web. But if students were given unfettered access to the web, or even to their local computer, they could look up answers online, share screen-captures of test questions, access an unauthorized calculator, share answers using chats, or even videoconference with someone who can help with the test. To prevent this, test delivery providers use a secure browser, also known as a lockdown browser. Such a browser is configured so it will only access the designated testing website and it takes over the computer - preventing access to other applications for the duration of the test. It also checks to ensure that no unauthorized applications are already running, such as screen grabbers or conferencing software.

Secure browsers are inherently difficult to build and maintain. That's because operating systems are designed to support multiple concurrent applications and to support convenient switching among applications. In one case, the operating system vendor added a dictionary feature — users could tap any word on the screen and get a dictionary definition of that word. This, of course, interfered with vocabulary-related questions on the test. In this, and many other cases, testing companies have had to work directly with operating system manufacturers to get special features required to enable secure browsing.

Secure browsers must communicate with testing servers. The server must detect that a secure browser is in use before delivering a test and it also supplies the secure browser with lists of authorized applications that can be run concurrently (such as assistive technology). To date, most testing services develop their own secure browsers. So, if a school or district uses tests from multiple vendors, they must install multiple secure browsers.

To encourage a more universal solution. [Smarter Balanced] commissioned a Universal Secure Browser Protocol that would allow browsers and servers from different companies to work effectively together. They also commissioned and host a Browser Implementation Readiness Test (BIRT) that can be used to verify a browser - that it implements the required protocols and also the basic HTML 5 requirements. So far, Microsoft has implemented their Take a Test feature in Windows 10 that satisfies secure browser requirements and Smarter Balanced has released into open source a set of secure browsers for Windows, MacOS, iOS (iPad), Chrome OS (ChromeBook), Android, and Linux. Nevertheless, most testing companies continue to develop their own solutions.

Large Item Pools - An Alternative Approach

Could there be an alternative to all of this security effort? Deploying secure browsers on thousands of computers is expensive and inconvenient. Proctoring and social media policing cost a lot of time and money. And conspiracy theorists ask if the testing companies have something to hide in their tests.

Computerized-adaptive testing opens one possibility. If the pool of questions is big enough, the probability that a student encounters a question they have previously studied will be small enough that it won't significantly impact the test result. With a large enough pool, you could publish all questions for public review and still maintain a valid and rigorous test. I once asked a psychometrician how large the pool would have to be for this. He estimated about 200 questions in the pool for each one that appears on the test. Smarter Balanced presently uses a 20 to one ratio. Anther benefit of such a large item pool is that students can retake the test and still get a valid result.

Even with a large item pool, you would still need to use a secure browser and proctoring to prevent students from getting help from social media. That is, unless we can change incentives to the point that students are more interested in an accurate evaluation than they are in getting getting a top score.

Quality Factors

The goal of test security is to maintain the validity of test results; ensuring that students do not have access to questions in advance of the test and that they cannot obtain unauthorized assistance during the test. The following practices contribute to a valid and reliable test:

  • For computerized-adaptive tests have a large item pool thereby reducing the impact of any item exposure and, potentially allowing for retakes.
  • For fixed-form tests, develop multiple forms. As with a large item pool, multiple forms let you switch forms in the event that an item is exposed and also allows for retakes.
  • For online tests, use secure browser technology to prevent unauthorized use of the computer during the test.
  • Monitor social media for people posting test content.
  • Have trained proctors monitor testing conditions.
  • Consider social changes, related to how test results are used, that would better align student motivation toward valid test results.

Wrapup

The purpose of Test Security is to ensure that test results are a valid measure of student skill and that they are comparable to other students' results on the same test. Current best practices include securing the browser, effective proctoring, and monitoring social media. Potential alternatives include larger test item banks and better alignment of student and institutional motivations.

05 October 2018

Quality Assessment Part 6: Achievement Levels and Standard Setting

This is part 6 of a 9-part series on building high-quality assessments.

Two mountains, one with a flag on top.

If you have a child in U.S. public school, chances are that they took a state achievement test this past spring and sometime this summer you received a report on how they performed on that test. That report probably looks something like this sample of a California Student Score Report. It shows that "Matthew" achieved a score of 2503 in English Language Arts/Literacy and 2530 in Mathematics. Both scores are described as "Standard Met (Level 3)". Notably, in prior years Matthew was in the "Standard Nearly Met" category so his performance has improved.

The California School Dashboard offers reports of school performance according to multiple factors. For example, the Detailed Report for Castle View Elementary includes a graph of "Assessment Performance Results: Distance from Level 3".

Line graph showing performance of Lake Matthews Elementary on the English and Math tests for 2015, 2016, and 2017. In all three years, they score between 14 and 21 points above proficiency in math and between 22 and 40 points above proficiency in English.

To prepare this graph, they take the average difference between students' scale scores and the Level 3 standard for proficiency in the grade in which they were tested. For each grade and subject, California and Smarter Balanced use four achievement levels, each assigned to a range of scores. Here are the achievement levels for 5th grade Math (see this page for all ranges).

LevelRangeDescriptor
Level 1Less than 2455Standard Not Met
Level 22455 to 2527Standard Nearly Met
Level 32528 to 2578Standard Met
Level 4Greater than 2578Standard Exceeded

So, for Matthew and his fellow 5th graders, the Math standard for proficiency, or "Level 3" score, is 2528. Students at Lake Matthews Elementary, on average, exceeded the Math standard by 14.4 points on the 2017 tests.

Clearly, there are serious consequences associated with the assignment of scores to achievement levels. A difference of 10-20 points can make the difference between a school, or student, meeting or failing to meet the standard. Changes in proficiency rates can affect allocation of federal Title 1 funds, the careers of school staff, and even the value of homes in local neighborhoods.

More importantly to me, achievement levels must be carefully set if they are to provide reliable guidance to students, parents, and educators.

Standard Setting

Standard Setting is the process of assigning test score ranges to achievement levels. A score value that separates one achievement level from another is called a cut score. The most important cut score is the one that distinguishes between proficient (meeting the standard) and not proficient (not meeting the standard). For the California Math test, and for Smarter Balanced, that's the "Level 3" score but different tests may have different achievement levels.

When Smarter Balanced performed its standard setting exercise in October of 2014, it used the Bookmark Method. Smarter Balanced had conducted a field test that previous spring (described in Part 4 of this series). From those field test results, they calculated a difficulty level for each test item and converted that into a scale score. For each grade, a selection of approximately 70 items were sorted from easiest to most difficult. This sorted list of items is called an Ordered Item Booklet (OIB) though, in the Smarter Balanced case, the items were presented online. A panel of experts, composed mostly of teachers, went through the OIB starting at the beginning (easiest item), and set a bookmark at the item they believed represented proficiency for that grade. A proficient student should be able to answer all preceding items correctly but might have trouble with the items that follow the bookmark.

There were multiple iterations of this process on each grade, and then the correlation from grade-to-grade was also reviewed. Panelists were given statistics on how many students in the field tests would be considered proficient at each proposed skill level. Following multiple review passes the group settled on the recommended cut scores for each grade. The Smarter Balanced Standard Setting Report describes the process in great detail.

Data Form

For each subject and grade, the standard setting process results in cut scores representing the division between achievement levels. The cut scores for Grade 5 math, from table above, are 2455, 2528, and 2579. Psychometricians also calculate the Highest Obtainable Scale Score (HOSS) and Lowest Obtainable Scale Score (LOSS) for the test.

I am not aware of any existing data format standard for achievement levels. Smarter Balanced publishes its achievement levels and cut scores on its web site. The Smarter Balanced test administration package format includes cut scores, and HOSS and LOSS; but not achievement level descriptors.

A data dictionary for publishing achievement levels would include the following elements:

ElementDefinition
Cut ScoreThe lowest *scale score* included in a particular achievement level.
LOSSThe lowest obtainable *scale score* that a student can achieve on the test.
HOSSThe highest obtainable *scale score* that a student can achieve on the test.
Achievement Level DescriptorA description of what an achievement level means. For example, "Met Standard" or "Exceeded Standard".

Quality Factors

The stakes are high for standard setting. Reliable cut scores for achievement levels ensure that students, parents, teachers, administrators, and policy makers receive appropriate guidance for high-stakes decisions. If the cut scores are wrong - many decisions may be ill informed. Quality is achieved by following a good process:

  • Begin with a foundation of high quality achievement standards, test items that accurately measure the standards, and a reliable field test.
  • Form a standard-setting panel composed of experts and grade-level teachers.
  • Ensure that the panelists are familiar with the achievement standards that the assessment targets.
  • Inform the panel with statistics regarding actual student performance on the test items.
  • Follow a proven standard-setting process.
  • Publish the achievement levels and cut scores in convenient human-readable and machine-readable forms.

Wrapup

Student achievement rates affect policies at state and national levels, direct budgets, impact staffing decisions, influence real estate values, and much more. Setting achievement level cut scores too high may set unreasonable expectations for students. Setting them too low may offer an inappropriate sense of complacency. Regardless, achievement levels are set on a scale calibrated to achievement standards. If the standards for the skills to be learned are not well-designed, or if the tests don't really measure the standards, then no amount of work on the achievement level cut scores can compensate.