Of That

Web Scale Authentication

2022-06-12T22:08:00.003-07:00

Today I’m presenting to the BYU IT & Cybersecurity program on the topic of Web Scale Authentication. This post is a resource for those attending and a condensed summary for everyone else. It’s a departure from my typical focus on learning technology and goes deep into web architecture.

Click here for the slide deck.

Click here to access the source code to the authentication performance simulator.

Nine years ago, I wrote on this blog that enterprise scale is not web scale and I outlined seven principles that should be applied to web scale design. In this presentation I focus on handling authentication and state.

Despite the name, most web applications are developed to enterprise scale, anticipating tens to hundreds of thousands of users. To scale up enterprise applications you get bigger servers. But there's a limit to how big servers can get. Web scale applications have millions or billions of users. Think Google, Ancestry, Reddit, Facebook, etc. To scale up a web scale application you add more servers. Which you can keep doing for a long time. But even if you don't anticipate web-scale demand, there are important benefits to web-scale development.

Most web development frameworks have a way to store data associated with a web session. In PHP, for example, you call “session_start()” and then have access to a collection called $_SESSION. Within the session collection you can store authentication information such as the logged-in user and permissions about what they can do.

In web scale applications, you should not use session variables. In fact, you should not store state on the web server at all. All state must be pushed to the browser or stored in the database. To understand why, you need to know how sessions work.

When a session is created, the server creates a record in which session values will be stored. Then it creates a cookie that corresponds with the record and sends that to the browser. With each subsequent request, the server uses the cookie to look up the session record. Each record has an associated timeout, and the session record is discarded when enough time has passed without activity.

The trouble with this how and where the session information is stored. If it’s stored in memory, and you get enough users, you can consume too much RAM and performance will drop as the virtual memory system tries to swap things to disk. If you store session information on the disk, as PHP does, then there’s an extra IO request for every web page and that can slow things down due to the disk storage bottleneck. In fact, many requests spend more time retrieving session data than delivering content. Either way, you have to set up server affinity on your load balancer because the same user must always be directed to the same server in your pool.

If your web server has no session state, the load balancer is free to shift traffic around and to add servers as needed. It speeds up servers because they don’t have to retrieve state with every request. And it lets you perform maintenance and software upgrades without downtime because you can rotate servers out of the pool, upgrade them, and return them back to the pool without interrupting sessions.

So, if the server has no session state, where does state go? It can be stored in the URL – the path and/or query string portion usually indicate what the user wants to do. It can be stored on the browser in Javascript variables (for the duration of the page) or in browser-local storage for cross-page operations. Or it can be stored in cookies. Choice of these depends on the kind of state data being stored.

It’s cookies where web-scale applications keep authentication information. Typically, they create a record that includes the user id and an expiration time. There may be other key information such as access permissions. A keyed-hash algorithm such as HMAC is used to secure the record – preventing others from forging or manipulating it. On each request, all the server must do to authenticate the request is validate the hash. This is strictly a CPU operation, so it is orders of magnitude faster than retrieving records from disk. And it does not consume any memory or disk storage. The result is much greater scalability on the same server hardware. And that results in considerable cost savings.

For any application, the benefits of web-scale design include lower and more predictable hosting costs, more robust operations, high availability, and you're ready to scale when the need is presented.

Abundant Assessment

2021-04-22T22:49:00.156-07:00

The Assessment for Learning 2021 conference is using a flipped model, much like a flipped classroom. For my session on Abundant Assessment I prepared a five-minute Ignite talk in video form (view here). The live Q&A discussion will be on 6 May 2021 at 12:00pm Pacific time. The post below is a slighly edited transcript of the video.

My daughter in law teaches figure skating. To advance, skaters must pass tests in which they trace standard patterns in the ice, skating forward and backward and changing feet at precise locations. While the artistic part of figure skating competition has changed a lot, the “Moves in the Field” tests have changed little in more than 30 years. Skaters know exactly which patterns will be on a test. They practice them hundreds of times. And they are evaluated by their coaches before they go before judges for a formal exam. The lowest result on an exam is “Retry” and there is no shame for receiving that score. Judges give valuable feedback and coaches consider exams to be an important part of the learning experience.

Skating tests are Abundant Assessments because students know what will be on the test, they are designed to support learning, and students can take them as many times as needed at a modest cost.

Assessments foster learning by letting a student demonstrate their skill, and letting them and their teacher, tutor, or coach know how far they have progressed. Many subjects, such as arithmetic, cannot be learned without constant assessment.

Unfortunately, the consequences of many assessments seem to be excessive. If you fail a final exam, you generally must retake the entire course. Getting a low score on the SAT may block you from the university you hope to attend. Simply being sick on the wrong day may result in a student taking remedial mathematics in summer school.

Why is this? Why aren’t there more practice tests? Why can’t you retake a final?

Assessment Scarcity

Unlike figure skating tests, MOST exams are expensive. They cost a lot to create, to score, and to report. In economic terms, this means that they are scarce.

Assessment scarcity provokes many problems in contemporary education. Learning from assessment is impaired because feedback is infrequent. Instructors are reluctant to give early access to tests for fear that students will memorize answers. Parents and the community can’t review exam questions. Assessment groups must monitor social media for leaked questions.

Abundant Assessment

Abundance is the economic term for things that are cheap and plentiful. Abundant Assessment offers students early and frequent feedback. A failed exam is transformed into a learning experience with opportunity to retry. Assessment becomes part of the learning process rather than a distinct event.

Can academic assessments be made abundant? During my tenure at the Gates Foundation, I asked that question. You would have to cut the per-student cost of preparing tests, scoring them, and reporting by about one hundred times while still maintaining high quality.

I postulated that this could be accomplished by pooling the resources of numerous institutions and the strategic application of technology. A shared question bank would spread the cost of writing good questions across many institutions. Technology, including strategic application of AI, can cut the cost of scoring. And for questions that cannot or should not be scored by computer, self-scoring and peer-scoring are proven to be excellent learning activities.

I accepted the role of CTO at the Smarter Balanced Assessment Consortium, in part, to learn as much as I could about assessment and determine whether these theories are supportable. Over five and a half years I helped develop tests that are now used in twelve states and administered to six and a half million students each year.

Along the way I participated in a workshop at an Open Educational Resource conference where we asked the question, “Can an assessment be offered under an open license?” Conventional wisdom is that you can’t openly license assessment questions because students will memorize the answers.

But that wisdom is wrong. For some assessments, memorization is appropriate. That’s what’s happening with the figure skating tests – muscle memorization. For assessments requiring reasoning and problem solving you apply computer-adaptive testing algorithms and a large pool of questions. In that framework, the probability of a student seeing a question they studied is small enough as to not materially affect the test result. So, an assessment can be open if it has a large enough pool of questions from which tests are assembled.

Can Open and Abundant Assessment be achieved? Absolutely! Practice, data, and theory all support the proposition.

The technical requirements are:

Question/Interaction types that are authentic to the skills being taught and assessed.
An assessment format that is accepted by a broad range of Assessment and Learning Management systems.
A very large question/item bank.
Metadata and algorithms for automated assessment composition.

The organizational requirements are:

Passionate Leadership,
Resource sharing among institutions,
Strategic application of technology.

The beauty of an Abundant Assessment initiative is that it leverages economic incentives to encourage and enable the desired outcome. When offered inexpensive and easy-to-administer tests and quizzes, students and teachers will naturally take the opportunity to practice, get feedback, and learn.

As the cost of retaking an exam drops its use for learning is amplified and the consequences naturally become milder. “Retry” becomes the standard low score instead of “Fail.” This will literally save lives.

To be sure, some advocacy and training will be helpful but it’s easier to push a boulder downhill than up. Abundant assessment aligns incentives and facilitates the use of assessments for learning.

Thank you.

"Quality Assessment Part 10: Scoring Tests"

2020-11-11T07:00:00.018-08:00

This is the last of a 10-part series on building high-quality assessments.

Part 1: Quality Factors
Part 2: Standards and Content Specifications
Part 3: Items and Item Specifications
Part 4: Item Response Theory, Field Testing, and Metadata
Part 5: Blueprints and Computerized-Adaptive Testing
Part 6: Achievement Levels and Standard Setting
Part 7: Securing the Test
Part 8: Test Reports
Part 9: Frontiers
Part 10: Scoring Tests

Lillian takes an exam and gets a score of 83%. Roberta takes an exam and achieves 95%. Are these comparable scores? A fraction of what was achieved in each case?

Suppose Lillian's exam was the final in College Thermodynamics. A score of 83% means that she earned 83% of the maximum possible score on the test. In traditional scoring, teachers assign a certain number of points to each question. A correct answer earns the student all points on that question. Incorrect earns zero points. And a partly correct answer may earn partial points.

Roberta's exam was in middle school Algebra. Like Lillian, her score is represented as a fraction of what can be achieved on that particular exam. Since these are different exams in different subject areas and different difficulty levels, comparing the scores isn't very meaningful. All you can say is that each student did well in her respective subject.

Standardized Scoring for Quality Assessment

This post is a long overdue coda to my series on quality assessment. In part 4 of the series, I introduced item response theory (IRT) and how items can be calibrated on a continuous difficulty scale. In part 6 I described how standardized test scores are mapped to a continuous scale and interpreted accordingly.

This post is a bridge between the two. How do you compute a test score from a calibrated set of IRT questions and put it on that continuous scale?

Calibrated Questions

As described in part 4, a calibrated question has three parameters. Parameter a represents discrimination, parameter b represents difficulty, and parameter c represents the probability of getting the question right by guessing.

Consider a hypothetical test composed of five questions, lettered V, W, X, Y, and Z. Here are the IRT parameters:

Item	V	W	X	Y	Z
a	1.7	0.9	1.8	2.0	1.2
b	-1.9	-0.6	0.0	0.9	1.8
c	0.0	0.2	0.1	0.2	0.0

Here are the five IRT graphs plotted together:

The horizontal dimension is the skill level on what psychometricians call the "Theta" scale. The usable portion is from approximately -3 to +3. The vertical dimension represents the probability that a student of that skill level will answer the question correctly.

All of the graphs in this post, and the calculations that produce them are in an Excel spreadsheet. You can download it from here and experiment with your own scenarios.

Adding Item Scores

When the test is administered, the items would probably be delivered in an arbitrary order. In this narrative, however, I have sorted them from least to most difficult.

For our first illustration the student answers V, W, and X correctly, and they answer Y and Z incorrectly. For answers they get correct, we plot the curve directly. For answers they get incorrect, we plot (1 - y) which inverts the curve. Here's the result for this example.

Now, we multiply the values of the curves together to produce the test score curve.

The peak of the curve is at 0.4 (as calculated by the spreadsheet). This is the student's Theta Score for the whole test.

Scale Scores

As mentioned before, the Theta Score is on an arbitrary scale of roughly -3 to +3. But that scale can be confusing or even troubling to those not versed in psychometrics. A negative score could represent excellent performance but, to a lay person, it doesn't look good.

To make scores more relatable, we convert the Theta Score to a Scale Score. Scale scores are also on an arbitrary scale. For example, the ACT test is mapped to a scale from 1 to 36. SAT is on a scale from 400 to 1600 and the Smarter Balanced tests are on a scale from 2000 to 3000.

For this hypothetical test we decide the scale score should range from 1200 to 1800. We do this by multiplying the Theta Score by 100 and adding 1500. We describe this as having a scaling function with slope of 100 and intercept of 1500.

In this illustration, the Theta Score of 0.4 results in a scale score if 1540.

Another Example

We have another student that's less consistent with their answers. They get questions X and Z correct and all others incorrect.

In this case, the peak is at -0.6 which results in a scale score of 1440.

Confidence Interval - Standard Error

In addition to a Scale Score, most tests report a confidence interval. No test is a perfect measure of a student's true skill level. The confidence interval is the range of scores in which the student's true skill level should lie. For this hypothetical test we will use a 90% interval which means there is a 90% probability that the student's true skill level lies within the interval.

To calculate the confidence interval, we take the area under the curve. Then, starting at the maximum point, which is the score, we move outward until we get an area that is 90% of the full area under the curve. The interval is always centered on the peak.

To convert the low and high points of the confidence interval from a Theta scale to a Scale Score scale, you use the same slope and intercept as used before.

We will illustrate this with the results of the first test. Recall that it has a score of 0.4. The confidence interval is from -1.08 to 1.48. The Scale Score is 1540 with a confidence interval from 1392 to 1688.

The difference between the top or bottom of the interval and the score is the standard error. In this case, the standard error of the Theta core is 1.48 and the standard error of the scale score is 148. Notice that scale score standard error is Theta standard error times the slope from the conversion function; 100 in this case.

The second test result has a Theta Score of -0.6 and a confidence interval from -2.87 to 1.67 (Standard Error ±2.27). The Scale Score is 1440 with a confidence interval from 1213 to 1667 (Standard Error ±227).

The second student answered harder questions correctly while missing easier questions. Intuitively this should lead to less confidence in the resulting score and that is confirmed by the broader confidence interval (larger standard error).

Disadvantages and Advantages to IRT Test Scoring

IRT Test Scoring is complicated to perform, to explain, and to understand. The math requires numerical methods that are impractical without computers.

Perhaps the biggest weakness is that this form of scoring works best when questions span a broad range of skill levels. That makes it less suitable for mastery assessment which benefits more from a threshold test.

For achievement testing, there are important advantages. Multiple tests can be calibrated to the same scale, supporting the measurement of growth year over year. Calibration of assessment items ensures that scores of different student cohorts are comparable even as the test itself is updated with new questions. And the scoring system is compatible with computer adaptive testing.

Wrapup

Next time you encounter a standardized test score you'll have a better idea of how it was calculated. And, hopefully that will give you a better understanding of how to interpret the score and what to do with it.

There are many types of assessment and corresponding ways to score them. Overall, it's important ensure that the form of testing and scoring is well-suited to the way the results will be interpreted and applied.

References

"Ability Estimation with Item Response Theory", Nathan A. Thompson, Ph.D, 2009.
"Smarter Balanced Scoring Specification", Prepared by the American Institutes for Research, 2014.

Did Blended Learning Save My Class from Coronavirus?

2020-05-08T07:00:00.003-07:00

Last week I turned in final grades from teaching my first-ever college course. In December I agreed to teach one section of Information Systems 201 at Brigham Young University. I've been talking and writing about learning technology for a long time; I was overdue for some first-hand experience. Little did I know what an experience we were in for.

The outbreak COVID-19 forced me, along with hundreds of thousands of other professors, to shift teaching to 100% online. The amazing part is how well it worked out.

About the Class

IS201 is required of all undergraduate business school majors at BYU. Nearly 800 took it Winter Semester. Mine was an evening section composed of 75 students. A handful dropped before the deadline. The other 68 stuck with me to the end. Majors in my class included Accounting, Marketing, Entrepreneurship, Management, Finance, Information Systems and a bunch of others. So, this was a very technical class taught to mostly less-technical majors.

The class starts with an lightweight introduction to Information Systems before diving into four technical subjects. First up was databases. We learned how to design a database using ER Diagrams and how to perform queries in SQL. For that segment, Data.world was our platform. Next up was procedural programming. We used Visual Basic for Applications to manipulate Microsoft Excel spreadsheets. We progressed to data analytics and visualization for which we used Tableau and Microsoft Excel. And the final segment was web programming in HTML and CSS.

Blended Learning

My class was blended from the beginning, which turned out to be quite valuable once Coronavirus hit and Physical Distancing began. As I write this I ponder how freaky it would have been to read those words only three months ago.

The course is standardized across all sections. We have a digital textbook hosted on MyEducator that was developed by a couple of the BYU Information Systems professors. The online materials are rich with both text and video explanations and hands-on exercises for the students. All homework and exams are submitted online and either scored automatically or by a set of TAs. Because the course is challenging, and taken by a lot of students, there is a dedicated lab where students can get TA help pretty much any time during school hours.

The online materials are sufficient that dedicated students can succeed on their own. In fact, an online section relies exclusively on the text and videos (that are available to all) and offers online TA support. Students of that section generally do well. However, they also self-select for the online experience.

So, students of IS 201 have available to them the following blending of learning experiences:

Live lectures.
Video tutorials. These are focused on a single topic and range from 5 to 25 minutes in length.
Written text with diagrams and images.
Live TA access.
Virtual TA access. Via email and text-chat from the beginning and online web conferencing later.
Office hours with the professor.

The assignments on which students are graded are consistent regardless of which learning method they choose. There are a few quizzes but most assignments are project-oriented.

The net result of this is that students are given a set of consistent assignments, mostly project-based, with their choice of a variety of learning opportunities for mastering the material.

Enter Coronavirus

On Thursday, March 12, 2020 BYU faculty and students got word that all classes would be cancelled Friday, Monday, and Tuesday. By Wednesday we should resume with everything entirely online. Students were encouraged to return to their homes in order to increase social distancing. Probably 2/3 of my class did so; one returning home to Korea. Yet, they all persisted. I didn't have any student drop out of the class after the switch to online.

Compared to many of my peers, the conversion was relatively easy. All of the learning materials were already online and students were already expected to submit their assignments online. Despite that, it took me about 10 hours to prepare. I adjusted the due dates for two assignments, scheduled the zoom lessons and posted links in the LMS, sent out communications to the students, and responded to their questions. On Monday I hosted an optional practice class so that I and the handful of students that joined could get practice with the online format.

The department leadership gave me the option of teaching live classes using Zoom or recording my classes for viewing whenever students chose. I elected to teach live but to post recordings of the live lectures.

Thursday, March 19 I posted this on LinkedIn:

Wednesday was the first day of online instruction for Brigham Young University. That day we achieved more than 1,200 online classes, with more than 26,000 participants coming in from 60-plus countries. Not bad, considering we only had five days to prepare, many of the students returned to their homes during the suspension of classes, and we had an #earthquake that morning.

As with the in-person classes, attendance was optional. Attendance dropped from a typical 60 in-person to about 20-25 online. One particularly sunny afternoon I only had seven show up. On average, the recordings had about 30 views each but the last couple, with focus on the final project, had nearly 90 which means some of the students were watching them more than once.

Having worked from home for the last seven years, I have a lot of experience with online videoconferencing. Despite that, I felt a huge loss moving to online classes. I never before realized how much feedback I got from eye contact and facial expressions. In-person, students were more ready to raise their hands or interrupt with questions. Online, I often felt like I was talking to empty space. I had to be very deliberate in seeking feedback. Maintaining long pauses when prompting for questions, encouraging them to post in the chat box, and suggesting questions of my own.

About two weeks into the online mode, I read an article that said there should be at least three live interactions per online class. They can be simple polls, a question to the students for which they are to call out or write a response, or a simple thumbs up or thumbs down on how well they are understanding the material. Zoom, like most other systems, has tools that make this pretty easy. And I found that the advice was good. Engagement really improved when I added even one or two interactions.

The biggest change was with the TA labs. The two TAs that served my class had to move their sessions online, again using Zoom and screen-sharing to support the students. They did an excellent job and I'm enormously grateful. My office hours were also virtually hosted. But, to my surprise, I only had three students make use of that in the online portion of the semester.

A Teaching Opportunity

COVID-19 became a threat to the U.S. just as my class was getting into the unit on Data Analytics. I wrote a little program in C# to download the virus data from Johns Hopkins University and reformat it into a table more suitable for analysis with Microsoft Excel or Tableau. On this webpage I posted links to the data, to the downloader program, and getting started videos for doing the analysis.

Wherever possible, throughout that unit, I used the COVID-19 data for my in-class examples. It turned out to be an excellent opportunity to show the strength of proper visualization with real-world data. I also showed examples of how rendering correct data in the wrong way can be misleading. Feedback from the students was very positive though it was sobering when we analyzed death rates.

Saved by Blended Learning

There are many models for blended learning. My class started out with a selection of learning modes with students given the freedom to choose among them. The LMS we used gives statistics on the students' modes of learning. Across the class, students only watched 15% of videos to completion. Meanwhile, they read 71% of the reading materials and completed 95% of the assessments. My rough estimate is that about 65% attended or viewed the lectures. I don't have statistics on their use of virtual TA help but I'm sure it was considerable.

This correlates with what I have seen in studies. Video is exciting but most students prefer from reading with still images. That's because they control the pace. Live interactions remain important because a teacher can respond immediately to feedback from the class. Online-live is more challenging because most visual cues are eliminated but there are ways to compensate. Most of them involve deliberate effort on the part of the instructor such as prompting for questions, instant quizzes, votes, and so forth.

Despite the challenges, my class came out with a 3.4 average, considerably better than the expected 3.2. I would love to take credit for that. But I think it has more to do with a subject and format that are well-suited to a blended model, high-quality online materials (prepared by my predecessors), and resilient students who simply hung in there until the end.

What’s Up in Learning Technology?

2020-01-14T16:00:00.000-08:00

With the turn of the decade I have read a lot of pessimistic articles about education and learning technology. Most start with the lamentation that there has been little overall progress in student achievement over the last couple of decades – which is true, unfortunately. But what they fail to note are the many small and medium scale successes.

Take, for example, Seaford School District in Delaware. The community has been economically challenged since DuPont closed its Nylon plant there. Three of its four elementary schools were among the lowest performing in the state just a few years ago. Starting with a focus on reading, they ensured a growth mindset among the educators, gave them training and support, and deployed data programs to track progress and inform interventions. They drew in experts in learning science to inform their programs and curriculum. The result: the district now matches the state averages in overall performance and the three challenged elementary schools are strongly outperforming the state in reading and mathematics.

My friend, Eileen Lento, calls this a lighthouse because it marks the way toward the learning successes we’re seeking. For sailing ships, you don’t need just one lighthouse. You need a series of them along the coast. And each lighthouse sends out a distinct flash pattern so that navigators can tell which one they are looking at. By watching educational lighthouses, we gain evidence of the learning reforms that will make a real and substantial difference in students’ lives.

What does the evidence say?

Perhaps the most dramatic evidence-based pivot in the last decade has been the Aurora Institute, formerly iNACOL. In 2010 their emphasis was on online learning and virtual schools. But the evidence pointed them toward competency-based learning and so they launched CompetencyWorks; they renamed the symposium; and, ultimately, renamed the whole organization.

Much criticism has been leveled at No Child Left Behind, and its successor, the Every Student Succeeds Act. The beneficial results of these federal interventions are the state standards, which form the foundation of competency-based learning; and consistent annual reports that indicate how well K-12 schools are performing. On the downside, we’ve learned that measuring and reporting performance, by themselves, are not enough to drive improvement.

Learning science has made great gains in general awareness over the last decade. We’ve learned that a growth mindset makes a critical difference in how students respond to feedback and that the form of praise given by teachers and mentors can develop that mindset. We have evidence backing the notion that deliberate practice and feedback are required to develop a new skill. And we’ve gained nuance about Bloom’s Two Sigma Problem – that tutoring must be backed by a mastery-based curriculum and that measures of mastery must be rigorous in order to achieve the two standard deviation gains that Benjamin Bloom observed.

Finally, we’ve learned that the type of instructional materials doesn’t matter nearly as much as how they are used. Video and animation are not significantly better at teaching than still pictures and text. That is, until interactivity and experimentation are added. To those, we must also add individual attention from a teacher, opportunities to practice, and feedback.

Learning Technology Responding to the Challenge

A common realization in this past decade is that technology does not drive learning improvement. Successful initiatives are based on a sound understanding of how students learn best. Then, technology may be deployed that supports the initiative.

A natural indicator of what technology developers are doing is the cross-vendor standards effort. In the last couple of years there has emerged an unprecedented level of cooperation not just between vendors but also between the technology standards groups.

Here’s what’s up:

Learning Engineering

A properly engineered learning experience requires a coalescence of Instructional Design, Learning Science, Data Science, Competency-Based Learning and more. The IEEE Learning Technology Standards Committee (LTSC) has sponsored the Industry Consortium on Learning Engineering (ICICLE) and I’m pleased to be a member. We held our conference on Learning Engineering in May 2019, proceedings are due out in Q1 of 2020, the eight Special Interest Groups (SIGs) meet regularly and we have a monthly community meeting.

Interoperable Learner Records (ILR)

The concept is that every learner (and that’s hopefully everyone) should have a portable record that tracks every skill they have mastered. Such a record would support learning plans and guide career opportunities.

The T3 Innovation Network, sponsored by the US Chamber of Commerce Foundation, includes “Open Data Standards” and “Map and Harmonize Data Standards” among their pilot projects. These projects are intended to support use of existing standards rather than develop new ones.
Common Education Data Standards (CEDS) define the data elements associated with learner records of all sorts and the various standards initiatives continue to align their data models to CEDS.
IMS Global has published the Comprehensive Learner Record (CLR) standard.
The PESC Standards define how to transfer student records to, from, and between colleges and universities.
The Competency Model for Learning Technology Standards (CM4LTS) study group has been authorized by the IEEE LTSC to document a common conceptual model that will harmonize current and future IEEE LTSC standards. The model is anticipated to be based on CEDS.
The Advanced Digital Learning Initiative (ADL) has launched the Total Learning Architecture (TLA) working group seeking to develop “plug and play” interoperability between adaptive instructional systems, intelligent digital tutors, real-time data analytics, and interactive e-books. Essential to the TLA will be a portable learner record that functions across products.
The HR Open Standards Consortium defines standards to support human resource management. The standards include competency-oriented job descriptions and experience records.

While these may seem like competing efforts, there is a tremendous amount of cooperation and shared membership across the different groups. In fact, A4L, PESC, and HR Open Standards have established an open sharing and cooperation agreement. Our goal is a complementary and harmonious set of standards.

Competency Frameworks

A Competency Framework is a set of competencies (skills, knowledge, abilities, attitudes, or learning outcomes) organized into a taxonomy. Examples include the Common Core State Standards, Next Generation Science Standards, the Physician Competency Reference Set, the Cisco Networking Academy Curriculum, and the O*Net Spectrum of Occupations. There are hundreds of others. Interoperable Learner Records must reference competency frameworks to represent the competencies in the record.

The Achievement Standards Network (ASN) is a registry of competency frameworks from many different domains in a browsable and machine-readable format.
The IEEE LTSC is renewing the Reusable Competency Definitions Standard IEEE 1484.20.1. A key part of this project is the “Best Practices for Developing Competencies.”
IMS CASE is an interoperable format for representing competency frameworks and the IMS CASE Network Registry is a registry of competency frameworks in CASE format.
The Credential Registry by Credential Engine is an open library that describes credentials in terms of the associated competencies. Credentials are machine readable in the Credential Transparency Description Language (CTDL).
The T3 Innovation Network Pilot Projects 5 and 6 (Competency Data Collaborative and Competency Translation and Analysis) seek to harmonize competency use across existing frameworks, registries, and formats like those listed above. (I'm pleased to be contributing to the T3 Competency Data Collaborative.)

Learning Resource Metadata

Metadata can indicate that a piece of content (text, audio, video, interactive activity, etc.) is intended to teach or assess a particular competency or set of competencies. So, when a person completes an activity, their interoperable learner record can be updated with evidence that they have learned or are learning those competencies.

The Learning Resource Metadata Initiative (LRMI) is a working group within the Dublin Core Metadata Initiative (DMCI) to define and learning-related metadata properties included in DMCI and Schema.org. (I've been a contributor to LRMI since its inception.)
IEEE Learning Object Metadata (LOM) is a standard published by the IEEE LTSC and incorporated into many other learning data standards.

Standards Advocacy

All of this interoperability effort will be of little use if the developers of learning activities and tools don’t make use of them.

Project Unicorn advocates for U.S. school districts to require interoperability standards from the vendors that supply their educational tools.
EdMatrix is my own Directory of Learning Standards. It is intended to help developers of learning tools know what standards are applicable, to help learning institutions know what to seek or require, and to help standards developers know what related efforts are underway and to support cooperation among them.

Looking to a New Decade

It can be discouraging to look back on the last decade or two and compare the tremendous investment society has put into education with the lack of measurable progress in outcomes.

I prefer to look forward and right now I’m optimistic. Here’s why:

Our understanding of learning science has grown. In daily conversation we use terms like “Growth Mindset,” “Competency-Based Learning,” “Practice and Feedback,” and “Motivation”.

Online and Blended Learning, and their cousin, Adaptive Learning Platforms, have progressed from the “Peak of Inflated Expectations” through the “Trough of Disillusionment” (using Gartner Hype Cycle terms) and are on their way to the “Plateau of Productivity.” Along the way we’ve learned that technology must serve a theory of learning not the other way around.

Technology and standards efforts are now spanning from primary, and secondary education, through higher education and into workforce training and lifelong learning. This reflects a rapidly changing demand for skills in the 21st century and a realization that most people will have to retrain 3-4 times during their lifetime. I expect that lasting improvement in postsecondary education and training will be driven by workplace demands and that corresponding updates to primary and secondary education will be driven by the downstream demand of postsecondary.

So, despite a lack of measurable impact in standardized tests, previous efforts have established a foundation of competency standards and measures of success. We have hundreds of “lighthouses” - successful initiatives worthy of imitation. On the foundation of competencies and standards, following the lighthouse guides we will build successful, student-centric learning systems.

What do you think? Are the investments of the last couple of decades finally about to pay off? Let me know in the comments.

Themes manifest as iNACOL Becomes Aurora

2019-11-01T15:00:00.000-07:00

In 2010 I took on the responsibility of forming an Education Technology Strategy for the Bill & Melinda Gates Foundation. That same year, I also attended the iNACOL Virtual Schools Symposium (VSS). A year later, I presented at the symposium and I've been pleased to present or contribute in some way most years since.

As my colleagues and I at the Gates Foundation worked on a theory of technology and education education, something quickly became clear. Technology doesn't drive educational improvement; it's simply an enabler. In the early part of this decade there were numerous 1:1 student:computer initiatives. Most failed to show measurable improvement and many turned into fiascos as teachers were tasked with finding something useful to do with their new computers or tablets.

At the foundation we turned to personalized learning, a theory that was based on promising evidence and one that has gained more support since then. With that as basis we looked to where technology could help. The result was support for key projects including Common Education Data Standards, the Learning Resource Metadata Initiative, and Profiles of Next-Generation Learning.

The great folks at iNACOL observed the same patterns and so they pivoted. VSS became, simply, the iNACOL Symposium and their emphasis shifted to personalized and competency-based education with online and blended learning as enablers. This year, they completed the transition, renaming the whole organization to The Aurora Institute. In their words:

[Our] organization has evolved significantly to become a leading nonprofit organization with a deep reach into practitioners creating next-generation learning models. Our focus has grown to examine systems change and education innovation, facilitating the future of learning through personalized learning and student-centered approaches to next-generation learning.

Serving Educators and Students

A theme that spontaneously emerged at the symposium this year is that we must do for the educators what we want for the students. It was first expressed by Dr. Brooke Stafford-Brizard in her opening keynote. As she advocated that we care for the mental health of the children she said, "Across all of our partners who have successfully integrated whole child practice, there isn’t one who didn’t start with their adults." She proceeded to show examples where the a school mental health programs were designed to support both staff and students.

With that as precedent, the principle kept reappearing throughout the symposium.

If we expect personalized instruction for the students we must offer personalized professional development for their teachers.
Establish the competency set we expect of educators and provide opportunities to master those competencies.
Actionable feedback to educators is critical to the success of any learning innovation just as actionable feedback to students is critical to their learning.
Create an environment of trust and safety among the staff of your institution - then project that to the students.
Growth mindset is as important to educators as it is for the students they teach.

Continuous Improvement

Both themes — technology as enabler, and caring for the educators — are simply signposts on a path of continuous improvement. We must follow the evidence and go where it leads us.

A Support System for High-Performing Schools

2019-03-07T15:00:00.000-08:00

Charter schools operated by Charter Management Organizations (CMOs) tend to outperform other charter schools and public schools. The National Study of Charter Management Organization Effectiveness from 2011 was the first rigorous study of CMO effectiveness and it showed that CMO-operated schools were better than other options. A 2017 study by Stanford University's Center for Research on Education Outcomes found that students enrolled in CMO-operated schools in New York City substantially outperformed their peers in conventional public schools and independent charter schools.

This improvement is to be expected. A basic premise of CMO operations is to study what works, and carry successful practices to other schools in the network.

Some conventional public schools are following a similar pattern. Their solution providers don't necessarily manage the school, like a CMO would. Instead, providers offer an integrated set of services backed by an evidence-driven theory of effective teaching. Here is the ecosystem I expect to emerge in the next few years:

Component and Curriculum Suppliers
Educational Solution Providers
Schools (and other learning institutions)

This same basic model applies to primary, secondary, and higher education though large universities and big districts have the capacity to be their own solution providers. Let's look at the components:

Schools, Districts, and other Learning Institutions

The school is where the teaching and learning occurs. It's where the supply chain of standards, curriculum, educational training, assessments, learning science, and everything else finally meets the student.

Many schools are implementing the same kinds of programs as charters: online curricula, blended learning, teacher dashboards, etc. But the complexity of integration grows exponentially with the number of components to combine. Building an integrated whole is beyond the capacity of most schools and all but the largest districts. The same pattern exists in higher education. Large universities can deliver an integrated solution but community colleges have a harder time.

Component and Curriculum Suppliers

On the supply side, there's a rich, complex, and rapidly growing market of component and curriculum suppliers. They include conventional textbook publishers, online curriculum developers, assessment providers, Learning Management Systems (LMS), Student Information Systems (SIS), and more.

Beyond these well-defined categories there's a host of other components, each designed to address a particular need in the educational economy. For example, Learnosity builds tools for creating and embedding high-quality assessments. Gooru offers a learning map, helping students know where they are in their learning progression. EdConnective offers live, virtual coaching for teachers. In 2018, education technology investment grew to a record $5.23 billion in the U.S. and a breathtaking $16.34 billion worldwide. We can expect many more components and materials to be produced from that level of investment.

Many of these components are raw - requiring significant integration effort before they can become part of an integrated learning solution. Despite this, developers of these components attempt to sell them directly to schools, districts, and states.

Educational Solution Providers

Summit Public Schools is a CMO that consistently achieves high rankings. Summit Learning also offers their online curriculum to public schools. But, separating the curriculum from the balance of the solution hasn't been so successful. In November 2018, Brooklyn students held a walkout and parents created a website to protest "Mass Customized Learning." It's not that the materials were bad; they were well-proven in other contexts. But, separated from the balance of the Summit program the student experience suffered.

An important new category in the education supply chain are Educational Solution Providers. CMOs belong to this category but solution providers to conventional schools don't take over management like a CMO would. Rather, they provide an integrated set of services that includes training and coaching for staff and leadership.

The best solution providers start with an evidence-based learning theory. They then assemble a comprehensive solution based on the theory and selected from the rich menu provided by the component market. A complete solution includes:

Training and Coaching Services
Professional Development
Curriculum (conventional or online)
Assessment (ideally curriculum-embedded)
Secure Student Data Systems with Educator Dashboards
Effectiveness Measures
Continuous Improvement

An important job for solution providers is to integrate the components so that they work seamlessly together in support of their learning theory. Training and professional development should embody the same theory that is being expressed to the students. LMS, SIS, dashboards, and all other online systems should function together as one solution even if the provider is sourcing the components from an array of suppliers. In order to do this, the solution provider must have their own curriculum experts for the content side and a talented technology staff focused on systems integration.

Players in this nascent category include The Achievement Network, CLI Solutions Group, and The National Institute for Excellence in Teaching. I think we can expect new entrants in the next few years. Successful CMOs may also cross over to providing services to conventional public schools.

Wrapup

The educational component and curriculum market is rich and rapidly growing with record levels of investment. But, schools don't have the capacity to integrate these components effectively and they need a guiding theory to underpin the selection of components and how they are to be integrated. The emerging category of Educational Solution Provider fills an important role in the ecosystem.

Are you aware of other existing or emerging solution providers? Please let me know in the comments!

Public-Private Partnership for Public Works

2019-02-11T14:00:00.000-08:00

On February 28, 2001 I was at Microsoft Headquarters in Redmond Washington when the Nisqually Earthquake hit. I was using Microsoft's scalability lab to perform tests on Agilix software. I remember standing in the doorway and asking the someone down the hall, "Is this really an earthquake?" It obviously was, but not having experienced one before my mind was still disbelieving.

Nine years later we moved to Seattle where I developed an education technology strategy at the Bill & Melinda Gates Foundation. At the time, politicians were still trying to figure out what should replace the Alaskan Way Viaduct which had been damaged in the earthquake, and which engineers predicted could collapse should another earthquake occur.

Last week, the Washington SR 99 tunnel replaced the viaduct; 18 years after the earthquake threatened its predecessor. Ironically, the tunnel opening was accompanied by a snowstorm that paralyzed the Northwest making the tunnel one of the few clear roads in the area.

Funding of Public Works

A few years back I visited New York's Grand Central Terminal and wondered at the great investments made in public works in the early 20th century. The terminal building is beautiful, functional, and built to last. It's been going for more than a century and will probably continue for a century or two more. I wondered why it is so hard to find contemporary investments in public works of such grandeur. However, upon doing some research I found that Grand Central was funded entirely by private investors. Even today, the building is privately owned though the railroad it serves has now been merged into the MTA, a public benefit corporation.

When we visited Seoul, Korea in 2015 we spent five days getting around on the excellent Seoul Metropolitan Subway. It is fast, efficient, clean, and among the largest subway systems in the world with more than 200 miles of track. It features wireless internet throughout, most platforms are protected by automated doors greatly improving safety. Yet, the whole network has been built since 1971. The subway is built and operated by Seoul Metro, Korail, and Metro 9. Seoul Metro and Korail are Korean public corporations; these are corporations where the government owns a controlling interest. Metro 9 is a private venture.

This past December we visited Brisbane, Australia. Brisbane traffic has been mediated through the construction of several bypass tunnels including the Airport Link. The tunnels have been built in relatively short time through public-private partnerships.

As I researched these projects I saw a consistent pattern. The most successful public works projects seem to involve some form of cooperation between government and private enterprise. Funding is more easily obtained and project management is better when a private organization participates and stands to benefit from the long-term success of the project. But government support is also needed to represent the public interest, to streamline access to land and permits, and to ensure that profit-taking isn't excessive. Consider the U.S. Transcontinental Railroad. It was built in six years by three companies with a combination of government land grants, private funding, and some government subsidy bonds.

Less-Successful Examples

Less-successful operations seem to be entirely publicly sponsored and managed. Private companies contract to do the work but they aren't invested beyond project completion. For example, the Boston Big Dig was the "most expensive highway project in the US, and was plagued by cost overruns, delays, leaks, design flaws, charges of poor execution and use of substandard materials, criminal arrests, and one death." While the project was built by private contractors, public agencies were exclusively responsible for sponsorship, oversight, funding, and success.

Similarly, the Florida High Speed Corridor was commissioned by a state constitutional amendment, theoretically obligating the state to build the rail system. While still in the planning stages, the project got bogged down in cost overruns, environmental studies, lawsuits, and declining public support. Ultimately, the project was canceled in 2011. In 2018, however, Brightline, launched service between Miami, Fort Lauderdale, and West Palm Beach with an extension to Orlando being planned. Brightline is privately funded and operated.

Education

The same principles seem to apply in education. In the U.S. the biggest challenge to traditional public education are charter schools. Studies, including this one from the Center on Reinventing Public Education show that charter schools managed by Charter Management Organizations (CMOs) perform better than conventional public schools or independently-managed charter schools. Most CMOs are not-for-profit but they still represent a private, non-government entity. Based on the success of CMOs, some school districts are also considering outside management or support firms. In higher education there is a long tradition of government funding for a mix of public and private universities. Like the successful public works, the greatest success seem to occur when public and private interests are combined and aligned toward a common goal. In these successes, government represents the public interest. The worst outcomes seem to occur when government fails to represent public interests and is either corrupted to serve private needs or excessively focused on politics and party issues.

Organizing for Success

I haven't done a comprehensive search of public works projects. My selection of examples is simply based on projects I happen to be aware of. Nevertheless, it seems that the greatest potential for success is achieved when public and private interests are aligned in a partnership that leverages the strengths of both models and ensures that both groups benefit. public-private partnerships, state-owned enterprises, and public benefit corporations are different ways of achieving these ends.

The SR 99 tunnel in Seattle was bored by Bertha which, at the time, was the largest-ever tunnel boring machine. Early in the process, the machine broke down and it took two years to dig a recovery pit and make repairs. At the time, two state senators sponsored a bill to cancel the project. Despite this setback, and significant cost overruns, the project was ultimately a success. So, we can add persistence to see things through as another key to success.

Though the contract with Seattle Tunnel Partners will conclude when the tunnel project is complete, the organization has achieved a high degree of cooperation with the Washington department of transportation. Public-private cooperation and alignment of interests are behind many of the most successful public projects. And the private interest is often the source of the persistence needed to see things through.

Quality Assessment Part 9: Frontiers

2019-01-10T14:00:00.001-08:00

This is part 9 of a 10-part series on building high-quality assessments.

Part 1: Quality Factors
Part 2: Standards and Content Specifications
Part 3: Items and Item Specifications
Part 4: Item Response Theory, Field Testing, and Metadata
Part 5: Blueprints and Computerized-Adaptive Testing
Part 6: Achievement Levels and Standard Setting
Part 7: Securing the Test
Part 8: Test Reports
Part 9: Frontiers
Part 10: Scoring Tests

A 2015 survey of US adults indicated that 34% of those surveyed felt that standardized tests were merely fair at measuring students' achievement; 46% think that the way schools use standardized tests has gotten worse; and only 20% are confident that tests have done more good than harm. The same year, the National Education Association surveyed 1500 members (teachers) and found that 70% do not feel that their state test is "developmentally appropriate.".

In the preceding eight parts of this series I described all of the effort that goes into building and deploying a high-quality assessment. Most of these principles are implemented to some degree in the states represented by these surveys. What these opinion polls tell us is that regardless of their quality, these assessments aren't giving valuable insight to two important constituencies: parents and teachers.

The NEA article describes a hypothetical "Most Useful Standardized Test" which, among other things, would "provide feedback to students that helps them learn, and assist educators in setting learning goals." This brings up a central issue in contemporary testing. The annual testing mandated by the Every Student Succeeds Act (ESSA), is focused on school accountability. This was also true of its predecessor, No Child Left Behind (NCLB). Both acts are based on the theory of measuring school performance, reporting that performance, and incentivising better school performance. States and testing consortia also strive to facilitate better performance by reporting individual results to teachers and parents. But facilitation remains a secondary goal of large-scale standardized testing.

The frontiers in assessment I discuss here shift the focus to directly supporting student learning with accountability being a secondary goal.

Curriculum-Embedded Assessment
Dynamically-Generated Assessments
Abundant Assessment

Curriculum-Embedded Assessment

The first model involves embedding assessment directly in the curriculum. Of course, nearly all curricula have embedded assessments of some sort. Math textbooks have daily exercises to apply the principles just taught. English and social studies texts include chapter-end quizzes and study questions. Online curricula intersperse the expository materials with questions, exercises, and quizzes. Some curricula even include pre-built exams. But these existing assessments lack the quality assurance and calibration of a high-quality assessment.

In a true Curriculum-Embedded Assessment, some of the items that appear in the exercises and quizzes would be developed with the same rigor as items on a high-stakes exam. They would be aligned to standards, field tested, and calibrated before appearing in the curriculum. In addition to contributing to the score on the exercise or quiz, the scores of these calibrated items would be aggregated into an overall record of the student's mastery of each skill in the standard.

Since the exercises and quizzes would not be administered in as controlled an environment as a high-stakes exam, the scores would not individually be as reliable as in a high-stakes environment. But by accumulating many more data points, and doing so continuously through the student's learning experience, it's possible to assemble an evaluation that is as reliable or more reliable than a year-end assessment.

Curriculum-Embedded Assessment has several advantages over either a conventional achievement test or the existing exercises and quizzes:

Student achievement relative to competency is continuously updated. This can offer much better guidance to students, educators, and parents than existing programs.
Student progress and growth can be continuously measured across weeks and months, not just years.
Performance relative to each competency can be reliably reported. This information can be used to support personalized learning.
Data from calibrated items can be correlated to data from the rest of the items on the exercise or quiz. Over time, these data can be used to calibrate and align the other items, thereby growing the pool of reliable and calibrated assessment items.
As Curriculum-Embedded Assessment is proven to offer data as reliable as year-end standardized tests, the standardized tests can be eliminated or reduced in frequency.

Dynamically-Generated Assessments

As described in my post on test blueprints, high-quality assessments begin with a bank of reviewed, field-tested, and calibrated items. Then, a test producer selects from that bank a set of items that match the blueprint of skills to be measured. For Computer-Adaptive Tests, the test is presented to a simulated set of students to determine how well it can measure student skill in the expected range.

In order to provide more frequent and fine-grained measures of student skills, educators prefer shorter interim tests to be used more frequently during the school year. Due to demand from districts and states, the Smarter Balanced Assessment Consortium will more than double the number of interim tests it offers over the next two years. Most of the new tests will be focused on just one or two targets (competencies) and have four to six questions. They will be short enough to be given in a few minutes at the beginning or end of a class period.

But what if you could generate custom tests on-demand to meet specific needs of a student or set of students? A teacher would design a simple blueprint — the skills to be measured and the degree of confidence required on each. Then the system could automatically generate the assessment, the scoring key, and the achievement levels based on the items in the bank and their associated calibration data.

Dynamically-generated assessments like these could target needs specific to a student, cluster of students, or class. With a sufficiently rich item bank, multiple assessments could be generated on the same blueprint thereby allowing multiple tries. And it should reduce the cost of producing all of those short, fine-grained assessments.

Abundant Assessment

Ideally, school should be a place where students are safe to make mistakes. We generally learn more from mistakes than from successes because failure affords us the opportunity to correct misconceptions and gain knowledge whereas success merely confirms existing understanding.

Unfortunately, school isn't like that. Whether primary, secondary, or college; school tends to punish failures. At the college level, a failed assignment is generally is unchangeable and a failed class, or low grade goes on the permanent record. Consider a student that studies hard all semester, gets reasonable grades on homework, but then blows the final exam. Perhaps they were sick on exam day, or perhaps the questions were confusing and different from what they expected, or perhaps the pressure of the exam just messed them up. Their only option is to repeat the whole class — and even then their permanent record will show the class repetition.

Why is this? Why do schools amplify the consequences to such small events? It's because assessments are expensive. They cost a lot to develop, to administer, and to score. In economic terms, assessments are scarce. For schools to offer easy recovery from failure they would have to develop multiple forms for every quiz and exam. They would have to incur the cost of scoring and reporting multiple times. And they would have to select the latest score and ignore all others. To date, such options have been cost-prohibitive.

"Abundant Assessment" is the prospect making assessment inexpensive — "abundant" in economic terms. In such a framework, students would be afforded many tries until they succeed or are satisfied with their performance. Negative consequences to failure would be eliminated and the opportunity to learn from failure would be amplified.

This could be achieved by a combination of collaboration and technology. Presently, most quizzes and exams are written by teachers or professors for their class only. If their efforts were pooled into a common item bank, then you could rapidly achieve a collection large enough to generate multiple exams on each topic area. Technological solutions would provide dynamically-generated assessments (as described in the previous section), online test administration, and automated scoring. All of this would dramatically reduce the labor involved in producing, administering, scoring, and reporting exams and quizzes.

Abundant assessment dramatically changes the cost structure of a school, college, or university. When it is no longer costly to administer assessments then you can encourage students to try early and repeat if they don't achieve the desired score. Each assessment, whether an exercise, quiz, or exam can be a learning experience with students encouraged to learn quickly from errors.

Wrapup

These three frontiers are synergistic. I can imagine a student, let's call her Jane, studying in a blended learning environment. Encountering a topic with which she is already familiar, Jane jumps ahead to the topic quiz. But the questions involve concepts she hasn't yet mastered and she fails. Nevertheless, this is a learning experience. Indeed, it could be reframed as a formative assessment as she now goes back and studies the material knowing what will be demanded of her in the assessment. After studying, and working a number of the exercises, Jane returns to the topic assessment and is presented with a new quiz, equally rigorous, on the same subject. This time she passes.

Outside the frame of Jane's daily work, the data from her assessments and those of her classmates are being accumulated. When the time comes, at the end of the year, to report on school performance, the staff are able to produce reliable evidence of student and school performance without the need for day-long standardized testing.

Most importantly, throughout this experience Jane feels confident and safe. At no point is she nervous that a mistake will have any long-term consequence. Rather, she knows that she can simply persist until she understands the subject matter.

Quality Assessment Part 8: Test Reports

2018-11-06T12:00:00.003-08:00

This is part 8 of a 10-part series on building high-quality assessments.

Part 1: Quality Factors
Part 2: Standards and Content Specifications
Part 3: Items and Item Specifications
Part 4: Item Response Theory, Field Testing, and Metadata
Part 5: Blueprints and Computerized-Adaptive Testing
Part 6: Achievement Levels and Standard Setting
Part 7: Securing the Test
Part 8: Test Reports
Part 9: Frontiers
Part 10: Scoring Tests

Since pretty much the first Tour de France cyclists have assumed that narrow tires and higher pressures would make for a faster bike. As tire technology improved to be able to handle higher pressures in tighter spaces the consensus standard became 23mm width and 115 psi. And that standard held for decades. This was despite the science that says otherwise.

Doing the math indicates that a wider tire will have a shorter footprint, and a shorter footprint loses less energy to bumps in the road. The math was confirmed in laboratory tests and the automotive industry has applied this information for a long time. But tradition held in the Tour de France and other bicycle races until a couple of teams began experimenting with wider tires. In 2012, Velonews published a laboratory comparison of tire widths and by 2018 the average moved up to 25 mm with some riders going as wide as 30mm.

While laboratory tests still confirm that higher pressure results in lower rolling resistance, high pressure also results in a rougher ride and greater fatigue for the rider. So teams are also experimenting with lower pressures adapted to the terrain being ridden and they find that the optimum pressure isn't necessarily the highest that the tire material can withstand.

You can build the best and most accurate student assessment ever. You can administer it properly with the right conditions. But if no one pays attention to the results, or if the reports don't influence educational decisions, then all of that effort will be for naught. Even worse, correct data may be interpreted in misleading ways. Like the tire width data, the information may be there but it still must be applied.

Reporting Test Results

Assuming you have reliable test results (the subjects of the preceding parts in this series), there are four key elements that must be applied before student learning will improve:

Delivery: Students, Parents, and Educators must be able to access the test data.
Explanation: They must be able to interpret the data — understand what it means.
Application: The student, and those advising the student, must be able to make informed decisions about learning activities based on assessment results.
Integration: Educators should correlate the test results with other information they have about the student.

Delivery

Most online assessment systems are paired with online reporting systems. Administrators are able to see reports for districts, schools, and grades sifting and sorting the data according to demographic groups. This may be used to hold institutions accountable and to direct Title 1 funds. Parents and other interested parties can access public reports like this one for California containing similar information.

Proper interpretation of individual student reports has greater potential to improve learning than the school, district, and state-level reports. Teachers have access to reports for students in their classes and parents receive reports for their children at least once a year. But teachers may not be trained to apply the data, or parents may not know how to interpret the test results.

Part of delivery is designing reports so that the information is clear and the correct interpretation is the most natural. To experts in the field, well-versed in statistical methods, the obvious design may not be the best one.

The best reports are designed using a lot of consumer feedback. The designers use focus groups and usability tests to find out what works best. In a typical trial, a parent or educator would be given a sample report and asked to interpret it. The degree to which they match the desired interpretation is an evaluation of the quality of the report.

Explanation

Even the best-designed reports will likely benefit from an interpretation guide. A good example is the Online Reporting Guide deployed by four western states. The individual student reports in these states are delivered to parents on paper. But the online guide provides interpretation and guidance to parents that would be hard to achieve in paper form.

Online reports should be rich with explanations, links, tooltips, and other tools to help users understand what each element means and how it should be interpreted. Graphs and charts should be well-labeled and designed as a natural representation of the underlying data.

An important advantage of online reporting is that it can facilitate exploration of the data. For example, a teacher might be viewing an online report of an interim test. They that a cluster of students all got a lower score. Clicking on the scores reveals a more detailed chart that shows how the students performed on each question. They might see that the students in the cluster all missed the same question. From there, they could examine the student's responses to that question to gain insight into their misunderstanding. When done properly, such an analysis would only take a few minutes and could inform a future review period.

Application

Ultimately, all of this effort should result in good decisions being made by the student and made by others in their behalf. Closing the feedback loop in this way consistently results in improved student learning.

In part 2 of this series, I wrote that assessment design starts with a set of defined skills, also known as competencies or learning objectives. This alignment can facilitate guided application of test results. When test questions are aligned to the same skills as the curriculum, then students and educators can easily locate the learning resources that are best suited to student needs.

Integration

The best schools and teachers use multiple measures of student performance to inform their educational decisions. In an ideal scenario, all measures, test results, homework, attendance, projects, etc., would be integrated into a single dashboard. Organizations like The Ed-Fi Alliance are pursuing this but it's proving to be quite a challenge.

An intermediate goal is for the measures to be reported in consistent ways. For example, measures related to student skill should be correlated to the state standards. This will help teachers find correlations (or lack thereof) between the different measures.

Quality Factors

Make the reports, or the reporting system, available and convenient for students, parents, and educators to use.
Ensure that reports are easy to understand and that they naturally lead to the right interpretations. Use focus groups and usability testing to refine the reports.
Actively connect between test results and learning resources.
Support integration of multiple measures.

Wrapup

Every educational program, activity, or material should be considered in terms of its impact on student learning. Effective reporting, that informs educational decisions, makes the considerable investment in developing and administering a test worthwhile.

Quality Assessment Part 7: Securing the Test

2018-10-16T12:00:00.003-07:00

This is part 7 of a 10-part series on building high-quality assessments.

Part 1: Quality Factors
Part 2: Standards and Content Specifications
Part 3: Items and Item Specifications
Part 4: Item Response Theory, Field Testing, and Metadata
Part 5: Blueprints and Computerized-Adaptive Testing
Part 6: Achievement Levels and Standard Setting
Part 7: Securing the Test
Part 8: Test Reports
Part 9: Frontiers
Part 10: Scoring Tests

Each spring, millions of students in the United States take their annual achievement tests. Despite proctoring, some fraction of those students carry in a phone or some other sort of camera, take pictures of test questions, and post them on social media. Concurrently, testing companies hire a few hundred people to scan social media sites for inappropriately shared test content and send takedown notices to site operators.

Proctoring, secure browsers, and scanning social media sites are parts of a multifaceted effort to secure tests from inappropriate access. If students have prior access to test content, the theory goes, then they will memorize answers to questions rather than study the principles of the subject. The high-stakes nature of the tests creates incentive for cheating.

Secure Browsers

Most computer-administered tests today are given over the world-wide web. But if students were given unfettered access to the web, or even to their local computer, they could look up answers online, share screen-captures of test questions, access an unauthorized calculator, share answers using chats, or even videoconference with someone who can help with the test. To prevent this, test delivery providers use a secure browser, also known as a lockdown browser. Such a browser is configured so it will only access the designated testing website and it takes over the computer - preventing access to other applications for the duration of the test. It also checks to ensure that no unauthorized applications are already running, such as screen grabbers or conferencing software.

Secure browsers are inherently difficult to build and maintain. That's because operating systems are designed to support multiple concurrent applications and to support convenient switching among applications. In one case, the operating system vendor added a dictionary feature — users could tap any word on the screen and get a dictionary definition of that word. This, of course, interfered with vocabulary-related questions on the test. In this, and many other cases, testing companies have had to work directly with operating system manufacturers to develop special features required to enable secure browsing.

Secure browsers must communicate with testing servers. The server must detect that a secure browser is in use before delivering a test and it also supplies the secure browser with lists of authorized applications that can be run concurrently (such as assistive technology). To date, most testing services develop their own secure browsers. So, if a school or district uses tests from multiple vendors, they must install multiple secure browsers.

To encourage a more universal solution. [Smarter Balanced] commissioned a Universal Secure Browser Protocol that would allow browsers and servers from different companies to work effectively together. They also commissioned and host a Browser Implementation Readiness Test (BIRT) that can be used to verify a browser - that it implements the required protocols and also the basic HTML 5 requirements. So far, Microsoft has implemented their Take a Test feature in Windows 10 that satisfies secure browser requirements and Smarter Balanced has released into open source a set of secure browsers for Windows, MacOS, iOS (iPad), Chrome OS (ChromeBook), Android, and Linux. Nevertheless, most testing companies continue to develop their own solutions.

Large Item Pools - An Alternative Approach

Could there be an alternative to all of this security effort? Deploying secure browsers on thousands of computers is expensive and inconvenient. Proctoring and social media policing cost a lot of time and money. And conspiracy theorists ask if the testing companies have something to hide in their tests.

Computerized-adaptive testing opens one possibility. If the pool of questions is big enough, the probability that a student encounters a question they have previously studied will be small enough that it won't significantly impact the test result. With a large enough pool, you could publish all questions for public review and still maintain a valid and rigorous test. I once asked a psychometrician how large the pool would have to be for this. He estimated about 200 questions in the pool for each one that appears on the test. Smarter Balanced presently uses a 20 to one ratio. Anther benefit of such a large item pool is that students can retake the test and still get a valid result.

Even with a large item pool, you would still need to use a secure browser and proctoring to prevent students from getting help from social media. That is, unless we can change incentives to the point that students are more interested in an accurate evaluation than they are in getting getting a top score.

Quality Factors

The goal of test security is to maintain the validity of test results; ensuring that students do not have access to questions in advance of the test and that they cannot obtain unauthorized assistance during the test. The following practices contribute to a valid and reliable test:

For computerized-adaptive tests, have a large item pool thereby reducing the impact of any item exposure and potentially allowing for retakes.
For fixed-form tests, develop multiple forms. As with a large item pool, multiple forms let you switch forms in the event that an item is exposed and also allows for retakes.
For online tests, use secure browser technology to prevent unauthorized use of the computer during the test.
Monitor social media for people posting test content.
Have trained proctors monitor testing conditions.
Consider social changes, related to how test results are used, that would better align student motivation toward valid test results.

Wrapup

The purpose of Test Security is to ensure that test results are a valid measure of student skill and that they are comparable to other students' results on the same test. Current best practices include securing the browser, effective proctoring, and monitoring social media. Potential alternatives include larger test item banks and better alignment of student and institutional motivations.

Quality Assessment Part 6: Achievement Levels and Standard Setting

2018-10-05T12:00:00.002-07:00

This is part 6 of a 10-part series on building high-quality assessments.

Part 1: Quality Factors
Part 2: Standards and Content Specifications
Part 3: Items and Item Specifications
Part 4: Item Response Theory, Field Testing, and Metadata
Part 5: Blueprints and Computerized-Adaptive Testing
Part 6: Achievement Levels and Standard Setting
Part 7: Securing the Test
Part 8: Test Reports
Part 9: Frontiers
Part 10: Scoring Tests

If you have a child in U.S. public school, chances are that they took a state achievement test this past spring and sometime this summer you received a report on how they performed on that test. That report probably looks something like this sample of a California Student Score Report. It shows that "Matthew" achieved a score of 2503 in English Language Arts/Literacy and 2530 in Mathematics. Both scores are described as "Standard Met (Level 3)". Notably, in prior years Matthew was in the "Standard Nearly Met" category so his performance has improved.

The California School Dashboard offers reports of school performance according to multiple factors. For example, the Detailed Report for Castle View Elementary includes a graph of "Assessment Performance Results: Distance from Level 3".

To prepare this graph, they take the average difference between students' scale scores and the Level 3 standard for proficiency in the grade in which they were tested. For each grade and subject, California and Smarter Balanced use four achievement levels, each assigned to a range of scores. Here are the achievement levels for 5th grade Math (see this page for all ranges).

Level	Range	Descriptor
Level 1	Less than 2455	Standard Not Met
Level 2	2455 to 2527	Standard Nearly Met
Level 3	2528 to 2578	Standard Met
Level 4	Greater than 2578	Standard Exceeded

So, for Matthew and his fellow 5th graders, the Math standard for proficiency, or "Level 3" score, is 2528. Students at Lake Matthews Elementary, on average, exceeded the Math standard by 14.4 points on the 2017 tests.

Clearly, there are serious consequences associated with the assignment of scores to achievement levels. A difference of 10-20 points can make the difference between a school, or student, meeting or failing to meet the standard. Changes in proficiency rates can affect allocation of federal Title 1 funds, the careers of school staff, and even the value of homes in local neighborhoods.

More importantly to me, achievement levels must be carefully set if they are to provide reliable guidance to students, parents, and educators.

Standard Setting

Standard Setting is the process of assigning test score ranges to achievement levels. A score value that separates one achievement level from another is called a cut score. The most important cut score is the one that distinguishes between proficient (meeting the standard) and not proficient (not meeting the standard). For the California Math test, and for Smarter Balanced, that's the "Level 3" score but different tests may have different achievement levels.

When Smarter Balanced performed its standard setting exercise in October of 2014, it used the Bookmark Method. Smarter Balanced had conducted a field test that previous spring (described in Part 4 of this series). From those field test results, they calculated a difficulty level for each test item and converted that into a scale score. For each grade, a selection of approximately 70 items were sorted from easiest to most difficult. This sorted list of items is called an Ordered Item Booklet (OIB) though, in the Smarter Balanced case, the items were presented online. A panel of experts, composed mostly of teachers, went through the OIB starting at the beginning (easiest item), and set a bookmark at the item they believed represented proficiency for that grade. A proficient student should be able to answer all preceding items correctly but might have trouble with the items that follow the bookmark.

There were multiple iterations of this process on each grade, and then the correlation from grade-to-grade was also reviewed. Panelists were given statistics on how many students in the field tests would be considered proficient at each proposed skill level. Following multiple review passes the group settled on the recommended cut scores for each grade. The Smarter Balanced Standard Setting Report describes the process in great detail.

Data Form

For each subject and grade, the standard setting process results in cut scores representing the division between achievement levels. The cut scores for Grade 5 math, from table above, are 2455, 2528, and 2579. Psychometricians also calculate the Highest Obtainable Scale Score (HOSS) and Lowest Obtainable Scale Score (LOSS) for the test.

I am not aware of any existing data format standard for achievement levels. Smarter Balanced publishes its achievement levels and cut scores on its web site. The Smarter Balanced test administration package format includes cut scores, and HOSS and LOSS; but not achievement level descriptors.

A data dictionary for publishing achievement levels would include the following elements:

Element	Definition
Cut Score	The lowest scale score included in a particular achievement level.
LOSS	The lowest obtainable scale score that a student can achieve on the test.
HOSS	The highest obtainable scale score that a student can achieve on the test.
Achievement Level Descriptor	A description of what an achievement level means. For example, "Met Standard" or "Exceeded Standard".

Quality Factors

The stakes are high for standard setting. Reliable cut scores for achievement levels ensure that students, parents, teachers, administrators, and policy makers receive appropriate guidance for high-stakes decisions. If the cut scores are wrong - many decisions may be ill informed. Quality is achieved by following a good process:

Begin with a foundation of high quality achievement standards, test items that accurately measure the standards, and a reliable field test.
Form a standard-setting panel composed of experts and grade-level teachers.
Ensure that the panelists are familiar with the achievement standards that the assessment targets.
Inform the panel with statistics regarding actual student performance on the test items.
Follow a proven standard-setting process.
Publish the achievement levels and cut scores in convenient human-readable and machine-readable forms.

Wrapup

Student achievement rates affect policies at state and national levels, direct budgets, impact staffing decisions, influence real estate values, and much more. Setting achievement level cut scores too high may set unreasonable expectations for students. Setting them too low may offer an inappropriate sense of complacency. Regardless, achievement levels are set on a scale calibrated to achievement standards. If the standards for the skills to be learned are not well-designed, or if the tests don't really measure the standards, then no amount of work on the achievement level cut scores can compensate.

Quality Assessment Part 5: Blueprints and Computerized-Adaptive Testing

2018-09-14T06:00:00.003-07:00

This is part 5 of a 10-part series on building high-quality assessments.

Part 1: Quality Factors
Part 2: Standards and Content Specifications
Part 3: Items and Item Specifications
Part 4: Item Response Theory, Field Testing, and Metadata
Part 5: Blueprints and Computerized-Adaptive Testing
Part 6: Achievement Levels and Standard Setting
Part 7: Securing the Test
Part 8: Test Reports
Part 9: Frontiers
Part 10: Scoring Tests

Molly is a 6th grade student who is already behind in math. Near the end of the school year she takes her state's annual achievement tests in mathematics and English Language Arts. Already anxious when she sits down to the test, her fears are confirmed by the first question where she is asked to divide 3/5 by 7/8. Though they spent several days on this during the year, she doesn't recall how to divide one fraction by another. As she progresses through the test, she is able to answer a few questions but resorts to guessing on all too many. After twenty minutes of this she gives up and just guesses on the rest of the answers. When her test results are returned a month later she gets the same rating as three previous years, "Needs Improvement." Perpetually behind, she decides that she is, "Just not good at math."

Molly is fictional but she represents thousands of students across the U.S. and around the world.

Let's try another scenario. In this case, Molly is given a Computerized-Adaptive Test (CAT). When she gets the first question wrong, the testing engine picks an easier question which she knows how to answer. Gaining confidence she applies herself to the next question which she also knows how to answer. The system presents easier and harder questions as it works to pinpoint her skill level within a spectrum extending back to 4th grade and ahead to 8th grade. When her score report comes she has a scale score of 2505 which is below the 6th grade standard of 2552. The report shows her previous year's score of 2423 which was well below standard for Grade 5. The summary says that, while Mollie is still behind, she has achieved significantly more than a year's progress in the past year of school; much like this example of a California report.

Computerized-Adaptive Testing

A fixed-form Item Response Theory test presents a set of questions at a variety of skill levels centered on the standard for proficiency for the grade or course. Such tests result in a scale score, which indicates the student's proficiency level, and a standard error which indicates a confidence level of the scale score. A simplified explanation is that the student's actual skill level should be within the range of the scale score plus or minus the standard error. Because a fixed-form test is optimized for the mean, the standard error is greater the further the student is from the target proficiency for that test.

Computerized Adaptive Tests (CAT) start with a large pool of assessment items. Smarter Balanced uses a pool of 1,200-1,800 items for a 40 item test. Each question is calibrated according to its difficulty within the range of the test. The test administration starts with a question near the middle of the range. From then on, the adaptive algorithm tracks the student's performance on prior items and then selects questions most likely to discover and increase confidence in the student's skill level.

A stage-adaptive or multistage test is similar except that groups of questions are selected together.

CAT tests have three important advantages over fixed-form:

The test can measure student skill across a wider range while maintaining a small standard error.
Fewer questions are required to assess the student's skill level.
Students may have a more rewarding experience as the testing engine offers more questions near their skill level.

When you combine more accurate results with a broader measured range and then use the same test family over time, you can reliably measure student growth over a period of time.

Test Blueprints

As I described in Part 2 and Part 3 of this series, each assessment item is designed to measure one or two specific skills. A test blueprint indicates what skills are to be measured in a particular test and how many items of which types should be used to measure each skill.

As an example, here's the blueprint for the Smarter Balanced Interim Assessment Block (IAB) for "Grade 3 Brief Writes":

Block 3: Brief Writes
Claim	Target	Items	Total Items
Writing	1a. Write Brief Texts (Narrative)	4	6
	3a. Write Brief Texts (Informational)	1
	6a. Write Brief Texts (Opinion)	1

This blueprint, for a relatively short fixed-form test, indicates a total of six items spread across one claim and three targets. For more examples, you can check out the Smarter Balanced Test Blueprints. The Summative Tests, which are used to measure achievement at the end of each year, have the most items and represent the broadest range of skills to be measured.

When developing a fixed-form test, the test producer will select a set of items that meets the requirements of the blueprint and represents an appropriate mix of difficulty levels.

For CAT tests it's more complicated. The test producer must select a much larger pool of items than will be presented to the student. A minimum is five to ten items in the pool for each item in to be presented to the student. For summative tests, Smarter Balanced uses a ratio averaging around 25 to 1. These items should represent the skills to be measured in approximately the same ratios as they are represented in the blueprint. And they should represent difficulty levels across the range of skill to be measured. (Difficulty level is represented by the IRT b parameter of each item.)

As the student progresses through the test, the CAT Algorithm selects the next item to be presented. In doing so, it takes into account three factors: 1. Information it has determined about the student's skill level so far, 2. How much of the blueprint has been covered so far and what it has yet to cover, and 3. The pool of items it has to select from. From those criteria it selects an item that will advance coverage of the blueprint and will improve measurement of the student's skill level.

Data Form

To present a CAT assessment the test engine needs three sets of data:

The Test Blueprint
A Catalog of all items in the pool. The entry for each item must specify its alignment to the test blueprint (which is equivalent to its alignment to standards), and its IRT Parameters.
The Test Items themselves.

Part 3 of this series describes formats for the items. The item metadata should include the alignment and IRT information. The manifest portion of IMS Content Packaging is one format for storing and transmitting item metadata.

To date, there is no standard or commonly-used data format for test blueprints. Smarter Balanced has published open specifications for its Assessment Packages. Of those, the Test Administration Package format includes the test blueprint and the item catalog. IMS CASE is designed for representing achievement standards, but it may also be applicable to test blueprints.

IMS Global has formed an "IMS CAT Task Force" which is working on interoperable standards for Computerized Adaptive Testing. They anticipate releasing specifications later in 2018.

Quality Factors

A CAT Simulation is used to measure the quality of a Computerized Adaptive Test. These simulations use a set of a few thousand simulated students each assigned a particular skill level. The system then simulates each student taking the test. For each item, the item characteristic function is used to determine whether a student at that skill level is likely to answer correctly. The adaptive algorithm uses those results to determine which item to present next.

The results of the simulation are used to see how well the CAT measures the skill levels of the simulated students by comparing the test scores against the skill levels of the simulated students. Results of a CAT simulation are used to ensure that the item pool has sufficient coverage, that the CAT algorithm satisfies the blueprint, and to find out which items get the most exposure. This feedback is used to tune the item pool and the configuration of the CAT algorithm to achieve optimal results across the simulated population of students.

To build a high-quality CAT assessment:

Build a large item pool with items of difficulty levels spanning the range to be measured.
Design a test blueprint that focuses on the skills to be measured and correlates with the overall score and the subscores to be reported.
Ensure that the adaptive algorithm effectively covers the blueprint and also focuses in on each student's skill level.
Perform CAT simulations to tune the effectiveness of the item pool, blueprint, and CAT algorithm.

Wrapup

Computerized adaptive testing offers significant benefits to students by delivering more accurate measures with a shorter, more satisfying test. CAT is best suited to larger tests with 35 or more questions spread across a broad blueprint. Shorter tests, focused on mastery of one or two specific skills, may be better served by conventional fixed-form tests.

Quality Assessment Part 4: Item Response Theory, Field Testing, and Metadata

2018-09-01T17:00:00.002-07:00

This is part 4 of a 10-part series on building high-quality assessments.

Part 1: Quality Factors
Part 2: Standards and Content Specifications
Part 3: Items and Item Specifications
Part 4: Item Response Theory, Field Testing, and Metadata
Part 5: Blueprints and Computerized-Adaptive Testing
Part 6: Achievement Levels and Standard Setting
Part 7: Securing the Test
Part 8: Test Reports
Part 9: Frontiers
Part 10: Scoring Tests

Consider a math quiz with the following two items:

Item A:

x = 5 - 2
What is the value of x?

Item B:

x2 - 6x + 9 = 0
What is the value of x?

George gets item A correct but gets the wrong answer for item B. Sally has the wrong answer for A but answers B correctly. Using traditional scoring, George and Sally each get 50%.

A more sophisticated quiz might assign 2 points to item A and 6 points to item B (recognizing that B is harder than A). Under such a scoring system, George would get 25% and Sally would get 75%.

But the score is still short on meaning. George scored 25% of what? Sally scored 75% of what?

An even more sophisticated model should acknowledge that knowing how to solve quadratics (item B) is evidence that the student can also perform subtraction (item A). Such a model would position George somewhere between first grade (single-digit subtraction) and High School (solving quadratics). That same model would indicate that Sally either guessed correctly on item B or made a mistake on item A that's not representative of her skill. Due to the conflicting evidence, we are less sure about Sally's skill level than George's. For both students, more items would be required to gain greater confidence in their skill levels.

Item Response Theory

Item Response Theory or IRT is a statistical method for describing how student performance on assessment items relates to their skill in the area the item was designed to measure.

The "three parameter logistic model" (3PL) for IRT describes the probability that a student of a certain skill level will answer the item correctly. Student proficiency is represented by θ (theta) and the three item parameters are a, b, and c. They represent the following factors:

a = Discrimination. This value indicates how well the item discriminates between proficient students and those who have not yet learned this skill.
b = Difficulty. This value indicates how difficult an item is for the student to answer correctly.
c = Guessing. The probability that a student might guess the correct response. For a four-item multiple-choice question, this would be 0.25 because the student has a one-in-four chance of guessing the right answer.

From these parameters we can create an item characteristic curve. The formula is as follows:

This is much easier to understand in graph form. So I loaded it into the Desmos graphing calculator.

The vertical (y) axis indicates the probability that a student will answer the item correctly. The horizontal (y) axis is student proficiency (represented by θ in the equation). You can move the sliders to change the a, b, and c parameters and see how different items would be represented in an item characteristic curve.

In addition to this "three-parameter" model, there are other IRT models but they all follow this same basic premise: The function represents the probability that a student of given skill (represented by θ, theta) will answer the question correctly. At least one parameter of the function represents the difficulty of the question. For items scored on multi-point scale, there are difficulty parameters (typically d1, d2, etc.) representing the difficulty thresholds for each point value.

Scale Scores

The difficulty parameter b, and the student skill value θ, are on the same logistic scale and center on the skill level being measured. For example, if an item is written for grade 5 math, a b parameter of 0 means that the average 5th grade student should be able to answer the question correctly 50% of the time.

Most assessments convert from this theta score into a scale score which is a consistent score reported to educators, students, and parents. For Smarter Balanced, the scale score ranges from 2000 to 3000 and represents skill levels from Kindergarten to High School Graduation. Theta scores are converted to scale scores using a polynomial function.

Field Testing

So how do we come up with the a, b, and c parameters for a particular item? Based on the item type and potential responses we can predict c (guessing) fairly well but our experience at Smarter Balanced has shown that authors are not very good at predicting b (difficulty) or c (discrimination). To get an objective measure of these values we use a field test.

In Spring 2014 Smarter Balanced held a field test in which 4.2 million students completed a test - typically in either English Language Arts or Mathematics. Some students took both. For the participating schools and students, this was a practice test - gaining experience in administering and taking tests. Since the items were not yet calibrated, we could not reliably score the tests. For Smarter Balanced it offered critical data on more than 19,000 test items. For each item we gained more than 10,000 scored responses from students representing the target grades across all demographics.

Psychometricians used these data, from students taking the test, to calculate the parameters (a, b, and c) for each item in the field test. The process of calculating IRT parameters from field test data is called calibration. Once items were calibrated we examined the parameters and the data to determine which items are approved for use in tests. For example, if a is too low then the question likely has a flaw. It may not measure the right skill or the answer key may be incorrect. Likewise, if the b parameter is different across demographic groups than the item may be sensitive to gender, cultural, or ethnic bias. Items from the field test that met statistical standards were approved and became the initial bank of items from which Smarter Balanced produces tests.

Each year Smarter Balanced does an embedded field test. Each test that a student takes has a few new "field test" items included. These items do not contribute to the student's test score. Rather, the students' scored responses are used to calibrate the items. This way the test item bank is being constantly renewed. Other organizations like ACT and SAT follow the same practice of embedding field test questions in regular tests.

To understand more about IRT, I recommend A Simple Guide to IRT and Rasch Modeling by Ho Yu.

Item Metadata

The IRT parameters, alignment to standards, and other critical information are collected as metadata about each item. In most cases, metadata is represented as a set of name-value pairs. There are many formats for representing metadata and also many dictionaries of field definitions. Smarter Balanced uses the metadata structure from IMS Content Packaging and draws field definitions from The Learning Resource Metadata Initiative (LRMI), from Schema.org, and from Common Education Data Standards (CEDS).

Here are some of the most critical metadata elements for assessment items with links to their definitions in those standards:

Identifier: An number that uniquely identifies this item.
PrimaryStandard: An identifier of the principal skill the item is intended to measure. The skill would be described in an Achievement Standard or Content Specification.
SecondaryStandard: Optional identifiers of additional Achievement Standards or Content Specifications that the item measures.
InteractionType: The type of interaction (multiple choice, matching, short answer, essay, etc.).
IRT Parameters: The a, b, and c parameters or another parameter set for the Item Response Theory function.
History: A record of when and how the item has been used to estimate how much it has been exposed.

Quality Factors

States, schools, assessment consortia, and assessment companies all maintain banks of assessment items from which they construct their assessments. There are a number of efforts underway to pool resources from multiple entities into large, joint item banks. The value of items in any such bank is multiplied tenfold if the items have consistent and reliable metadata regarding alignment to standards and IRT parameters.

Here are factors to consider related to IRT Calibration and Metadata:

Are all items field-tested and calibrated before they are used in an operational test?
Is alignment to standards and content specifications an integral part of item writing?
Are the identifiers used to record alignment consistent across the entire item bank?
Is field testing an integral part of the assessment design?
Are IRT parameters consistent and comparable across the entire bank?
When sharing items or an item bank across multiple organizations, do all participants agree to contribute data (field testing and operational use) back to the bank?

Wrapup

Field testing can be expensive, inconvenient, or both. But without actual data from student performance we have no objective evidence that a particular assessment item measures what it's intended to measure at the expected level of difficulty.

The challenges around field testing combined with the lack of training in IRT and related psychometrics have been kept these measures from being used in anything other than large-scale, high stakes tests. Nevertheless, it's concerning to me that final exams and midterms of great consequence are rarely, if ever, calibrated and validated. Greater collaboration among institutions, among curriculum developers, or both could achieve sufficient scale for calibrated tests to become more common.

Quality Assessment Part 3: Items and Item Specifications

2018-08-23T17:00:00.002-07:00

This is part 3 of a 10-part series on building high-quality assessments.

Part 1: Quality Factors
Part 2: Standards and Content Specifications
Part 3: Items and Item Specifications
Part 4: Item Response Theory, Field Testing, and Metadata
Part 5: Blueprints and Computerized-Adaptive Testing
Part 6: Achievement Levels and Standard Setting
Part 7: Securing the Test
Part 8: Test Reports
Part 9: Frontiers
Part 10: Scoring Tests

Some years ago I remember reading my middle school science textbook. The book was attempting to describe the difference between a mixture and a compound. It explained that water is a compound of two parts hydrogen and one part oxygen. However, if you mix two parts hydrogen and one part oxygen in a container, you will simply have a container with a mixture of the two gasses, they will not spontaneously combine to form water.

So far, so good. Next, the book said that if you introduced an electric spark in the mixed gasses you would, "start to see drops of water appear on the inside surface of the container as the gasses react to form water." This was accompanied by an image of a container with wires and an electric spark.

I suppose the book was technically correct; that is what would happen if the container was strong enough to contain the violent explosion. But, even as a middle school student, I wondered how the dangerously misleading passage got written and how it survived the review process.

The writing and review of assessments requires the same or better rigor than writing textbooks. An error on an assessment item affects the evaluation of all students who take the test.

Items

In the parlance of the assessment industry, test questions are called items. The latter term is intended include more complex interactions than just answering questions.

Stimuli and Performance Tasks

Oftentimes, an item is based on a stimulus or passage that sets up the question. It may be an article, short story, or description of a math or science problem. The stimulus is usually associated with three to five items. When presented by computer, the stimulus and the associated items are usually presented on one split screen so that the student can refer to the stimulus while responding to the items.

Sometimes, item authors will write the stimulus; this is frequently the case for mathematics stimuli as they set up a story problem. But the best items draw on professionally-written passages. To facilitate this, the Copyright Clearance Center has set up the Student Assessment License as a means to license copyrighted materials for use in student assessment.

A performance task is a larger-scale activity intended to allow the student to demonstrate a set of related skills. Typically, it begins with a stimulus followed by a set of ordered items. The items build on each other usually finishing with an essay that asks the student to draw conclusions from the available information. For Smarter Balanced this pattern (stimulus, multiple items, essay) is consistent across English Language Arts and Mathematics.

Prompt or Stem

The prompt, sometimes called a stem, is the request for the student to do something. A prompt might be as simple as, "What is the sum of 24 and 62." Or it might be as complex as, "Write an essay comparing the views of the philosophers Voltaire and Kant regarding enlightenment. Include quotes from each that relate to your argument." Regardless, the prompt must provide required information, clearly describe what the student is to do, and how they are to express their response.

Interaction or Response Types

The response is a student's answer to the prompt. Two general categories of items are selected response and constructed response. Selected response items require the student to select one or more alternatives from a set of pre-composed responses. Multiple choice is the most common selected response type, but others include multi-select (in which more than one response may be correct), matching, true/false, and others.

Multiple choice items are particularly popular due to the ease of recording and scoring student responses. For multiple choice items, alternatives are the responses that a student may select from, distractors are the incorrect responses, and the answer is the correct response.

The most common constructed response item types are short answer and essay. In each case, the student is expected to write their answer. The difference is the length of the answer; short answer is usually a word or phrase while essay is a composition of multiple sentences or paragraphs. A variation of short answer may have a student enter a mathematical formula. Constructed responses may also have students plot information on a graph or arrange objects into a particular configuration.

Technology-Enhanced items are another commonly used category. These items are delivered by computer and include simulations, composition tools, and other creative interactions. However, all technology-enhanced items can still be categorized as either selected response or constructed response.

Scoring Methods

There are two general ways of scoring items, deterministic scoring and probabilistic scoring.

Deterministic scoring is indicated when a student's response may be unequivocally determined to be correct or incorrect. When a response is scored on multiple factors there may be partial credit for the factors the student addressed correctly. Deterministic scoring is most often associated with selected response items, but many constructed response items may also be deterministically scored when the factors of correctness are sufficiently precise, such as a numeric answer or a single word for a fill-in-the-blank question. When answers are collected by computer or are easily entered into a computer, deterministic scoring is almost always done by computer.

Probabilistic scoring is indicated when the quality of a student's answer must be judged on a scale. This is most often associated with essay type questions but may also apply to other constructed response forms. When handled well, a probabilistic score may include a confidence level — how confident is the scoring person or system that the score is correct.

Probabilistic scoring may be done by humans (e.g. judging the quality of an essay) or by computer. When done by computer, Artificial Intelligence techniques are frequently used with different degrees of reliability depending on the question type and the quality of the AI.

Answer Keys and Rubrics

The answer key is the information needed to score a selected-response item. For multiple choice questions, it's simply the letter of the correct answer. A machine scoring key or machine rubric is an answer key coded in such a way that a computer can perform the scoring.

The rubric is a scoring guide used to evaluate the quality of student responses. For constructed response items the rubric will indicate which factors should be evaluated in the response and what scores should be assigned to each factor. Selected response items may also have a rubric which, in addition to indicating which response is correct, would also give an explanation about why that response is correct and why each distractor is incorrect.

Item Specifications

An item specification describes the skills to be measured and the interaction type to be used. It serves as both a template and a guide for item authors.

The skills should be expressed as references to the Content Specification and associated Competency Standards (see Part 2 of this series). A consistent identifier scheme for the Content Specification and Standards greatly facilitates this. However, to assist item authors, the specification often quotes relevant parts of the specification and standards verbatim.

If the item requires a stimulus, the specification should describe the nature of the stimulus. For ELA, that would include the type of passage (article, short-story, essay, etc.), the length, and the reading difficulty or text complexity level. In mathematics, the stimulus might include a diagram for Geometry, a graph for data analysis, or a story problem.

The task model describes the structure of the prompt and the interaction type the student will use to compose their response. For a multiple choice, item, the task model would indicate the type of question to be posed, sometimes with sample text. That would be followed by the number of multiple choice options to be presented, the structure for the correct answer, and guidelines for composing appropriate distractors. Task models for constructed response would include the types of information to be provided and how the student should express their response.

The item specification concludes with guidelines about how the item will be scored including how to compose the rubric and scoring key. The rubric and scoring key focus on what evidence is required to demonstrate the student's skill and how that evidence is detected.

Smarter Balanced content specifications include references to the Depth of Knowledge that should be measured by the item, and guidelines on how to make the items accessible to students with disabilities. Smarter Balanced also publishes specifications for full performance tasks.

Data Form for Item Specifications

Like Content Specifications, Item Specifications have traditionally been published in document form. When offered online they are typically in PDF format. Like Content Specifications, there are great benefits to be achieved by publishing content specs in a structured data form. Doing so can integrate the content specification into the item authoring system — presenting a template for the item with pre-filled content-specification alignment metadata, pre-selected interaction time, and guidelines about stimulus and prompt alongside the places where the author is to fill in the information.

Smarter Balanced has selected the IMS CASE format for publishing item specifications in structured form. This is the same data format we used for the content specifications.

Smarter Balanced Item Specifications in PDF form (see the Item and Task Specifications section)
Smarter Balanced Item Specifications in CASE form (See the item specification sections)

Data Form for Items

The only standardized format for assessment items in general use is IMS Question and Test Interoperability (QTI). It's a large standard with many features. Some organizations have chosen to implement a custom subset of QTI features known as a "profile." The soon-to-be-released QTI 3.0 aims to reduce divergence among profiles.

A few organizations, including Smarter Balanced and CoreSpring have been collaborating on the Portable Interactions and Elements (PIE) concept. This is a framework for packaging custom interaction types using Web Components. If successful, this will simplify the player software and support publishing of custom interaction types.

Quality Factors

A good item specification will likely be much longer than the items it describes. As a result, producing an item specification also consumes a lot more work than writing any single item. But, since each item specification will result in dozens or hundreds of items, the effort of writing good item specifications pays huge dividends in terms of the quality of the resulting assessment.

Start with a good quality standards and content specifications
Create task models that are authentic to the skills being measured. The task that the student is asked to perform should be as similar as possible to how they would manifest the measured skill in the real world.
Choose or write high-quality stimuli. For language arts items, the stimulus should demand the skills being measured. For non-language-arts items, the stimulus should be clear and concise so as to reduce sensitivity to student reading skill level.
Choose or create interaction types that are inherently accessible to students with disabilities.
Ensure that the correct answer is clear and unambiguous to a person who possesses the skills being measured.
Train item authors in the process of item writing. Sensitize them to common pitfalls such as using terms that may not be familiar to students of diverse ethnic backgrounds.
Use copy editors to ensure that language use is consistent, parallel in structure, and that expectations are clear.
Develop a review, feedback, and revision process for items before they are accepted.
Write specific quality criteria for reviewing items. Set up a review process in which reviewers apply the quality criteria and evaluate the match to the item specification.

Wrapup

Most tests and quizzes we take, whether in K-12 or college, are composed one question at a time based on the skills taught in the previous unit or course. Item specifications are rarely developed or consulted in these conditions and even the learning objectives may be somewhat vague. Furthermore, there is little third-party review of such assessments. Considering the effort students go through to prepare for and take an exam, not to mention the consequences associated with their performance on those exams, it seems like institutions should do a better job.

Starting from an item specification is both easier and produces better results than writing an item from scratch. The challenge is producing the item specifications themselves, which is quite demanding. Just as achievement standards are developed at state or multi-state scale, so also could item specifications be jointly developed and shared broadly. As shown in the links above, Smarter Balanced has published its item specifications and many other organizations do the same. Developing and sharing item specifications will result in better quality assessments at all levels from daily quizzes to annual achievement tests.

Quality Assessment Part 2: Standards and Content Specifications

2018-08-11T09:00:00.002-07:00

This is part 2 of a 10-part series on building high-quality assessments.

Part 1: Quality Factors
Part 2: Standards and Content Specifications
Part 3: Items and Item Specifications
Part 4: Item Response Theory, Field Testing, and Metadata
Part 5: Blueprints and Computerized-Adaptive Testing
Part 6: Achievement Levels and Standard Setting
Part 7: Securing the Test
Part 8: Test Reports
Part 9: Frontiers
Part 10: Scoring Tests

Some years ago my sister was in middle school and I had just finished my freshman year at college. My sister's English teacher kept assigning word search puzzles and she hated them. The family had just purchased an Apple II clone and so I wrote a program to solve word searches for my sister. I'm not sure what skills her teacher was trying to develop with the puzzles; but I developed programming skills and my sister learned to operate a computer. Both skill sets have served us later in life.

Alignment to Standards

The first step in building any assessment, from a quiz to a major exam, should be to determine what you are trying to measure. In the case of academic assessments, we measure skills, also known as competencies. State standards are descriptions of specific competencies that a student should have achieved by the end of the year. They are organized by subject and grade. State summative tests indicate student achievement by measuring how close each student is to the state standards — typically at the close of the school year. Interim tests can be used during the school year to measure progress and to offer more detailed focus on specific skill areas.

At the higher education level, Colleges and Universities set Learning Objectives for each course. A common practice is to use the term, "competencies" as a generic reference to both state standards and college learning objectives and I'll follow that pattern here.

The Smarter Balanced Assessment Consortium, where I have been working, measures student performance relative to the Common Core State Standards. Choosing standards that have been adopted by multiple states enables us to write one assessment that meets the needs of all our member states and territories.

The Content Specification

The content specification is a restatement of competencies organized in a way that facilitates assessment. Related skills are clustered together so that performance measures on related tasks may be aggregated. For example, Smarter Balanced collects skill measures associated with "Reading Literary Texts" and "Reading Informational Texts" together into a general evaluation of "Reading". In contrast, a curriculum might cluster "Reading Literary Texts" with "Creative Writing" because synergies occur when you teach those skills together.

The Smarter Balanced content specification follows a hierarchy of Subject, Grade, Claim, and Target. In Mathematics, the four claims are:

Concepts and Procedures
Problem Solving
Communicating Reasoning
Modeling and Data Analysis

In English Language Arts, the four claims are:

Reading
Writing
Speaking & Listening
Research & Inquiry

These same four claims are repeated in each grade but the expected skill level increases. That increase in skill is represented by the targets assigned to the claims at each grade level. In English Reading (Claim 1), the complexity of the text presented to the student increases and the information the student is expected to draw from the text is increasingly demanding. Likewise, in Math Claim 1 (Concepts and Procedures) the targets progress from simple arithmetic in lower grades to Geometry and Trigonometry in High School.

Data Form

Typical practice is for states to publish their standards as documents. When published online they have been published as PDF files. Such documents are human readable but they lack the structure needed for data systems to facilitate access. In many cases they also lack identifiers that are required when referencing standards or content specifications.

Most departments within colleges and universities will develop a set of learning objectives for each course. Often times a state college system will develop statewide objectives. While these objectives are used internally for course design, there's little consistency in publishing the objectives. Some institutions publish all of their objectives while others keep them as internal documents. The Temple University College of Liberal Arts offers an example of publicly published learning objectives in HTML form.

In August 2017, IMS Global published the Competencies & Academic Standards Exchange (CASE) data standard. It is a vendor-independent format for publishing achievement standards suitable for course learning objectives, state standards, content specifications, and many other competency frameworks.

Public Consulting Group, in partnership with a several organizations built OpenSALT, an open source "Standards Alignment Tool" as a reference implementation of CASE.

Here's an example. Smarter Balanced originally published its content specifications in PDF form. The latest versions, from July of 2017, are available on the Development and Design page of their website. These documents have complete information but they do not offer any computer-readable structure.

"Boring" PDF form of Smarter Balanced Content Specifications:

In Spring 2018, Smarter Balanced published the same specifications, in CASE format, using the OpenSALT tool. The structure of the format lets you navigate the hierarchy of the specifications. The CASE format also supports cross-references between publications. In this case, Smarter Balanced also published a rendering of the Common Core State Standards in CASE format to facilitate references from the content specifications to the corresponding Common Core standards.

"Cool" CASE form of Smarter Balanced Content Specifications and CCSS:

I hope you agree that the Standards and Content Specifications are significantly easier to navigate in their structured form. Smarter Balanced is presently working on a "Content Specification Explorer" which will offer a friendlier user interface on the structured CASE data.

Identifiers

Regardless of how they are published, use of standards is greatly facilitated if an identifier is assigned to each competency. There are two general categories of identifiers: Opaque identifiers carry no meaning - they are just a number. Often they are "Univerally Unique IDs" (UUIDs) which are generated using an algorithm to assure that identifier is not used anywhere else in the world. Any meaning of the identifier is by virtue of the record to which it is assigned. "Human Readable" identifiers are constructed to have a meaningful structure to a human reader. There are good justifications each approach.

The Common Core State Standards assigned both types of identifier to each standard. Smarter Balanced has followed a similar practice in the identifiers for our Content Specification.

Common Core State Standards Example:

Opaque Identifier: DB7A9168437744809096645140085C00
Human Readable Identifier: CCSS.Math.Content.5.OA.A.1
URL: http://corestandards.org/Math/Content/5/OA/A/1/
Statement: Use parentheses, brackets, or braces in numerical expressions, and evaluate expressions with these symbols.

Smarter Balanced Content Specification Target Example:

Opaque Identifier URL: https://case.smarterbalanced.org/uri/aa55f687-5c0f-5066-8a1d-5e1231741968
Human Readable Identifier: M.G5.C1OA.TA.5.OA.A.1
Statement: Use parentheses, brackets, or braces in numerical expressions, and evaluate expressions with these symbols.

You'll notice that the Smarter Balanced Content Specification target is a copy of the corresponding Common Core State Standard. The CASE representation includes an "Exact Match Of" cross-reference from the content specification to the corresponding standard to show that's the case.

Smarter Balanced has published a specification for its human-readable Content Specification Identifiers. Here's the interpretation of "M.G5.C1OA.TA.5.OA.A.1":

M Math
G5 Grade 5
C1 Claim 1
OA Domain OA (Operations & Algebraic Thinking)
TA Target A
5.OA.A.1 CCSS Standard 5.OA.A.1

Quality Factors

The design of any educational activity should begin with a set of learning objectives. State Standards offer a template for curricula, lesson plans, assessments, supplemental materials, games and more. At the higher education level, Colleges and Universities set learning objectives for each course that serve a similar purpose. The quality of the achievement standards will have a fundamental impact on the quality of the related learning activities.

Factors to consider when selecting or building standards or learning objectives include the following:

Are the competencies relevant to the discipline being taught?
Are the competencies parallel in construction, describing skills at a similar grain size?
Are the skills ordered in a natural learning progression?
Are related skills, such as reading and writing, taught together in a coordinated fashion?
Is the amount of material covered by the competencies appropriate for the amount of time that will be allocated for learning?

The Development Process and the Standards-Setting Criteria used by the authors of the Common Core State Standards offer some insight into how they sought to develop high quality standards.

Factors to consider when developing an assessment content specification include the following:

Does the specification reference an existing standard or competency set?
Are the competencies described in such a way that they can be measured?
Is the grain size (the amount of knowledge involved) for each competency optimal for construction of test questions?
Are the competencies organized so that related skills are clustered together?
Does the content standard factor in dependencies between competencies? For example, performing long division is evidence that an individual is also competent at multiplication.
Is the organization of the competencies, typically into a hierarchy, consistent and easy to navigate?
Does the competency set lend itself to reporting skills at multiple levels? For example, Smarter Balanced reports an overall ELA score and then subscores for each claim: Reading, Writing, Speaking & Listening, and Research & Inquiry.

Wrapup

Compared with curricula, standards and content specifications are relatively short documents. The Common Core State Standards total 160 pages, much less than the textbook for a single grade. But standards have a disproportionate impact on all learning activities within the state, college, or class where they are used. Careful attention to the selection or construction of standards is a high-impact effort.

Quality Assessment Part 1: Quality Factors

2018-08-02T09:00:00.002-07:00

This is part 1 of a 10-part series on building high-quality assessments.

Part 1: Quality Factors
Part 2: Standards and Content Specifications
Part 3: Items and Item Specifications
Part 4: Item Response Theory, Field Testing, and Metadata
Part 5: Blueprints and Computerized-Adaptive Testing
Part 6: Achievement Levels and Standard Setting
Part 7: Securing the Test
Part 8: Test Reports
Part 9: Frontiers
Part 10: Scoring Tests

As I wrap up my service at the Smarter Balanced Assessment Consortium I am reflecting on what we've accomplished over the last 5+ years. We've assembled a full suite of assessments; we built an open source platform for assessment delivery; and multiple organizations have endorsed SmarterBalanced as more rigorous and better aligned to state standards than prior state assessments.

So, what are the characteristics of a high-quality assessment? How do you go about constructing such an assessment? And what distinguishes an assessment like Smarter Balanced from a typical quiz or exam that you might have in class?

That will be the subject of this series of posts. Starting from the achievement standards that guide construction of both curriculum and assessment I will walk through the process Smarter Balanced and other organizations use to create standardized assessments and then indicate the extra effort required to make them both standardized and high quality.

But, to start with, we must define what quality means — at least in the context of an assessment.

Goal of a Quality Assessment

Nearly a year ago the Smarter Balanced member states released test scores for 2017. In most states the results were flat — with little or no improvement from 2016. It was a bit disappointing but what surprised me at the time was the criticism directed at the test. "The test must be flawed," certain critics said, "because it didn't show improvement."

This seemed like a strange criticism to direct at the measurement instrument. If you stick your hand in an oven and it doesn't feel warm do you wonder why your hand is numb or do you check the oven to see if it is working? Both are possibilities but I expect you would check the oven first.

The more I thought about it, however, the more I realized that the critics have a point. Our purpose in deploying assessments is to improve student learning, not just to passively measure learning. The assessment is a critical part of the eduational feedback loop.

Smarter Balanced commissioned an independent study and confirmed that the testing instrument is working properly. Nevertheless, there are more things that the assessment system can do support better learning.

Features of a Quality Assessment

So, we define a quality assessment as one that consistently contributes to better student learning. What are the features of an assessment that does this?

Valid: The test must measure the skills it is intended to measure. That requires us to start with a taxonomy of skills — typically called achievement standards or state standards and also known as competencies. The quality of the standards also matter, of course, but that's the subject of a different blog post. A valid test should be relatively insensitive to skills or characteristics it is not intended to measure. For example, it should be free of ethnic or cultural bias.
Reliable: The test should consistently return the same results for students of the same skill level. Since repeated tests may not be composed of the same questions, the measures must be calibrated to ensure they return consistent results. And the test must accurately measure growth of a student when multiple tests are given over an interval of time.
Timely: Assessment results must be provided in time to guide future learning activities. Summative assessments, big tests near the end of the school year, are useful but they must be augmented with interim assessments and formative activities that happen at strategic times during the school year.
Informative: If an assessment is to support improved learning, the information it offers must be useful for guiding the next steps in a student's learning journey.
Rewarding: Test anxiety has been the downfall of many well-intentioned assessment programs. Not only does anxiety interfere with the reliability of results but inappropriate consequences to teachers can encourage poor instructional practice. By its nature, the testing process is demanding of students. Upon completion, their effort should be rewarded with a feeling that they've achieved something important.

Watch This Space

In the coming weeks, I will describe the processes that go into constructing quality assessments. Because I'm a technology person, I'll include discussions of how data and technology standards support the work.

A Brief History of Copyright

2018-06-09T15:39:00.000-07:00

In the early 2000s I began writing a book titled Frictionless Media. The subject was business models for digital and online media. My thesis was that digital media is naturally frictionless — naturally easy to copy and transmit. Prior media formats had natural friction, they required specialized equipment and significant expense to copy. Traditional media business models are based on that natural friction. In order to preserve business models, publishers have attempted to introduce artificial friction through mechanisms like Digital Rights Management. They would be better off adapting their business models to leverage that frictionlessness to their advantage. My ideas were inspired by experience at Folio Corporation where we had invented a sophisticated Digital Rights Management system for textual publications. We found that the fewer restrictions publishers imposed on their publications the more successful they were.

I didn’t finish the manuscript before the industry caught up with me. Before long, most of my arguments were being made by dozens of pundits. Nevertheless, the second chapter, "A Brief History of Copyright," remains as relevant as ever. In 2018 I updated it to include recent developments such as Creative Commons.

Click Here to read the chapter.

Why Assessment?

2017-07-31T16:00:00.000-07:00

Most departments or ministries of education state the purposes of assessment. I'm particularly fond of New Zealand's statement:

The primary purpose of assessment is to improve students’ learning and teachers’ teaching as both respond to the information it provides. Assessment for learning is an ongoing process that arises out of the interaction between teaching and learning.

I like this because it captures the feedback process and acknowledges that both students and educators should respond to that feedback. It also encompasses the various goals of assessment — to inform individual student learning, to measure the effectiveness of the learning system, and to serve as evidence of student skills.

Today I'm writing about the purposes of assessment and the value of standardized assessment.

Inform Individual Student Learning

The most important use of assessment is to improve individual learning. When used properly, assessments improve individual learning in three ways. 1) Exercising and demonstrating skills reinforces student understanding and helps retention. 2) Student attention to the assessment results can increase motivation and direct their choice of learning activities. 3) Educator attention to assessment results can direct their assignment of learning activities or inform interventions.

All of these involve educational feedback loops. However these impacts are only achieved if the right assessment is used for the right purpose. For example, many high-stakes assessments were developed primarily to comply with regulations, such as The No Child Left Behind Act (replaced by ESSA). The reports required by these regulations focus on the percentage of students at each institution that achieve standards for grade-level competency. A test focused on that type of report centers on the threshold of competency. It can indicate with great reliability whether a student is above or below the threshold but may not be reliable for other insight. Such a threshold test is a poor choice for informing learning activities, diagnosing areas of weakness, or measuring growth.

More advanced tests include questions at a variety of skill levels centered on the expected competency level. These tests indicate the student's competency on a continuous scale. Accordingly, they can indicate how far ahead or behind the student is, not just whether they are above or below a certain threshold. By comparing scores from successive tests, you can measure student growth over a period of time. Advanced tests include questions designed to get measure greater depths of knowledge. Such tests offer more reliable detail about student skills in specific areas.

One objection to advanced tests is that it takes more questions and more time to measure skills at this level of detail. The use of computer adaptive testing can shorten the test while maintaining reliability.

Measure the Effectiveness of the Learning System

When standardized assessments are used to measure the effectiveness of the learning system, individual student results are aggregated to indicate the fraction that are achieving competency levels. Typically results are compared with previous years to see if schools are improving. This Delaware Press Release is a typical example of the public statements made each year.

Assessments like these are based on the premise that if you measure performance and report on it, then performance will improve. Unfortunately, education has proven to be a stubborn counter example of this premise. Sixteen years after the No Child Left Behind act mandated standardized testing and established remedies for underperforming schools there has been limited progress. This leads some to call for abandonment of standardized assessment altogether. But if we don't measure performance, we will never know if we are succeeding.

These are our contemporary challenges: discovering the factors under that contribute to better learning and investing the resources needed to improve those factors. Continuing to measure performance will support gathering evidence of principles of effective teaching and learning.

Less frequently applied but equal in importance is using assessments to evaluate parts of the learning system. For example, assessment data can be used to compare different curricula or textbooks, for continuous improvement of online learning systems, and to evaluate the effectiveness of professional development programs.

Provide Evidence of Student Achievement

Since 1905, the primary measure of student achievement in the U.S. has been the Carnegie Unit. This measure uses the time a student spends in the classroom as proxy for how much they have learned. A century later, in 2005, New Hampshire began converting to a competency-based system where student skill is measured directly rather than by proxy. Other states have programs that allow competency measures as an alternative to seat time. Such measures depend on high quality assessments that are aligned to specific and relevant standards of achievement.

Standardized and High Quality Assessment

It has become common, in recent years, to complain about achievement standards and the associated standardized assessments. A typical protest might be, "My child is not standardized." To be sure, our goal should not be to achieve sameness among children and this is not the purpose of achievement standards. Rather, we recognize that people need to achieve a basic competency level in language arts and mathematics in order to function and achieve in our society. The standards are intended to reflect that basic competency with the hope that students and educators will build a wide variety of skill and achievement on that core foundation.

All of these uses — informing learning activities, measuring program effectiveness, and providing evidence of achievement — depend on the quality of the assessment. An assessment will provide poor guidance if it is sensitive to the wrong factors, is unreliable, or is tuned to the wrong skill level. I've written before that personalized learning is currently the most promising approach to improving learning. Choosing high quality assessments to inform personalization is essential to the success of such programs.

That should be our demand — that states, districts, and schools give us evidence of the quality of the assessments they use.

Reducing Income Inequality when Productivity Exceeds Consumption

2017-06-30T16:00:00.000-07:00

Among the last bastions of labor demand is Agricultural Harvesting. Every fall, groups of migrant workers follow the maturation of fruit and vegetables across the different climates. But even those jobs are going away. In one case, Abundant Robotics is developing robots that use image recognition to locate ripe apples and delicate manipulators to harvest fruit without bruising it.

In my last post I described how creatively destructive innovations like these have increased productivity in the United States until it exceeds basic needs by nearly four times. If it weren't for distribution problems, we could eliminate poverty altogether.

The problem is that when production exceeds consumption, prices fall and the economy goes into recession. As I wrote previously the U.S. and most of the word economy have relied on a combination of advertising, increased standards of living, and fiscal policy to stimulate demand sufficiently to keep up with productivity. But these stimulus methods have a side effect of increasing wage disparity.

The impact is mostly on the unskilled labor force as these jobs are the easiest to automate or export. Even though the income pie keeps growing, the slice owned by the lowest quintile shrinks. Free trade exacerbates the spread between skilled and unskilled labor as do attempts to stimulate consumption through government spending, low interest rates, and tax cuts. (Please see that previous post for details on all of this.)

Conventional Solutions and Their Limits

This is a gnarly problem. Potential remedies tend to have side effects that can make the situation worse, or at least limit the positive impact. In this post I'll name some of the conventional solutions and then attempt to frame up the kind of innovation that will be required to properly solve the problem.

Infrastructure Investment

Investments in infrastructure are a good short-term stimulus. Constructing or upgrading roads, bridges, energy, and communications infrastructure employs a lot of people of all skill levels. As most infrastructure is government-funded, it's a convenient fiscal stimulus. Trouble is that in the long run these infrastructure improvements result in greater overall productivity thereby reducing the labor demand.

Progressive Taxes

A common method of wealth redistribution is a progressive tax. Presumably the wealthy can afford to pay a larger fraction of their income than those with lower incomes. Progressive taxes are effective tools but, in the U.S., we have pretty much reached the limit of how much a progressive system can help low income households. Those earning less than the income tax threshold already pay no taxes. For a single parent with one child, the marginal tax threshold for 2016 was approximately $21,000 before accounting for federal and state health care benefits which amount to approximately $9,400 more.

Since the lowest tax bracket is already zero we can consider increasing the tax rate in upper tax brackets. While that may increase the amount paid by the wealthy it doesn't directly benefit the poor. Indeed, the resulting economic slowdown may worsen the situation.

Refundable Tax Credits

Through tax credits you can reduce the lowest effective income tax rate to negative values — paying money rather than collecting it. In the U.S. the Earned Income Tax Credit (ETC) serves this role with progressive benefits for the number of children in the household. Because the ETC is designed as a percentage of income, the benefit increases as the individual earns more money before being phased out at higher income levels. This is intended to incentivize workers to find and improve their employment even while drawing government benefits.

The downside of tax credits is that they are tied in with the very complicated process of filing income tax returns. The Government Accounting Office and IRS indicate that between 15% and 25% of eligible households don't collect credits to which they are entitled. Nevertheless, this is a promising approach because tax credits are unrelated to productivity and they have a direct impact on income inequality.

Increased Minimum Wage

Recently there have been increased calls for a $15 minimum wage. Many cities and some states have already passed laws to increase wages to that level. Critics of the minimum wage point out that increased wages can lead to increased prices — thereby reducing the benefit to low-income workers. And indexing minimum wage to the cost of living, as certain states and municipalities have done, may eliminate entry level jobs and accelerate inflation.

Many are watching to see how experiments in Seattle, New York City, and San Francisco will work out. Two recent studies of Seattle's minimum wage offer early indicators. One, from the University of Washington indicates an increase of 3% in total wages in low-paying jobs (the desired outcome) but a reduction, by 9%, of total hours worked (not so desirable). A study by UC Berkeley confirms the increase in total wages and indicates no reduction in overall employment though it does show a decline in employment of workers by limited-service restaurants (fast food). Hours worked was not among the data in the Berkeley study.

These are preliminary results but they tend to confirm the expected pattern. An increased minimum wage incentivizes greater automation and, thereby reduces the total labor demand. For example, fast food restaurants are deploying self-serve kiosks in place of human cashiers. So, while the employed benefit from higher wages, many jobs may be eliminated.

Idling Portions of the Workforce

A recession naturally reduces the number of workers until production is a closer match to demand. Unfortunately, those who lose their jobs are predominantly low-income workers, people who are least able to tolerate job loss. Better options preserve household income while reducing hours worked. These include mandatory vacation time, a shortened work week, and more holidays. When couples elect to have only one spouse be employed that also reduces the workforce.

Education

Increasing educational achievement is the intervention most dear to my heart. Not only does a better education enable better wages, but it also results in better health, greater happiness, more active citizenship, and reduced violence. Those with higher educational attainment make better financial decisions. All of this results in a better quality of life.

As more and more routine jobs are automated, the opportunities for those without a postsecondary degree or certificate will continue to diminish. A recent study at Georgetown University indicates that, by 2020, 65% of jobs in the U.S. will require postsecondary education. That's up from only 28% in 1973.

Certainly greater educational achievement is a part of the solution as it moves more people into higher-wage jobs. But those jobs pay higher wages precisely because they are more productive. So, as more people achieve higher levels of education, productivity will continue to increase.

Working With Market Forces

Nobel Laureate Milton Friedman observed that the communication function of markets is as important as the commerce function. When demand rises then prices rise. Rising prices signal suppliers to produce more goods. On the other hand, excess goods result in falling prices signaling manufacturers to reduce production. No command economy has achieved the signaling efficiency of the market.

The same signaling occurs in labor markets. Higher wages for more-skilled jobs will encourage people to seek the necessary education and training to qualify for those jobs. But the counter-point is our contemporary concern, when there is excess labor available, especially for low-skill jobs, then wages may fall below the point where workers can earn an adequate living. Some workers will retrain, and many programs will help pay for that. But retraining takes time, interest, and an affinity for the new field.

With worker productivity in the U.S. soon to crest 4 times basic needs the natural market signal is for less production – but that would mean unemployment. To date, we have interfered with that signal by artificially propping up demand. Massive advertising, high consumer debt, low interest rates, the housing bubble, planned obsolescence; all are symptoms of the interference. Besides, the overconsumption caused by this stimulus is also damaging to the environment.

As productivity continues to increase we will have to allow the signal to get through – it will inevitably anyway. The challenge is finding ways to match production to demand while sustaining employment and wages especially for the most vulnerable.

Framing the Problem

The innovation we need is this:

A way to distribute abundant resources more equitably; while preserving incentives to learn, work, and make a difference; and allowing market signals to balance production and consumption.

We can't look to the past because the challenge of abundance is different from any faced by previous generations. It's going to require a generous sharing of ideas, some experimentation, and development of greater trust between parties.

I'm optimistic that there is a solution because never before in history has society had such plentiful resources as today.

Cut Taxes or Increase Spending - Is the Debate Obsolete?

2017-03-31T14:00:00.002-07:00

As the Trump administration turns its attention to a tax reform plan debate swirls about the best way to stimulate the economy. Traditionally, Democrats have advocated for increased government spending while Republicans have fought for reduced taxes. Both methods succeed in stimulating and both have their roots in the theories of prominent economists. But it may be that both strategies, and the theories that support them, are obsolete in a day when production is many times basic needs.

Government spending advocates cite the work of John Maynard Keynes. Prior to Keynes, neoclassical economists theorized that free markets should naturally balance the economy toward full employment. Keynes observed that economies tended to swing between boom and bust cycles and advocated government intervention through fiscal policy (government taxing and spending) and monetary policy (central bank regulation of the money supply) to moderate the swings. Keynesian theory was influential in addressing the Great Depression and remained dominant following World War II into the 1970s.

Among the expectations of early Keynesian economics was that high inflation and high unemployment should not co-exist. Economist Milton Friedman challenged that notion and was proven right when "stagflation" emerged in the 1970s. Friedman theorized that stagflation and related poor economic conditions result from excessive or malinformed government intervention. The solution, he said, was to free the market through reduced regulation and lower taxes. This school of thought is generally known as "supply-side" or "monetarist". President Reagan successfully employed that approach early in his presidency launching a sustained period of economic growth that continued through the Bush and Clinton administrations.

Today, Keynesian economics is associated with greater regulation, increased government spending, and with an overall trust in government interventions. Meanwhile, monetarist economics is associated with free markets, reduced taxes, and with an overall trust in the market's ability to self-balance. In fact, both schools of thought are much more nuanced than these broad strokes. On the Keynesian side, it matters a lot where the government spends its money. On the monetarist side, it matters a great deal which taxes are reduced and how regulations are tuned. Earnest theorists on both sides have a healthy respect for the other theory.

But the nuance is quickly lost in the morass of political debate. Indeed, I fear that most political Keynesians choose that theory because it justifies their existing desire to increase government spending. And most monetarists choose supply-side theory for it's justification of reduced taxes and regulation. In each case I think they first choose their preferred intervention and then select a theory to justify it.

Through the latter half of the 20th century U.S. government economic focus was pretty much what Keynes described - moderating the boom and bust cycle toward more stable continuous growth. During slow cycles this meant adding economic stimulus, through increased spending and reduced taxes. When inflation started to get out of hand, government would slow things by increasing taxes, reducing spending, and reduced interest rates. Reagan met the stagflation challenge (high inflation and high unemployment) with an unusual combination of reduced taxes (to stimulate hiring) and increased interest rates (to slow inflation). Nevertheless, Reagonomics still used the same tools, just in different ways.

Our contemporary challenge is a new one. Since roughly 2001 the economy has required continuous stimulation to maintain growth. Radical new stimuli such as Quantitative Easing and zero interest rates have been used. Previously experts avoided those stimuli because of their potential to provoke high inflation yet inflation remains at historically low levels and it seems that, without continuous stimulation, the economy will slow to a crawl.

Production compared to Basic Needs

The new economic challenge is due principally to the rapid increase in workforce productivity. According to the U.S. Bureau of Labor Statistics individual worker productivity has more than quadrupled since World War II. Overall productivity per person in 2012 was 412% that of 1947.

Productivity growth becomes even more striking when compared with basic needs. In 2014 U.S. per capita GDP was $54,539. The basic needs per capita that same year was approximately $13,908. So, per-capita production exceeds basic needs by nearly four times. And while the basic needs side of this equation includes the whole population, the productivity side only accounts for those employed, it doesn't include unemployed workers and people choosing not to seek paid employment. So productive capacity compared with basic needs would be even higher.

If it weren't for problems of distribution this would be a great thing! For the first time in history, society has sufficient capacity to provide comfortable housing, plenty of food, health care, entertainment, and leisure time for all. The challenge is that, in a market economy, productivity increases disproportionally benefit those who are already at the higher end of the wage scale.

Disproportionate Benefits

Creative destruction is the term economists us to describe the transformation of an industry by innovation. It is usually associated with the elimination of jobs due to new technology but any innovation that increases individual productivity qualifies. Some examples: The backhoe replaces the jobs of several ditch diggers with that of a more-skilled heavy equipment operator. Computerized catalogs reduce the demand on librarians. Industrial robots replace factory workers. The common feature of such innovations is that they substantially increase the productivity of individual workers. Frequently, these innovations also result in jobs moving upscale — requiring more skill or training and with correspondingly higher pay.

Creatively destructive innovations have led to enormous productivity increases in recent decades thereby reducing the demand for labor. As with any market, when supply increases or demand declines the value also declines. In this case the value of routine jobs has declined dramatically. Here's how economist Dr. David Autor describes it.

And so the things that are most susceptible to computerization or to automation with computers are things where we have explicit procedures for accomplishing them. Right? They’re what my colleagues and I often call “routine tasks.” I don't mean routine in the sense of mundane. I mean routine in the sense of being codifiable. And so the things that were first automated with computers were military applications like encryption. And then banking and census-taking and insurance, and then things like word processing and office clerical operations. But what you didn’t see computers doing a lot of — and still don't, in fact — are tasks that demand flexibility and don't follow well-understood procedures. I don’t know how to tell someone how do you write a persuasive essay, or come up with a great new hypothesis, or develop an exciting product that no one has seen before. ... What we’ve been very good at doing with computers is substituting them for routine, codifiable tasks. The tasks done by workers on production lines, the tasks done by clerical workers, the tasks done by librarians, the tasks done by kind of para-professionals, like legal assistants who go into the stacks for you. And so we see a big decline in clerical workers. We see a decline in production workers. We see a decline even in lower-level management positions because they’re all kind of information processing tasks that have been codified.

Recent creative destruction has predominantly affected lower-middle-class jobs and manufacturing jobs. While increased productivity has made our nation more wealthy as a whole, large sectors of the labor force have been left behind. This may be the biggest factor behind the slow recovery from the 2008 recession. Automation substituted for jobs that were eliminated during the recession; those jobs are not coming back.

The decline in U.S. manufacturing employment has been balanced, in part, by growth in the service sector. This makes sense; growth in productivity has resulted in greater overall wealth. On average, people in the U.S. have more money to spend on eating out, recreation, vacations, and health care. But again, the benefits are not evenly distributed. As workers displaced from manufacturing have moved into the service sector, wages in that area have stagnated.

Disproportionate Impact of Globalization

Economists have consistently advocated for free trade. The math is incontrovertible; when regions or countries with different costs of production trade goods and services, all communities benefit as each is able to specialize and all benefit from the overall productivity increase.

Only recently have economists begun to study how free trade impacts sectors of the economy rather than the economy as a whole. Unsurprisingly, the impact to the U.S. has disproportionally affected manufacturing and routine labor. Here's another quote from Dr. Autor:

When we import those labor-intensive goods, we’re going to reduce demand for blue-collar workers, who are not doing skill-intensive production. Now we benefit because we get lower prices on the goods we consume and we sell the things that we're good at making at a higher price to the world. So that raises GDP but simultaneously it tends to make high-skilled and highly educated labor better off, raise their wages, and it tends to make low-skilled manually intensive laborers worse off because there is less demand for their services — so there's going to be fewer of them employed or they're going to be employed at lower wages. So the net effect you can show analytically is going to be positive. But the redistributional consequences are, many of us would view that as adverse because we would rather redistribute from rich to poor than poor to rich. And trade is kind of working in the redistributing from poor to rich direction in the United States. The scale of benefits and harms are rather incommensurate. ...

We would conservatively estimate that more than a million manufacturing jobs in the U.S. were directly eliminated between 2000 and 2007 as a result of China's accelerating trade penetration in the United States. Now that doesn't mean a million jobs total. Maybe some of those workers moved into other sectors. But we've looked at that and as best we can find in that period, you do not see that kind of reallocation. So we estimate that as much as 40 percent of the drop in U.S. manufacturing between 2000 and 2007 is attributable to the trade shock that occurred in that period, which is really following China's ascension to the WTO in 2001.

During the campaign, Donald Trump and Bernie Sanders both advocated rethinking free trade. Perhaps we can use tariffs or government incentives to return manufacturing back to the U.S. As it turns out, that's already happening even without incentives. As labor costs increase in Asia the offshoring advantage is diluted. Many manufacturers are, indeed, opening new U.S. plants. The trouble is that returning manufacturing doesn't result in substantial job or wage growth. These are highly automated plants, employing a fraction of the workers whose jobs were eliminated when manufacturing went overseas. For example, Standard Textile just opened a new plant in Union, SC to make towels for Marriott International. Due to automation, the plant only created 150 new jobs. A generation ago the same plant would have employed more than 1000 people. And many of the new jobs are more highly skilled — designing, operating, and maintaining automated machinery.

Creative destruction and globalization are working together here. Both increase overall GDP, both increase individual worker productivity, both increase total wealth, and both disproportionately benefit skilled upper-middle-class workers over blue collar and middle-management workers. Any benefit from manufacturing returning to the U.S. will be blunted by the increase in automation reducing labor needs and shifting what remains to more skilled jobs.

Demand-Side Economics

So far, we have looked at the supply side of labor. The massive increase in productivity over the last six decades has been driven by innovative technology with global trade as an accelerant. As noted before, when the supply of labor exceeds demand then the value decreases. When supply exceeds demand across the economy as a whole then you get a recession.

From the end of World War II through the rest of the 20th century we succeeded in driving demand to keep up with supply. Advertising grew tremendously as an important demand driver. Television programs established new norms: two cars per family, a large home in the suburbs, annual luxury vacations, and designer clothing labels to name a few. Home appliances like air conditioning and a dishwasher went from luxury to necessity.

Government has participated in driving demand. Housing programs made home ownership much more accessible. So much so that it resulted in the 2007 real estate bubble. Likewise, the Federal Reserve has kept interest rates down ensuring that consumer credit remains accessible and people can buy ahead of income.

In the 21st Century we seem to have reached the limits of demand stimuli to compensate for ever increasing productivity. Smaller cars like the Mini Cooper or Fiat 500 have become stylish. Even the wealthy are choosing to reduce consumption — buying smaller homes or moving into the city. The result is that it takes increasingly strong stimuli to keep the economy moving. For the recession of 2008 the government spent unprecedented amounts of money borrowing directly from the Federal Reserve to do so. Despite this pressure, interest and inflation rates remain at historically low values.

Increase Spending or Cut Taxes?

And so we return to the contemporary debate: Should government increase spending or cut taxes to stimulate the economy? When government cuts taxes, individuals and companies have more disposable income. Presumably they will spend some of that income and save part. When government increases spending, it chooses directly where that money will be spent. Both theories depend on "trickle-down" effects even though that has traditionally been associated with tax cuts. In each case, the direct beneficiary of government policy employs more people and purchases more goods and services; those employees and suppliers also do more business and the impact "trickles" through the economy. The primary differentiator is whether you have greater trust in government (increase spending) or the market (cut taxes) to determine who is at the top of the trickle-down pyramid.

The question is really obsolete. Regardless of which stimulus you choose, demand stimuli are increasingly unable to keep up with increased productive capacity. As a country, we already produce nearly four times basic needs and the multiplier will continue to grow. Meanwhile, the twin pressures of Creative Destruction and Globalization will continue to drive the greater benefit of demand stimulus to those who already earn higher wages. Under either strategy, wage disparity will continue to worsen despite attempts by policymakers to direct tax breaks or government spending toward lower income households.

It seems that we will need a greater economic innovation than either of these 20th century solutions. In my next blog post I will write about some promising ideas. More effective education for all students is, of course, an essential component but insufficient by itself.

Estimating Basic Needs Per Capita: The Self-Sufficiency Standard is an measure of the income necessary to meet basic needs without assistance. Values are in terms of household. National averages aren't published so we have to make an approximation starting with samples of two cities. The cost of living index for Milwaukee, WI is 101.9% of the national average. Rochester, NY is exactly 100.0%. The 2014 Average household size in the U.S. was 2.54. We round up to 3 - two adults and one child. For Milwaukee the 2016 Self-Sufficiency Standard for that household is $43,112 annually. For Rochester, the 2010 Self-Sufficiency Standard for the same family is $40,334. Per-capita values are $14,371 and $13,445 respectively. Averaging the values comes to $13,908 approximate U.S. basic needs per capita in 2014. To be sure, there's a lot of variability across region, household size, medical needs, and so forth. I also mixed figures across 2010-2016. Nevertheless, this is a good enough working figure for comparing to per capita production in the same timeframe.

The Challenge of Information Democracy

2016-12-30T16:00:00.000-08:00

In the 1990s I was a co-founder at Folio Corporation, an electronic publishing software company. Correlating with the growth of the internet we produced tools that let average individuals publish their content and search for the information they needed in vast pools. Such tools are common today but it was cutting edge at the time.

"Information Democracy" was the term we used to describe the concept. In previous generations, a select few were able to publish their words to a sizable audience. Likewise, only business leaders and rulers of countries could afford the research staff necessary to stay well-informed. We produced a video featuring James Earl Jones and held conferences anticipating the how greater access to media would spread liberty, increase productivity, and support a more moral society.

We weren't alone in our optimism. In Life After Television George Gilder wrote, "Television is not vulgar because people are vulgar; it is vulgar because people are similar in their prurient interests and sharply differentiated in their civilized concerns." Ever the optimist, Gilder anticipated that greater diversity of media channels would result in a gradual elevation of quality and subject matter.

We have, indeed, achieved a world where any organization can publish to the whole world, where individual citizens can create TV channels on YouTube, and where average researchers have better resources than national leaders had a generation ago. Unfortunately, unfettered access to media hasn't resulted in the utopia many of us expected. Today's challenge is distinguishing reliable information from deliberate deceit and the whole spectrum between.

Unreliable Information

The recent presidential election brought the issue of fake news to the the media's attention. It will probably take years to sort the origin and impact. One source seems to be entrepreneurial Macedonian teenagers making money with fake news sites and Google AdSense. The Washington Post claims it was a coordinated Russian effort to destabilize American democracy.

The Pizzagate episode offers a warning sign of how fake information can provoke violent response. In terms of death toll, Andrew Wakefield's fradulent MMR Vaccine paper was worse. Despite millions of dollars invested in follow-on studies and publicity campaigns, the anti-vaccine movement has contributed to thousands of illnesses and numerous childhood deaths.

In recent decades, most newspapers and magazines have reduced or eliminated their fact-checking departments. Fact-checking of this sort is a cost-center and with declining revenues due to internet media, publishers have sought to reduce costs. The decline in ante hoc (before publication) fact checking has been matched by a growth in post hoc efforts like FactCheck, and PolitiFact as well as fact-checking pages at major news publications. Post hoc fact checking builds a revenue center out of the effort by sensationalizing politician's and other publication's mistakes. The unfortunate result is that post hoc fact checking is selective, biased, and missed or ignored by those who prefer to believe an inaccurate story.

Post-Truth

The Oxford English Dictionary (OED) named "Post-Truth" as it's word of the year for 2016. Their definition is "relating to or denoting circumstances in which objective facts are less influential in shaping public opinion than appeals to emotion and personal belief." The OED editors acknowledge that "post-truth" conditions aren't new; only that use of the term increased rapidly in 2016 year in the context of the U.S. Presidential Election and the U.K. Brexit vote.

I prefer the term "Confirmation Bias", defined as "The tendency to interpret new evidence as confirmation of one's existing beliefs or theories. In extreme cases, conspiracy theorists tend to reject any evidence contrary to their opinion as part of the conspiracy while considering all evidence in support of their opinion as true and factual.

Information Democracy

It's time to resort to that old standby - the 2x2 matrix. In the latter half of the 20th century, prior to the advent of the internet and world-wide web, U.S. society was in a state of high media trust but all publishing flowed through a relatively small set of media outlets. Opinion polls identified Walter Cronkite as "the most trusted man in America." This state of high trust and exclusive access is the upper-left quadrant, "Information Hegemony."

We optimists of the 1990s expected free access to media to provoke a shift to Information Democracy. Likewise, we anticipated that totalitarian states would be forced from state-controlled media to an information democracy model.

We didn't understand the importance of trust systems in that shift. As access was generalized, and economic forces pushed media to a more sensationalist orientation, trust declined and we have ended up with Information Anarchy. It's hard to say whether this is superior to the more trustworthy but also restricted hegemony that preceded our day. But this state isn't unprecedented. In 19th and early 20th centuries, there were a greater variety of newspapers and magazines each with strong biases and little distinction between fiction and fact. Journalistic objectivity as a value didn't become prominent until the mid-20th Century.

Tools for Discerning Truth

At present, the main tool most citizens use to judge media is whether a story matches their existing world view and opinions. Confirmation bias turns out to be a pretty good tool so long as one's world view is somewhat close to truth. And, of course, everyone thinks that their biases are the "true" ones. The problem is that when one relies exclusively on confirmation bias, they don't have a tool for correcting their biases - for getting closer to what's really true.

Trust is our second-best tool for judging content. It's also an excellent tool for correcting ones own biases. That's why the decline of trust, and the concurrent decline of trustworthiness, is such a problem today. The sensationalism of most media outlets concentrates on confirmation bias as a way to gain audience. Careful readers have to seek trustworthy journalists rather than organizations - at least until the trend turns.

Critical thinking isn't so much a tool as a discipline. It's something that our schools can and should teach and it's incorporated into good quality language curricula. As students are taught critical thinking they are taught to recognize and use good-quality arguments, to measure the credibility of facts based on origins and citations, and to compare and contrast writings from multiple authors.

Taking Personal Responsibility

I'm afraid that profit motive will prevent the media industry from solving this problem for us. Rather, we need to take individual responsibility for recognizing and tuning our own biases. We must bring the language of critical thinking into our vocabulary; asking about sources, seeking contrasting points of view, looking for supporting evidence, checking the logic of arguments, and discounting emotional appeals.

Clay Johnson wrote in The Information Diet, "The pattern here is simple: seek to get information directly from the sources, and when the information requires you to act, interact directly with those sources. An over-reliance on third party sources for information and action reduces your ability to know the truth about what's happening, and dilutes your ability to cause change." (Page 140)

The Information Age has given us unprecedented access to the original sources. We can take advantage of that. Institutions will follow the people, not the other way around.

What I Would Tell Donald Trump about Education

2016-11-10T16:00:00.000-08:00

I never thought Donald Trump would survive the first primary much less gain the nomination. By the time we reached the general election I gave up making predictions because, where Trump is concerned, I was always wrong. I don't expect this post to ever make it to the Trump transition team. But I could be wrong about that as well. Regardless, I hope it will help some of you in the community.

The Trump Policy Page on Education is pretty spare. During the campaigns, Trump spoke very little about education policy. In the primaries he made a few anti-Common Core remarks that seemed requisite of all Republican candidates. But those quotes date back to February. Mike Pence has been a strong advocate for school choice and that's reflected in the policy page. Their goal is to "provide school choice to every one of the 11 million school aged children living in poverty."

On the prospect that Trump's education strategy is still nascent, here's what I would tell him if I were asked:

Leave Standards to the States

The No Child Left Behind act required states to set educational achievement standards and measure the degree to which students meet those standards. It's successor, the Every Student Succeeds Act (ESSA) was passed in December 2015 with broad bipartisan support. ESSA maintains the emphasis on standards and accountability while returning responsibility to states to decide how to address underperforming schools.

Contrary to popular belief, the Common Core State Standards (CCSS) are not a federal mandate. They were created in a state-led cooperative effort with support from private foundations. The Obama Administration's, Race to the Top grants encouraged adoption of common standards among states without specifying any particular set. Those grants have mostly expired and there is no continuing federal support for the CCSS.

So, for Trump to eliminate the Common Core or to substitute other standards in their place would constitute more federal meddling in education, not less. Leave the development of standards to the states. Some will choose to collaborate on the CCSS, others will go their own way. We're in the third year of Common Core deployment. Within one or two more years we'll know whether it's been effective.

Ensure Title I Funds Really Benefit Economically Disadvantaged Students

This is a gnarly problem loaded with unintended consequences. Title I of ESSA (which is the latest reauthorization of the Elementary and Secondary Education Act) provides extra funding to schools and districts with a high proportion of children from low-income families. The goal is to close the achievement gap by offering more resources to schools that serve children with greater needs.

Unfortunately, as Marguerite Roza observed in Educational Economics the greater the distance between funding decisions and the students, the less effective they are at achieving the intended result. All too often, Title I funds are balanced by other funds being directed toward more mainstream schools and the most challenged schools remain with the fewest funds.

The Trump Campaign's proposal is to have specific money allocated to each economically disadvantaged child and for that money to move with the child to whatever school they choose. It's a promising strategy because it ties the funding decisions directly to the child but the concept won't work if there aren't good quality schools available for parents and their children to choose from.

Base Strategic Initiatives on Reliable Evidence

The theory behind the No Child Left Behind Act was to measure success and incentivize improvement. It's an approach that has worked in other domains but education has proven to be more challenging. That's because we still don't have a good model for effectively educating all students, at scale while preserving initiative, creativity, the arts, and joy.

We're making progress. And there's a growing body of evidence supporting some key strategies. They include:

Personalized Learning
Competency Education
Feedback
Growth Mindset
And Technological Supports for All of the Above.

Choose a Secretary of Education Who Understands the Landscape

Education doesn't need another shakeup right now. There are a lot of experiments underway that will yield great insights into what works. Some of these are at statewide scale like the competency-based New Hampshire High School Transformation or the Rhode Island Education Action Plan. Others are at district or school scale. We are rapidly learning what works and US Ed can shine a light on successful programs.

The Secretary of Education should have an optimistic outlook for US Education. They should have spoken at iNACOL, Educause, and SXSWEdu. They should know the education leaders at the Gates, Hewlett, and Dell foundations. Most of all, they have a humble attitude about the challenges ahead and the limited but important role of the federal government in US education.

"Growth Mindset" is the Buzzword of 2016 - and That's a Good Thing

2016-02-26T08:03:00.003-08:00

I first encountered the Growth Mindset nearly ten years ago in a New York Magazine article titled "How Not to Talk to Your Kids". The central point of the article was that when a child succeeds at a task, it makes a big difference whether you praise them for their effort or praise them for their talent or ability. Praising a child for their effort is associated with a growth mindset. It fosters children's belief that they can overcome obstacles and increase their mental capacity.

The article I read was based on the research of Dr. Carol Dweck. There is a large and growing body of evidence showing that students with a growth mindset achieve more and overcome challenges more consistently. It's also supported by contemporary research in psychology and neurology. "The brain is like a muscle." is a common metaphor, "Giving it a harder workout makes you smarter." Indeed, continuing research shows that IQ is malleable and can be increased.

In recent years, both anecdotal and rigorous evidence for Growth Mindset has increased with books, school programs, and parental training programs. Mindset Works is an advocacy organization dedicate to the concept. The result is an explosion of Growth Mindset interest in late 2015 and 2016.

And here are some recent examples:

Risk of a Buzzword

Growth Mindset is based on solid evidence and sound psychology. But as the buzzword starts trending we risk failure and discreditation of the idea due to enthusiastic but misguided efforts. A colleague recently worried that growth mindset might fall victim to the Self-Esteem fad of the 1990s. To be sure, the right kind of praise is connected with growth mindset. But equally important are fostering the determination to overcome obstacles and the safety to fail.

Some years ago I had the privilege of being a chaperone when my children's school competed in the Utah Shakespearian Festival. It was a small school and the drama team was composed of the majority of the high school - grades 9 through 12. I watched in amazement as these average kids rehearsed dramatic scenes, choreographed their own dance pieces, and performed a breathtakingly creative ensemble scene from Much Ado About Nothing. In the sweepstakes, they took second place against much larger and better-equipped schools. I chatted with teachers and other parents about what qualities enabled our school to perform so well without cherry-picking the best drama students for the team. We decided that an important factor is the emotional safety students had at the school. The cultural climate enabled students to take risks and regularly fail with minimal fear of ridicule. The courage to step out and take risks is especially important in the performing arts. Years later I found corroborating evidence in Brene Brown's research on vulnerability.

Growth Mindset has as much or more to do with proper response to failure as it has to do with proper praise for success. Like a scientist performing experiments, students should be encouraged to treat failures as opportunities to learn and gain insight. Indeed, study of a failure can yield new understanding whereas success simply confirms existing knowledge.

Learning Mindsets

The Raikes Foundation considers a broader concept of "Learning Mindsets". This includes growth mindset and adds other skills that help students "actively participate, work through problems, think critically, and approach learning with energy and enthusiasm." Andy Calkins calls this "Agency." Of these skills; which include grit, determination, self-advocacy, and confidence; growth mindset seems to be getting the attention in 2016. If people study the concept and implement it well, that will be a good thing!

Personalized Learning - More Evidence, More Progress

2015-12-30T13:57:00.000-08:00

I've written a lot about Personalized Learning on this blog. The theory has a lot of things going for it. It's intuitive, it's the principle behind the most effective learning factors, and supporting evidence continues to accumulate.

When introducing personalized learning it's useful to contrast with factory-model education. Under a factory model, students with wide variation in personality, interests, skills, and talents are exposed to a consistent educational experience. Unsurprisingly, there is wide variation in the results because the consistent learning activities resonate better with some students than others. So, we grade the students with some portion of the grade attributable to student effort and other parts attributable to evidence of subject mastery. When students with inconsistent backgrounds participate in consistent learning activities, it's not surprising that the results are also inconsistent.

Personalized education applies in two ways. For fundamental subjects like Reading, Writing, and Mathematics, the learning experience should be personalized to meet the diverse needs of individual students. Customizing the experience to each student's individual needs can result in consistent achievement in a diverse population.

With a foundation of core skills in place, the second form of personalization is supporting students as they pursue diverse interests - science, music, art, history, sports, and so forth. The most successful students have always personalized their education. The innovation is for institutions to deliberately participate in the personalization effort.

Accumulating Evidence

Earlier this year, the Bill & Melinda Gates Foundation commissioned a RAND Corporation study of 62 public charter and district schools pursuing a variety of personalized learning practices. The results are promising. Average performance of students in the study schools was below the national average at the beginning of a two-year study period and was above the national average at the conclusion. Growth rates increased in the third year achieving effect sizes exceeding 0.4 in the third year.

Five specific personalization strategies identified and studied are:

Increased one-on-one time between student and instructor.
Personalized learning paths with students able to choose from a variety of instructional formats.
Competency-based learning models that enable individual-pacing with supports tailored to each student's learning level.
Flexible learning environments that can be adapted to student needs, particularly when they have conflicting demands on their time.
College and career readiness programs.

The authors observe that, "While the concept of personalized learning has been around for some time, advances in technology and digital content have placed personalized learning within reach for an increasing number of schools."

Progress and Public Support

The most significant policy event this year was the reauthorization of the Elementary and Secondary Education Act (ESEA). The previous iteration was known as "No Child Left Behind", this version is titled the "Every Child Succeeds Act". About the new law, iNACOL wrote, "Through ESEA reauthorization, Congress [supports] the shift to new, personalized learning models by redesigning assessments, rethinking accountability, and supporting the modernization of educator and leadership development."

Another important event this year is Education Reimagined. The Convergence Center for Policy Resolution brought together leaders from across the political and educational spectrum to describe a new vision for education. As they describe it, "We were not your typical group -- no two in agreement about how to fix the current system. What we did share, however, was a fundamental commitment for all children to love learning and thrive regardless of their circumstances. We knew it was time to stop debating how to fix the system and start imagining a new system." I had the privilege of hearing Becky Pringle, vice president of the National Educaton Association and Gizele Huff, director of the libertarian Jacquelin Hume Foundation describe their shared vision of student-centered education. It's compelling that, when you get all of the parties to converge on a shared educational vision it focuses on personalization - on meeting the specific needs of each student.

As we head into the new year, I'm optimistic. At this moment, we have progress, evidence, and policy coherently driving toward a better education for all of our students.