People often think that big data for education is a new thing. And it’s true that using big data the way Knewton does sounds almost like science fiction — our engine passively “norms” content at scale, uses normed content to determine students’ conceptual proficiencies to the percentile, and makes granular content recommendations for each student based on the combined anonymized data of all the other students in our network.
But there is an older, bricks-and-mortar type of big data for education: standardized tests.
People always ask me whether I think standardized tests actually measure anything. I used to give nerdy statistical answers about scoring validity and such. Eventually I realized that all they want to hear is that standardized tests are useless, possibly corrupt.
Standardized tests are increasingly unpopular, and not without reason. But when used properly they do, or at least should, serve important purposes. There are two main kinds of standardized tests: admissions and state assessment. Admissions tests like the SAT were built to predict academic performance in college and graduate school. Grades and transcripts also do this, but academic standards and programs differ so greatly from school to school and region to region that a central standard measure is extremely useful. State assessments help demonstrate whether students are graduating from high school with basic literacy and math skills. They exist because society, which pays for every child to have free K-12 education, has a right to know that kids are actually learning.
To fill its only purpose (and to have a shot at being fair), a standardized test must yield totally consistent scores across administrations. If Maria takes the SAT in May, takes the summer off (and, let’s assume, doesn’t gain or lose any knowledge), and then takes the SAT again in September — she should get the exact same score both times (plus or minus the margin of error).
It is a nearly impossible technical challenge to build a test so well that it yields totally consistent scores across multiple examinations. Try holding a dinner party and giving every guest a 50-question test on any topic. Then invite them back the following week and give them another test on the same topic, but with different questions. Spare your friends the ordeal and trust me: most of them won’t receive similar scores. Yet the big U.S. test makers have figured out how to do exactly this, except they can do it repeatedly, at much larger scale, and with examinees about whom they know almost nothing.1
The reality is that standardized tests are effective predictors of academic success in college or graduate school.2 That’s a fact of statistics, whether anyone likes it or not. It’s also just an average, and anyone could be an exception one way or the other. Standardized test scores, in my opinion, are more reliable at the top end of the scale than they are lower down. High scores generally indicate that someone is quite capable; but low scores may not mean very much at all. For these reasons, among others, standardized tests should be just part of the admissions process, and should never be the most important part.
The problem isn’t with the tests themselves, or their goals. The problems with major standardized tests are nearly always in how they are used. Admissions offices grossly misuse assessments like the SAT. As do all standardized tests, the SAT has a margin of error: when comparing the scores of two students, it’s about 28 points.3 The difference between Maria’s 1410 and Sam’s 1390 could mean absolutely nothing. This is why test makers repeatedly tell schools not to use cutoff scores. But schools do it anyway. Their admissions departments are small and understaffed, so they must take short cuts. Plus, to the average human brain, a score that starts with 14 just looks better than one that starts with 13.
State assessments are even more misused than are admissions tests, in large parts thanks to No Child Left Behind. This legislation has a strong focus on testing kids and judging schools. (The international high stakes assessment PISA is also routinely misinterpreted, with who knows what policy consequences.) Again, society absolutely has a right and responsibility to make sure students are learning — the country’s competitive future depends upon it. Tax-payers have a right to know that the dollars taken from them to pay for free K-12 education are being spent effectively. But as sometimes happens with sweeping legislation, unintended consequences may in fact exacerbate the problems the legislation was trying to solve.
Because of reasons like societal inequality, systemic selection bias, overwhelming data noise, and problems with the tests themselves, state assessments are a terrible way of judging schools or teachers. Trying to judge teachers algorithmically is a fool’s errand; it will never work as well as doing what good principals have always done: observing them in the classroom. Yet the outcomes of these tests can have serious punitive consequences to schools. The most immediately impactful way for schools to improve their performance on these tests is by turning the classroom into a test prep course. Many kids in public school today are faced with relentless homework — because that’s the easiest way to practice kids up, teach them to recognize question types, and train their endurance. This is bad for teachers, bad for kids, and bad for America. Not only is overworking kids like crazy not the way to fix problems in education, it will ultimately increase our problems by driving all the joy out of learning (and teaching, which will discourage teaching as a profession, exacerbating the problem).
But there is reason to be hopeful. The new big data of Knewton-powered adaptive learning can partially disrupt the old big data of state assessments. Like standardized tests, Knewton can predict performance and measure proficiency. The difference is that our technology does so continuously as an organic part of the learning process, rather than interrupting the learning process.4 Plus, Knewton doesn’t get just three hours of student performance data one time from a bricks-and-mortar test administration. We can see hours of (anonymized) data from students every day, so our proficiency estimates can be more accurate and continuously updated.
High stakes standardized tests aren’t going away. They will continue to be used — for test security, external validation, etc. But Knewton’s real-time, concept-level analytics may end up relaxing the need for and overuse of high stakes assessments. Furthermore, Knewton can analyze what students know in the natural course of doing their homework and help reduce the need for teachers to “teach to the test.” Knewton wants to free teachers to emphasize creativity and critical thinking, develop social and emotional skills in young students, and tickle curiosity and develop a love of learning in every student. Knewton-powered materials could, for example, guide students through the facts and concepts on the causes of World War I, so that they come to class ready to experience a high level analysis or discussion.
There are those who believe we’re about to enter a new era of “teachers vs. big data.” That’s wrong. We are, though, about to enter an era of “old big data vs. new big data.” If there is a single best hope to eliminate onerous over-assessment, it is in the latter. And who will be the winner of this impending battle of old big data vs. new big data? Teachers and students.
Ironically, there is only one way to accomplish this, and it simultaneously opens up the tests to their greatest weakness. To solve the problem of scoring consistency, test-makers must give basically the same exam every time. They test from the exact same pool of concepts every time in more or less the same proportions; they just change the particular words or numbers in the questions, but the questions have the same underlying structure in every single test. If you can train yourself to recognize and master that underlying structure, standardized tests become embarrassingly easy. ↩
Studies have shown the SAT, especially in conjunction with high school grades, to be a “robust predictor of college success.” Studies have also shown graduate admissions tests to be more effective predictors of performance in grad school than transcripts alone (the combination of tests and grades is the most accurate predictor). ↩
Knewton has built the necessary infrastructures to passively, algorithmically, and inexpensively norm content at scale. Anyone who wants to builds true adaptive learning apps, without doing all the painful and expensive work on a one-off basis, can plug into our network and build on top of our infrastructures. ↩