Knewton Reads: A Data Scientist Reviews “The Information”

April 25, 2012

James Gleick’s “The Information: A History, a Theory, a Flood” was the pick for this month’s Knewton book club. The book covers the history of information — from the invention of scripts and alphabets to the Morse code and the arrival of the Information Age. We’ll be posting reviews throughout the month; read others here.

When a molecular biologist reasons about genes and heredity, the second law of thermodynamics is not always her second thought. As I sit here composing these sentences, I’m barely cognizant of the fact that I could leave out roughly half of the textual characters I’m typing and still my point would come across, and only mildly adulterated. Because I’m a creature of 2012, I think nothing of the miracle of this language transmitted in full fidelity across the many wires of the web — but what if I were a creature of some time earlier and my medium was the drum?

It’s a rare book that manages fantastic leaps across time and concept, and does so with such complete fidelity to the sciences and biographies of those who developed those concepts. It was a pleasure to be able to share a few hours in the mind share of James Gleick, reading his latest book, “The Information,” which explores information in all its streaming, noisy, lively, expressive, fickle, and multitudinous incarnations. It was particularly rewarding to realize that many of the connections and relationships he shares are particularly foundational to thought processes that run through an information scientist’s mind here at Knewton at any given moment.

The Information dissects the lifecycle of information itself: transmitter goes to encoder- goes to imperfect medium- goes to decoder- goes to receiver. For a data scientist at Knewton, this view of the world — this particular lifecycle of information — can be mapped to the way we model student understanding and behavior. A student’s unknowable state of understanding- goes to imperfect assessment of that understanding–goes to receiver — with the exception that the receiver in our system is really a feedback loop wherein we update both our knowledge of the properties of our assessments and our knowledge of each student that operates through this loop. This is a process-oriented description of what Item Response Theory (one of the fundamental tools we use at Knewton) provides us.

Given a configuration of messages we derive from an assessment medium, the task is to “decode” the inner state of a student’s knowledge. To me, this is just one notion of what a (probabilistic) model is — a layer of abstraction that rides just above the clickstream, the knowable answers to questions on tests viewed through an imperfect lens, that gives us a picture of a student’s state of knowledge, from which we attempt to infer a student’s optimal next course of action.