Chapter 4

Big data & adaptive infrastructure

The term “big data” is used to describe the tremendous volume, velocity, and variety of data generated by various technology platforms, many of which involve the continuous or ubiquitous collection of data. Big data refers specifically to data sets that are so large and complex that they are challenging to work with using traditional database management tools; specific challenges include the storage, search, analysis, and visualization of data. Download PDF

Big data & education

The advent of “big data” in areas such as internet search and social media has disrupted existing industries, created new industries, and led to the extraordinary success of companies such as Google and Facebook. Big data unleashes a range of productive possibilities in the education domain in particular, since data that reflects cognition is structurally unique from the data generated by user activity around web pages, social profiles, and online purchasing habits.

One feature that distinguishes the data produced by students (from that of consumers shopping online or engaging in social media, for example) is the fact that academic study requires a prolonged period of engagement; students thus remain on the platform for an extended length of time. Furthermore, there is a focus, intention, and intensity to students’ activity: they are engaging in high stakes situations — taking a course for credit, trying to improve their future, expanding their range of skills. The sustained intensity of these efforts generates vast quantities of meaningful data that can be harnessed continuously to power personalized learning for each individual.

Another feature that distinguishes the data produced by students is the very high degree of correlation between educational data and the aggregated effect of all those correlations. If, for example, a student has demonstrated mastery of fractions, algorithms can reveal how likely it is that he will demonstrate mastery of exponentiation as well — and how best to introduce that concept to him. If a student has demonstrated mastery of various grammatical concepts (say, subjects, verbs, and clauses), educational data can optimize his learning path, so that different sentence patterns will “click” for the student as quickly as possible.

The hierarchical nature of educational concepts means that they can be organized in a graph-like structure, which means that the student flow from concept-to-concept can be optimized over time, as we learn more and more about the relationships between them through data. Every student action and response around each content item ripples out and affects the system’s understanding of all the content in the system and all the students in the network.

Adaptive infrastructure

Knewton has established an infrastructure that allows the platform to process tremendous amounts of student data. For instance, inference on probabilistic graphical models is one example of a class of algorithms called ”graph algorithms.” These algorithms are special in that they can be broken down into units of computation that depend only on other specific units and can thus be parallelized very efficiently if the work is split between computers so that limited coordination is required.

Given the absence of robust, public frameworks for accomplishing these computations at a large scale, Knewton has designed its own framework called AltNode which works by dividing work between machines and then sending continuous updates between the minimal necessary number of machines. All significant updates are stored in a distributed database and broadcast between services. If one machine fails, another one automatically takes its place, recovering recent values from the database and resuming work. One unique feature of AltNode is that it allows models to recover from any state and respond to new data as it arrives.

figure h.

altnode framework