This summer I worked on the Analytics team at Knewton. One of the goals of analytics is to condense the huge amount of information generated by students interacting with educational products and provide teachers with data-driven insights into how their students are doing.
In this post, I will discuss an approach to conveying information about student behaviors using generative models.
A generative model is a description of how to generate data that is similar to the observed data.
Imagine we have a stream of binary observations for each student. The observations represent whether the student did any work on the system today, whether the student answered questions correctly as she progressed through a course, or anything else about the student that can be represented in binary terms. The observations are viewed as a stream; over time more observations are added.
Here are some examples of observations of correctness on practice questions from when each student started a course until today.
We could specify different generative models for each student. One possible generative model for student A could be:
Model #1 (Student A)
“print nine 1′s and then one 0″.
If we wanted to describe student B similarly the description would be:
Model #1 (Student B)
“print one 1 then three 0’s then four 1s then one 0 …”.
Models like Model #1 exactly preserve all of the information that was present in the student data but in doing so don’t reduce the complexity of the observations at all. Often, in communicating information about student behaviours, we seek to preserve the important information while reducing the complexity to communicate concisely.
A generative model does not need to reproduce the data exactly. One way of reducing complexity while preserving important information is to attribute parts of the observation that we believe to be unimportant as being generated from a process with some randomness.
Throughout this post, we’ll use a simple generative model for demonstration. The generative model belongs to a family of models, parametrized by a value w.
“Flip a weighted coin that lands on heads with probability w. If it’s heads, print 1, otherwise print 0″.
In reporting to a teacher, I can report that the model that best fits with the student history is w=0.9.
That can be pretty informative. It means that in the family of Model #2, w=0.9 gives the model that is closest to the full description of StudentA (using KL divergence as a measure of closeness). Often, a student’s full data description is inconvenient to communicate, but now I can summarize it concisely with a description of the model family (Model #2) and a single parameter (w=0.9).
While the model parameters may be meaningful themselves (in this case they are the student’s overall score on practice questions), they also define a space in which we can compare students to each other (or the same student at different points in time). For example, I can note that StudentA is best described by w=0.9 while StudentB is best described by w0.65. If I wanted to find the student most similar to StudentC (w0.94), I would find the student with the closest value for w and choose StudentA. Unlike standard string comparisons (eg. Levenshtein distance, which would yield studentB as the closest student to studentC), a distance measure based on a generative model is not sensitive to differences in the number of observations for each student.
In this case, because the generative model is quite simple, I’m just reporting a summary statistic of the data (in this case the sample mean), but that is not always the case.*
In choosing Model #2 to convey my information, I’ve consciously made some decisions about what qualities of the data are important or not. For example this model is insensitive to the order of the data. If I were to imagine a new student (StudentD) whose data was the same as StudentA’s but had its order reversed, the two students would be reported identically. If I believed that the order of the 1’s and 0’s was important, I could choose a different generative model. While many generative models will be valid for describing the same set of observations, some of these models will describe the data better, so it is also possible to empirically evaluate the relative descriptiveness of models.
Generative models can also be used for prediction. For concreteness, let’s imagine that the data we’re interested in is related to student work habits and the 0’s and 1’s represent whether the student interacted with the system in each hour. If we want to predict how many hours a student will spend on the system next week, we could fit a generative model to the student’s past habits and then run that model forward from where it left off to estimate a distribution for hours spent on the system next week.
In one sense this is still a description of the past work habits of a student, just phrased as a prediction of the future: “If these habits continue for another week, the distribution over time spent in the system is X.”
In some cases, this future prediction may be the most useful type of summary. For example, if a teacher wants to intervene before a struggling student fails an exam, an extremely useful description of the student’s improvements of proficiency is phrased this way: “If this rate of improvement continues until the exam date, the student is projected to get a 20% on the exam!”
*For example, a common generative model used for summarizing text documents is the Latent Dirichlet Allocation (LDA) model is parameterized by a set of distributions. These distributions can be viewed as a summary of the documents.
Another example is the IRT model of student proficiency described by Alex in his N choose K post. Reporting parameters from this model forms the basis of how we at Knewton communicate about what students know.