Offline Parameter Estimation

As part of my summer internship at Knewton, I helped design and implement an offline parameter estimation batch. This blog post discusses some of the design considerations and challenges associated with the implementation of offline parameter learning. But first, I’ll provide a little context around the model and an overview of the parameter estimation process.

Vocabulary and Context

  • A student event is any interaction a student has with educational content that is recorded in Knewton’s data store (e.g., watching an instructional video or answering a quiz question).
    • When an event consists of a student responding to an assessment question, it is referred to as an item interaction. In this case, the item refers to the actual question.
    • Items are responsible for assessing one or more concepts (e.g., long division). The content for a particular book consists of a collection of concepts.
    • A graph structure identifies the relationship between concepts (e.g., a concept might be a prerequisite for another concept). The graph structure also identifies which concept(s) a particular item assesses. Therefore, a single graph identifies the relationships for all items and concepts contained in the book used for a course.
    • By understanding the characteristics of each item (e.g., its level of difficulty) as well as a student’s current level of proficiency in a particular concept, Knewton is able to provide a recommendation on what piece of educational content the student should interact with next in order to meet the learning objectives of the course. This explanation glosses over a number of other factors taken into consideration by Knewton’s recommendation engine, but suffice to say, the recommendation relies on an understanding of item characteristics and student proficiency.
    • Item characteristics and student proficiency are modeled using Item Response Theory (IRT) as discussed in a previous blog post.

Parameter Estimation Overview

Once the initial model components are identified based on experimentation and evaluation by data science teams, Knewton has to assess how the parameters for this model will be estimated over time to ensure robustness and accuracy as more data are collected. The team can choose to train parameters in an ad hoc mode, a real time online mode, a recurring offline mode, or some hybrid.

Ad hoc

In this mode, the data scientist(s) might derive parameters outside of the operational functions of the product based on their analysis. Parameters are configured into the product based on this analysis in an ad hoc manner based on additional experiments/analysis by the data scientist(s).

The key operational drawback of this approach is the lack of a systematic and real-time response to new student events. Parameter updates have to wait until the next time the data scientist(s) choose to gather and incorporate new student events into a parameter estimation run. In addition, without a standardized and repeatable process in place, different data scientists may approach the parameter updates in a slightly different way based on their familiarity with the problem. This can pose a challenge in maintaining quality consistency and reproducibility.

The benefit of this approach, however, is that it has few constraints imposed upon it, and so is incredibly flexible. Data scientists are able to tweak the training process as they see fit.

Offline

This mode refers to a recurring operational process by which parameters are regularly re-estimated as part of some (semi-)automated process. At Knewton, this is achieved through scheduled batch jobs that automate the gathering of training data and execute the parameter learning.

While this limits the flexibility afforded to the parameter estimation method (since the logic has to be implemented in an automated batch), it improves robustness by simplifying the process by which model parameters can be updated in response to changes in the underlying data. This process still does not provide real-time parameter updates, but, at the time of execution, it can incorporate all available data into its parameter estimates.

Online

This mode refers to the integration of the parameter update process into the online behavior of the product. At Knewton, this means that each incoming student event can trigger a parameter update. This is necessary because each additional student interaction provides insight into the current proficiency level attained by the student. An application of this approach outside of Knewton involves having to cluster news articles into related topic areas. Since new articles are constantly coming in over a newswire, there is a need to associate and update the clusters to reflect the update corpus of news article [1].

The key operational challenge of online estimation is that data generally have to be processed incrementally. Multiple scans of the full data set are not always feasible. In addition, real-time response rate requirements limit the amount of computation that can take place for each incoming data point (particularly if the number of incoming events is high) [2]. As an example in the context of news articles and topic models, Banerjee et al. employ algorithms which hold the “number of clusters” constant and only update the cluster parameters for the topic area which appears to be most closely aligned with an incoming news article [1].

While these techniques are effective in providing real-time response rates, the accuracy of model predictions may suffer over time if there are underlying changes to the model parameters that were held constant. In the case of news topic clusters, the study finds that, in general, “online algorithms give worse nMI (a measure of cluster quality) results than corresponding batch algorithms, which is expected since the online algorithms can only update the cluster statistics incrementally” [1]. In Knewton’s case, new data might suggest that item parameters need to be tweaked from the default settings with which they were originally configured.

Hybrid

The authors of the news topic clustering study — like Knewton — propose the use of a hybrid model where offline parameter estimation is used to regularly refresh model parameters that are otherwise being updated incrementally through the online parameter update [1]. This provides the ability for models parameters to react in real time to incoming data points, while systematically refreshing parameters that might otherwise become stale as a result of only making incremental online parameter adjustments.

The hybrid approach does pose the challenge of having to identify an optimal frequency at which offline parameter estimation should be carried out to augment the performance of the online parameter updates. The frequency is likely to be dependent on how quickly errors accumulate due to online incremental updates and the tolerance range for such prediction errors.

Design Considerations for Offline Parameter Estimation

The remainder of this post discusses the offline batch process that was to be introduced in support of the hybrid parameter estimation approach. As such, the objectives of this batch job were to:

  • Extract item interactions from the student events recorded by Knewton;
  • Parse them into training data for a parameter learning algorithm for the IRT model; and
  • Estimate new item parameters grouped by book. By grouping items by the book, the estimation process can use relationships between different concepts (e.g., if mastery of concept 1 is a prerequisite for mastery of concept 2) to inform parameter selection for items assessing each of those concepts.

There were four major design considerations which went into achieving these objectives: robustness, flexibility, scale, and transparency.

Robustness

One of the main considerations was to design an offline parameter estimation job which would be able to accommodate future changes in the estimation process as well as the underlying data. As a result, there was heavy use of modularized design to decouple various components of the batch process:

    • Rather than directly retrieve student events from the database, the batch relies on an intermediary batch job that retrieves student events from the database and parses them into an easily consumable input format.
    • Finally, a separate library is defined that accepts a training data set as input and uses it to perform the actual parameter estimation.

This allows the batch process to gather a training dataset and pass it to a machine learning algorithm for parameter estimation while being agnostic of the underlying database structure, graph structure, and parameter estimation algorithm details.

Flexibility

Though the batch job is itself written in Java, the library responsible for performing parameter estimation is written in Python in order to provide the flexibility of leveraging mathematical packages such as pandas and numpy. To facilitate this, the batch job accommodates python code execution as a post-Reduce calculation step using a custom MapReduce framework (which uses Thrift to communicate between the Java and Python components).

The library used to perform the actual parameter estimation is the same library which the data science teams utilize for their experimental analysis. This eliminates the need to generate duplicate code going from exploratory analysis to offline parameter estimation — particularly as the parameter estimation techniques are tweaked. In addition, the batch is able to support input parameters corresponding to the full range of optimization settings used by the data science teams to tweak the parameter learning process.

Scale

The biggest challenge was dealing with the amount of data involved in the parameter estimation step. Knewton records over 500 million student interactions each month. Using a MapReduce design allowed the extraction and parsing of item interactions from student events to be achieved by distributing the student event records across multiple Hadoop partitions. Using a 15 node cluster of R3 AWS instances, the batch was able parse student events for a given month within an hour. The R3 instances are optimized for in-memory analytics, providing 61 GiB of memory and 160 GB of SSD storage.

Even after the batch had extracted the item interactions from the student event records, this still left between one and five million training data instances per book for the estimation step. Such large data sets place a heavy load on memory during the iterative parameter estimation process. While this issue was addressed by using AWS instances optimized for memory, another option would be to sub-sample down to manageable levels as the amount of data increases over time. The other alternative would be to modify the learning algorithm to avoid storing all 5 million training data instances in memory during the update calculations. However, devising a special algorithm for this may limit the long-term flexibility of the parameter estimation process.

Transparency

The batch job uses Counters and logging to provide transparency into the parameter estimation steps. The logging is facilitated by the parameter estimation library which enables the user to see parameter estimation statistics (like log-likelihood values) at the end of each iteration in the algorithm. This was designed so that one can look inside the parameter estimation process and evaluate discrepancies or optimization opportunities.

Parameter Validations

Finally, a systematic QA step is an important feature for any automated offline parameter estimation. Knewton exercises a number of manual validation steps ranging from quantitative analysis (e.g., running historical events through the model to evaluate predictive power) to qualitative analysis (e.g., reviewing recommendations produced with the parameter estimates). In the future, some of these QA steps could be incorporated into the batch process to provide fully automated offline parameter estimation.

References

[1] A. Banerjee and S. Basu, “Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning,” Society for Industrial and Applied Mathematics, pp. 431-436, 2007.
[2] S. Zhong, “Efficient streaming text clustering,” Neural Networks, vol. 18, no. 5-6, pp. 790-798, 2005.

Online Cross-Validation to Predict Future Student Behavior

At Knewton we use various mathematical models to understand how students learn. When building such models we want to make sure they generalize, or perform well for a large population of students. Cross-validation, a technique in machine learning, is a way to assess the predictive performance of a mathematical model. At Knewton, we use cross-validation extensively to test our models.

Cross-validation is based on the fact that we don’t have access to unlimited data. If we had all the possible data on student learning patterns, the solution would be straightforward. We would test all our models with the data and pick the one with the lowest error rate. In reality, we only have a finite set of student data to work with. Given a limited amount of data, how do we decide which model performs the best?

One approach is to use all of the available data to test our model. A major problem with this approach is overfitting, which is demonstrated in Figure 1.

Gizem1.jpg

Figure 1: Left: the model (blue) underfits the data (orange). This is an over-simplistic explanation of the data where the model would be a better fit if it had more parameters. Middle: the model fits the data just right, where the model captures the overall pattern in the data well. Right: the model overfits the data, where the model fits the noise in the dataset. (Source)

If our model overfits the data, the error rate will be low but if new data is added to the dataset, the model might perform poorly as the fit doesn’t explain the new data well. This is why models that overfit do not generalize well and should be avoided.

This is where cross-validation comes into play. In this approach, rather than fitting the model to the full dataset we split it into training and test sets. This is also referred to as holdout cross-validation, as we are leaving a portion of the data out for testing. The model is fitted using only the training portion of the dataset. Then we assess the predictive performance of the model on the left-out data, which is the test set.

As an example, one model we use to assess student learning is Item Response Theory (IRT). We want to cross-validate our IRT model for a set of student responses to test the performance of our model. To do this, we can split the student response data into training and test sets, fit the model to the training data, and validate it on the test data. If the fitted model predicts the student responses in the test set accurately we can accept this IRT model.

When measuring how students learn, we assume they learn over time. Therefore, it is useful to understand how students behave as time progresses. A shortcoming of the holdout cross-validation technique is that it makes comparisons between random bits of past student data so it can’t make predictions about how students will behave in the future. It would be very useful if we were able to make predictions about students’ future behavior given their past learning patterns.

Online cross-validation is a version of cross-validation which can validate over time series data. Going back to our student response data example, online cross-validation uses a student’s past data to predict how that student will behave in the future. The dataset for online cross-validation is a time-ordered set of responses the student gave in the past. We take the first k responses of a student and use them for the training set, then we try to predict that student’s k+1st, k+2nd, …, k+nth response. If our prediction accuracy is high, we can say that our model is a good fit for our dataset.

Let’s look at how online cross-validation works in more detail. The students answer some questions over time. Some of these responses are correct (green) and some are incorrect (red). Online cross-validation will start by training on the student’s first response only (k=1), then use this to predict whether the student is going to get the next item (k+1 = 2) correct or incorrect.

Gizem2.jpg

Figure 2: The first iteration of online cross-validation. The dots represent whether a student got a question correct (green) or incorrect (red). The model is fitted using the first response (k=1) and then used to predict the second, k+1st item (k+1=2). If our prediction matches the student response, our model accuracy increases. 0/1 refers to incorrect/correct.

In the next iteration of online cross-validation, we can use the first two responses (k=2) as our training set, fit the model using these two data points, and predict the third response (k+1=3).

Gizem3.jpg

Figure 3: The second iteration of online cross-validation. The dots represent whether a student got a question correct (green) or incorrect (red). The model is fitted using the first two responses (k=2) and then used to predict the third, k+1st item (k+1=3). 0/1 refers to incorrect/correct.

Online cross-validation continues until we run through all the iterations by increasing the training set one student response at a time. We expect to make better predictions as we add more data to our training set.

With online cross-validation, we are not limited to predicting only the next response in the future. We can predict a student’s next 2, 3, …, n responses. This makes online cross-validation a very useful technique if we want to make predictions far in the future.

Both holdout cross-validation and online cross-validation are very useful methods to assess the performance of models. Holdout cross-validation method is useful in assessing performance if we have a static dataset, whereas online cross-validation is helpful when we want to test a model on time series data.

 

Student Latent State Estimation with the Kalman Filter

The Kalman filter is an algorithm that estimates the values of unknown, or latent, variables from a series of noisy measurements taken over time. The Kalman filter has numerous applications in finance and different areas of engineering, and in this blog post, we will show how the Kalman filter can apply to student learning. We will use a Kalman filter to infer an approximation to a student’s ability given their response times to a set of questions.

To fully understand the concepts below, you’ll need a background in basic probability and statistics.

Model

To begin, we are going to describe our inference problem and a model of the system – a set of assumptions of how student learning works.

Given a series of scalar measurements:

Z=\{z_1,z_2,...,z_n\}

where z_i denotes the natural logarithm of the time it took our student to answer question i,

We want to infer the scalar latent variables:

X=\{x_1,x_2,...,x_n\}

where the student has ability x_i at the time question i is answered.

We model the change in student ability over time as a Gaussian random walk, meaning that the current ability value is based on the previous ability value, plus some Gaussian noise:

x_k=x_{k-1}+w_k, where w_k is drawn from N(0, \tau \sigma^2_x), where \tau is the time the student spent between questions, and \sigma^2_x is a hyperparameter that corresponds to the variance for this latent process (1).

Having the variance of the noise increase linearly with the time difference makes our equation consistent with a continuous Gaussian random walk, and is consistent with our intuition that a student’s ability needs time to change. In other words, a student is unlikely to experience significant change in ability if they’re between questions in a quiz, but if it’s been a day since they’ve answered a question, they’re more likely to have taken the time to learn more about the concept, or, conversely, they might have forgotten the material. Because the latent state variance is time-dependent, our filter is technically called a hybrid Kalman filter, since it assumes a continuous-time model for a discrete set of observations.

We don’t assume that the student ability x_k accounts for all the variability in the log of the student response time, z_k. For example, it’s possible that the student is familiar with this particular problem or is distracted by her environment. Therefore, we say that the observed times are likewise corrupted by some Gaussian noise, v_k:

z_k=x_k+v_k, where v_k is drawn from N(0, \sigma^2_z), where \sigma^2_z is a hyperparameter that corresponds to the variance of the Gaussian noise (2).


Kalman Filter blog post (v2)

The resulting model is pictured by the diagram above: the ability of the student at the previous question x_{k-1} determines the log response time z_{k-1}. The ability of the student at current question x_k is determined by the ability at the previous question x_{k-1}, and determines the current log response time, z_k.


Inference

The Kalman filter is an algorithm for estimating each x_k a posteriori — that is, it computes an estimate for each x_k given observations Z = \{z_1, z_2,...,z_k\}. It does this recursively, which means that it only needs an estimate of x_{k-1}, along with observation z_k, to output an estimate for x_k.

To make a new estimate, we first need to compute two intermediate pieces of information: a prior distribution and a likelihood distribution.

A note on syntax:

z_{1:k-1} denotes our observations z_1 through z_{k-1}, and x_{k-1|k-1} represents our estimate of ability at the k-1th question given observations z_{1:k-1}; likewise \sigma^2_{k-1|k-1} represents our estimate of the variance given observations z_{1:k-1}.

f denotes the Gaussian probability density function (pdf) with mean \mu  and variance \sigma^2:

f(x;\mu,\sigma^2) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{(x-\mu)^2}{2 \sigma^2}}

Calculating our prior distribution

Our prior term represents the knowledge we have of current latent state x_k having seen everything but our current observation z_{k}.

To calculate our prior, we start off with an estimate of x_{k-1}, represented as a Gaussian distribution with mean x_{k-1|k-1} and variance \sigma^2_{k-1|k-1}:

p(x_{k-1} |z_{1:k-1}) = f(x_{k-1};x_{k-1|k-1}, \sigma^2_{k-1|k-1}) (3)

From equation 1, we see that x_k is simply the addition of two independent random Gaussian random variables, x_{k-1} and w_k.

From probability theory, we know that the probability density of the addition of two independent random variables can be expressed as a convolution of the two composite probability densities. It happens that the convolution of two Gaussians is also a Gaussian:

p(x_k|z_{1:k-1}) = \displaystyle\int p(x_{k-1}|z_{1:k-1}) p(x_k|x_{k-1}) dx_{k-1} = f(x_k; x_{k|k-1}, \sigma^2_{k|k-1}), where

x_{k|k-1}=x_{k-1|k-1} and

\sigma^2_{k|k-1}=\sigma^2_{k-1|k-1}+ \tau \sigma^2_x  (4)

We call p(x_k|z_{1:k-1}) the prior knowledge we have about x_k. The next step is to look at the current observation, z_k and see what information it adds.

Calculating our likelihood distribution

Our likelihood term represents the information our current observation z_k gives us about our latent state x_k.

From equation 2, we see that likelihood of our observation z_k given our hidden variable x_k is simply a Gaussian centered at x_k. This becomes our likelihood term:

p(z_k|x_k) = f(z_k; x_k, \sigma^2_z) (5)

Combining prior and likelihood

The Kalman filter combines the prior knowledge we have about x_k and our likelihood term in accordance with Bayes’ rule, by multiplying the prior term with the likelihood term. We call this resulting distribution the posterior, or our estimate of x_k given all the information we have.

Luckily, the multiplication of two Gaussians is still a Gaussian, although unnormalized:

p(x_k|z_{1:k}) = \frac{1}{Z} p(x_k|z_{1:k-1})*p(z_k|x_k) = f(x_k;x_{k|k}, \sigma^2_{k|k}), where Z is a normalizing constant, and where:

x_{k|k}=x_{k|k-1}+C_k(z_k-x_{k|k-1})

C_k=\frac{\sigma^2_{k|k-1}}{\sigma^2_{k|k-1} + \sigma^2_z}

\sigma^2_{k|k} = (\frac{1}{\sigma^2_{k|k-1}} + \frac{1}{\sigma^2_z})^{-1} (6)

To summarize, given a Gaussian posterior distribution for x_{k-1} (Equation 3) and a new observation z_k, the Kalman filter estimates a new Gaussian posterior for x_k (Equation 6). By updating the Kalman filter as we receive new observations, we can obtain fast, real-time estimates of our latent state.

Open Source Code

An open source implementation of this hybrid Kalman filter algorithm is on Knewton’s GitHub:

https://github.com/Knewton/Kalman

Authors

Sophie Chou is a senior at Columbia University majoring in Computer Science. She’s interested in Machine Learning, Natural Language Processing, and becoming a better teacher. You can follow her on twitter @mpetitchou.

Andersen Chen is a senior at Brown University, majoring in Mathematics and Computer Science. He’s interested in data science, startups, and machine learning.


Deciding Who is Right: Using Item Response Theory Instead of Majority Rule to Crowdsource Answers

As the saying goes, it takes a village to build an adaptive learning technology provider. Even when you have an entire village at your disposal, however, the added help may contribute more headache than help as even the simplest question can generate differing opinions.

To place the situation in more concrete terms, imagine you have a stack of 10,000 photos of either kangaroos and kittens, but you do not know which photo depicts what. Because object recognition remains a difficult problem in artificial intelligence, even the most powerful computers will have a difficult time determining if a photo is a kangaroo or a kitten.

Classifying them all by yourself would be time consuming and potentially inaccurate if you start to lose focus. Luckily, 100 of your closest friends have offered to help; unluckily, some informal surveying already reveals that sometimes they disagree with each other. After all, kangaroos and kittens do sometimes look similar. What to do?

How can we decide the correct answer amidst a sea of potentially contradictory information? One straightforward approach would be to gather two or three (or ten) labels for each photo and take the majority vote. While the majority method would give us a rough idea of the correct label for each photo, it fails to incorporate the fact that some of your friends may be more gifted kangaroo/kitten labelers than others. Additionally, some of the photos might be harder to label, which would skew calculations about your friends’ abilities.

All of a sudden, our kangaroo-kitten problem is starting to sound like a problem we have already tackled: Item Response Theory (IRT)!

IRT Model

In order to solve our kangaroo-kitten problem, we can use a slightly modified form of IRT. Here at Knewton, we use IRT to determine student proficiency. Unlike standard tests where all questions are weighted equally, IRT assigns each student an ability value and each question a difficulty value, which provides a more nuanced picture of how a classroom of students is doing.

In the one-parameter logistic (1PL) IRT model, for question \beta j with difficulty j and student i with ability \theta i, the probability that question j is answered correctly, xj = 1, is

For more information about IRT, check out Alejandro Companioni’s previous N Choose K post.

The question then becomes, how can we use our knowledge about abilities and difficulties to determine who is better at kangaroo/kitten labeling?

Beyond IRT

If we stretch our minds a little bit, we can imagine our kangaroo/kitten problem as a test where the questions are the photos and the students are our friends. We want to determine which students are most proficient at the very strenuous task of animal classification. In addition to the parameters included in the 1PL IRT model, however, we also want to compute a probability to capture how likely each picture is to be either a kangaroo or a kitten.

Similar to the 1PL IRT model, the parameters in our model now include labels L, vector of abilities theta, and vector of difficulties beta. To make sure we’re all on the same page, the labels L represents all of the given labels by our labelers. Not all labelers need label every photo. Each of our abilities \theta can range from negative infinity to positive infinity. The greater the ability, the more skilled the labeler. Our difficulties range from zero to infinity where the higher the difficulty, the harder the image is to label correctly.

Consider how the observed labels, true labels, abilities, and difficulties all relate to each other. Would the difficulty of the question affect the accuracy of the observed label? Potentially. Would the true label of the image affect the ability of the labeler? Unlikely. Below we have drawn the general graphical model describing the relationships between these parameters where the shaded variables are observed.

Remember that in our case, we have 10,000 images and 100 labelers. Unsurprisingly, the difficulties, abilities, and the true labels are all independent of each other, meaning the accuracy of a labeler has no effect on the likelihood that a photo depicts a kangaroo or a kitten!

How does this all have anything to do with if the photo is a kangaroo or a kitten? For specific photo j, we can derive how likely the photo depicts either adorable animal. That is, the posterior probability of the correct label zj for photo j denotes the probability the photo depicts an animal.

Because we know that the photo contains either of the two animals, we can designate kangaroo as 0 and kitten as 1. Our posterior probability then designates from 0 to 1 how likely the photo is to contain either animal. If we assume that the correct label zj is independent of both abilities theta and difficulties beta, the probability simplifies dramatically.

The posterior probability now consists of two components: a prior belief and an IRT-based probability. Our first term p(zj) captures our prior knowledge about how many of the photos contain each. For example, if we suspected that the majority of the photos were kittens rather than kangaroos, we could use that parameter to include our prior belief in the model. The second probability uses our 1PL IRT probability to denote the probability the labeler gave a label (aka answered a test question) conditioned on the correct answer, the labeler ability, and the difficulty of the question.

Expectation-Maximization

Now that we have established our graphical model including relevant dependencies, we can use an Expectation-Maximization (EM) approach to obtain maximum likelihood estimates of the parameters of interest. Essentially we can now alternate between the expectation and maximization steps to find the most likely probabilities and parameters, given all of the other probabilities and parameters.

By a lucky coincidence, we have actually already determined our expectation step above when we computed the posterior probability of each label! A simpler way to think about the expectation step is to imagine that our abilities and difficulties are all fixed, and we calculate the animal image probabilities accordingly. If we only calculated the probabilities once, however, the probabilities would only depend on whatever values of abilities and difficulties we initialized in our model! How can we keep adjusting the model?

This revelation brings us to the second half of EM: the maximization step. For this step, we want to find a way to make our posterior probabilities as large as possible—denoting how certain we are overall of our guesses of the correct label—by adjusting our parameters and . More formally, we are trying to maximize the expectation of the joint log-likelihood of the observed and hidden variables (L, Z) given the parameters (\theta, \beta) with respect to the posterior probabilities that we calculated in the expectation step.

Our joint log-likelihood function is the expected value of the logarithm of our joint probabilities. That is, how certain are we of everything so far? Using our conditional independence assumptions outlined earlier, we can find our joint log-likelihood function:

Using gradient ascent, we can then find values of \theta and \beta that locally maximize Q.

From here, we simply alternate between expectation and maximization steps until convergence. To recap, expectation holds ability and difficulty constant, while calculating posterior probabilities. Maximization then calculates the ability and difficulty parameters to maximize joint log-likelihood, given constant posterior probabilities.

Depending on the gradient ascent implementation, the algorithm should converge quickly, revealing our best guesses for which animal is featured in which photo. As we can see below from our implementation based on simulated data, the EM approach outscores the majority approach at nearly 5% initially before converging later. Additionally, as we increase the number of voters, the accuracy increases. Success!

While our improvement over the majority method may be impressive, our E-M IRT model still has plenty of room to expand. What if we also had pictures of koalas and killer whales, increasing the number of options? What if we had reason to believe that the abilities of our friends fall in a Gaussian distribution, creating a prior distribution on our parameters? What if we assumed that our friends might become better labelers as they continued to label, making our model intertemporal?

References
Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., and J. Movellan. (2009). Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043.

de Ayala, R.J. (2008). The Theory and Practice of Item Response Theory, New York, NY: The Guilford Press.

Algorithmically Generating Questions

Developing assessment materials is a lot of work. Teachers spend countless hours scavenging textbooks and developing original exercises for practice worksheets, homework problems, remedial materials, and exams. To prevent cheating, many teachers write several versions of each test, multiplying the work required for creating problems (not to mention grading!). The issue is exacerbated by the rising popularity of MOOCs, in which tens of thousands of students may be enrolled in the same course.

Furthermore, access to a large, varied pool of assessment items is a key to the success of many adaptive courses of study. Without a large pool of items, proficiency estimates can be compromised and personalized courses of study can become less effective than they would be otherwise.

Machine Generated Questions

Machine generated questions have been studied for decades as a component of intelligent tutoring systems. Most research falls into two categories: solution-oriented approaches, and template-based approaches.

Solution-Oriented Approach*

In this approach, questions are generated based on the set of skills and concepts required to solve them. For example, skills related to addition includes adding single digit numbers, adding multi-digit numbers, adding three or more numbers, and carrying digits.

A recent paper, entitled “A Trace-Based Framework for Analyzing and Synthesizing Educational Progressions,” describes an interesting implementation of solution-oriented question generation. On page three, the authors write out pseudocode for the standard, classroom addition procedure. They then annotate the code with symbols representing skills (for example, C for “carry digit”). Thus, by running the pseudocode and keeping track of the symbols, one can obtain a “trace” that categorizes each possible addition problem.

Because solution-oriented approaches group problems based on skills, they lend themselves well to adaptivity. As a student answers questions, one can identify skills he or she is struggling with, and then recommend material reinforcing those skills. However, a major drawback of solution-oriented approaches is that developing questions even for a topic as simple as addition requires a fair amount of labor and domain expertise.

Template-Based Approach*

In this approach, a question template is used to represent a potentially large class of problems. For example, consider a familiar question:

Find all roots of _x+ _x + _.

The underlines are “holes” that must be filled in by the question generator. A template might also specify valid ways to fill in the holes. For example, maybe each hole can only be filled in by the integers one through 10, leading to 10^3= 1000 possible questions. The instructor may wish to further restrict the template to only permit quadratics with real, distinct roots.

The biggest advantage of this approach is that it is accessible to a majority of instructors, provided there is an intuitive and robust templating language. In addition, template-based approaches are easily generalizable, capable of representing entire domains. A disadvantage of templates is that they tend to group problems based on appearance, not skills.

This summer, I set out to create an assessments generator engine that would be both accessible and expressive enough to generate a wide variety of problems. For these reasons, a templating approach made the most sense. Furthermore, Knewton already has infrastructure in place that will enable us to achieve adaptivity even with a template-oriented approach.

The Art of Templating

My first task was to devise a templating language. I decided that it would be a good exercise to define a domain specific language (DSL) that formalizes the space of possible templates. This DSL must let instructors specify the following:

  • Which variables in the question can be adjusted?

  • What values are these variables allowed to take on?

  • How is the correct answer computed?

  • How are the incorrect answers computed? (for multiple choice questions)

After several iterations, I came up with a format general enough to cover many (if not most) of the questions used by Knewton Math Readiness. I’ll go through some examples, beginning with simple addition, that illustrate the main features of the templating language.

Adding Two Numbers

The below template is used to generate questions of the form _ +_ = ?

template "add 2 numbers" {
  question is "<x>+<y>=?"
  answer is "<sum>"
  variables {
    x, y are integers between 1 and 9
    sum=x+y
  }
}

The question and answer are simply strings with variable names denoting the “holes.” Variables come in two flavors: generated (x and y) and derived (sum). Generated variables are bound to a sample set, which could be a range of integers, numbers with 2 decimal places, or even the set of fibonacci numbers. Derived variables are defined by mathematical expressions.

Finding Quadratic Roots

Let’s return to the earlier question of finding the roots of a quadratic. Following the addition example, we might try:

template "quadratic roots" {
  question is "<a>x^2 + <b>x + <c> = 0. Solve for x."
  answer is "x = <root1>, <root2>"
  variables {
    a, b, c are integers between -10 and 10
    discriminant = b^2 - 4*a*c
    root1 = -b + sqrt(discriminant)/(2*a)
    root2 = -b - sqrt(discriminant)/(2*a)
  }
}

Here, we are generating each coefficient (a, b, c) from the range [-10, 10]. However, the below table illustrates an issue with this template. For certain coefficients, the solutions can be integral, fractions, irrational numbers, or even imaginary numbers.

(a,b,c)

Solutions

(1, -10, 16)

2, 8

(-4, 7, -3)

0.75, 1

(3, 6, 2)

-1.577…, -0.422…

(1, 2, 1)

-1

(1, 4, 8)

-2 + 2i, -2 -2i

Because of this, the template can represent questions requiring different skill sets and mastery levels. It is important to give instructors the ability to maintain a consistent difficulty level, and to control the concepts covered by a single template. This is achieved using constraints.

template "quadratic roots" {
  question is "<a>x^2 + <b>x + <c> = 0. Solve for x."
  answer is "x = <root1>, <root2>"
  variables {
    a, b, c are integers between -10 and 10
    discriminant = b^2 - 4*a*c
    root1 = -b + sqrt(discriminant)/(2*a)
    root2 = -b - sqrt(discriminant)/(2*a)
  }
  constraints {
    root1, root2 must be integers
    root1 <> root2
    gcd(a, b, c) = 1
  }
}

In general, constraints are useful for narrowing the skill set covered by the template, and to ensure that instantiations of the template are sufficiently varied. In this example, requiring unique, integer roots is used to control difficulty level, while requiring gcd(a, b, c) = 1 ensures that no two questions will be multiples of one another.

Car Distance

Another important feature of the templating language is the ability to specify wrong answers.

template "car distance" {
  question is "How far does a car travel in <m> minutes traveling <r> miles/hour?"
  answer is "<dist>miles"
  variables {
    m is an integer between 30 and 90 divisible by 10
    r is an integer between 40 and 65 divisible by 5
    dist = rate*time/60
    wrong1 = rate*time
    wrong2 = rate/time
    wrong3 = rate/time/60
  }
  wrong answers {
    "<wrong1>miles"
    "<wrong2>miles"
    "<wrong3>miles"
  }
}

Wrong answers can be either static or variable. What’s powerful about this feature is that each wrong answer might be associated with a particular deficiency or misconception. In the example, a student might choose “rate/time/60” because she doesn’t know the correct distance formula, while another student might choose “rate*time” because she has trouble converting units. This is another source of information that Knewton can use to provide more targeted recommendations.

The Problem Generation Algorithm

Great, so we have a template. Now how do we actually generate questions? My first inclination was to start with the simplest possible algorithm:

  1. Go down the list of variables, selecting values for the generated variables uniformly at random from the sample sets and using the formulas to compute the derived variables

  2. If the variables satisfy all of the constraints, add it to the list of questions.

  3. Repeat.

This naive algorithm performs nicely given one key assumption: a large enough fraction of the sample space (the set of all possible questions, i.e. the cartesian product of the sample sets) must meet the constraints specified in the template. For instance, if 100 questions are desired and the algorithm can handle 100,000 iterations, roughly 1/1000 questions need to be valid. This isn’t too daunting: as long as we offer an expressive library of sample sets and constraints, instructors can be expected to provide templates meeting this requirement.

It is a very difficult to come up with a more efficient approach. For some problems, algorithms do exist to generate solutions (see Euclid’s method for pythagorean triples), but for others it is mathematically impossible (see Hilbert’s tenth problem). In many cases, introducing heuristics may improve the random algorithm. For instance, it may be possible to identify a large chunk of the sample space that leads to solutions that are too large, non integral, negative, etc.

The Prototype

I chose to implement the assessment generator in Scala for several reasons:

  • Scala’s interoperability with Java made integrating with the rest of the Knewton code base an easy task.

  • Scala’s powerful Parser Combinators library made implementing the template DSL straightforward. Because of their low overhead, I also used parser combinators for converting math expressions like “sqrt(x^3 + 5)” into my internal data model. While there are many existing Java/Scala libraries that accomplish this, I was unable to find any capable of manipulating general mathematical objects, such as fractions, matrices, polynomials, polygons, and groups.

  • Scala’s parallel collections allowed running iterations of the problem generator routine in parallel. Doing so only required swapping out a Map with a ParMap, and appending “.par” to the main program loop.

Here is a screenshot of the prototype in action.

*For examples of solution-oriented approaches in the literature, see https://act.org/research/researchers/reports/pdf/ACT_RR93-09.pdf (survey) and http://research.microsoft.com/en-us/um/people/sumitg/pubs/chi13.pdf (grade school math).

*For examples of template-based approaches in the literature, see http://www.eecs.berkeley.edu/~dsadigh/WESE12.pdf (embedded systems).

Ku Pima: To Measure

Trivia: Ku Pima means To Measure in the Xitsonga Language

An Analytics (Data Science) team is made up of engineers/scientists with a wide array of skills. This results from the nature of the goals the team has to meet. As an Electrical Engineering major at Wits University, I’ve spent two summers as an instrumentation engineering intern. Instrumentation deals with the task of engineering instruments that can measure certain quantities for industrial processes to be controlled. Examples of environments include manufacturing and chemical plants, houses, or even the International Space Station. I find analytics to be a similar process to instrumentation engineering in that useful measurements are sought and then the instruments to calculate those measures are engineered.

Building the Analytics Pipeline

On the Analytics team at Knewton, the data scientists develop measures that are useful to track, whether directly for a business case or for building blocks for future analytics. Within the Analytics team there is a Data Analysis component that develops analytics (measures). Another component, Data Infrastructure, engineers the pipeline (instruments) to actually calculate these analytics on a large/production scale. Initially an analytic is developed by exploring some interesting idea of a measure, using available organization data, and then refining it to arrive at the final analytic.

My internship was concerned with creating Data Infrastructure (the instrumentation) to compute some analytics at Knewton. My initial major task was to take a newly developed analytic, in this case Engagement, data from different sources within our products, and  engineer tools to calculate this analytic. This task itself encompases not only the implementation of an algorithm but also the engineering processes necessary to construct the components needed for the measure to be calculated. Further there is a need to analyze and validate the measure on a larger scale than the initial one used to develop it. This necessitates the need for a feedback loop between the data analysts and data infrastructure components.

Engagement is a many-faceted construct.  One facet is an analytic that serves to indicate how much time a student spends “actively engaged” on the Knewton platform. There are a number of ways it can be defined. Here is the basic premise: Let’s say a student is busy with a homework assignment. After they have submitted their task, the Knewton system sends recommendations on which material the student should tackle next. From these interactions one might want to know how engaged a student is with the learning platform. There can be many ways of quantifying this: time spent on the system; number of logins per week; time spent on recommended material, etc. The analytics team is tasked with investigating and developing the analytic into a form that will be useful internally or to a partner. After this initial synthesis, we need to engineer a pipeline that will take student interactions and the recommendations into account and calculate the engagement analytic. Further, this analytic is an example of analytic that needs to be inferred. By “infer” we mean that we cannot directly observe the data we want and thus have to deduce it from other data.

There Can Be Multiple Reasons to Infer an Analytic

The raw student interaction data needs to be cleaned and clustered: The raw data captures a lot of information, some of which may not be useful. Thus, there is a need for cleaning/filtering. Some student interactions can be clustered and thought of as a single event instead of multiple events. (Think of a student doing multiple tasks consecutively on a single homework assignment.)

You have to categorize the users’ intentions: The user’s intention is important as it can make an analytic useful or less so. For example: there is a situation where the student did not intend to do the next recommended action, not because they thought it was irrelevant, but because they had to move to another task (say, another homework assignment with an upcoming deadline). In this situation we would have to treat this as a data point that would not be useful in calculating engagement.

Resources: Available resources are always a factor for any organization. It might be faster to calculate the analytic in one way as opposed to another. It might be more efficient to infer an analytic from a readily available dataset than use a much richer, hard to obtain dataset that is less efficient and only provides a small boost in accuracy with an accompanying large use of resources.

Computing Engagement

The overall view of the pipeline created for Engagement is shown in the figure below. The pipeline takes in data generated by Knewton and its partners that contains student interactions as well as recommendations sent to students. From this data the student learning/task sessions are inferred. From this data we then calculate the Engagement analytic. The computed analytic is reported and the data used for its calculation stored for future use and analysis.

 Building the Analytics Pipeline

After the initial pipeline is engineered there are still a number of tasks to complete. Validation is very important in determining if the analytic can be interpreted as expected. We need to know whether with sample data it produces similar results to the analytic development stage. This part of the process involves some modeling and statistics and needs analysis tools in order to detect errors and may lead to refining the analytic itself.

Through this validation process we are able to refine the measure further. Validation gives us a better understanding of the shortcomings of the data we collect and what other data might be useful.

If you’re interested in this subject, I’d also recommend checking out these tech blog posts: Netflix’s Recommendation Engine, Finding similar users on Twitter, How Facebook Enabled Graph Search On Such a Scale.

Generative Models for Description and Prediction

This summer I worked on the Analytics team at Knewton. One of the goals of analytics is to condense the huge amount of information generated by students interacting with educational products and provide teachers with data-driven insights into how their students are doing.

In this post, I will discuss an approach to conveying information about student behaviors using generative models.

A generative model is a description of how to generate data that is similar to the observed data.

Imagine we have a stream of binary observations for each student. The observations represent whether the student did any work on the system today, whether the student answered questions correctly as she progressed through a course, or anything else about the student that can be represented in binary terms. The observations are viewed as a stream; over time more observations are added.

Here are some examples of observations of correctness on practice questions from when each student started a course until today.

StudentA: “1111111110”

StudentB: “100011110111100111111011100110011”

StudentC: “111011110111111111111111111111111”

We could specify different generative models for each student. One possible generative model for student A could be:

Model #1 (Student A)

“print nine 1’s and then one 0″.

If we wanted to describe student B similarly the description would be:

Model #1 (Student B)

“print one 1 then three 0’s then four 1s then one 0 …”.

Models like Model #1 exactly preserve all of the information that was present in the student data but in doing so don’t reduce the complexity of the observations at all. Often, in communicating information about student behaviours, we seek to preserve the important information while reducing the complexity to communicate concisely.

A generative model does not need to reproduce the data exactly. One way of reducing complexity while preserving important information is to attribute parts of the observation that we believe to be unimportant as being generated from a process with some randomness.

Throughout this post, we’ll use a simple generative model for demonstration. The generative model belongs to a family of models, parametrized by a value w.

Model #2:

“Flip a weighted coin that lands on heads with probability w. If it’s heads, print 1, otherwise print 0″.

In reporting to a teacher, I can report that the model that best fits with the student history is w=0.9.

That can be pretty informative. It means that in the family of Model #2, w=0.9 gives the model that is closest to the full description of StudentA (using KL divergence as a measure of closeness). Often, a student’s full data description is inconvenient to communicate, but now I can summarize it concisely with a description of the model family (Model #2) and a single parameter (w=0.9).

While the model parameters may be meaningful themselves (in this case they are the student’s overall score on practice questions), they also define a space in which we can compare students to each other (or the same student at different points in time). For example, I can note that StudentA is best described by w=0.9 while StudentB is best described by w0.65. If I wanted to find the student most similar to StudentC (w0.94), I would find the student with the closest value for w and choose StudentA. Unlike standard string comparisons (eg. Levenshtein distance, which would yield studentB as the closest student to studentC), a distance measure based on a generative model is not sensitive to differences in the number of observations for each student.

In this case, because the generative model is quite simple, I’m just reporting a summary statistic of the data (in this case the sample mean), but that is not always the case.*

In choosing Model #2 to convey my information, I’ve consciously made some decisions about what qualities of the data are important or not. For example this model is insensitive to the order of the data. If I were to imagine a new student (StudentD) whose data was the same as StudentA’s but had its order reversed, the two students would be reported identically. If I believed that the order of the 1’s and 0’s was important, I could choose a different generative model. While many generative models will be valid for describing the same set of observations, some of these models will describe the data better, so it is also possible to empirically evaluate the relative descriptiveness of models.

Generative models can also be used for prediction. For concreteness, let’s imagine that the data we’re interested in is related to student work habits and the 0’s and 1’s represent whether the student interacted with the system in each hour. If we want to predict how many hours a student will spend on the system next week, we could fit a generative model to the student’s past habits and then run that model forward from where it left off to estimate a distribution for hours spent on the system next week.

In one sense this is still a description of the past work habits of a student, just phrased as a prediction of the future: “If these habits continue for another week, the distribution over time spent in the system is X.”

In some cases, this future prediction may be the most useful type of summary. For example, if a teacher wants to intervene before a struggling student fails an exam, an extremely useful description of the student’s improvements of proficiency is phrased this way: “If this rate of improvement continues until the exam date, the student is projected to get a 20% on the exam!”

*For example, a common generative model used for summarizing text documents is the Latent Dirichlet Allocation (LDA) model is parameterized by a set of distributions. These distributions can be viewed as a summary of the documents.

Another example is the IRT model of student proficiency described by Alex in his N choose K post. Reporting parameters from this model forms the basis of how we at Knewton communicate about what students know.

Parameter Recovery

1. Introduction

Providing students with good study recommendations begins with understanding what they already know. At Knewton, we’re combining existing work on psychometrics and large scale machine learning to perform proficiency estimation more accurately and at a larger scale than ever before.

As Alejandro Companioni [discussed in a previous post], Item Response Theory (IRT) is the core model around which our proficiency estimation is built. IRT proposes that there are latent features of students (proficiency) and questions (difficult, discrimination, and guessability) which explain the accuracy of responses we see on assessments. However, most IRT analyses assume that students bring the same level of proficiency to every question asked of them. This might be fine for a 2-hour standardized test, but we are interested in helping students over an entire course. If we’re doing our job, students should be improving their proficiency as time goes on!

This summer, I made our IRT models more sensitive to gains (and losses) in student proficiencies over time. I’ll leave the details of that model to another post, but as a teaser, I’ll show you our first figure. In this figure, the x-axis represents time as measured by the total number of questions answered. Dot represent the question which was answered at that time. Red dots represent incorrectly answered questions while green dots represent correctly answered questions. The {y}-value of the dot is its difficulty. The green line tracks our simulated student proficiency, and the blue line tracks our recovered student proficiency paramter given the questions that they have answered.

In this post, I will discuss three methods which we used to evalute the performance of our algorithms and discuss their relative strengths and weaknesses. We’ll focus on the mean-squared error, the log-likelihood, and the Kullback-Liebler divergence.

student_1

A visual representation of our temporal IRT model

2. Evaluating results 

To tackle the problems I faced, I explored statistical inference algorithms on probabilistic graphical models. To judge my results, I simulated classes of students answering sets of questions, and saw how accurately my algorithms recovered the parameters of the students and questions.

One method of quantifying the accuracy of estimates is the mean-squared error (MSE). It takes the mean of the square of the differences between the estimated parameters and their actual values. In symbols, if {\theta^{\text{act}}} is the vector of model parameters used to generate our data, and {\theta^{\text{rec}}} are the parameters our method recovered, then

\displaystyle \text{MSE} = \frac{1}{N} \sum_{i=1}^N \left(\theta^{\text{act}}_i - \theta_i^{\text{rec}}\right)^2.

While the MSE is a good indicator for accuracy in many places, it has problems when models have multiple solutions at different scales. Let’s see why this is through an example.

Suppose we have a class of students answering a set of questions. Suppose the questions are actually very hard and that the students happen to be very proficient at the material. Just looking at this set of data, however, we have no way of knowing that these students are actually very proficient. We can only assume the most likely scenario–the students have average proficiences and are answering the questions competently. From the data alone, we can only hope to discern the values of item parameters relative to the proficiency of students and vice-versa. We cannot hope to know their values absolutely. So there are many equally valid interpretations at different scales of the same data. Because the MSE looks at the difference between the two parameters, it will penalize parameters that are scaled differently, even though those parameters might be an equally valid solution given the data!

Let’s look at Figure 2. We see that the ordering is basically preserved, which we could measure with Pearson’s {r}. but the recovered parameters are not on the same scale. If that were the case, then the line of best fit would have slope one and {y}-intercept {0}. However, in this case, the slope is closer to {2/3}. In statistical terms, the variance of the recovered parameters is much less than the variance of the actual parameters.

scatter_thetas

Recovered student proficiencies plotted against actual student proficiencies from a simulation.

The log-likelihood gives us a more meaningful method of measuring accuracy. The log-likelihood tells us the log-probability of our recovered parameters given our data. At instantiation, our algorithms should have low log-likelihoods–the parameters that we guess first are random and don’t fit our data well. Our algorithms should iterate toward higher log-likelihoods, hopefully converging at the set of parameters with the highest log-likelihood. This is the philosophy behind the Expectation-Maximization algorithm. But the log-likelihood is susceptible to tail events. For instance, if a question is in reality extremely hard but, through random chance, a few students with average proficiency answer the question correctly, then maximizing the log-likelihood will lead to marking these very hard questions as easier than they actually are. This, of course, could be solved with more data from large numbers of extremely proficient students, but this data is often hard to come by. Instead, we introduce another way of measuring model fit: the Kullback-Leibler (KL) divergence. Suppose we have probability distributions {P} and {P'}, generated by our original and recovered parameters, respectively. That is,

\displaystyle P(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{act}} )

and

\displaystyle P'(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{rec}} )

In our case, a datum is one student’s response to one question and the {\theta} paramters are the student proficiency and question discrimination, difficult, and guessability discussed in Alex’s post. Then we can define

\displaystyle D_{KL} (P||P') = \displaystyle\sum_i \ln\left({\frac{P(i)}{P'(i)}}\right) P(i).

If {P = P'}, then {D_{KL}(P||P') = 0}. Moreover, Gibbs’ inequality implies that {D_{KL}(P||P') \geq 0}. If the original and recovered parameters generate similar probabilities, then the KL Divergence should be small. Note that since the KL divergence involves the ratio of the likelihoods, it is less susceptible to the influence of tail events than the log-likelihood alone.

The log-likelihood and KL divergence both use likelihoods to measure fit, which means that they only care about fit of the parameters to the data, and not the exact convergence to the original. So they often prove to be reliable measures of fit to judge our algorithms on. For instance, even though the MSE of our recovered parameters is large, Figure 2 shows us that our algorithm has likely converged (since the log-likelihood and KL divergences are not changing much) and give us a reliable measure of model fit.

metrics

The log-likelihood and KL divergence of our recovered parameters through a run of our algorithm.

Web-Based, Real-Time Data Monitoring

This post was written by Joy Zheng, a student at Harvard University who interned at Knewton this summer.

THE PROBLEM

How do you quickly visualize and validate data in a large-scale, distributed, and real-time system? How do you make it deployable in multiple cloud environments? And how do you track changes to this data over time?

These are questions that I set out to answer in my internship this summer, as I worked to develop a web-based interface that displays how much data the Knewton system is receiving from hundreds of thousands of Pearson students across the country.

At a very high level, this is how things work: Hundreds of thousands of college kids across the country use a recommended textbook published by Pearson. These books have an online component with which the students interact, learn, take tests and quizzes, etc. Information regarding all this activity flows through to us at Knewton. We process it to make recommendations to each student about how to progress through the course, so that the student can achieve the milestones set by the professor.

For any given student, we need to know something about the student and his or her history of previous answers, as well as something about the course and the textbook that the course uses. We also need to track any custom content that instructors have created. In order to serve up recommendations, the work gets divided up among different services in the system — along with the data.

CONSIDERATIONS

With this problem set in mind, I came up with a list of overarching goals for a web-based dashboard that would provide this information. Before building the system, I had to keep in mind the following:

Recentness of data: While it isn’t vital for the dashboard to have up-to-the-second data, someone has to be able to login at any time and see fairly recent data — say, from the last hour. We also want to make sure that, when needed, users can get even more recent data on demand.

Responsiveness: How fast can we refresh the data?

Redundancy: When two users want the same piece of data from the platform, the dashboard should not have to fetch it twice.

Ability to get details on the fly: Given thousands of courses, how do you get more descriptive information on the fly?

Ease of use: Using the dashboard should not require deep knowledge of how data is being handled in the platform itself.

Configuration: In line with the last point, the dashboard should be configurable from the website; one should be able to change things like “refresh time” on the fly, for example.

IMPLEMENTATION

Ultimately, we settled on an implementation where data would be polled at regular intervals from various pieces of the Knewton platform and then cached in memory.

This resulted in three strategies for obtaining the data:

Auto-refresh: We query for updates every hour.

On-demand refreshes: If a user wants the most recent state for some piece of information, they can force a refresh.

Stats over time: Permanent persistence of summary statistics — because we want to be able to see some numbers over time, we save relevant data once every 24 hours.

As a result, data presented to the users is almost always drawn from the cache; we preferred this to live querying for two primary reasons.

First, as mentioned previously, we didn’t want to have to send off queries to the platform for data multiple times if we would be getting the same data back; we only wanted to have to pipe data when it had been updated.

Secondly, speed. After all, drawing information from local cache is faster than having to gather it, and we were even able to cache significant portions of the web pages themselves.

Additionally, as part of collating data, one of the roles of the dashboard was to cross-reference it — to line up the data about a textbook with the data for a course and the data for a student. Thus, the nature of the internal back-end references meant that querying on-demand for say, information about one course or user, would require querying in sequence, rather than querying in parallel.

TECHNOLOGIES

Given that most of the Knewton platform is written in Java, and non-Java pieces had Java APIs in place, the language choice for the dashboard was very straightforward; admittedly, though, there were points where I missed functional programming, especially when trying to make sure that the dashboard could handle multiple simultaneous users during parallelizing refreshes.

The webpage itself was run using an embedded Jetty servlet. In retrospect, now that I am more familiar with Jetty and having run through a cycle of configurations, I likely would have given this choice more consideration — either by looking at different server technologies or different configuration methods. Especially compared with Ruby on Rails, which defined my previous experience with servers, resolving relative URLS both for webpage and resource files ended up taking notably more effort than expected with this configuration style. This challenge was compounded by the use of JSP (Java Server Page = Java-embedded HTML) files which facilitated UI and CSS better than the Java-generated HTML that I had in the first running dashboard version.

The interaction within the system can be represented in the diagram below:

PERFORMANCE OPTIMIZATIONS

While testing functionality, I ran the code on a local machine against a very small set of data (really small — like counting actual numbers rather than trailing zeros on your fingers) just to make sure that everything was running and that we were gathering the expected data. Consequently, in our first run against a bigger sized data set (approaching 100,000), some areas of data took a few hours to refresh. We were also piping enough data around to have a tangible effect on the performance of the platform as a whole (ouch!).

To improve performance, we took a few steps:

Reduced field set: First, we took another look at all the data we gathered to make sure that we weren’t collecting anything we didn’t need. While our trimming efforts here probably didn’t have as much of an effect on performance, they definitely reduced the memory footprint.

Delta Refreshes: Next, we made the dashboard smarter by only receiving data that had been updated since the last refresh. Given API restrictions, this wasn’t possible for every service that we queried, but it still gave us a big speed boost.

Lazy loading: Data that we might reasonably want (but could live without 90% of the time) was taken off the automatic update list, and instead only queried and cached on demand. This sometimes meant a wait of 15-30 seconds by users if they really wanted to dig into the details, but it reduced the amount of data we had to cycle through every hour by an order of magnitude.

These tweaks improved performance by orders of magnitude, too, so that refreshes could be measured in minutes or seconds rather than in hours.

While the server sent out data in HTML tables, data manipulations were left to JavaScript on the client side. I started with the TableFilter library (tablefilter.free.fr); however, once running the dashboard against more realistically sized datasets, we hit the ceiling on the ability of this library to sort and search through large tables. After being pointed in a new direction by a fellow Knerd, we switched to the DataTables javascript library which, in addition to having much better performance, also offered a larger variety of available plugins and extensions. While JavaScript will work for the near-term, and using the pre-existing libraries allowed us to get off the ground quickly, processing of huge datasets will ultimately need to be moved to the back end as the number of Knewton users grows.

All in all, I was able to get the dashboard up and running into production just a few minutes before my final presentation of my internship. The dashboard is now being used officially at Knewton to track data across multiple environments and for several hundred thousand students across the country!

Knerdlings: Improving Courses through Simulation

N choose K’s inaugural post is brought to you by the Knewton Adaptive Learning Team (Ashley Miller, Christopher Tang, David Kuntz,  George Davis, Jesse St. Charles, John Davies, Sidharth Gupta)

No two students are identical—they learn and forget at different rates, come from different educational backgrounds, and have different intellectual capabilities, attention spans, and modes of learning. As a result, designing a real-time recommendation engine that is sensitive to the idiosyncrasies of each student is an immensely complex task.

Here at Knewton we are addressing this challenge head-on, using educational path planning technologies and advanced models of student ability. These technologies and models ensure that, no matter how different students are when they enter our system, each of them progresses through the course material in a way that maximizes his or her individual rate of learning. In our ongoing research and experimentation to improve these student models, our latest and most diabolical step has been to create our very own army of virtual abecedarians (Knerdlings), who assist in the testing of our recommendation engine.

Testing Our Recommendation Engine

One good way to evaluate the courses that make use of our recommendation engine is of course to collect and analyze real data generated by real students. These analyses are critical to maintaining and improving the quality of our courses for future students, but real students take real time to complete a course. So, to enable ourselves to research and develop in the interim, we needed a flexible and automated way of generating student data. With this goal in mind, we engineered a virtual student named “Isaak” and used him as the prototype for a whole army of Knerdlings.

In addition to meeting the goal of generating usable data, there are numerous advantages that this strategy offers us. First, it gives us an efficient way to simulate taking our course over and over (and over and over) again. Second, we gain fine-grained control over the many possible kinds of students who could use our course in the future. Next, we can induce particular behaviors in students, and simulate a broad range of student abilities and learning styles. Last but not least, we can be as realistic as we need to be, and nuanced in the exact ways we expect a real-world student would actually be. With these thoughts in mind, we designed the inner workings of Isaak’s mind.

Modeling Student Behavior

The core of Isaak’s mind is a graphical (that’s “graph-based,” not “made of pretty pictures”) model that keeps track of the current state of Isaak’s latent abilities—like his proficiency in “Solving 30-60-90 triangle problems,” or in “Understanding trigonometry”. If Isaak were a real student in a Knewton-powered course, he would be exposed to instructional content and take assessments, and his performance on these assessments would play a role in determining his path through the course. Each student’s latent abilities determine how well the student performs on any assessment of a given topic or concept—in technical parlance, latent ability contributes to a “generative model of performance”. To model student behavior more accurately, Isaak’s latent ability levels also need to improve over time as he is “exposed” to more material, and to degrade without proper reinforcement.

In short, Isaak needs to be able to learn, and (regrettably) needs to be able to forget.  Inspired by research that started with Hermann Ebbinghaus’s work on memory retention and learning curves, we model the changes in Isaak’s latent ability, while learning and forgetting, using exponential growth and decay curves. Isaak doesn’t actually do any learning (or forgetting), of course—he just receives a “bump” in his virtual ability level for a topic each time we expose him to content associated with that topic. (The precise nature of that “bump” depends on the functional form of the learning curve.) Likewise, if Isaak is not exposed to some other topic, he might forget that topic over time. The forgetting curve itself that governs Isaak’s rate of retention is roughly described by the following formula:

,  where R is memory retention, S is the relative strength of memory, and t is time.

By integrating this process into the simulation, we can capture the way in which a student’s knowledge waxes and wanes, depending on how and when they are exposed to various content. The resulting simulation isn’t a perfect recreation of real student behavior, but, in addition to being really cool, the corps of Knerdlings generated in the simulation lets us test the algorithms that govern a student’s flow through the course quite stringently, while maintaining enough simplicity for us to understand the results.

Visualizing the Data

Once we were ready to set our army of Knerdlings loose in our courses, we needed to make sense of the copious amount of data they were about to generate. We did this by summarizing the results in a visualization that lets us understand the behavior of our virtual students graphically (this time, that’s “graph-based” and “in a pretty picture”). The visualization lets us watch latent ability levels change over time, as individual Knerdlings interact with a Knewton course, in a way that’s much easier to grasp than, say, a spreadsheet containing the exact values of a student’s ability on every concept for each hour that they were enrolled. We have nothing against spreadsheets here at Knewton—they have their place—but problems and strange behavior jump out much more clearly to us visually.

Creating Better Algorithms

The visualization tool will also have applications outside of Knewton HQ. In the hands of an instructor, for example, it could help track the progress of students, and allow the instructor to adjust her course plan to account for her students’ likely learning states. Say that a student has passed a sizable number of the units in the course, at a point well into the semester. The student has most likely forgotten some of the topics that were covered earlier in the semester and not revisited since. Using this visualization, along with the Ebbinghaus-based model of learning and forgetting, the instructor now can see at a glance what parts of the course the student is most likely to have forgotten—even what parts the student is likely to forget in the future—so that she can remediate her student appropriately.

So far, this graphical tool and its associated models are only operational internally, but those models are already helping us create better algorithms, which in turn are going to start moving students through Knewton-powered courses more efficiently than ever before, during this next academic year.