Student Latent State Estimation with the Kalman Filter

The Kalman filter is an algorithm that estimates the values of unknown, or latent, variables from a series of noisy measurements taken over time. The Kalman filter has numerous applications in finance and different areas of engineering, and in this blog post, we will show how the Kalman filter can apply to student learning. We will use a Kalman filter to infer an approximation to a student’s ability given their response times to a set of questions.

To fully understand the concepts below, you’ll need a background in basic probability and statistics.

Model

To begin, we are going to describe our inference problem and a model of the system – a set of assumptions of how student learning works.

Given a series of scalar measurements:

Z=\{z_1,z_2,...,z_n\}

where z_i denotes the natural logarithm of the time it took our student to answer question i,

We want to infer the scalar latent variables:

X=\{x_1,x_2,...,x_n\}

where the student has ability x_i at the time question i is answered.

We model the change in student ability over time as a Gaussian random walk, meaning that the current ability value is based on the previous ability value, plus some Gaussian noise:

x_k=x_{k-1}+w_k, where w_k is drawn from N(0, \tau \sigma^2_x), where \tau is the time the student spent between questions, and \sigma^2_x is a hyperparameter that corresponds to the variance for this latent process (1).

Having the variance of the noise increase linearly with the time difference makes our equation consistent with a continuous Gaussian random walk, and is consistent with our intuition that a student’s ability needs time to change. In other words, a student is unlikely to experience significant change in ability if they’re between questions in a quiz, but if it’s been a day since they’ve answered a question, they’re more likely to have taken the time to learn more about the concept, or, conversely, they might have forgotten the material. Because the latent state variance is time-dependent, our filter is technically called a hybrid Kalman filter, since it assumes a continuous-time model for a discrete set of observations.

We don’t assume that the student ability x_k accounts for all the variability in the log of the student response time, z_k. For example, it’s possible that the student is familiar with this particular problem or is distracted by her environment. Therefore, we say that the observed times are likewise corrupted by some Gaussian noise, v_k:

z_k=x_k+v_k, where v_k is drawn from N(0, \sigma^2_z), where \sigma^2_z is a hyperparameter that corresponds to the variance of the Gaussian noise (2).


Kalman Filter blog post (v2)

The resulting model is pictured by the diagram above: the ability of the student at the previous question x_{k-1} determines the log response time z_{k-1}. The ability of the student at current question x_k is determined by the ability at the previous question x_{k-1}, and determines the current log response time, z_k.


Inference

The Kalman filter is an algorithm for estimating each x_k a posteriori — that is, it computes an estimate for each x_k given observations Z = \{z_1, z_2,...,z_k\}. It does this recursively, which means that it only needs an estimate of x_{k-1}, along with observation z_k, to output an estimate for x_k.

To make a new estimate, we first need to compute two intermediate pieces of information: a prior distribution and a likelihood distribution.

A note on syntax:

z_{1:k-1} denotes our observations z_1 through z_{k-1}, and x_{k-1|k-1} represents our estimate of ability at the k-1th question given observations z_{1:k-1}; likewise \sigma^2_{k-1|k-1} represents our estimate of the variance given observations z_{1:k-1}.

f denotes the Gaussian probability density function (pdf) with mean \mu  and variance \sigma^2:

f(x;\mu,\sigma^2) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{(x-\mu)^2}{2 \sigma^2}}

Calculating our prior distribution

Our prior term represents the knowledge we have of current latent state x_k having seen everything but our current observation z_{k}.

To calculate our prior, we start off with an estimate of x_{k-1}, represented as a Gaussian distribution with mean x_{k-1|k-1} and variance \sigma^2_{k-1|k-1}:

p(x_{k-1} |z_{1:k-1}) = f(x_{k-1};x_{k-1|k-1}, \sigma^2_{k-1|k-1}) (3)

From equation 1, we see that x_k is simply the addition of two independent random Gaussian random variables, x_{k-1} and w_k.

From probability theory, we know that the probability density of the addition of two independent random variables can be expressed as a convolution of the two composite probability densities. It happens that the convolution of two Gaussians is also a Gaussian:

p(x_k|z_{1:k-1}) = \displaystyle\int p(x_{k-1}|z_{1:k-1}) p(x_k|x_{k-1}) dx_{k-1} = f(x_k; x_{k|k-1}, \sigma^2_{k|k-1}), where

x_{k|k-1}=x_{k-1|k-1} and

\sigma^2_{k|k-1}=\sigma^2_{k-1|k-1}+ \tau \sigma^2_x  (4)

We call p(x_k|z_{1:k-1}) the prior knowledge we have about x_k. The next step is to look at the current observation, z_k and see what information it adds.

Calculating our likelihood distribution

Our likelihood term represents the information our current observation z_k gives us about our latent state x_k.

From equation 2, we see that likelihood of our observation z_k given our hidden variable x_k is simply a Gaussian centered at x_k. This becomes our likelihood term:

p(z_k|x_k) = f(z_k; x_k, \sigma^2_z) (5)

Combining prior and likelihood

The Kalman filter combines the prior knowledge we have about x_k and our likelihood term in accordance with Bayes’ rule, by multiplying the prior term with the likelihood term. We call this resulting distribution the posterior, or our estimate of x_k given all the information we have.

Luckily, the multiplication of two Gaussians is still a Gaussian, although unnormalized:

p(x_k|z_{1:k}) = \frac{1}{Z} p(x_k|z_{1:k-1})*p(z_k|x_k) = f(x_k;x_{k|k}, \sigma^2_{k|k}), where Z is a normalizing constant, and where:

x_{k|k}=x_{k|k-1}+C_k(z_k-x_{k|k-1})

C_k=\frac{\sigma^2_{k|k-1}}{\sigma^2_{k|k-1} + \sigma^2_z}

\sigma^2_{k|k} = (\frac{1}{\sigma^2_{k|k-1}} + \frac{1}{\sigma^2_z})^{-1} (6)

To summarize, given a Gaussian posterior distribution for x_{k-1} (Equation 3) and a new observation z_k, the Kalman filter estimates a new Gaussian posterior for x_k (Equation 6). By updating the Kalman filter as we receive new observations, we can obtain fast, real-time estimates of our latent state.

Open Source Code

An open source implementation of this hybrid Kalman filter algorithm is on Knewton’s GitHub:

https://github.com/Knewton/Kalman

Authors

Sophie Chou is a senior at Columbia University majoring in Computer Science. She’s interested in Machine Learning, Natural Language Processing, and becoming a better teacher. You can follow her on twitter @mpetitchou.

Andersen Chen is a senior at Brown University, majoring in Mathematics and Computer Science. He’s interested in data science, startups, and machine learning.


Parameter Recovery

1. Introduction

Providing students with good study recommendations begins with understanding what they already know. At Knewton, we’re combining existing work on psychometrics and large scale machine learning to perform proficiency estimation more accurately and at a larger scale than ever before.

As Alejandro Companioni [discussed in a previous post], Item Response Theory (IRT) is the core model around which our proficiency estimation is built. IRT proposes that there are latent features of students (proficiency) and questions (difficult, discrimination, and guessability) which explain the accuracy of responses we see on assessments. However, most IRT analyses assume that students bring the same level of proficiency to every question asked of them. This might be fine for a 2-hour standardized test, but we are interested in helping students over an entire course. If we’re doing our job, students should be improving their proficiency as time goes on!

This summer, I made our IRT models more sensitive to gains (and losses) in student proficiencies over time. I’ll leave the details of that model to another post, but as a teaser, I’ll show you our first figure. In this figure, the x-axis represents time as measured by the total number of questions answered. Dot represent the question which was answered at that time. Red dots represent incorrectly answered questions while green dots represent correctly answered questions. The {y}-value of the dot is its difficulty. The green line tracks our simulated student proficiency, and the blue line tracks our recovered student proficiency paramter given the questions that they have answered.

In this post, I will discuss three methods which we used to evalute the performance of our algorithms and discuss their relative strengths and weaknesses. We’ll focus on the mean-squared error, the log-likelihood, and the Kullback-Liebler divergence.

student_1

A visual representation of our temporal IRT model

2. Evaluating results 

To tackle the problems I faced, I explored statistical inference algorithms on probabilistic graphical models. To judge my results, I simulated classes of students answering sets of questions, and saw how accurately my algorithms recovered the parameters of the students and questions.

One method of quantifying the accuracy of estimates is the mean-squared error (MSE). It takes the mean of the square of the differences between the estimated parameters and their actual values. In symbols, if {\theta^{\text{act}}} is the vector of model parameters used to generate our data, and {\theta^{\text{rec}}} are the parameters our method recovered, then

\displaystyle \text{MSE} = \frac{1}{N} \sum_{i=1}^N \left(\theta^{\text{act}}_i - \theta_i^{\text{rec}}\right)^2.

While the MSE is a good indicator for accuracy in many places, it has problems when models have multiple solutions at different scales. Let’s see why this is through an example.

Suppose we have a class of students answering a set of questions. Suppose the questions are actually very hard and that the students happen to be very proficient at the material. Just looking at this set of data, however, we have no way of knowing that these students are actually very proficient. We can only assume the most likely scenario–the students have average proficiences and are answering the questions competently. From the data alone, we can only hope to discern the values of item parameters relative to the proficiency of students and vice-versa. We cannot hope to know their values absolutely. So there are many equally valid interpretations at different scales of the same data. Because the MSE looks at the difference between the two parameters, it will penalize parameters that are scaled differently, even though those parameters might be an equally valid solution given the data!

Let’s look at Figure 2. We see that the ordering is basically preserved, which we could measure with Pearson’s {r}. but the recovered parameters are not on the same scale. If that were the case, then the line of best fit would have slope one and {y}-intercept {0}. However, in this case, the slope is closer to {2/3}. In statistical terms, the variance of the recovered parameters is much less than the variance of the actual parameters.

scatter_thetas

Recovered student proficiencies plotted against actual student proficiencies from a simulation.

The log-likelihood gives us a more meaningful method of measuring accuracy. The log-likelihood tells us the log-probability of our recovered parameters given our data. At instantiation, our algorithms should have low log-likelihoods–the parameters that we guess first are random and don’t fit our data well. Our algorithms should iterate toward higher log-likelihoods, hopefully converging at the set of parameters with the highest log-likelihood. This is the philosophy behind the Expectation-Maximization algorithm. But the log-likelihood is susceptible to tail events. For instance, if a question is in reality extremely hard but, through random chance, a few students with average proficiency answer the question correctly, then maximizing the log-likelihood will lead to marking these very hard questions as easier than they actually are. This, of course, could be solved with more data from large numbers of extremely proficient students, but this data is often hard to come by. Instead, we introduce another way of measuring model fit: the Kullback-Leibler (KL) divergence. Suppose we have probability distributions {P} and {P'}, generated by our original and recovered parameters, respectively. That is,

\displaystyle P(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{act}} )

and

\displaystyle P'(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{rec}} )

In our case, a datum is one student’s response to one question and the {\theta} paramters are the student proficiency and question discrimination, difficult, and guessability discussed in Alex’s post. Then we can define

\displaystyle D_{KL} (P||P') = \displaystyle\sum_i \ln\left({\frac{P(i)}{P'(i)}}\right) P(i).

If {P = P'}, then {D_{KL}(P||P') = 0}. Moreover, Gibbs’ inequality implies that {D_{KL}(P||P') \geq 0}. If the original and recovered parameters generate similar probabilities, then the KL Divergence should be small. Note that since the KL divergence involves the ratio of the likelihoods, it is less susceptible to the influence of tail events than the log-likelihood alone.

The log-likelihood and KL divergence both use likelihoods to measure fit, which means that they only care about fit of the parameters to the data, and not the exact convergence to the original. So they often prove to be reliable measures of fit to judge our algorithms on. For instance, even though the MSE of our recovered parameters is large, Figure 2 shows us that our algorithm has likely converged (since the log-likelihood and KL divergences are not changing much) and give us a reliable measure of model fit.

metrics

The log-likelihood and KL divergence of our recovered parameters through a run of our algorithm.