Parameter Recovery

1. Introduction

Providing students with good study recommendations begins with understanding what they already know. At Knewton, we’re combining existing work on psychometrics and large scale machine learning to perform proficiency estimation more accurately and at a larger scale than ever before.

As Alejandro Companioni [discussed in a previous post], Item Response Theory (IRT) is the core model around which our proficiency estimation is built. IRT proposes that there are latent features of students (proficiency) and questions (difficult, discrimination, and guessability) which explain the accuracy of responses we see on assessments. However, most IRT analyses assume that students bring the same level of proficiency to every question asked of them. This might be fine for a 2-hour standardized test, but we are interested in helping students over an entire course. If we’re doing our job, students should be improving their proficiency as time goes on!

This summer, I made our IRT models more sensitive to gains (and losses) in student proficiencies over time. I’ll leave the details of that model to another post, but as a teaser, I’ll show you our first figure. In this figure, the x-axis represents time as measured by the total number of questions answered. Dot represent the question which was answered at that time. Red dots represent incorrectly answered questions while green dots represent correctly answered questions. The {y}-value of the dot is its difficulty. The green line tracks our simulated student proficiency, and the blue line tracks our recovered student proficiency paramter given the questions that they have answered.

In this post, I will discuss three methods which we used to evalute the performance of our algorithms and discuss their relative strengths and weaknesses. We’ll focus on the mean-squared error, the log-likelihood, and the Kullback-Liebler divergence.

A visual representation of our temporal IRT model

2. Evaluating results 

To tackle the problems I faced, I explored statistical inference algorithms on probabilistic graphical models. To judge my results, I simulated classes of students answering sets of questions, and saw how accurately my algorithms recovered the parameters of the students and questions.

One method of quantifying the accuracy of estimates is the mean-squared error (MSE). It takes the mean of the square of the differences between the estimated parameters and their actual values. In symbols, if {\theta^{\text{act}}} is the vector of model parameters used to generate our data, and {\theta^{\text{rec}}} are the parameters our method recovered, then

\displaystyle \text{MSE} = \frac{1}{N} \sum_{i=1}^N \left(\theta^{\text{act}}_i - \theta_i^{\text{rec}}\right)^2.

While the MSE is a good indicator for accuracy in many places, it has problems when models have multiple solutions at different scales. Let’s see why this is through an example.

Suppose we have a class of students answering a set of questions. Suppose the questions are actually very hard and that the students happen to be very proficient at the material. Just looking at this set of data, however, we have no way of knowing that these students are actually very proficient. We can only assume the most likely scenario–the students have average proficiences and are answering the questions competently. From the data alone, we can only hope to discern the values of item parameters relative to the proficiency of students and vice-versa. We cannot hope to know their values absolutely. So there are many equally valid interpretations at different scales of the same data. Because the MSE looks at the difference between the two parameters, it will penalize parameters that are scaled differently, even though those parameters might be an equally valid solution given the data!

Let’s look at Figure 2. We see that the ordering is basically preserved, which we could measure with Pearson’s {r}. but the recovered parameters are not on the same scale. If that were the case, then the line of best fit would have slope one and {y}-intercept {0}. However, in this case, the slope is closer to {2/3}. In statistical terms, the variance of the recovered parameters is much less than the variance of the actual parameters.

Recovered student proficiencies plotted against actual student proficiencies from a simulation.

The log-likelihood gives us a more meaningful method of measuring accuracy. The log-likelihood tells us the log-probability of our recovered parameters given our data. At instantiation, our algorithms should have low log-likelihoods–the parameters that we guess first are random and don’t fit our data well. Our algorithms should iterate toward higher log-likelihoods, hopefully converging at the set of parameters with the highest log-likelihood. This is the philosophy behind the Expectation-Maximization algorithm. But the log-likelihood is susceptible to tail events. For instance, if a question is in reality extremely hard but, through random chance, a few students with average proficiency answer the question correctly, then maximizing the log-likelihood will lead to marking these very hard questions as easier than they actually are. This, of course, could be solved with more data from large numbers of extremely proficient students, but this data is often hard to come by. Instead, we introduce another way of measuring model fit: the Kullback-Leibler (KL) divergence. Suppose we have probability distributions {P} and {P'}, generated by our original and recovered parameters, respectively. That is,

\displaystyle P(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{act}} )

and

\displaystyle P'(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{rec}} )

In our case, a datum is one student’s response to one question and the {\theta} paramters are the student proficiency and question discrimination, difficult, and guessability discussed in Alex’s post. Then we can define

\displaystyle D_{KL} (P||P') = \displaystyle\sum_i \ln\left({\frac{P(i)}{P'(i)}}\right) P(i).

If {P = P'}, then {D_{KL}(P||P') = 0}. Moreover, Gibbs’ inequality implies that {D_{KL}(P||P') \geq 0}. If the original and recovered parameters generate similar probabilities, then the KL Divergence should be small. Note that since the KL divergence involves the ratio of the likelihoods, it is less susceptible to the influence of tail events than the log-likelihood alone.

The log-likelihood and KL divergence both use likelihoods to measure fit, which means that they only care about fit of the parameters to the data, and not the exact convergence to the original. So they often prove to be reliable measures of fit to judge our algorithms on. For instance, even though the MSE of our recovered parameters is large, Figure 2 shows us that our algorithm has likely converged (since the log-likelihood and KL divergences are not changing much) and give us a reliable measure of model fit.

The log-likelihood and KL divergence of our recovered parameters through a run of our algorithm.

The Mismeasure of Students: Using Item Response Theory Instead of Traditional Grading to Assess Student Proficiency

Imagine for a second that you’re teaching a math remediation course full of fourth graders. You’ve just administered a test with 10 questions. Of those 10 questions, two questions are trivial, two are incredibly hard, and the rest are equally difficult. Now imagine that two of your students take this test and answer nine of the 10 questions correctly. The first student answers an easy question incorrectly, while the second answers a hard question incorrectly. How would you try to identify the student with higher ability?

Under a traditional grading approach, you would assign both students a score of 90 out of 100, grant both of them an A, and move on to the next test. This approach illustrates a key problem with measuring student ability via testing instruments: test questions do not have uniform characteristics. So how can we measure student ability while accounting for differences in questions?

Item response theory (IRT) attempts to model student ability using question level performance instead of aggregate test level performance. Instead of assuming all questions contribute equally to our understanding of a student’s abilities, IRT provides a more nuanced view on the information each question provides about a student. What kind of features can a question have? Let’s consider some examples.

First, think back to an exam you have previously taken. Sometimes you breeze through the first section, work through a second section of questions, then battle with a final section until the exam ends. In the traditional grading paradigm described earlier, a correct answer on the first section would count just as much as a correct answer on the final section, despite the fact that the first section is easier than the last! Similarly, a student demonstrates greater ability as she answers harder questions correctly; the traditional grading scheme, however, completely ignores each question’s difficulty when grading students!

The one-parameter logistic (1PL) IRT model attempts to address this by allowing each question to have an independent difficulty variable. It models the probability of a correct answer using the following logistic function:where j represents the question of interest, theta is the current student’s ability, and beta is item j’s difficulty. This function is also known as the item response function. We can examine its plot (with different values of beta) below
to confirm a couple of things:

  1. For a given ability level, the probability of a correct answer increases as item difficulty decreases. It follows that, between two questions, the question with a lower beta value is easier.
  2. Similarly, for a given question difficulty level, the probability of a correct answer increases as student ability increases. In fact, the curves displayed above take a sigmoidal form, thus implying that the probability of a correct answer increases monotonically as student ability increases.

Now consider using the 1PL model to analyze test responses provided by a group of students. If one student answers one question, we can only draw information about that student’s ability from the first question. Now imagine a second student answers the same question as well as a second question, as illustrated below.
We immediately have the following additional information about both students and both test questions:

  1. We now know more about student 2’s ability relative to student 1 based on student 2’s answer to the first question. For example, if student 1 answered correctly and student 2 answered incorrectly we know that student 1’s ability is greater than student 2’s ability.
  1. We also know more about the first question’s difficulty after student 2 answered the second question. Continuing the example from above, if student 2 answers the second question correctly, we know that Q1 likely has a higher difficulty than Q2 does.
  1. Most importantly, however, we now know more about the first student! Continuing the example even further, we now know that Q1 is more difficult than initially expected. Student 1 answered the first question correctly, suggesting that student 1 has greater ability than we initially estimated!

This form of message passing via item parameters is the key distinction between IRT’s estimates of student ability and other naive approaches (like the grading scheme described earlier). Interestingly, it also suggests that one could develop an online version of IRT that updates ability estimates as more questions and answers arrive!

But let’s not get ahead of ourselves. Instead, let’s continue to develop item response theory by considering the fact that students of all ability levels might have the same probability of correctly answering a poorly-written question. When discussing IRT models, we say that these questions have a low discrimination value, since they do not discriminate between students of high- or low-ability. Ideally, a good question (i.e. one with a high discrimination) will maximally separate students into two groups: those with the ability to answer correctly, and those without.

This gets at an important point about test questions: some questions do a better job than others of distinguishing between students of similar abilities. The two-parameter logistic (2PL) IRT model incorporates this idea by attempting to model each item’s level of discrimination between high- and low-ability students. This can be expressed as a simple tweak to the 1PL:

How does the addition of alpha, the item discrimination parameter, affect our model? As above, we can take a look at the item response function while changing alpha a bit:

As previously stated, items with high discrimination values can distinguish between students of similar ability. If we’re attempting to compare students with abilities near zero, a higher discrimination sharply decreases the probability that a student with ability < 0 will answer correctly, and increases the probability that a student with ability > 0 will answer correctly.

We can even go a step further here, and state that an adaptive test could use a bank of high-discrimination questions of varying difficulty to optimally identify a student’s abilities. As a student answers each of these high-discrimination questions, we could choose a harder question if the student answers correctly (and vice versa). In fact, one could even identify the student’s exact ability level via binary search, if the student is willing to work through a test bank with an infinite number of high-discrimination questions with varying difficulty!

Of course, the above scenario is not completely true to reality. Sometimes students will identify the correct answer by simply guessing! We know that answers can result from concept mastery or filling in your Scantron like a Christmas tree. Additionally, students can increase their odds of guessing a question correctly by ignoring answers that are obviously wrong. We can thus model each question’s “guess-ability” with the three-parameter logistic (3PL) IRT model. The 3PL’s item response function looks like this:

where chi represents the item’s “pseudoguess” value. Chi is not considered a pure guessing value, since students can use some strategy or knowledge to eliminate bad guesses. Thus, while a “pure guess” would be the reciprocal of the number of options (i.e. a student has a one-in-four chance of guessing the answer to a multiple-choice question with four options), those odds may increase if the student manages to eliminate an answer (i.e. that same student increases her guessing odds to one-in-three if she knows one option isn’t correct).

As before, let’s take a look at how the pseudoguess parameter affects the item response function curve:

Note that students of low ability now have a higher probability of guessing the question’s answer. This is also clear from the 3PL’s item response function (chi is an additive term and the second term is non-negative, so the probability of answering correctly is at least as high as chi). Note that there are a few general concerns in the IRT literature regarding the 3PL, especially regarding whether an item’s “guessability” is instead a part of a student’s “testing wisdom,” which arguably represents some kind of student ability.

Regardless, at Knewton we’ve found IRT models to be extremely helpful when trying to understand our students’ abilities by examining their test performance.

References

de Ayala, R.J. (2008). The Theory and Practice of Item Response Theory, New York, NY: The Guilford Press.
Kim, J.S., Bolt, D (2007). “Estimating Item Response Theory Models Using Markov Chain Monte Carlo Methods.” Educational Measurement: Issues and Practices 38 (51).
Sheng, Y (2008). “Markov Chain Monte Carlo Estimation of Normal Ogive IRT Models in MATLAB.” Journal of Statistical Software 25 (8).

Also, thanks to Jesse St. Charles, George Davis, and Christina Yu for their helpful feedback on this post!