Parameter Recovery

1. Introduction

Providing students with good study recommendations begins with understanding what they already know. At Knewton, we’re combining existing work on psychometrics and large scale machine learning to perform proficiency estimation more accurately and at a larger scale than ever before.

As Alejandro Companioni [discussed in a previous post], Item Response Theory (IRT) is the core model around which our proficiency estimation is built. IRT proposes that there are latent features of students (proficiency) and questions (difficult, discrimination, and guessability) which explain the accuracy of responses we see on assessments. However, most IRT analyses assume that students bring the same level of proficiency to every question asked of them. This might be fine for a 2-hour standardized test, but we are interested in helping students over an entire course. If we’re doing our job, students should be improving their proficiency as time goes on!

This summer, I made our IRT models more sensitive to gains (and losses) in student proficiencies over time. I’ll leave the details of that model to another post, but as a teaser, I’ll show you our first figure. In this figure, the x-axis represents time as measured by the total number of questions answered. Dot represent the question which was answered at that time. Red dots represent incorrectly answered questions while green dots represent correctly answered questions. The {y}-value of the dot is its difficulty. The green line tracks our simulated student proficiency, and the blue line tracks our recovered student proficiency paramter given the questions that they have answered.

In this post, I will discuss three methods which we used to evalute the performance of our algorithms and discuss their relative strengths and weaknesses. We’ll focus on the mean-squared error, the log-likelihood, and the Kullback-Liebler divergence.

A visual representation of our temporal IRT model

2. Evaluating results 

To tackle the problems I faced, I explored statistical inference algorithms on probabilistic graphical models. To judge my results, I simulated classes of students answering sets of questions, and saw how accurately my algorithms recovered the parameters of the students and questions.

One method of quantifying the accuracy of estimates is the mean-squared error (MSE). It takes the mean of the square of the differences between the estimated parameters and their actual values. In symbols, if {\theta^{\text{act}}} is the vector of model parameters used to generate our data, and {\theta^{\text{rec}}} are the parameters our method recovered, then

\displaystyle \text{MSE} = \frac{1}{N} \sum_{i=1}^N \left(\theta^{\text{act}}_i - \theta_i^{\text{rec}}\right)^2.

While the MSE is a good indicator for accuracy in many places, it has problems when models have multiple solutions at different scales. Let’s see why this is through an example.

Suppose we have a class of students answering a set of questions. Suppose the questions are actually very hard and that the students happen to be very proficient at the material. Just looking at this set of data, however, we have no way of knowing that these students are actually very proficient. We can only assume the most likely scenario–the students have average proficiences and are answering the questions competently. From the data alone, we can only hope to discern the values of item parameters relative to the proficiency of students and vice-versa. We cannot hope to know their values absolutely. So there are many equally valid interpretations at different scales of the same data. Because the MSE looks at the difference between the two parameters, it will penalize parameters that are scaled differently, even though those parameters might be an equally valid solution given the data!

Let’s look at Figure 2. We see that the ordering is basically preserved, which we could measure with Pearson’s {r}. but the recovered parameters are not on the same scale. If that were the case, then the line of best fit would have slope one and {y}-intercept {0}. However, in this case, the slope is closer to {2/3}. In statistical terms, the variance of the recovered parameters is much less than the variance of the actual parameters.

Recovered student proficiencies plotted against actual student proficiencies from a simulation.

The log-likelihood gives us a more meaningful method of measuring accuracy. The log-likelihood tells us the log-probability of our recovered parameters given our data. At instantiation, our algorithms should have low log-likelihoods–the parameters that we guess first are random and don’t fit our data well. Our algorithms should iterate toward higher log-likelihoods, hopefully converging at the set of parameters with the highest log-likelihood. This is the philosophy behind the Expectation-Maximization algorithm. But the log-likelihood is susceptible to tail events. For instance, if a question is in reality extremely hard but, through random chance, a few students with average proficiency answer the question correctly, then maximizing the log-likelihood will lead to marking these very hard questions as easier than they actually are. This, of course, could be solved with more data from large numbers of extremely proficient students, but this data is often hard to come by. Instead, we introduce another way of measuring model fit: the Kullback-Leibler (KL) divergence. Suppose we have probability distributions {P} and {P'}, generated by our original and recovered parameters, respectively. That is,

\displaystyle P(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{act}} )

and

\displaystyle P'(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{rec}} )

In our case, a datum is one student’s response to one question and the {\theta} paramters are the student proficiency and question discrimination, difficult, and guessability discussed in Alex’s post. Then we can define

\displaystyle D_{KL} (P||P') = \displaystyle\sum_i \ln\left({\frac{P(i)}{P'(i)}}\right) P(i).

If {P = P'}, then {D_{KL}(P||P') = 0}. Moreover, Gibbs’ inequality implies that {D_{KL}(P||P') \geq 0}. If the original and recovered parameters generate similar probabilities, then the KL Divergence should be small. Note that since the KL divergence involves the ratio of the likelihoods, it is less susceptible to the influence of tail events than the log-likelihood alone.

The log-likelihood and KL divergence both use likelihoods to measure fit, which means that they only care about fit of the parameters to the data, and not the exact convergence to the original. So they often prove to be reliable measures of fit to judge our algorithms on. For instance, even though the MSE of our recovered parameters is large, Figure 2 shows us that our algorithm has likely converged (since the log-likelihood and KL divergences are not changing much) and give us a reliable measure of model fit.

The log-likelihood and KL divergence of our recovered parameters through a run of our algorithm.

Web-Based, Real-Time Data Monitoring

This post was written by Joy Zheng, a student at Harvard University who interned at Knewton this summer.

THE PROBLEM

How do you quickly visualize and validate data in a large-scale, distributed, and real-time system? How do you make it deployable in multiple cloud environments? And how do you track changes to this data over time?

These are questions that I set out to answer in my internship this summer, as I worked to develop a web-based interface that displays how much data the Knewton system is receiving from hundreds of thousands of Pearson students across the country.

At a very high level, this is how things work: Hundreds of thousands of college kids across the country use a recommended textbook published by Pearson. These books have an online component with which the students interact, learn, take tests and quizzes, etc. Information regarding all this activity flows through to us at Knewton. We process it to make recommendations to each student about how to progress through the course, so that the student can achieve the milestones set by the professor.

For any given student, we need to know something about the student and his or her history of previous answers, as well as something about the course and the textbook that the course uses. We also need to track any custom content that instructors have created. In order to serve up recommendations, the work gets divided up among different services in the system — along with the data.

CONSIDERATIONS

With this problem set in mind, I came up with a list of overarching goals for a web-based dashboard that would provide this information. Before building the system, I had to keep in mind the following:

Recentness of data: While it isn’t vital for the dashboard to have up-to-the-second data, someone has to be able to login at any time and see fairly recent data — say, from the last hour. We also want to make sure that, when needed, users can get even more recent data on demand.

Responsiveness: How fast can we refresh the data?

Redundancy: When two users want the same piece of data from the platform, the dashboard should not have to fetch it twice.

Ability to get details on the fly: Given thousands of courses, how do you get more descriptive information on the fly?

Ease of use: Using the dashboard should not require deep knowledge of how data is being handled in the platform itself.

Configuration: In line with the last point, the dashboard should be configurable from the website; one should be able to change things like “refresh time” on the fly, for example.

IMPLEMENTATION

Ultimately, we settled on an implementation where data would be polled at regular intervals from various pieces of the Knewton platform and then cached in memory.

This resulted in three strategies for obtaining the data:

Auto-refresh: We query for updates every hour.

On-demand refreshes: If a user wants the most recent state for some piece of information, they can force a refresh.

Stats over time: Permanent persistence of summary statistics — because we want to be able to see some numbers over time, we save relevant data once every 24 hours.

As a result, data presented to the users is almost always drawn from the cache; we preferred this to live querying for two primary reasons.

First, as mentioned previously, we didn’t want to have to send off queries to the platform for data multiple times if we would be getting the same data back; we only wanted to have to pipe data when it had been updated.

Secondly, speed. After all, drawing information from local cache is faster than having to gather it, and we were even able to cache significant portions of the web pages themselves.

Additionally, as part of collating data, one of the roles of the dashboard was to cross-reference it — to line up the data about a textbook with the data for a course and the data for a student. Thus, the nature of the internal back-end references meant that querying on-demand for say, information about one course or user, would require querying in sequence, rather than querying in parallel.

TECHNOLOGIES

Given that most of the Knewton platform is written in Java, and non-Java pieces had Java APIs in place, the language choice for the dashboard was very straightforward; admittedly, though, there were points where I missed functional programming, especially when trying to make sure that the dashboard could handle multiple simultaneous users during parallelizing refreshes.

The webpage itself was run using an embedded Jetty servlet. In retrospect, now that I am more familiar with Jetty and having run through a cycle of configurations, I likely would have given this choice more consideration — either by looking at different server technologies or different configuration methods. Especially compared with Ruby on Rails, which defined my previous experience with servers, resolving relative URLS both for webpage and resource files ended up taking notably more effort than expected with this configuration style. This challenge was compounded by the use of JSP (Java Server Page = Java-embedded HTML) files which facilitated UI and CSS better than the Java-generated HTML that I had in the first running dashboard version.

The interaction within the system can be represented in the diagram below:

PERFORMANCE OPTIMIZATIONS

While testing functionality, I ran the code on a local machine against a very small set of data (really small — like counting actual numbers rather than trailing zeros on your fingers) just to make sure that everything was running and that we were gathering the expected data. Consequently, in our first run against a bigger sized data set (approaching 100,000), some areas of data took a few hours to refresh. We were also piping enough data around to have a tangible effect on the performance of the platform as a whole (ouch!).

To improve performance, we took a few steps:

Reduced field set: First, we took another look at all the data we gathered to make sure that we weren’t collecting anything we didn’t need. While our trimming efforts here probably didn’t have as much of an effect on performance, they definitely reduced the memory footprint.

Delta Refreshes: Next, we made the dashboard smarter by only receiving data that had been updated since the last refresh. Given API restrictions, this wasn’t possible for every service that we queried, but it still gave us a big speed boost.

Lazy loading: Data that we might reasonably want (but could live without 90% of the time) was taken off the automatic update list, and instead only queried and cached on demand. This sometimes meant a wait of 15-30 seconds by users if they really wanted to dig into the details, but it reduced the amount of data we had to cycle through every hour by an order of magnitude.

These tweaks improved performance by orders of magnitude, too, so that refreshes could be measured in minutes or seconds rather than in hours.

While the server sent out data in HTML tables, data manipulations were left to JavaScript on the client side. I started with the TableFilter library (tablefilter.free.fr); however, once running the dashboard against more realistically sized datasets, we hit the ceiling on the ability of this library to sort and search through large tables. After being pointed in a new direction by a fellow Knerd, we switched to the DataTables javascript library which, in addition to having much better performance, also offered a larger variety of available plugins and extensions. While JavaScript will work for the near-term, and using the pre-existing libraries allowed us to get off the ground quickly, processing of huge datasets will ultimately need to be moved to the back end as the number of Knewton users grows.

All in all, I was able to get the dashboard up and running into production just a few minutes before my final presentation of my internship. The dashboard is now being used officially at Knewton to track data across multiple environments and for several hundred thousand students across the country!

Knerdlings: Improving Courses through Simulation

N choose K’s inaugural post is brought to you by the Knewton Adaptive Learning Team (Ashley Miller, Christopher Tang, David Kuntz,  George Davis, Jesse St. Charles, John Davies, Sidharth Gupta)

No two students are identical—they learn and forget at different rates, come from different educational backgrounds, and have different intellectual capabilities, attention spans, and modes of learning. As a result, designing a real-time recommendation engine that is sensitive to the idiosyncrasies of each student is an immensely complex task.

Here at Knewton we are addressing this challenge head-on, using educational path planning technologies and advanced models of student ability. These technologies and models ensure that, no matter how different students are when they enter our system, each of them progresses through the course material in a way that maximizes his or her individual rate of learning. In our ongoing research and experimentation to improve these student models, our latest and most diabolical step has been to create our very own army of virtual abecedarians (Knerdlings), who assist in the testing of our recommendation engine.

Testing Our Recommendation Engine

One good way to evaluate the courses that make use of our recommendation engine is of course to collect and analyze real data generated by real students. These analyses are critical to maintaining and improving the quality of our courses for future students, but real students take real time to complete a course. So, to enable ourselves to research and develop in the interim, we needed a flexible and automated way of generating student data. With this goal in mind, we engineered a virtual student named “Isaak” and used him as the prototype for a whole army of Knerdlings.

In addition to meeting the goal of generating usable data, there are numerous advantages that this strategy offers us. First, it gives us an efficient way to simulate taking our course over and over (and over and over) again. Second, we gain fine-grained control over the many possible kinds of students who could use our course in the future. Next, we can induce particular behaviors in students, and simulate a broad range of student abilities and learning styles. Last but not least, we can be as realistic as we need to be, and nuanced in the exact ways we expect a real-world student would actually be. With these thoughts in mind, we designed the inner workings of Isaak’s mind.

Modeling Student Behavior

The core of Isaak’s mind is a graphical (that’s “graph-based,” not “made of pretty pictures”) model that keeps track of the current state of Isaak’s latent abilities—like his proficiency in “Solving 30-60-90 triangle problems,” or in “Understanding trigonometry”. If Isaak were a real student in a Knewton-powered course, he would be exposed to instructional content and take assessments, and his performance on these assessments would play a role in determining his path through the course. Each student’s latent abilities determine how well the student performs on any assessment of a given topic or concept—in technical parlance, latent ability contributes to a “generative model of performance”. To model student behavior more accurately, Isaak’s latent ability levels also need to improve over time as he is “exposed” to more material, and to degrade without proper reinforcement.

In short, Isaak needs to be able to learn, and (regrettably) needs to be able to forget.  Inspired by research that started with Hermann Ebbinghaus’s work on memory retention and learning curves, we model the changes in Isaak’s latent ability, while learning and forgetting, using exponential growth and decay curves. Isaak doesn’t actually do any learning (or forgetting), of course—he just receives a “bump” in his virtual ability level for a topic each time we expose him to content associated with that topic. (The precise nature of that “bump” depends on the functional form of the learning curve.) Likewise, if Isaak is not exposed to some other topic, he might forget that topic over time. The forgetting curve itself that governs Isaak’s rate of retention is roughly described by the following formula:

,  where R is memory retention, S is the relative strength of memory, and t is time.

By integrating this process into the simulation, we can capture the way in which a student’s knowledge waxes and wanes, depending on how and when they are exposed to various content. The resulting simulation isn’t a perfect recreation of real student behavior, but, in addition to being really cool, the corps of Knerdlings generated in the simulation lets us test the algorithms that govern a student’s flow through the course quite stringently, while maintaining enough simplicity for us to understand the results.

Visualizing the Data

Once we were ready to set our army of Knerdlings loose in our courses, we needed to make sense of the copious amount of data they were about to generate. We did this by summarizing the results in a visualization that lets us understand the behavior of our virtual students graphically (this time, that’s “graph-based” and “in a pretty picture”). The visualization lets us watch latent ability levels change over time, as individual Knerdlings interact with a Knewton course, in a way that’s much easier to grasp than, say, a spreadsheet containing the exact values of a student’s ability on every concept for each hour that they were enrolled. We have nothing against spreadsheets here at Knewton—they have their place—but problems and strange behavior jump out much more clearly to us visually.

Creating Better Algorithms

The visualization tool will also have applications outside of Knewton HQ. In the hands of an instructor, for example, it could help track the progress of students, and allow the instructor to adjust her course plan to account for her students’ likely learning states. Say that a student has passed a sizable number of the units in the course, at a point well into the semester. The student has most likely forgotten some of the topics that were covered earlier in the semester and not revisited since. Using this visualization, along with the Ebbinghaus-based model of learning and forgetting, the instructor now can see at a glance what parts of the course the student is most likely to have forgotten—even what parts the student is likely to forget in the future—so that she can remediate her student appropriately.

So far, this graphical tool and its associated models are only operational internally, but those models are already helping us create better algorithms, which in turn are going to start moving students through Knewton-powered courses more efficiently than ever before, during this next academic year.