Code Swarm Post Git: Data Visualization

The above visualization shows the history of commits for the Knewton Adaptive Learning Platform. A commit occurs when a developer alters the code and transfers the changes into our central git repository. Each file in a commit is represented by a colored-coded blob which indicates the type of file committed. Each time a developer commits a file, the algorithm brings the file closer to the user’s blob; this proximity reflects the associativity between code files and users as well as the relationships among users themselves. The histogram at the bottom displays the frequency of commits over time.

This visualization can also be referred to as a “code swarm”, and is an example of “organic information visualization” (a phrase coined by Ben Fry), that is, any data visualization that allows the elements to interact in a dynamic and unpredictable fashion. The benefit of such a visualization is that it brings the relationships between different project components into relief. The viewer can also observe the evolution of a project over time.

Parameter Recovery

1. Introduction

Providing students with good study recommendations begins with understanding what they already know. At Knewton, we’re combining existing work on psychometrics and large scale machine learning to perform proficiency estimation more accurately and at a larger scale than ever before.

As Alejandro Companioni [discussed in a previous post], Item Response Theory (IRT) is the core model around which our proficiency estimation is built. IRT proposes that there are latent features of students (proficiency) and questions (difficult, discrimination, and guessability) which explain the accuracy of responses we see on assessments. However, most IRT analyses assume that students bring the same level of proficiency to every question asked of them. This might be fine for a 2-hour standardized test, but we are interested in helping students over an entire course. If we’re doing our job, students should be improving their proficiency as time goes on!

This summer, I made our IRT models more sensitive to gains (and losses) in student proficiencies over time. I’ll leave the details of that model to another post, but as a teaser, I’ll show you our first figure. In this figure, the x-axis represents time as measured by the total number of questions answered. Dot represent the question which was answered at that time. Red dots represent incorrectly answered questions while green dots represent correctly answered questions. The {y}-value of the dot is its difficulty. The green line tracks our simulated student proficiency, and the blue line tracks our recovered student proficiency paramter given the questions that they have answered.

In this post, I will discuss three methods which we used to evalute the performance of our algorithms and discuss their relative strengths and weaknesses. We’ll focus on the mean-squared error, the log-likelihood, and the Kullback-Liebler divergence.


A visual representation of our temporal IRT model

2. Evaluating results 

To tackle the problems I faced, I explored statistical inference algorithms on probabilistic graphical models. To judge my results, I simulated classes of students answering sets of questions, and saw how accurately my algorithms recovered the parameters of the students and questions.

One method of quantifying the accuracy of estimates is the mean-squared error (MSE). It takes the mean of the square of the differences between the estimated parameters and their actual values. In symbols, if {\theta^{\text{act}}} is the vector of model parameters used to generate our data, and {\theta^{\text{rec}}} are the parameters our method recovered, then

\displaystyle \text{MSE} = \frac{1}{N} \sum_{i=1}^N \left(\theta^{\text{act}}_i - \theta_i^{\text{rec}}\right)^2.

While the MSE is a good indicator for accuracy in many places, it has problems when models have multiple solutions at different scales. Let’s see why this is through an example.

Suppose we have a class of students answering a set of questions. Suppose the questions are actually very hard and that the students happen to be very proficient at the material. Just looking at this set of data, however, we have no way of knowing that these students are actually very proficient. We can only assume the most likely scenario–the students have average proficiences and are answering the questions competently. From the data alone, we can only hope to discern the values of item parameters relative to the proficiency of students and vice-versa. We cannot hope to know their values absolutely. So there are many equally valid interpretations at different scales of the same data. Because the MSE looks at the difference between the two parameters, it will penalize parameters that are scaled differently, even though those parameters might be an equally valid solution given the data!

Let’s look at Figure 2. We see that the ordering is basically preserved, which we could measure with Pearson’s {r}. but the recovered parameters are not on the same scale. If that were the case, then the line of best fit would have slope one and {y}-intercept {0}. However, in this case, the slope is closer to {2/3}. In statistical terms, the variance of the recovered parameters is much less than the variance of the actual parameters.


Recovered student proficiencies plotted against actual student proficiencies from a simulation.

The log-likelihood gives us a more meaningful method of measuring accuracy. The log-likelihood tells us the log-probability of our recovered parameters given our data. At instantiation, our algorithms should have low log-likelihoods–the parameters that we guess first are random and don’t fit our data well. Our algorithms should iterate toward higher log-likelihoods, hopefully converging at the set of parameters with the highest log-likelihood. This is the philosophy behind the Expectation-Maximization algorithm. But the log-likelihood is susceptible to tail events. For instance, if a question is in reality extremely hard but, through random chance, a few students with average proficiency answer the question correctly, then maximizing the log-likelihood will lead to marking these very hard questions as easier than they actually are. This, of course, could be solved with more data from large numbers of extremely proficient students, but this data is often hard to come by. Instead, we introduce another way of measuring model fit: the Kullback-Leibler (KL) divergence. Suppose we have probability distributions {P} and {P'}, generated by our original and recovered parameters, respectively. That is,

\displaystyle P(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{act}} )


\displaystyle P'(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{rec}} )

In our case, a datum is one student’s response to one question and the {\theta} paramters are the student proficiency and question discrimination, difficult, and guessability discussed in Alex’s post. Then we can define

\displaystyle D_{KL} (P||P') = \displaystyle\sum_i \ln\left({\frac{P(i)}{P'(i)}}\right) P(i).

If {P = P'}, then {D_{KL}(P||P') = 0}. Moreover, Gibbs’ inequality implies that {D_{KL}(P||P') \geq 0}. If the original and recovered parameters generate similar probabilities, then the KL Divergence should be small. Note that since the KL divergence involves the ratio of the likelihoods, it is less susceptible to the influence of tail events than the log-likelihood alone.

The log-likelihood and KL divergence both use likelihoods to measure fit, which means that they only care about fit of the parameters to the data, and not the exact convergence to the original. So they often prove to be reliable measures of fit to judge our algorithms on. For instance, even though the MSE of our recovered parameters is large, Figure 2 shows us that our algorithm has likely converged (since the log-likelihood and KL divergences are not changing much) and give us a reliable measure of model fit.


The log-likelihood and KL divergence of our recovered parameters through a run of our algorithm.

Java-Scala Interoperability

To many people, Java is programming. It’s the only programming language most of my non-technical friends could name off-hand, and the first one that many programmers learn. Since its initial appearance in 1995, Java’s usage has steadily grown.

There are many reasons for this. The design of the Java platform, in which Java code compiles to bytecode that runs on the Java Virtual Machine (JVM), allows Java to be a “write once, run anywhere” language. That is, Java code that works on my computer will work exactly the same on your computer, or a server in the cloud (in theory, at least). Its language features, especially static typing and object-orientation, lend it to robust, modular, rock-solid code. Due to years of JVM and Java development, the platform is blazingly fast and stable. In addition, the many libraries available for Java allow anybody to leverage years of development and experience to perform complicated tasks simply and quickly. The Java community is also strong: on top of the official documentation, there are years of blog posts (an unscientific Google search for “java blog” yields 268,000,000 results) and close to 280,000 questions tagged “Java” on StackExchange. That’s a lot of collective knowledge and expertise.

At Knewton our systems are Java-based, for a few of the reasons mentioned above. But Java is far from perfect. Its syntax includes much boilerplate and support for concepts like concurrency, parallelism, and functional programming is unidiomatic. There are a lot of small tweaks or features that could make Java a lot more pleasant to work with. These imperfections aren’t significant enough to outweigh the benefits of using Java, but they do make daily work a little more challenging.

Enter Scala. Scala is what would happen if there were a Genie that eavesdropped on every Java developer as they cursed when their build broke due to a missing semicolon, then inferred a wish from their complaints. It offers what Java lacks: a much cleaner syntax (semicolons optional, types inferred), pattern matching and case classes, mix-ins (a more powerful inheritance/interface system), higher-order/first-class/anonymous functions, built-in immutable data structures, a built-in read-evaluate-print-loop (REPL) and actor-based concurrency. Even more importantly than all of this, though, is one huge advantage: it runs on the JVM. This means that, unlike other languages, Scala doesn’t require the reimplementation of large banks of code; programs can use existing Java libraries with minimal effort. For instance, the Java standard libraries are available as soon as your Scala REPL starts. To translate

ArrayList<String> fruits = new ArrayList<String>();
for (String fruit : fruits) {

into Scala, we simply write the following (JavaConversions is a Scala library that makes using Java collections like ArrayList much more idiomatic):

import java.util.ArrayList
import scala.collection.JavaConversions._
val fruits = new ArrayList[String]
for (fruit &lt;- fruits) {
// More idiomatically

If you’re using libraries that aren’t packaged with Java or your own classes, just ensure that they lie in the classpath. In addition, the syntactic similarities between Java and Scala mean that the information from nearly any Java resource — a blog post, StackExchange question, Javadoc, or whiteboard code snippet — will be equally valuable in Scala, allowing Scala to tap into over 15 years of experience. While Scala may be a relatively new language, it has access to all of the network benefits of a behemoth like Java.

Portability works the other way, too, but with a few caveats. Consider the following Scala class, in the file “Hamster.scala”:

class Hamster()

The Java file “” might contain the following:

public class HamsterTest {
    public static void main(String[] args) {
        Hamster hamster = new Hamster();

To compile and run, execute

$ scalac Hamster.scala
$ javac -cp path/to/scala-library.jar:.
$ java -cp path/to/scala-library.jar:. HamsterTest

in the shell, and hopefully find that everything works as expected. However, if we want to add a weight field to our Hamster class (“class Hamster(var weight: Int)”), things start to break.

Hamster hamster = new Hamster(42);
// This works (note: "weight()", not "weight"):
System.out.println("The hamster weighs " + hamster.weight());
// But if our hamster then goes on a diet, we're out of luck:
hamster.setA(10); // doesn't work
hamster.a(10); // doesn't work

We can solve this problem with the handy scala.reflect.BeanProperty annotation, which adds getWeight and setWeight methods (we could also add them explicitly):

class Hamster(@BeanProperty var weight: Int)

Then, from Java:

Hamster hamster = new Hamster(42);
hamster.setWeight(hamster.getWeight() - 32);

What about using Scala traits within Java? Since traits aren’t quite interfaces, but they aren’t quite abstract classes, they’re a little tricky. A basic Scala trait, such as

trait Mammal {
    def beHairy

can be used just as an interface. But if we add another trait, Rodent,

trait Rodent extends Mammal {
    def beHairy {
        println("I have fur!")
    def gnaw

and attempt to extend it

public class Rat extends Rodent {
    public void gnaw() {

we are met with the compile-time error “no interface expected here”, even though Rodent implements beHairy(). To see why we get this message, whip out javap, the Java disassembler bundled with every JDK (javap is an essential tool for deciphering strange Scala behavior).

$ javap Rodent
Compiled from "Rodent.scala"
public interface Rodent extends Mammal,scala.ScalaObject{
    public abstract void beHairy();
    public abstract void gnaw();

Thus, even though Rodent contains method implementation, it still compiles to an interface. Some compile-time magic allows Scala classes to use these traits. The easiest fix is to add something like

abstract class RodentJava extends Rodent

for each trait that implements methods, and have Java classes extend the abstract class, rather than the trait. This has the advantage of being dead simple, but the disadvantage of requiring the maintenance of a separate Java API and sacrificing the ability to mix-in multiple traits as in Scala (this is, however, impossible in Java, since it lacks multiple inheritance); a single abstract Scala class for each potentially used set of mix-ins is the way to go here, but this method risks a combinatorial explosion).

Traits are just one example of a Scala feature that lacks a direct Java equivalent. First-class functions, anonymous functions, Scala’s greater number of access level modifiers, pattern matching, tuples, implicits, and operator overloading are all either non-trivial or impossible to use in Java. In general, to be useful to Java programmers, a library that uses any Scala-specific features must offer a Java-oriented API (of greater complexity than the examples seen here, but involving similar workarounds). Both Apache Kafka [3] and Twitter’s Finagle [4], which I’ve worked with in my summer at Knewton, take this approach; while it’s possible to write Scala code that can be used seamlessly from Java, to do so is to give up the expressiveness of idiomatic Scala.

Whether, as some suggest, Scala will be a successor to Java, or merely a fad, its compatibility with Java is one of its greatest strengths. Combining Java and Scala is nowhere near the undertaking that combining two unrelated languages would be, so the cost of experimenting with Scala as a part of an existing Java architecture is minimal. Given the potential benefits and low risk, it makes sense to consider a total or partial move to Scala as part of a JVM-based ecosystem.

Web-Based, Real-Time Data Monitoring

This post was written by Joy Zheng, a student at Harvard University who interned at Knewton this summer.


How do you quickly visualize and validate data in a large-scale, distributed, and real-time system? How do you make it deployable in multiple cloud environments? And how do you track changes to this data over time?

These are questions that I set out to answer in my internship this summer, as I worked to develop a web-based interface that displays how much data the Knewton system is receiving from hundreds of thousands of Pearson students across the country.

At a very high level, this is how things work: Hundreds of thousands of college kids across the country use a recommended textbook published by Pearson. These books have an online component with which the students interact, learn, take tests and quizzes, etc. Information regarding all this activity flows through to us at Knewton. We process it to make recommendations to each student about how to progress through the course, so that the student can achieve the milestones set by the professor.

For any given student, we need to know something about the student and his or her history of previous answers, as well as something about the course and the textbook that the course uses. We also need to track any custom content that instructors have created. In order to serve up recommendations, the work gets divided up among different services in the system — along with the data.


With this problem set in mind, I came up with a list of overarching goals for a web-based dashboard that would provide this information. Before building the system, I had to keep in mind the following:

Recentness of data: While it isn’t vital for the dashboard to have up-to-the-second data, someone has to be able to login at any time and see fairly recent data — say, from the last hour. We also want to make sure that, when needed, users can get even more recent data on demand.

Responsiveness: How fast can we refresh the data?

Redundancy: When two users want the same piece of data from the platform, the dashboard should not have to fetch it twice.

Ability to get details on the fly: Given thousands of courses, how do you get more descriptive information on the fly?

Ease of use: Using the dashboard should not require deep knowledge of how data is being handled in the platform itself.

Configuration: In line with the last point, the dashboard should be configurable from the website; one should be able to change things like “refresh time” on the fly, for example.


Ultimately, we settled on an implementation where data would be polled at regular intervals from various pieces of the Knewton platform and then cached in memory.

This resulted in three strategies for obtaining the data:

Auto-refresh: We query for updates every hour.

On-demand refreshes: If a user wants the most recent state for some piece of information, they can force a refresh.

Stats over time: Permanent persistence of summary statistics — because we want to be able to see some numbers over time, we save relevant data once every 24 hours.

As a result, data presented to the users is almost always drawn from the cache; we preferred this to live querying for two primary reasons.

First, as mentioned previously, we didn’t want to have to send off queries to the platform for data multiple times if we would be getting the same data back; we only wanted to have to pipe data when it had been updated.

Secondly, speed. After all, drawing information from local cache is faster than having to gather it, and we were even able to cache significant portions of the web pages themselves.

Additionally, as part of collating data, one of the roles of the dashboard was to cross-reference it — to line up the data about a textbook with the data for a course and the data for a student. Thus, the nature of the internal back-end references meant that querying on-demand for say, information about one course or user, would require querying in sequence, rather than querying in parallel.


Given that most of the Knewton platform is written in Java, and non-Java pieces had Java APIs in place, the language choice for the dashboard was very straightforward; admittedly, though, there were points where I missed functional programming, especially when trying to make sure that the dashboard could handle multiple simultaneous users during parallelizing refreshes.

The webpage itself was run using an embedded Jetty servlet. In retrospect, now that I am more familiar with Jetty and having run through a cycle of configurations, I likely would have given this choice more consideration — either by looking at different server technologies or different configuration methods. Especially compared with Ruby on Rails, which defined my previous experience with servers, resolving relative URLS both for webpage and resource files ended up taking notably more effort than expected with this configuration style. This challenge was compounded by the use of JSP (Java Server Page = Java-embedded HTML) files which facilitated UI and CSS better than the Java-generated HTML that I had in the first running dashboard version.

The interaction within the system can be represented in the diagram below:


While testing functionality, I ran the code on a local machine against a very small set of data (really small — like counting actual numbers rather than trailing zeros on your fingers) just to make sure that everything was running and that we were gathering the expected data. Consequently, in our first run against a bigger sized data set (approaching 100,000), some areas of data took a few hours to refresh. We were also piping enough data around to have a tangible effect on the performance of the platform as a whole (ouch!).

To improve performance, we took a few steps:

Reduced field set: First, we took another look at all the data we gathered to make sure that we weren’t collecting anything we didn’t need. While our trimming efforts here probably didn’t have as much of an effect on performance, they definitely reduced the memory footprint.

Delta Refreshes: Next, we made the dashboard smarter by only receiving data that had been updated since the last refresh. Given API restrictions, this wasn’t possible for every service that we queried, but it still gave us a big speed boost.

Lazy loading: Data that we might reasonably want (but could live without 90% of the time) was taken off the automatic update list, and instead only queried and cached on demand. This sometimes meant a wait of 15-30 seconds by users if they really wanted to dig into the details, but it reduced the amount of data we had to cycle through every hour by an order of magnitude.

These tweaks improved performance by orders of magnitude, too, so that refreshes could be measured in minutes or seconds rather than in hours.

While the server sent out data in HTML tables, data manipulations were left to JavaScript on the client side. I started with the TableFilter library (; however, once running the dashboard against more realistically sized datasets, we hit the ceiling on the ability of this library to sort and search through large tables. After being pointed in a new direction by a fellow Knerd, we switched to the DataTables javascript library which, in addition to having much better performance, also offered a larger variety of available plugins and extensions. While JavaScript will work for the near-term, and using the pre-existing libraries allowed us to get off the ground quickly, processing of huge datasets will ultimately need to be moved to the back end as the number of Knewton users grows.

All in all, I was able to get the dashboard up and running into production just a few minutes before my final presentation of my internship. The dashboard is now being used officially at Knewton to track data across multiple environments and for several hundred thousand students across the country!

Announcing WP Stack

My name is Mark Jaquith. I’m one of the lead developers on the WordPress core software, and am a technical consultant to Knewton. The marketing team at Knewton wanted their WordPress-powered site to have a more professional development approach using version control; a real staging and deployment system (with the ability to roll back); and fast, scalable redundancy. I helped create that system for them. They loved it, and after talking to other companies with similar needs for their WordPress-powered sites, they thought it would be useful to turn this system into a generic WordPress deployment system that anyone could use to run a professional WordPress site.

Today, we’re announcing the result of that effort: WP Stack.

What it is

WP Stack is a set of deployment scripts using Capistrano, and a series of drop-in WordPress plugins that work in tandem with this system to perform commonly desired tasks on professional sites, such as rewriting media files to a CDN or serving WordPress multisite uploads directly through Nginx. It supports both production and staging environments, as well as file/database snapshotting from production to staging, so that you can test your code changes on fresh data.

The commands for tasks like deployment, rollback, and database/file syncs, are short one-liners such as cap production deploy or cap staging db:sync. And while code changes (like templates or plugins) require commits and deployments, the regular WordPress workflow is unchanged… so content creators can post, upload media, and edit content to their heart’s content without needing to know what’s going on under the hood.

WP Stack can be used to deploy your existing WordPress site repository, but you’ll get best results if you start with WordPress Skeleton, a complementary project that gives you a nicely laid out starter WordPress Git repository, that not-coincidentally is pre-wired to work well with WP Stack.

Who this is for

This project will have the most benefit for professional WordPress sites, where “doing it live” just isn’t an option. But there’s no reason it can’t be used for personal or small business websites too. Many organizations know that they should be using version control and a deployment system, but they don’t quite know where to start. It is my hope that WP Stack (and WordPress Skeleton) will help lower the barriers and lead to more organizations using professional development and deployment techniques for using the software that powers their public face.

The future

We’re not done! The roadmap for WP Stack includes things like multi-server uploads syncing, Puppet/Chef manifests with full Nginx configs, and a Vagrant config for easier local development. If you’d like to get involved, head on over to GitHub.

I’d also like to thank Knewton for sponsoring the development of WP Stack. It’s great to be able to work with a company that can see the value of open source contribution.

The Mismeasure of Students: Using Item Response Theory Instead of Traditional Grading to Assess Student Proficiency

Imagine for a second that you’re teaching a math remediation course full of fourth graders. You’ve just administered a test with 10 questions. Of those 10 questions, two questions are trivial, two are incredibly hard, and the rest are equally difficult. Now imagine that two of your students take this test and answer nine of the 10 questions correctly. The first student answers an easy question incorrectly, while the second answers a hard question incorrectly. How would you try to identify the student with higher ability?

Under a traditional grading approach, you would assign both students a score of 90 out of 100, grant both of them an A, and move on to the next test. This approach illustrates a key problem with measuring student ability via testing instruments: test questions do not have uniform characteristics. So how can we measure student ability while accounting for differences in questions?

Item response theory (IRT) attempts to model student ability using question level performance instead of aggregate test level performance. Instead of assuming all questions contribute equally to our understanding of a student’s abilities, IRT provides a more nuanced view on the information each question provides about a student. What kind of features can a question have? Let’s consider some examples.

First, think back to an exam you have previously taken. Sometimes you breeze through the first section, work through a second section of questions, then battle with a final section until the exam ends. In the traditional grading paradigm described earlier, a correct answer on the first section would count just as much as a correct answer on the final section, despite the fact that the first section is easier than the last! Similarly, a student demonstrates greater ability as she answers harder questions correctly; the traditional grading scheme, however, completely ignores each question’s difficulty when grading students!

The one-parameter logistic (1PL) IRT model attempts to address this by allowing each question to have an independent difficulty variable. It models the probability of a correct answer using the following logistic function:where j represents the question of interest, theta is the current student’s ability, and beta is item j’s difficulty. This function is also known as the item response function. We can examine its plot (with different values of beta) below
to confirm a couple of things:

  1. For a given ability level, the probability of a correct answer increases as item difficulty decreases. It follows that, between two questions, the question with a lower beta value is easier.
  2. Similarly, for a given question difficulty level, the probability of a correct answer increases as student ability increases. In fact, the curves displayed above take a sigmoidal form, thus implying that the probability of a correct answer increases monotonically as student ability increases.

Now consider using the 1PL model to analyze test responses provided by a group of students. If one student answers one question, we can only draw information about that student’s ability from the first question. Now imagine a second student answers the same question as well as a second question, as illustrated below.
We immediately have the following additional information about both students and both test questions:

  1. We now know more about student 2’s ability relative to student 1 based on student 2’s answer to the first question. For example, if student 1 answered correctly and student 2 answered incorrectly we know that student 1’s ability is greater than student 2’s ability.
  1. We also know more about the first question’s difficulty after student 2 answered the second question. Continuing the example from above, if student 2 answers the second question correctly, we know that Q1 likely has a higher difficulty than Q2 does.
  1. Most importantly, however, we now know more about the first student! Continuing the example even further, we now know that Q1 is more difficult than initially expected. Student 1 answered the first question correctly, suggesting that student 1 has greater ability than we initially estimated!

This form of message passing via item parameters is the key distinction between IRT’s estimates of student ability and other naive approaches (like the grading scheme described earlier). Interestingly, it also suggests that one could develop an online version of IRT that updates ability estimates as more questions and answers arrive!

But let’s not get ahead of ourselves. Instead, let’s continue to develop item response theory by considering the fact that students of all ability levels might have the same probability of correctly answering a poorly-written question. When discussing IRT models, we say that these questions have a low discrimination value, since they do not discriminate between students of high- or low-ability. Ideally, a good question (i.e. one with a high discrimination) will maximally separate students into two groups: those with the ability to answer correctly, and those without.

This gets at an important point about test questions: some questions do a better job than others of distinguishing between students of similar abilities. The two-parameter logistic (2PL) IRT model incorporates this idea by attempting to model each item’s level of discrimination between high- and low-ability students. This can be expressed as a simple tweak to the 1PL:

How does the addition of alpha, the item discrimination parameter, affect our model? As above, we can take a look at the item response function while changing alpha a bit:

As previously stated, items with high discrimination values can distinguish between students of similar ability. If we’re attempting to compare students with abilities near zero, a higher discrimination sharply decreases the probability that a student with ability < 0 will answer correctly, and increases the probability that a student with ability > 0 will answer correctly.

We can even go a step further here, and state that an adaptive test could use a bank of high-discrimination questions of varying difficulty to optimally identify a student’s abilities. As a student answers each of these high-discrimination questions, we could choose a harder question if the student answers correctly (and vice versa). In fact, one could even identify the student’s exact ability level via binary search, if the student is willing to work through a test bank with an infinite number of high-discrimination questions with varying difficulty!

Of course, the above scenario is not completely true to reality. Sometimes students will identify the correct answer by simply guessing! We know that answers can result from concept mastery or filling in your Scantron like a Christmas tree. Additionally, students can increase their odds of guessing a question correctly by ignoring answers that are obviously wrong. We can thus model each question’s “guess-ability” with the three-parameter logistic (3PL) IRT model. The 3PL’s item response function looks like this:

where chi represents the item’s “pseudoguess” value. Chi is not considered a pure guessing value, since students can use some strategy or knowledge to eliminate bad guesses. Thus, while a “pure guess” would be the reciprocal of the number of options (i.e. a student has a one-in-four chance of guessing the answer to a multiple-choice question with four options), those odds may increase if the student manages to eliminate an answer (i.e. that same student increases her guessing odds to one-in-three if she knows one option isn’t correct).

As before, let’s take a look at how the pseudoguess parameter affects the item response function curve:

Note that students of low ability now have a higher probability of guessing the question’s answer. This is also clear from the 3PL’s item response function (chi is an additive term and the second term is non-negative, so the probability of answering correctly is at least as high as chi). Note that there are a few general concerns in the IRT literature regarding the 3PL, especially regarding whether an item’s “guessability” is instead a part of a student’s “testing wisdom,” which arguably represents some kind of student ability.

Regardless, at Knewton we’ve found IRT models to be extremely helpful when trying to understand our students’ abilities by examining their test performance.


de Ayala, R.J. (2008). The Theory and Practice of Item Response Theory, New York, NY: The Guilford Press.
Kim, J.S., Bolt, D (2007). “Estimating Item Response Theory Models Using Markov Chain Monte Carlo Methods.” Educational Measurement: Issues and Practices 38 (51).
Sheng, Y (2008). “Markov Chain Monte Carlo Estimation of Normal Ogive IRT Models in MATLAB.” Journal of Statistical Software 25 (8).

Also, thanks to Jesse St. Charles, George Davis, and Christina Yu for their helpful feedback on this post!

GraphLint: Creating a Domain Specific Language for Graph Validation

Kenny Yu is currently a sophomore at Harvard College where he is studying computer science. In the summer of 2011, he interned at Knewton as a software engineer. During his internship, he built a new source review and release process and created a graph validation engine based on configuration schemas. He interned at Knewton again in January 2012 and rebuilt the graph validator by creating a domain specific language. The following blog post is a summary of this project and the process he used to design the language and the new validation engine.

The Question

At the core of the Knewton Adaptive Learning Platform is the course structure that brings all the lessons, concepts, and content into one connected graph. This graph is so large that checking it by hand is infeasible. Thus, Knewton’s graph datastore needed a validation system to check the structure of its graphs.

The validator idea is simple: given a graph, how can we tell if it is “well-formed”? How can we create a set of rules to express what edge and node types are acceptable in a given graph? My mentors, Jordan Lewis and Trevor Smith, decided that I should build a domain specific language (DSL) for writing the ontology schemas to solve this problem.

Developing the Language: “Onto”

Before creating the DSL, I had to understand the use case of the language I was building. After speaking with developers and content creators at Knewton, I composed this list of considerations that would drive my planning process:

  1. Expressiveness of the language – Above all, the language must be capable of encoding complex relationships between nodes and edges.
  2. Simplicity and ease of writing the language – The language must be easy to write and so easy to understand that non-developers can think and write in it.
  3. Ease of implementing the language – It should be easy to take the language and parse it into a form that can be used for graph validation.
  4. Graph validation run-time – Once the ontology had been parsed from the DSL into usable form, how quickly can we validate a graph using the provided ontology?

The graphs I worked with are directed multigraphs, where edges and nodes may have associated data with them. In Knewton’s case, both nodes and edges have types (e.g. “Concept”, “Module” and “Contains”, “Represents”). Using these types, we want to assert relationships like:

For all nodes m of type Module, there exists exactly 1 node c of type Concept such that the directed edge (c,Contains,m) exists in the graph.

We also want to express logical combinations like the following:

For all nodes c of type Concept:

there exists exactly 1 node m of type Module such that the directed edge (c,Contains,m) OR (c,Represents,m) exists in the graph


there exists exactly 0 or 1 nodes m of type Module such that the the directed edge (c,Contains,m) exists in the graph.

I used statements like these as inspiration to create the DSL. The DSL consists of these statements:

  1. Universal quantifier statement: for all <variable> in <node type> <statement>
  2. Existential quantifier statement: there exists <number rule> <variable> in <nodetype> such that <statement>
  3. Edge rule assertion statement: (<variable>,<edge type>,<variable>)
  4. And statement: <statement> and <statement>
  5. Or statement: <statement> or <statement>
  6. Xor (Exclusive or) statement: <statement> xor <statement>
  7. Not statement: not <statement>

Thus, the above two statements, written in the DSL, look like this

and the second one:

Lacking creativity, I decided to give ontology file names the “.onto” extension, and thus the Onto language was born.

Language Implementation Choice

It was difficult to determine which language I would use to implement this project. Having recently taken a compilers class in OCaml and knowing the benefits of OCaml’s strongly typed system and functional programming language features (and my personal preference for coding in OCaml), I thought that OCaml seemed like the natural choice for this kind of project. However, after I considered the reusability of this project with other systems in the company, I realized that OCaml would not be the best choice. The company recently decided to move towards Java and possibly other JVM-based languages. I had researched Java-OCaml bindings and discovered that existing options couldn’t meet this project’s need. As a result, I chose to implement this project in Java to make it more compatible with other systems at Knewton.

Having only used ocamllex and ocamlyacc to create parser generators, I needed to research other parser generators for Java. My mentors suggested ANTLR. I found ANTLR to be a high quality project that was incredibly easy to use.

Validation Algorithm

Now that I could successfully parse ontology files into memory, how would I validate a graph? I decided to implement the algorithm in a functional way: traverse the abstract syntax tree (AST) generated by the parser and for each statement node, create a closure that, when provided an environment–a mapping from variable names to actual nodes in the graph–and called, would return “true” or “false” (and thus represent whether the graph passed validation on that statement). After traversing the entire AST, one closure would be produced and would represent the validate function. A list of errors would be generated if the graph failed validation.

For example, I did the following for the Universal quantifier statement:

for all <variable> in <node type> <statement>

  1. Generate the closure by traversing the AST in <statement>
  2. Iterate over all nodes with type <node type> in the provided graph, and for each node:
  1. extend the current environment by mapping: <variable> to current node
  2. call the closure on this new environment, and ensure the closure returns true

I followed a similar procedure for the rest of the node types in an ontology.

Unfortunately, closures are not directly accessible in Java. Instead, I had to generate wrapper classes that wrapped methods representing the closure. In addition, because of the lack of match statements in Java, I used the visitor pattern to traverse the AST.

Testing and Profiling

After unit testing the validation engine on small graphs and simple ontologies, I moved on to a real life error-riddled data set. The graph in the data set had 53,377 nodes and 53,960 edges. When I initially tried to validate this graph, the JVM exploded. Even with 2GB of memory, the JVM still did not have enough heap space to validate this graph in half an hour. After profiling, I discovered that over a hundred million Strings were being generated by error messages for failed validation. As a result, I made several changes to make the validator more memory efficient: I used the flyweight pattern when allocating space for environment variables, node and edge types, and I lazily generated the error messages, only creating them when necessary.

This solved the memory problem. However, the run-time of the validation algorithm was still horrible. Each nesting level of a for all and exists statement added a factor of V to the asymptotic run-time (where V is the number of vertices). For example, if an ontology had three nested for all statements, the run-time would be O(V^3). Fortunately, since edges could only connect two vertices (unlike in a hypergraph), I could optimize the exists statement to only check in the neighboring set of the current node. With this optimization, exists statements no longer multiplied the computational complexity. As a result of these memory and time optimizations, the run-time for the 50,000+ node data set went from 30+ minutes to less than 4 seconds. Changing the run-time from V^2 to V makes such a big difference!

Looking Back, and Next Steps

It’s a rare opportunity to go back and rebuild old projects from scratch. It’s an even rarer opportunity to get complete freedom (as an intern) to design a central component of the company’s system. At Knewton, I was fortunate to have both opportunities. Knewton tasked me with rebuilding the graph validation engine, and I had complete freedom to design the project however I wished. Even more remarkable–this project allowed me to to work at a scale (50,000+ nodes!) that I would rarely ever touch at school.

There are several things that in retrospect I would have done differently. The lack of functional language features in Java made it difficult to implement the validation algorithm. As a result, if I were to redo the project, I would probably implement it in Scala. Using Scala’s interoperability with Java and its functional language features (especially the ability to create closures and use case classes) would have made the implementation easier to write and more concise. Furthermore, there are many optimizations I could have implemented to make the validation engine faster: I could have used properties of second-order logic to transform quantifier and logic statements into equivalent statements with a faster run-time, and I could have parallelized the algorithm to make use of concurrency. But this is pretty good for working just two and a half weeks!

Simplifying Cross-Browser Javascript Unit Testing

By Daniel Straus and Eric Garside

Javascript unit testing is complicated and expensive; unlike server-side logic, code that seems to work fine may execute differently on different browser/OS combinations, which means that in order to get adequate test coverage, all supported platforms need to be tested thoroughly. In order to allow our codebase to be testable as it grows in size we came up with a way to make our testing process easy and scalable.

To write our tests we used ScrewUnit, a behavior-driven testing framework which provides an extensible cross-browser utility for developing tests. Each test suite is just a web page; running it is only as difficult as pointing a browser at it. Let take a look at an example:

This test passes in a Webkit browser, but not in Internet Explorer (IE can’t process the date format). As you can imagine, it would be quite time consuming to have to run tests like this on multiple browsers manually on every check-in.

To make the use of ScrewUnit manageable and efficient we turned to Selenium-WebDriver (a browser automation tool). We can write Selenium routines to open a project’s ScrewUnit test and analyze its results. With a single line of code, we can check if ScrewUnit detects failures:

This is improved over the previous method. But to make this process optimal we wanted to run these routines in parallel. To accomplish this (run across multiple browser/OS combinations) we used Sauce Labs. Sauce Labs allows us kick off tests from a single machine quickly and doesn’t require a large testing hardware infrastructure on our end (everything runs in the Sauce Labs’ cloud). Below is an example of how this all comes together:

By combining ScrewUnit, Selenium and Sauce Labs, we removed the complexity of running Javascript unit tests across multiple browsers. We’ve put together a demo of our tests as they run on our own project.

Knerdlings: Improving Courses through Simulation

N choose K’s inaugural post is brought to you by the Knewton Adaptive Learning Team (Ashley Miller, Christopher Tang, David Kuntz,  George Davis, Jesse St. Charles, John Davies, Sidharth Gupta)

No two students are identical—they learn and forget at different rates, come from different educational backgrounds, and have different intellectual capabilities, attention spans, and modes of learning. As a result, designing a real-time recommendation engine that is sensitive to the idiosyncrasies of each student is an immensely complex task.

Here at Knewton we are addressing this challenge head-on, using educational path planning technologies and advanced models of student ability. These technologies and models ensure that, no matter how different students are when they enter our system, each of them progresses through the course material in a way that maximizes his or her individual rate of learning. In our ongoing research and experimentation to improve these student models, our latest and most diabolical step has been to create our very own army of virtual abecedarians (Knerdlings), who assist in the testing of our recommendation engine.

Testing Our Recommendation Engine

One good way to evaluate the courses that make use of our recommendation engine is of course to collect and analyze real data generated by real students. These analyses are critical to maintaining and improving the quality of our courses for future students, but real students take real time to complete a course. So, to enable ourselves to research and develop in the interim, we needed a flexible and automated way of generating student data. With this goal in mind, we engineered a virtual student named “Isaak” and used him as the prototype for a whole army of Knerdlings.

In addition to meeting the goal of generating usable data, there are numerous advantages that this strategy offers us. First, it gives us an efficient way to simulate taking our course over and over (and over and over) again. Second, we gain fine-grained control over the many possible kinds of students who could use our course in the future. Next, we can induce particular behaviors in students, and simulate a broad range of student abilities and learning styles. Last but not least, we can be as realistic as we need to be, and nuanced in the exact ways we expect a real-world student would actually be. With these thoughts in mind, we designed the inner workings of Isaak’s mind.

Modeling Student Behavior

The core of Isaak’s mind is a graphical (that’s “graph-based,” not “made of pretty pictures”) model that keeps track of the current state of Isaak’s latent abilities—like his proficiency in “Solving 30-60-90 triangle problems,” or in “Understanding trigonometry”. If Isaak were a real student in a Knewton-powered course, he would be exposed to instructional content and take assessments, and his performance on these assessments would play a role in determining his path through the course. Each student’s latent abilities determine how well the student performs on any assessment of a given topic or concept—in technical parlance, latent ability contributes to a “generative model of performance”. To model student behavior more accurately, Isaak’s latent ability levels also need to improve over time as he is “exposed” to more material, and to degrade without proper reinforcement.

In short, Isaak needs to be able to learn, and (regrettably) needs to be able to forget.  Inspired by research that started with Hermann Ebbinghaus’s work on memory retention and learning curves, we model the changes in Isaak’s latent ability, while learning and forgetting, using exponential growth and decay curves. Isaak doesn’t actually do any learning (or forgetting), of course—he just receives a “bump” in his virtual ability level for a topic each time we expose him to content associated with that topic. (The precise nature of that “bump” depends on the functional form of the learning curve.) Likewise, if Isaak is not exposed to some other topic, he might forget that topic over time. The forgetting curve itself that governs Isaak’s rate of retention is roughly described by the following formula:

,  where R is memory retention, S is the relative strength of memory, and t is time.

By integrating this process into the simulation, we can capture the way in which a student’s knowledge waxes and wanes, depending on how and when they are exposed to various content. The resulting simulation isn’t a perfect recreation of real student behavior, but, in addition to being really cool, the corps of Knerdlings generated in the simulation lets us test the algorithms that govern a student’s flow through the course quite stringently, while maintaining enough simplicity for us to understand the results.

Visualizing the Data

Once we were ready to set our army of Knerdlings loose in our courses, we needed to make sense of the copious amount of data they were about to generate. We did this by summarizing the results in a visualization that lets us understand the behavior of our virtual students graphically (this time, that’s “graph-based” and “in a pretty picture”). The visualization lets us watch latent ability levels change over time, as individual Knerdlings interact with a Knewton course, in a way that’s much easier to grasp than, say, a spreadsheet containing the exact values of a student’s ability on every concept for each hour that they were enrolled. We have nothing against spreadsheets here at Knewton—they have their place—but problems and strange behavior jump out much more clearly to us visually.

Creating Better Algorithms

The visualization tool will also have applications outside of Knewton HQ. In the hands of an instructor, for example, it could help track the progress of students, and allow the instructor to adjust her course plan to account for her students’ likely learning states. Say that a student has passed a sizable number of the units in the course, at a point well into the semester. The student has most likely forgotten some of the topics that were covered earlier in the semester and not revisited since. Using this visualization, along with the Ebbinghaus-based model of learning and forgetting, the instructor now can see at a glance what parts of the course the student is most likely to have forgotten—even what parts the student is likely to forget in the future—so that she can remediate her student appropriately.

So far, this graphical tool and its associated models are only operational internally, but those models are already helping us create better algorithms, which in turn are going to start moving students through Knewton-powered courses more efficiently than ever before, during this next academic year.