Why “N choose K”?

The name of the Knewton tech blog is “N choose K”. If it’s been a long time since you set foot in a math class, this may not have any meaning to you. However, it’s an interesting concept, and really not all that difficult. In this post, I’ll go over what the term means in simple English and discuss how we use it here at Knewton.

Imagine you’re visiting the newest, trendiest lounge in New York. Like many such lounges, this one specializes in producing excellent cocktails from before the Prohibition era. You look over the menu and see that they offer one drink for each type of alcohol: champagne, gin, rum, tequila, and vodka. You decide that you want to try all of them, in pairs rather than one at a time, to see how well they go with each other. You want to know how many different pairs‚ (one for each hand) you can make with these 5 drinks.

One way of doing this would be to count each of them. To do this, we can label each of the drinks with their main ingredient:

and systematically count each of the pairs (note that order of ingredients is not important):

for a total of 10 different pairs of drinks. Mathematically, this is known as n choose k, and is written as with n being the total number of items, and k being the number to choose each time. You’re asking “how many ways can I choose 2 items out of 5,” which we know from our graph is:
It’s not too difficult to count each of these by hand. But now imagine that you’re an octopus at a different bar, which has 20 drinks, and you want to find out how many ways you can choose 8 drinks out of 20, i.e. . Clearly, counting is not an option here.

The great thing about n choose k is that there’s a simple formula to figure it out: you just plug in your values for n and k, and out comes the answer. Here’s the formula:


which is a lot less complicated than it seems. The exclamation point is known as a factorial, and you can find any factorial by multiplying together all the numbers from that number to 1. So, 3! is simply:
Let’s try it on our original problem, how many ways can we choose 2 drinks out of 5?

Note how we can make the calculations easier by canceling the top and bottom of the fraction when they’re exactly the same.

Now with :

It’s a good thing octopi have a strong tolerance for liquor.

Of course, “n choose k” pertains to more than just mixing drinks at bars. In fact, you can see this in all sorts of applications, and often in exactly the sort of work we do here at Knewton. For example, let’s say we have a graph that has only paths to adjacent nodes:


The question we need to answer is how many shortest paths are there from A to Z? (These are called lattice paths). There are algorithms that allow you to determine this exactly, but they require polynomial time to complete. Instead, let’s solve an easier problem: What is the upper bound of the number of shortest paths from A to Z? So, what would the graph would look like where the upper bound is the same as the actual number?

Well, a fully-connected graph would fit this scenario:


We can count the length of the path from A to Z by counting the length of the path:

Clearly we cannot have a shorter path than this, and so all shortest paths must have exactly 6 steps. The only independent variable then is the choice of which steps we take to the right; the dependent variable is which steps we take down (it’s dependent because it is completely determined by the other variables). So in the example above:

Independent: steps 1, 2, 3, 4 are to the right
Dependent: steps 5, 6 are to the bottom (whatever is left)

Starting to sound familiar? It’s the same as “n choose k”! Given the total number of steps, how many ways can we choose to go right? And so we get:

More generally, for an m x n lattice grid, we can use the formula:

where j is one of the dimensions of the grid and k is the other; which one you pick doesn’t matter, as you’ll see below.

Be aware, you might see this formula as . This is correct if you count the starting point A as (0, 0) and ending point Z as (m, n), which subtracts one in each direction.

You might be interested in why we chose which to the right and which down; in other words, can we switch the independent variable? The answer is yes, and it doesn’t matter, and it’s pretty neat, because as you could calculate:

In fact, if you calculate a number of these and group them into a triangle shape, it results in a pattern known as Pascal’s triangle:

To find  for example, count to the fifth row and the second column, and you’ll find that . As you can see, the left side of the triangle has the exact same values as the right side, which is why , and more generally:

These values are also known as binomial coefficients; if you’re interested in why, you can easily find more information on the web.

There are a couple neat patterns that Pascal’s triangle has. First, each number in the triangle is the sum of the two numbers above (you can think of the entire triangle surrounded by zeros if that helps). This is pretty miraculous if you recall our original formula for determining n choose k.

The second pattern has to do with its shape: if you take Pascal’s triangle and color only the odd numbers, you get a Sierpinski triangle, which you may be familiar with: it’s our n choose k blog logo!

Our work at Knewton allows us to play with ideas all day and turn those ideas into a reality. We chose the binomial coefficient   to be the title of our tech blog because it reflects many of our core values as Knewton developers: our passion for mathematics and the language of ideas, and our commitment to intellectual endeavor.

Conclusion

In this entry we discussed what n choose k means and how you can use it to limit your alcohol consumption find the number of ways groups of things can be chosen. Then we found a simple formula to calculate it. We saw a practical use of formula in finding an upper bound of the number of shortest paths in a particular graph. Finally we looked at Pascal’s Triangle and the relationships between the binomial coefficients, aka the results of multiple n choose k computations. I hope you had some fun, learned a bit about math, and have something new to discuss at your next cocktail party.

Thanks to Kevin Wilson and Solon Gordon for their help with this post.

Parameter Recovery

1. Introduction

Providing students with good study recommendations begins with understanding what they already know. At Knewton, we’re combining existing work on psychometrics and large scale machine learning to perform proficiency estimation more accurately and at a larger scale than ever before.

As Alejandro Companioni [discussed in a previous post], Item Response Theory (IRT) is the core model around which our proficiency estimation is built. IRT proposes that there are latent features of students (proficiency) and questions (difficult, discrimination, and guessability) which explain the accuracy of responses we see on assessments. However, most IRT analyses assume that students bring the same level of proficiency to every question asked of them. This might be fine for a 2-hour standardized test, but we are interested in helping students over an entire course. If we’re doing our job, students should be improving their proficiency as time goes on!

This summer, I made our IRT models more sensitive to gains (and losses) in student proficiencies over time. I’ll leave the details of that model to another post, but as a teaser, I’ll show you our first figure. In this figure, the x-axis represents time as measured by the total number of questions answered. Dot represent the question which was answered at that time. Red dots represent incorrectly answered questions while green dots represent correctly answered questions. The {y}-value of the dot is its difficulty. The green line tracks our simulated student proficiency, and the blue line tracks our recovered student proficiency paramter given the questions that they have answered.

In this post, I will discuss three methods which we used to evalute the performance of our algorithms and discuss their relative strengths and weaknesses. We’ll focus on the mean-squared error, the log-likelihood, and the Kullback-Liebler divergence.

student_1

A visual representation of our temporal IRT model

2. Evaluating results 

To tackle the problems I faced, I explored statistical inference algorithms on probabilistic graphical models. To judge my results, I simulated classes of students answering sets of questions, and saw how accurately my algorithms recovered the parameters of the students and questions.

One method of quantifying the accuracy of estimates is the mean-squared error (MSE). It takes the mean of the square of the differences between the estimated parameters and their actual values. In symbols, if {\theta^{\text{act}}} is the vector of model parameters used to generate our data, and {\theta^{\text{rec}}} are the parameters our method recovered, then

\displaystyle \text{MSE} = \frac{1}{N} \sum_{i=1}^N \left(\theta^{\text{act}}_i - \theta_i^{\text{rec}}\right)^2.

While the MSE is a good indicator for accuracy in many places, it has problems when models have multiple solutions at different scales. Let’s see why this is through an example.

Suppose we have a class of students answering a set of questions. Suppose the questions are actually very hard and that the students happen to be very proficient at the material. Just looking at this set of data, however, we have no way of knowing that these students are actually very proficient. We can only assume the most likely scenario–the students have average proficiences and are answering the questions competently. From the data alone, we can only hope to discern the values of item parameters relative to the proficiency of students and vice-versa. We cannot hope to know their values absolutely. So there are many equally valid interpretations at different scales of the same data. Because the MSE looks at the difference between the two parameters, it will penalize parameters that are scaled differently, even though those parameters might be an equally valid solution given the data!

Let’s look at Figure 2. We see that the ordering is basically preserved, which we could measure with Pearson’s {r}. but the recovered parameters are not on the same scale. If that were the case, then the line of best fit would have slope one and {y}-intercept {0}. However, in this case, the slope is closer to {2/3}. In statistical terms, the variance of the recovered parameters is much less than the variance of the actual parameters.

scatter_thetas

Recovered student proficiencies plotted against actual student proficiencies from a simulation.

The log-likelihood gives us a more meaningful method of measuring accuracy. The log-likelihood tells us the log-probability of our recovered parameters given our data. At instantiation, our algorithms should have low log-likelihoods–the parameters that we guess first are random and don’t fit our data well. Our algorithms should iterate toward higher log-likelihoods, hopefully converging at the set of parameters with the highest log-likelihood. This is the philosophy behind the Expectation-Maximization algorithm. But the log-likelihood is susceptible to tail events. For instance, if a question is in reality extremely hard but, through random chance, a few students with average proficiency answer the question correctly, then maximizing the log-likelihood will lead to marking these very hard questions as easier than they actually are. This, of course, could be solved with more data from large numbers of extremely proficient students, but this data is often hard to come by. Instead, we introduce another way of measuring model fit: the Kullback-Leibler (KL) divergence. Suppose we have probability distributions {P} and {P'}, generated by our original and recovered parameters, respectively. That is,

\displaystyle P(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{act}} )

and

\displaystyle P'(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{rec}} )

In our case, a datum is one student’s response to one question and the {\theta} paramters are the student proficiency and question discrimination, difficult, and guessability discussed in Alex’s post. Then we can define

\displaystyle D_{KL} (P||P') = \displaystyle\sum_i \ln\left({\frac{P(i)}{P'(i)}}\right) P(i).

If {P = P'}, then {D_{KL}(P||P') = 0}. Moreover, Gibbs’ inequality implies that {D_{KL}(P||P') \geq 0}. If the original and recovered parameters generate similar probabilities, then the KL Divergence should be small. Note that since the KL divergence involves the ratio of the likelihoods, it is less susceptible to the influence of tail events than the log-likelihood alone.

The log-likelihood and KL divergence both use likelihoods to measure fit, which means that they only care about fit of the parameters to the data, and not the exact convergence to the original. So they often prove to be reliable measures of fit to judge our algorithms on. For instance, even though the MSE of our recovered parameters is large, Figure 2 shows us that our algorithm has likely converged (since the log-likelihood and KL divergences are not changing much) and give us a reliable measure of model fit.

metrics

The log-likelihood and KL divergence of our recovered parameters through a run of our algorithm.

GraphLint: Creating a Domain Specific Language for Graph Validation

Kenny Yu is currently a sophomore at Harvard College where he is studying computer science. In the summer of 2011, he interned at Knewton as a software engineer. During his internship, he built a new source review and release process and created a graph validation engine based on configuration schemas. He interned at Knewton again in January 2012 and rebuilt the graph validator by creating a domain specific language. The following blog post is a summary of this project and the process he used to design the language and the new validation engine.

The Question

At the core of the Knewton Adaptive Learning Platform is the course structure that brings all the lessons, concepts, and content into one connected graph. This graph is so large that checking it by hand is infeasible. Thus, Knewton’s graph datastore needed a validation system to check the structure of its graphs.

The validator idea is simple: given a graph, how can we tell if it is “well-formed”? How can we create a set of rules to express what edge and node types are acceptable in a given graph? My mentors, Jordan Lewis and Trevor Smith, decided that I should build a domain specific language (DSL) for writing the ontology schemas to solve this problem.

Developing the Language: “Onto”

Before creating the DSL, I had to understand the use case of the language I was building. After speaking with developers and content creators at Knewton, I composed this list of considerations that would drive my planning process:

  1. Expressiveness of the language – Above all, the language must be capable of encoding complex relationships between nodes and edges.
  2. Simplicity and ease of writing the language – The language must be easy to write and so easy to understand that non-developers can think and write in it.
  3. Ease of implementing the language – It should be easy to take the language and parse it into a form that can be used for graph validation.
  4. Graph validation run-time – Once the ontology had been parsed from the DSL into usable form, how quickly can we validate a graph using the provided ontology?

The graphs I worked with are directed multigraphs, where edges and nodes may have associated data with them. In Knewton’s case, both nodes and edges have types (e.g. “Concept”, “Module” and “Contains”, “Represents”). Using these types, we want to assert relationships like:

For all nodes m of type Module, there exists exactly 1 node c of type Concept such that the directed edge (c,Contains,m) exists in the graph.

We also want to express logical combinations like the following:

For all nodes c of type Concept:

there exists exactly 1 node m of type Module such that the directed edge (c,Contains,m) OR (c,Represents,m) exists in the graph

AND

there exists exactly 0 or 1 nodes m of type Module such that the the directed edge (c,Contains,m) exists in the graph.

I used statements like these as inspiration to create the DSL. The DSL consists of these statements:

  1. Universal quantifier statement: for all <variable> in <node type> <statement>
  2. Existential quantifier statement: there exists <number rule> <variable> in <nodetype> such that <statement>
  3. Edge rule assertion statement: (<variable>,<edge type>,<variable>)
  4. And statement: <statement> and <statement>
  5. Or statement: <statement> or <statement>
  6. Xor (Exclusive or) statement: <statement> xor <statement>
  7. Not statement: not <statement>

Thus, the above two statements, written in the DSL, look like this

and the second one:

Lacking creativity, I decided to give ontology file names the “.onto” extension, and thus the Onto language was born.

Language Implementation Choice

It was difficult to determine which language I would use to implement this project. Having recently taken a compilers class in OCaml and knowing the benefits of OCaml’s strongly typed system and functional programming language features (and my personal preference for coding in OCaml), I thought that OCaml seemed like the natural choice for this kind of project. However, after I considered the reusability of this project with other systems in the company, I realized that OCaml would not be the best choice. The company recently decided to move towards Java and possibly other JVM-based languages. I had researched Java-OCaml bindings and discovered that existing options couldn’t meet this project’s need. As a result, I chose to implement this project in Java to make it more compatible with other systems at Knewton.

Having only used ocamllex and ocamlyacc to create parser generators, I needed to research other parser generators for Java. My mentors suggested ANTLR. I found ANTLR to be a high quality project that was incredibly easy to use.

Validation Algorithm

Now that I could successfully parse ontology files into memory, how would I validate a graph? I decided to implement the algorithm in a functional way: traverse the abstract syntax tree (AST) generated by the parser and for each statement node, create a closure that, when provided an environment–a mapping from variable names to actual nodes in the graph–and called, would return “true” or “false” (and thus represent whether the graph passed validation on that statement). After traversing the entire AST, one closure would be produced and would represent the validate function. A list of errors would be generated if the graph failed validation.

For example, I did the following for the Universal quantifier statement:

for all <variable> in <node type> <statement>

  1. Generate the closure by traversing the AST in <statement>
  2. Iterate over all nodes with type <node type> in the provided graph, and for each node:
  1. extend the current environment by mapping: <variable> to current node
  2. call the closure on this new environment, and ensure the closure returns true

I followed a similar procedure for the rest of the node types in an ontology.

Unfortunately, closures are not directly accessible in Java. Instead, I had to generate wrapper classes that wrapped methods representing the closure. In addition, because of the lack of match statements in Java, I used the visitor pattern to traverse the AST.

Testing and Profiling

After unit testing the validation engine on small graphs and simple ontologies, I moved on to a real life error-riddled data set. The graph in the data set had 53,377 nodes and 53,960 edges. When I initially tried to validate this graph, the JVM exploded. Even with 2GB of memory, the JVM still did not have enough heap space to validate this graph in half an hour. After profiling, I discovered that over a hundred million Strings were being generated by error messages for failed validation. As a result, I made several changes to make the validator more memory efficient: I used the flyweight pattern when allocating space for environment variables, node and edge types, and I lazily generated the error messages, only creating them when necessary.

This solved the memory problem. However, the run-time of the validation algorithm was still horrible. Each nesting level of a for all and exists statement added a factor of V to the asymptotic run-time (where V is the number of vertices). For example, if an ontology had three nested for all statements, the run-time would be O(V^3). Fortunately, since edges could only connect two vertices (unlike in a hypergraph), I could optimize the exists statement to only check in the neighboring set of the current node. With this optimization, exists statements no longer multiplied the computational complexity. As a result of these memory and time optimizations, the run-time for the 50,000+ node data set went from 30+ minutes to less than 4 seconds. Changing the run-time from V^2 to V makes such a big difference!

Looking Back, and Next Steps

It’s a rare opportunity to go back and rebuild old projects from scratch. It’s an even rarer opportunity to get complete freedom (as an intern) to design a central component of the company’s system. At Knewton, I was fortunate to have both opportunities. Knewton tasked me with rebuilding the graph validation engine, and I had complete freedom to design the project however I wished. Even more remarkable–this project allowed me to to work at a scale (50,000+ nodes!) that I would rarely ever touch at school.

There are several things that in retrospect I would have done differently. The lack of functional language features in Java made it difficult to implement the validation algorithm. As a result, if I were to redo the project, I would probably implement it in Scala. Using Scala’s interoperability with Java and its functional language features (especially the ability to create closures and use case classes) would have made the implementation easier to write and more concise. Furthermore, there are many optimizations I could have implemented to make the validation engine faster: I could have used properties of second-order logic to transform quantifier and logic statements into equivalent statements with a faster run-time, and I could have parallelized the algorithm to make use of concurrency. But this is pretty good for working just two and a half weeks!