Student Latent State Estimation with the Kalman Filter

The Kalman filter is an algorithm that estimates the values of unknown, or latent, variables from a series of noisy measurements taken over time. The Kalman filter has numerous applications in finance and different areas of engineering, and in this blog post, we will show how the Kalman filter can apply to student learning. We will use a Kalman filter to infer an approximation to a student’s ability given their response times to a set of questions.

To fully understand the concepts below, you’ll need a background in basic probability and statistics.


To begin, we are going to describe our inference problem and a model of the system – a set of assumptions of how student learning works.

Given a series of scalar measurements:


where z_i denotes the natural logarithm of the time it took our student to answer question i,

We want to infer the scalar latent variables:


where the student has ability x_i at the time question i is answered.

We model the change in student ability over time as a Gaussian random walk, meaning that the current ability value is based on the previous ability value, plus some Gaussian noise:

x_k=x_{k-1}+w_k, where w_k is drawn from N(0, \tau \sigma^2_x), where \tau is the time the student spent between questions, and \sigma^2_x is a hyperparameter that corresponds to the variance for this latent process (1).

Having the variance of the noise increase linearly with the time difference makes our equation consistent with a continuous Gaussian random walk, and is consistent with our intuition that a student’s ability needs time to change. In other words, a student is unlikely to experience significant change in ability if they’re between questions in a quiz, but if it’s been a day since they’ve answered a question, they’re more likely to have taken the time to learn more about the concept, or, conversely, they might have forgotten the material. Because the latent state variance is time-dependent, our filter is technically called a hybrid Kalman filter, since it assumes a continuous-time model for a discrete set of observations.

We don’t assume that the student ability x_k accounts for all the variability in the log of the student response time, z_k. For example, it’s possible that the student is familiar with this particular problem or is distracted by her environment. Therefore, we say that the observed times are likewise corrupted by some Gaussian noise, v_k:

z_k=x_k+v_k, where v_k is drawn from N(0, \sigma^2_z), where \sigma^2_z is a hyperparameter that corresponds to the variance of the Gaussian noise (2).

Kalman Filter blog post (v2)

The resulting model is pictured by the diagram above: the ability of the student at the previous question x_{k-1} determines the log response time z_{k-1}. The ability of the student at current question x_k is determined by the ability at the previous question x_{k-1}, and determines the current log response time, z_k.


The Kalman filter is an algorithm for estimating each x_k a posteriori — that is, it computes an estimate for each x_k given observations Z = \{z_1, z_2,...,z_k\}. It does this recursively, which means that it only needs an estimate of x_{k-1}, along with observation z_k, to output an estimate for x_k.

To make a new estimate, we first need to compute two intermediate pieces of information: a prior distribution and a likelihood distribution.

A note on syntax:

z_{1:k-1} denotes our observations z_1 through z_{k-1}, and x_{k-1|k-1} represents our estimate of ability at the k-1th question given observations z_{1:k-1}; likewise \sigma^2_{k-1|k-1} represents our estimate of the variance given observations z_{1:k-1}.

f denotes the Gaussian probability density function (pdf) with mean \mu  and variance \sigma^2:

f(x;\mu,\sigma^2) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{(x-\mu)^2}{2 \sigma^2}}

Calculating our prior distribution

Our prior term represents the knowledge we have of current latent state x_k having seen everything but our current observation z_{k}.

To calculate our prior, we start off with an estimate of x_{k-1}, represented as a Gaussian distribution with mean x_{k-1|k-1} and variance \sigma^2_{k-1|k-1}:

p(x_{k-1} |z_{1:k-1}) = f(x_{k-1};x_{k-1|k-1}, \sigma^2_{k-1|k-1}) (3)

From equation 1, we see that x_k is simply the addition of two independent random Gaussian random variables, x_{k-1} and w_k.

From probability theory, we know that the probability density of the addition of two independent random variables can be expressed as a convolution of the two composite probability densities. It happens that the convolution of two Gaussians is also a Gaussian:

p(x_k|z_{1:k-1}) = \displaystyle\int p(x_{k-1}|z_{1:k-1}) p(x_k|x_{k-1}) dx_{k-1} = f(x_k; x_{k|k-1}, \sigma^2_{k|k-1}), where

x_{k|k-1}=x_{k-1|k-1} and

\sigma^2_{k|k-1}=\sigma^2_{k-1|k-1}+ \tau \sigma^2_x  (4)

We call p(x_k|z_{1:k-1}) the prior knowledge we have about x_k. The next step is to look at the current observation, z_k and see what information it adds.

Calculating our likelihood distribution

Our likelihood term represents the information our current observation z_k gives us about our latent state x_k.

From equation 2, we see that likelihood of our observation z_k given our hidden variable x_k is simply a Gaussian centered at x_k. This becomes our likelihood term:

p(z_k|x_k) = f(z_k; x_k, \sigma^2_z) (5)

Combining prior and likelihood

The Kalman filter combines the prior knowledge we have about x_k and our likelihood term in accordance with Bayes’ rule, by multiplying the prior term with the likelihood term. We call this resulting distribution the posterior, or our estimate of x_k given all the information we have.

Luckily, the multiplication of two Gaussians is still a Gaussian, although unnormalized:

p(x_k|z_{1:k}) = \frac{1}{Z} p(x_k|z_{1:k-1})*p(z_k|x_k) = f(x_k;x_{k|k}, \sigma^2_{k|k}), where Z is a normalizing constant, and where:


C_k=\frac{\sigma^2_{k|k-1}}{\sigma^2_{k|k-1} + \sigma^2_z}

\sigma^2_{k|k} = (\frac{1}{\sigma^2_{k|k-1}} + \frac{1}{\sigma^2_z})^{-1} (6)

To summarize, given a Gaussian posterior distribution for x_{k-1} (Equation 3) and a new observation z_k, the Kalman filter estimates a new Gaussian posterior for x_k (Equation 6). By updating the Kalman filter as we receive new observations, we can obtain fast, real-time estimates of our latent state.

Open Source Code

An open source implementation of this hybrid Kalman filter algorithm is on Knewton’s GitHub:


Sophie Chou is a senior at Columbia University majoring in Computer Science. She’s interested in Machine Learning, Natural Language Processing, and becoming a better teacher. You can follow her on twitter @mpetitchou.

Andersen Chen is a senior at Brown University, majoring in Mathematics and Computer Science. He’s interested in data science, startups, and machine learning.

Cassandra and Hadoop – Introducing the KassandraMRHelper

Here at Knewton we use Cassandra for storing a variety of data. Since we follow a service-oriented architecture, many of our internal services are backed by their own data store. Some of the types of data we store include student events, recommendations, course graphs, and parameters for models. Many of these services and clusters are often deployed in more than two environments, increasing the total number of Cassandra clusters deployed.

On the Analytics team at Knewton we are constantly working on improving a lot of the inferential models that go into our platform, while at the same time building new ones. This often involves munging a lot of data in short periods of time. For a lot of our ad-hoc analysis we use a data warehouse by which analysts can query and extract data relatively quickly. One of the challenges we’ve faced at Knewton — and specifically in Analytics — involved how to go about populating our data warehouse with data from Cassandra clusters that predated our data warehouse. To solve this problem, we implemented an internal library for bulk extracting data out of Cassandra into Hadoop with zero hits to the Cassandra cluster. A few months later we opened sourced it here and called it the KassandraMRHelper.

KassandraMRHelper takes a slightly different approach than the constructs contained in the Hadoop package in the Cassandra source code (e.g. AbstractColumnFamilyInputFormat), in that it doesn’t require a live Cassandra cluster to extract the data from. This allows us to re-run map-reduce jobs multiple times without worrying about any performance degradation of our production services. This means that we don’t have to accommodate more traffic for these offline analyses, which keeps costs down.

How does it work?
The KassandraMRHelper includes specialized Input Formats and Record Readers for SSTables. First, here’s a little bit about SSTables:

SSTables are immutable; once they’re written they never change.
SSTables can exist independently of each other but collectively they form the complete data set.
SSTables consist of 4-5 parts depending on which version you’re using:


There are 4 to 5 different components:

  • Data.db files store the data compressed or uncompressed depending on the configuration
  • Index.db is an index to the Data.db file.
  • Statistics.db stores various statistics about that particular SSTable (times accessed etc)
  • Filter.db is a bloomfilter that’s loaded in memory by the cassandra node that can tell it quickly whether a key is in a table or not.
  • CompressionInfo.db may or may not be there depending on whether compression is enabled for a ColumnFamily. It contains information about the compressed chunks in the Data.db file.

Data in columns and rows are essentially key value pairs, with rows as the keys and columns as values to the rows. The columns are also key value pairs consisting of a name and a value.

Given how data are stored, Cassandra is in fact a really good fit for MapReduce. The same partitioning schemes that Cassandra uses can also be used in MapReduce. Columns and rows can be the keys and values that get passed in the Mappers or Reducers in a MapReduce job.

Some key components of KassandraMRHelper are:

  • The SSTableInputFormat: The SSTableInputFormat describes how to read the SSTables and filters through all the components and ColumnFamilies, keeping only what’s necessary for processing using a custom PathFilter. There are two types of SSTableInputFormats depending on how you want to represent key/value pairs in MapReduce. The SSTableColumnInputFormat constructs an SSTableColumnRecordReader in which keys in the Mapper represent the row keys in Cassandra and values represent a single column under that row key. Similarly the SSTableRowInputFormat constructs an SSTableRowRecordReader in which keys in the Mappers are the Cassadndra row keys and values are column iterators over all the columns under a single row key.
  • The SSTableRecordReader: It internally uses an SSTableScanner and a Descriptor similar to the one contained in recent version of Cassandra but with backwards compatibility to identify all the parts of an SSTable. As described previously, subclasses of the SSTableRecordReader can represent values as single columns under a row or entire column iterators.
  • The SSTableColumnMapper: This abstract mapper inherits from the default Hadoop mapper but adds a bit more structure to deal with the ByteBuffers and columns contained in the SSTables. It can also skip tombstoned columns.
  • The SSTableRowMapper: This is similar to the mapper above that deals with column iterators.

Setting up a MapReduce job for reading a Cassandra cluster becomes very simple. The only missing piece is finding an easy way to get all the SSTables into a Hadoop cluster. At Knewton we found Netflix’s Priam to be a good match. Priam backs up our Cassandra cluster multiple times a day into S3 making it really easy to transfer the data to Elastic MapReduce (EMR).

This simple MapReduce job shows a complete example job that consumes student event data from backed up Cassandra SSTables. The example can also be found here.

public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  Job job = new Job(conf);
      RandomPartitioner.class.getName(), job);
  SSTableInputFormat.setComparatorClass(LongType.class.getName(), job);
  SSTableInputFormat.setColumnFamilyName("StudentEvents", job);
  SSTableInputFormat.addInputPaths(job, args[0]);
  FileOutputFormat.setOutputPath(job, new Path(args[1]));
public class StudentEventMapper extends SSTableColumnMapper
      <Long, StudentEvent, LongWritable, StudentEventWritable> {
  public void performMapTask(Long key, StudentEvent value, Context context) {
    context.write(new LongWritable(getMapperKey(),
                  new StudentEventWritable(studentEvent));
  protected Long getMapperKey(ByteBuffer key, Context context) {
    ByteBuffer dup = key.slice();
    Long studentId = dup.getLong();
    return studentId;
  protected StudentEvent getMapperValue(IColumn iColumn, Context context) {
    return getStudentEvent(iColumn, context);

Notice that the mapper extends from a specialized SSTableColumnMapper which can be used in conjunction with the SSTableColumnRecordReader.

The example above uses the identity reducer to simply write the data as comma separated values by calling the toString() method on the StudentEventWritable objects. The only additional task you have to worry about in the Reducer is deduping the data, since you will probably have a replication factor of > 1.

Automating this job becomes an easy task given that SSTables are immutable and older tables don’t have to be read if they were already read once. Enabling incremental snapshots can also help here.


If you want to get started on using the KassandraMRHelper you can check out the code here: Contributions are more than welcome and encouraged.

If you’re interested in additional topics in Cassandra and Hadoop you should check out the presentation on bulk reading and writing Cassandra using Hadoop here with the slides shared here.


Deciding Who is Right: Using Item Response Theory Instead of Majority Rule to Crowdsource Answers

As the saying goes, it takes a village to build an adaptive learning technology provider. Even when you have an entire village at your disposal, however, the added help may contribute more headache than help as even the simplest question can generate differing opinions.

To place the situation in more concrete terms, imagine you have a stack of 10,000 photos of either kangaroos and kittens, but you do not know which photo depicts what. Because object recognition remains a difficult problem in artificial intelligence, even the most powerful computers will have a difficult time determining if a photo is a kangaroo or a kitten.

Classifying them all by yourself would be time consuming and potentially inaccurate if you start to lose focus. Luckily, 100 of your closest friends have offered to help; unluckily, some informal surveying already reveals that sometimes they disagree with each other. After all, kangaroos and kittens do sometimes look similar. What to do?

How can we decide the correct answer amidst a sea of potentially contradictory information? One straightforward approach would be to gather two or three (or ten) labels for each photo and take the majority vote. While the majority method would give us a rough idea of the correct label for each photo, it fails to incorporate the fact that some of your friends may be more gifted kangaroo/kitten labelers than others. Additionally, some of the photos might be harder to label, which would skew calculations about your friends’ abilities.

All of a sudden, our kangaroo-kitten problem is starting to sound like a problem we have already tackled: Item Response Theory (IRT)!

IRT Model

In order to solve our kangaroo-kitten problem, we can use a slightly modified form of IRT. Here at Knewton, we use IRT to determine student proficiency. Unlike standard tests where all questions are weighted equally, IRT assigns each student an ability value and each question a difficulty value, which provides a more nuanced picture of how a classroom of students is doing.

In the one-parameter logistic (1PL) IRT model, for question \beta j with difficulty j and student i with ability \theta i, the probability that question j is answered correctly, xj = 1, is


For more information about IRT, check out Alejandro Companioni’s previous N Choose K post.

The question then becomes, how can we use our knowledge about abilities and difficulties to determine who is better at kangaroo/kitten labeling?

Beyond IRT

If we stretch our minds a little bit, we can imagine our kangaroo/kitten problem as a test where the questions are the photos and the students are our friends. We want to determine which students are most proficient at the very strenuous task of animal classification. In addition to the parameters included in the 1PL IRT model, however, we also want to compute a probability to capture how likely each picture is to be either a kangaroo or a kitten.

Similar to the 1PL IRT model, the parameters in our model now include labels L, vector of abilities theta, and vector of difficulties beta. To make sure we’re all on the same page, the labels L represents all of the given labels by our labelers. Not all labelers need label every photo. Each of our abilities \theta can range from negative infinity to positive infinity. The greater the ability, the more skilled the labeler. Our difficulties range from zero to infinity where the higher the difficulty, the harder the image is to label correctly.

Consider how the observed labels, true labels, abilities, and difficulties all relate to each other. Would the difficulty of the question affect the accuracy of the observed label? Potentially. Would the true label of the image affect the ability of the labeler? Unlikely. Below we have drawn the general graphical model describing the relationships between these parameters where the shaded variables are observed.


Remember that in our case, we have 10,000 images and 100 labelers. Unsurprisingly, the difficulties, abilities, and the true labels are all independent of each other, meaning the accuracy of a labeler has no effect on the likelihood that a photo depicts a kangaroo or a kitten!

How does this all have anything to do with if the photo is a kangaroo or a kitten? For specific photo j, we can derive how likely the photo depicts either adorable animal. That is, the posterior probability of the correct label zj for photo j denotes the probability the photo depicts an animal.

Because we know that the photo contains either of the two animals, we can designate kangaroo as 0 and kitten as 1. Our posterior probability then designates from 0 to 1 how likely the photo is to contain either animal. If we assume that the correct label zj is independent of both abilities theta and difficulties beta, the probability simplifies dramatically.



The posterior probability now consists of two components: a prior belief and an IRT-based probability. Our first term p(zj) captures our prior knowledge about how many of the photos contain each. For example, if we suspected that the majority of the photos were kittens rather than kangaroos, we could use that parameter to include our prior belief in the model. The second probability uses our 1PL IRT probability to denote the probability the labeler gave a label (aka answered a test question) conditioned on the correct answer, the labeler ability, and the difficulty of the question.


Now that we have established our graphical model including relevant dependencies, we can use an Expectation-Maximization (EM) approach to obtain maximum likelihood estimates of the parameters of interest. Essentially we can now alternate between the expectation and maximization steps to find the most likely probabilities and parameters, given all of the other probabilities and parameters.

By a lucky coincidence, we have actually already determined our expectation step above when we computed the posterior probability of each label! A simpler way to think about the expectation step is to imagine that our abilities and difficulties are all fixed, and we calculate the animal image probabilities accordingly. If we only calculated the probabilities once, however, the probabilities would only depend on whatever values of abilities and difficulties we initialized in our model! How can we keep adjusting the model?

This revelation brings us to the second half of EM: the maximization step. For this step, we want to find a way to make our posterior probabilities as large as possible—denoting how certain we are overall of our guesses of the correct label—by adjusting our parameters and . More formally, we are trying to maximize the expectation of the joint log-likelihood of the observed and hidden variables (L, Z) given the parameters (\theta, \beta) with respect to the posterior probabilities that we calculated in the expectation step.

Our joint log-likelihood function is the expected value of the logarithm of our joint probabilities. That is, how certain are we of everything so far? Using our conditional independence assumptions outlined earlier, we can find our joint log-likelihood function:


Using gradient ascent, we can then find values of \theta and \beta that locally maximize Q.

From here, we simply alternate between expectation and maximization steps until convergence. To recap, expectation holds ability and difficulty constant, while calculating posterior probabilities. Maximization then calculates the ability and difficulty parameters to maximize joint log-likelihood, given constant posterior probabilities.

Depending on the gradient ascent implementation, the algorithm should converge quickly, revealing our best guesses for which animal is featured in which photo. As we can see below from our implementation based on simulated data, the EM approach outscores the majority approach at nearly 5% initially before converging later. Additionally, as we increase the number of voters, the accuracy increases. Success!


While our improvement over the majority method may be impressive, our E-M IRT model still has plenty of room to expand. What if we also had pictures of koalas and killer whales, increasing the number of options? What if we had reason to believe that the abilities of our friends fall in a Gaussian distribution, creating a prior distribution on our parameters? What if we assumed that our friends might become better labelers as they continued to label, making our model intertemporal?

Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., and J. Movellan. (2009). Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043.

de Ayala, R.J. (2008). The Theory and Practice of Item Response Theory, New York, NY: The Guilford Press.

Algorithmically Generating Questions

Developing assessment materials is a lot of work. Teachers spend countless hours scavenging textbooks and developing original exercises for practice worksheets, homework problems, remedial materials, and exams. To prevent cheating, many teachers write several versions of each test, multiplying the work required for creating problems (not to mention grading!). The issue is exacerbated by the rising popularity of MOOCs, in which tens of thousands of students may be enrolled in the same course.

Furthermore, access to a large, varied pool of assessment items is a key to the success of many adaptive courses of study. Without a large pool of items, proficiency estimates can be compromised and personalized courses of study can become less effective than they would be otherwise.

Machine Generated Questions

Machine generated questions have been studied for decades as a component of intelligent tutoring systems. Most research falls into two categories: solution-oriented approaches, and template-based approaches.

Solution-Oriented Approach*

In this approach, questions are generated based on the set of skills and concepts required to solve them. For example, skills related to addition includes adding single digit numbers, adding multi-digit numbers, adding three or more numbers, and carrying digits.

A recent paper, entitled “A Trace-Based Framework for Analyzing and Synthesizing Educational Progressions,” describes an interesting implementation of solution-oriented question generation. On page three, the authors write out pseudocode for the standard, classroom addition procedure. They then annotate the code with symbols representing skills (for example, C for “carry digit”). Thus, by running the pseudocode and keeping track of the symbols, one can obtain a “trace” that categorizes each possible addition problem.

Because solution-oriented approaches group problems based on skills, they lend themselves well to adaptivity. As a student answers questions, one can identify skills he or she is struggling with, and then recommend material reinforcing those skills. However, a major drawback of solution-oriented approaches is that developing questions even for a topic as simple as addition requires a fair amount of labor and domain expertise.

Template-Based Approach*

In this approach, a question template is used to represent a potentially large class of problems. For example, consider a familiar question:

Find all roots of _x+ _x + _.

The underlines are “holes” that must be filled in by the question generator. A template might also specify valid ways to fill in the holes. For example, maybe each hole can only be filled in by the integers one through 10, leading to 10^3= 1000 possible questions. The instructor may wish to further restrict the template to only permit quadratics with real, distinct roots.

The biggest advantage of this approach is that it is accessible to a majority of instructors, provided there is an intuitive and robust templating language. In addition, template-based approaches are easily generalizable, capable of representing entire domains. A disadvantage of templates is that they tend to group problems based on appearance, not skills.

This summer, I set out to create an assessments generator engine that would be both accessible and expressive enough to generate a wide variety of problems. For these reasons, a templating approach made the most sense. Furthermore, Knewton already has infrastructure in place that will enable us to achieve adaptivity even with a template-oriented approach.

The Art of Templating

My first task was to devise a templating language. I decided that it would be a good exercise to define a domain specific language (DSL) that formalizes the space of possible templates. This DSL must let instructors specify the following:

  • Which variables in the question can be adjusted?

  • What values are these variables allowed to take on?

  • How is the correct answer computed?

  • How are the incorrect answers computed? (for multiple choice questions)

After several iterations, I came up with a format general enough to cover many (if not most) of the questions used by Knewton Math Readiness. I’ll go through some examples, beginning with simple addition, that illustrate the main features of the templating language.

Adding Two Numbers

The below template is used to generate questions of the form _ +_ = ?

template "add 2 numbers" {
  question is "<x>+<y>=?"
  answer is "<sum>"
  variables {
    x, y are integers between 1 and 9

The question and answer are simply strings with variable names denoting the “holes.” Variables come in two flavors: generated (x and y) and derived (sum). Generated variables are bound to a sample set, which could be a range of integers, numbers with 2 decimal places, or even the set of fibonacci numbers. Derived variables are defined by mathematical expressions.

Finding Quadratic Roots

Let’s return to the earlier question of finding the roots of a quadratic. Following the addition example, we might try:

template "quadratic roots" {
  question is "<a>x^2 + <b>x + <c> = 0. Solve for x."
  answer is "x = <root1>, <root2>"
  variables {
    a, b, c are integers between -10 and 10
    discriminant = b^2 - 4*a*c
    root1 = -b + sqrt(discriminant)/(2*a)
    root2 = -b - sqrt(discriminant)/(2*a)

Here, we are generating each coefficient (a, b, c) from the range [-10, 10]. However, the below table illustrates an issue with this template. For certain coefficients, the solutions can be integral, fractions, irrational numbers, or even imaginary numbers.



(1, -10, 16)

2, 8

(-4, 7, -3)

0.75, 1

(3, 6, 2)

-1.577…, -0.422…

(1, 2, 1)


(1, 4, 8)

-2 + 2i, -2 -2i

Because of this, the template can represent questions requiring different skill sets and mastery levels. It is important to give instructors the ability to maintain a consistent difficulty level, and to control the concepts covered by a single template. This is achieved using constraints.

template "quadratic roots" {
  question is "<a>x^2 + <b>x + <c> = 0. Solve for x."
  answer is "x = <root1>, <root2>"
  variables {
    a, b, c are integers between -10 and 10
    discriminant = b^2 - 4*a*c
    root1 = -b + sqrt(discriminant)/(2*a)
    root2 = -b - sqrt(discriminant)/(2*a)
  constraints {
    root1, root2 must be integers
    root1 <> root2
    gcd(a, b, c) = 1

In general, constraints are useful for narrowing the skill set covered by the template, and to ensure that instantiations of the template are sufficiently varied. In this example, requiring unique, integer roots is used to control difficulty level, while requiring gcd(a, b, c) = 1 ensures that no two questions will be multiples of one another.

Car Distance

Another important feature of the templating language is the ability to specify wrong answers.

template "car distance" {
  question is "How far does a car travel in <m> minutes traveling <r> miles/hour?"
  answer is "<dist>miles"
  variables {
    m is an integer between 30 and 90 divisible by 10
    r is an integer between 40 and 65 divisible by 5
    dist = rate*time/60
    wrong1 = rate*time
    wrong2 = rate/time
    wrong3 = rate/time/60
  wrong answers {

Wrong answers can be either static or variable. What’s powerful about this feature is that each wrong answer might be associated with a particular deficiency or misconception. In the example, a student might choose “rate/time/60” because she doesn’t know the correct distance formula, while another student might choose “rate*time” because she has trouble converting units. This is another source of information that Knewton can use to provide more targeted recommendations.

The Problem Generation Algorithm

Great, so we have a template. Now how do we actually generate questions? My first inclination was to start with the simplest possible algorithm:

  1. Go down the list of variables, selecting values for the generated variables uniformly at random from the sample sets and using the formulas to compute the derived variables

  2. If the variables satisfy all of the constraints, add it to the list of questions.

  3. Repeat.

This naive algorithm performs nicely given one key assumption: a large enough fraction of the sample space (the set of all possible questions, i.e. the cartesian product of the sample sets) must meet the constraints specified in the template. For instance, if 100 questions are desired and the algorithm can handle 100,000 iterations, roughly 1/1000 questions need to be valid. This isn’t too daunting: as long as we offer an expressive library of sample sets and constraints, instructors can be expected to provide templates meeting this requirement.

It is a very difficult to come up with a more efficient approach. For some problems, algorithms do exist to generate solutions (see Euclid’s method for pythagorean triples), but for others it is mathematically impossible (see Hilbert’s tenth problem). In many cases, introducing heuristics may improve the random algorithm. For instance, it may be possible to identify a large chunk of the sample space that leads to solutions that are too large, non integral, negative, etc.

The Prototype

I chose to implement the assessment generator in Scala for several reasons:

  • Scala’s interoperability with Java made integrating with the rest of the Knewton code base an easy task.

  • Scala’s powerful Parser Combinators library made implementing the template DSL straightforward. Because of their low overhead, I also used parser combinators for converting math expressions like “sqrt(x^3 + 5)” into my internal data model. While there are many existing Java/Scala libraries that accomplish this, I was unable to find any capable of manipulating general mathematical objects, such as fractions, matrices, polynomials, polygons, and groups.

  • Scala’s parallel collections allowed running iterations of the problem generator routine in parallel. Doing so only required swapping out a Map with a ParMap, and appending “.par” to the main program loop.

*For examples of solution-oriented approaches in the literature, see (survey) and (grade school math).

*For examples of template-based approaches in the literature, see (embedded systems).

Ku Pima: To Measure

Trivia: Ku Pima means To Measure in the Xitsonga Language

An Analytics (Data Science) team is made up of engineers/scientists with a wide array of skills. This results from the nature of the goals the team has to meet. As an Electrical Engineering major at Wits University, I’ve spent two summers as an instrumentation engineering intern. Instrumentation deals with the task of engineering instruments that can measure certain quantities for industrial processes to be controlled. Examples of environments include manufacturing and chemical plants, houses, or even the International Space Station. I find analytics to be a similar process to instrumentation engineering in that useful measurements are sought and then the instruments to calculate those measures are engineered.

Building the Analytics Pipeline

On the Analytics team at Knewton, the data scientists develop measures that are useful to track, whether directly for a business case or for building blocks for future analytics. Within the Analytics team there is a Data Analysis component that develops analytics (measures). Another component, Data Infrastructure, engineers the pipeline (instruments) to actually calculate these analytics on a large/production scale. Initially an analytic is developed by exploring some interesting idea of a measure, using available organization data, and then refining it to arrive at the final analytic.

My internship was concerned with creating Data Infrastructure (the instrumentation) to compute some analytics at Knewton. My initial major task was to take a newly developed analytic, in this case Engagement, data from different sources within our products, and  engineer tools to calculate this analytic. This task itself encompases not only the implementation of an algorithm but also the engineering processes necessary to construct the components needed for the measure to be calculated. Further there is a need to analyze and validate the measure on a larger scale than the initial one used to develop it. This necessitates the need for a feedback loop between the data analysts and data infrastructure components.

Engagement is a many-faceted construct.  One facet is an analytic that serves to indicate how much time a student spends “actively engaged” on the Knewton platform. There are a number of ways it can be defined. Here is the basic premise: Let’s say a student is busy with a homework assignment. After they have submitted their task, the Knewton system sends recommendations on which material the student should tackle next. From these interactions one might want to know how engaged a student is with the learning platform. There can be many ways of quantifying this: time spent on the system; number of logins per week; time spent on recommended material, etc. The analytics team is tasked with investigating and developing the analytic into a form that will be useful internally or to a partner. After this initial synthesis, we need to engineer a pipeline that will take student interactions and the recommendations into account and calculate the engagement analytic. Further, this analytic is an example of analytic that needs to be inferred. By “infer” we mean that we cannot directly observe the data we want and thus have to deduce it from other data.

There Can Be Multiple Reasons to Infer an Analytic

The raw student interaction data needs to be cleaned and clustered: The raw data captures a lot of information, some of which may not be useful. Thus, there is a need for cleaning/filtering. Some student interactions can be clustered and thought of as a single event instead of multiple events. (Think of a student doing multiple tasks consecutively on a single homework assignment.)

You have to categorize the users’ intentions: The user’s intention is important as it can make an analytic useful or less so. For example: there is a situation where the student did not intend to do the next recommended action, not because they thought it was irrelevant, but because they had to move to another task (say, another homework assignment with an upcoming deadline). In this situation we would have to treat this as a data point that would not be useful in calculating engagement.

Resources: Available resources are always a factor for any organization. It might be faster to calculate the analytic in one way as opposed to another. It might be more efficient to infer an analytic from a readily available dataset than use a much richer, hard to obtain dataset that is less efficient and only provides a small boost in accuracy with an accompanying large use of resources.

Computing Engagement

The Engagement pipeline takes in data generated by Knewton and its partners that contains student interactions as well as recommendations sent to students. From this data the student learning/task sessions are inferred. From this data we then calculate the Engagement analytic. The computed analytic is reported and the data used for its calculation stored for future use and analysis.

 Building the Analytics Pipeline

After the initial pipeline is engineered there are still a number of tasks to complete. Validation is very important in determining if the analytic can be interpreted as expected. We need to know whether with sample data it produces similar results to the analytic development stage. This part of the process involves some modeling and statistics and needs analysis tools in order to detect errors and may lead to refining the analytic itself.

Through this validation process we are able to refine the measure further. Validation gives us a better understanding of the shortcomings of the data we collect and what other data might be useful.

If you’re interested in this subject, I’d also recommend checking out these tech blog posts: Netflix’s Recommendation Engine, Finding similar users on Twitter, How Facebook Enabled Graph Search On Such a Scale.

Generative Models for Description and Prediction

This summer I worked on the Analytics team at Knewton. One of the goals of analytics is to condense the huge amount of information generated by students interacting with educational products and provide teachers with data-driven insights into how their students are doing.

In this post, I will discuss an approach to conveying information about student behaviors using generative models.

A generative model is a description of how to generate data that is similar to the observed data.

Imagine we have a stream of binary observations for each student. The observations represent whether the student did any work on the system today, whether the student answered questions correctly as she progressed through a course, or anything else about the student that can be represented in binary terms. The observations are viewed as a stream; over time more observations are added.

Here are some examples of observations of correctness on practice questions from when each student started a course until today.

StudentA: “1111111110”

StudentB: “100011110111100111111011100110011”

StudentC: “111011110111111111111111111111111”

We could specify different generative models for each student. One possible generative model for student A could be:

Model #1 (Student A)

“print nine 1’s and then one 0″.

If we wanted to describe student B similarly the description would be:

Model #1 (Student B)

“print one 1 then three 0’s then four 1s then one 0 …”.

Models like Model #1 exactly preserve all of the information that was present in the student data but in doing so don’t reduce the complexity of the observations at all. Often, in communicating information about student behaviours, we seek to preserve the important information while reducing the complexity to communicate concisely.

A generative model does not need to reproduce the data exactly. One way of reducing complexity while preserving important information is to attribute parts of the observation that we believe to be unimportant as being generated from a process with some randomness.

Throughout this post, we’ll use a simple generative model for demonstration. The generative model belongs to a family of models, parametrized by a value w.

Model #2:

“Flip a weighted coin that lands on heads with probability w. If it’s heads, print 1, otherwise print 0″.

In reporting to a teacher, I can report that the model that best fits with the student history is w=0.9.

That can be pretty informative. It means that in the family of Model #2, w=0.9 gives the model that is closest to the full description of StudentA (using KL divergence as a measure of closeness). Often, a student’s full data description is inconvenient to communicate, but now I can summarize it concisely with a description of the model family (Model #2) and a single parameter (w=0.9).

While the model parameters may be meaningful themselves (in this case they are the student’s overall score on practice questions), they also define a space in which we can compare students to each other (or the same student at different points in time). For example, I can note that StudentA is best described by w=0.9 while StudentB is best described by w0.65. If I wanted to find the student most similar to StudentC (w0.94), I would find the student with the closest value for w and choose StudentA. Unlike standard string comparisons (eg. Levenshtein distance, which would yield studentB as the closest student to studentC), a distance measure based on a generative model is not sensitive to differences in the number of observations for each student.

In this case, because the generative model is quite simple, I’m just reporting a summary statistic of the data (in this case the sample mean), but that is not always the case.*

In choosing Model #2 to convey my information, I’ve consciously made some decisions about what qualities of the data are important or not. For example this model is insensitive to the order of the data. If I were to imagine a new student (StudentD) whose data was the same as StudentA’s but had its order reversed, the two students would be reported identically. If I believed that the order of the 1’s and 0’s was important, I could choose a different generative model. While many generative models will be valid for describing the same set of observations, some of these models will describe the data better, so it is also possible to empirically evaluate the relative descriptiveness of models.

Generative models can also be used for prediction. For concreteness, let’s imagine that the data we’re interested in is related to student work habits and the 0’s and 1’s represent whether the student interacted with the system in each hour. If we want to predict how many hours a student will spend on the system next week, we could fit a generative model to the student’s past habits and then run that model forward from where it left off to estimate a distribution for hours spent on the system next week.

In one sense this is still a description of the past work habits of a student, just phrased as a prediction of the future: “If these habits continue for another week, the distribution over time spent in the system is X.”

In some cases, this future prediction may be the most useful type of summary. For example, if a teacher wants to intervene before a struggling student fails an exam, an extremely useful description of the student’s improvements of proficiency is phrased this way: “If this rate of improvement continues until the exam date, the student is projected to get a 20% on the exam!”

*For example, a common generative model used for summarizing text documents is the Latent Dirichlet Allocation (LDA) model is parameterized by a set of distributions. These distributions can be viewed as a summary of the documents.

Another example is the IRT model of student proficiency described by Alex in his N choose K post. Reporting parameters from this model forms the basis of how we at Knewton communicate about what students know.

On Comparable Noise and Satisfiable Sound

The Motivation

Working for a startup can be intense. Many in this great city succumb to the pressures of work life and relinquish their purely intellectual pursuits. Here at Knewton, however, our love affairs with art, music, literature, formal systems — love affairs that, for the most part, began within very walls of the institutions that we hope to disrupt — these passions inspire our work and our camaraderie with each other. And practically speaking, interview questions at New York’s hottest startups can lean heavily on what was learned in the undergraduate CS lecture hall. The need to keep these skills sharp is clearly still relevant.

The aim here is to kick off a series of blog posts that pair substantial results in CS or formal systems with significant works of classical music. The series is limited to classical works explicitly — there is plenty of headphones-on time at work to listen to The Postal Service.

The Works

Listen to this: Gondwana by Tristan Murail

Read this: On Computable Numbers by Alan Turing

Alan Turing’s On Computable Numbers, with an Application to the Entscheidungsproblem is a seminal paper. Anyone taking a CS undergraduate core will encounter the notion of computability and the halting problem. Further from the ivory tower, it is not difficult to find people who are near-giddy about Turing Completeness:

Breaking news: HTML + CSS3 is Turing Complete,

People looking for help developing a Turing machine in Microsoft Excel,

A Turing machine built using Legos,

Brainfuck, a language described on its website as “an eight instruction Turing-Complete Programming Language”

Turing’s name has become quite literally a “formal” stamp of approval in the computing world. Turing’s seminal paper defined a set of abstractions that formed the basis for what we now call “Turing Completeness,” a requirement of all first-class computing languages. What exactly did Turing say in that paper? And what does it have to do with French spectral music?

Turing’s Paper

Beyond a few feeble mechanical attempts, computers did not exist in 1936, and Mr. Turing didn’t appear that excited about making one — he was interested in making Godel’s long, German work on incompleteness comprehensible. That is, Turing wanted to intuitively construct a class of decision problems that could be proven undecidable. Turing could not tackle this without defining a notion of “computable.” His definition leaned heavily on a few notions that we would recognize today as common, but required a little more formal attention-to-detail back then.

Turing describes memory:

We may compare a man in the process of computing a real number to a machine which is only capable of a finite number of conditions q1, q2, …. qI; which will be called “m-configurations”

and the idea of output:

… the subsequence of the symbols printed by it which are of the first kind will be called the sequence computed by the machine. The real number whose expression as a binary decimal is obtained by prefacing this sequence by a decimal point is called the number computed by the machine.

Academic language aside, the machine Turing is defining is simple. Turing calls this machine an “‘automatic-machine’ (or alpha-machine).” The machine has the following properties:

a tape divided into squares

symbols that are on the squares

the scanned symbol (the rth symbol). Modern texts will likely refer to this as the symbol the ‘head’ is pointing to.

configuration: the current m-configuration and the current scanned symbol (qn, G(r))

But this paper is not without its difficulties, especially if it is approached today.

One initial difficulty: having completed standard undergraduate classes in formal computation theory, Turing’s initial choice of terminology — “Circle-free” and “Circular” — was extremely confusing to my modern eyes. Most undergraduates approach Turing’s ideas in their coursework from the “halting problem” angle. What is initially confusing here is the following:


How can this be? Turing quickly skirts over the definition of a circle-free machine early on in the paper when he describes it as the following:

If a computing machine never writes down more than a finite number of symbols of the first kind, it will be called circular. Otherwise it is said to be circle-free. A machine will be circular if it reaches a configuration from which there is no possible move, or if it goes on moving, and possibly printing symbols of the second kind, but cannot print any more symbols of the first kind. (p. 233)

Finite. This should be the same as a machine that “halts,” right?

No. Turing’s machines are built to write down numbers. These numbers are (somewhat sneakily, in my opinion) implied to be repeating binary decimals on page 232:

The real number whose expression as a binary decimal is obtained by prefacing this sequence by a decimal point is called the number computed by the machine.

So we’re essentially talking about endlessly repeating floats — the kind of numbers that you get when you type in [1] [/] [3] on a calculator, except in binary. Numbers with bars over the top of the last digit: 0.33333333… for example (0.010101010… in binary). In Turing’s paper, the machine that prints 0.010101010… is “circle-free.” This means it encounters no terminal difficulty in computation — it can keep printing out the endless decimal representation without pausing indefinitely to calculate. By this definition, “finite” numbers such as 4, 5, and 6 would be circle-free numbers, because the machine could be constructed to just print out a sequence of zeroes after the decimal point: 6.000000000….

“Halting” really has no place in his discussion: if we took a machine that “did not halt” by the “halting problem definition”, it would print out, say, 6.000 and then wait for an indefinite amount of time before printing out the next zero.

Understanding the important concept of a circle-free machine is critical, even if your only goal in reading the Turing is to see Godel’s diagonalization “done simply” (see the argument on page 247 of the article, where he argues that the machine H is circular by suggesting an infinite loop).

Why the Murail?

The Murail is a significant representation of the spectral music genre which has significant parallels with the work of Turing.

The Murail, like the Turing, uses an intuitive language, completely divorced from the method-as-music ideal of serialism (about serial music). Gondwana is based entirely off of the first sounds you hear in the piece — it is the subsequent deconstruction that leads the listener down a path that is nearly always defined momentarily, in that each subsequent moment of sound in the piece is very often based off of the prior moment. There are few moments of discontinuity.

The work starts off very simply. The piece begins with seven to eight moments of repeated sound (see 0:12, 0:30, 0:46, 1:01, 1:15, 1:28, 1:38, 1:48, and then MAYBE [1:55, 2:01, 2:07, 2:13-2:40] in this recording), as easily understood derivations from a simple sonic idea.

Indeed, these events are like Turing’s “appeal to intuition” by Murail. There is no recognition of pattern possible without repetition. With the seven chime-strike chords, Murail is handing us a ball of yarn — we will need it to find our way home from the complicated, thorny walk in the woods that is the rest of the piece. There will be no resounding return, no recapitulation to save the day here.

Beyond the opening, the Murail begins to dig deep into the essence of sound. At roughly eight minutes into the work, a hissing begins to become a main sonic feature. The connection to the beginning is simple — there is a brief, percussive sound at the beginning of the piece with a few strings playing, perhaps a maracca — there are lots of high, hiss-like frequencies that comprise the sound of a maracca. But it brings to mind an important question — a question that is at the heart of spectral music: what kind of music is happening in the parts of sound waves that we DON’T hear? What about all the extremely high frequencies that dogs can hear? Above those, even? They exist. They have to. A sound wave can be encoded by a number. Large numbers exist. What if we slowed the noise, say, of water molecules at room temperature bouncing around inside a mason jar? Anyone that has been distracted by the sound of a tea kettle on the verge of boiling already knows that interesting sonic things can happen with water molecules on the surface of a boiling hot-plate. But what about beyond the ear? What types of large-number-based frequencies exist at this sonic extremes of our world?

Computable vs. Uncomputable Sounds?

There is a strong argument that much of “Western Harmony” can be boiled down into small, rational-number relationships between different frequencies; numerical relationships, in fact, that are easily computable by a Turing machine (with a reasonable “skeleton table,” even!). A major triad, for example, is little more than the three ratios:

1, 5/4, and 3/2 (pitches ‘do’, ‘mi’ and ‘so’, if you’ve ever watched “The Sound of Music”)

By the standards of Murail (and Turing, for that matter), these are small numbers, and the human ear is quite adept at performing the computation of recognizing them and placing them in reasonable cultural and musical contexts.

The human ear, from a perspective of what it can understand or “compute” in this way, is quite good. We like minor chords, diminished chords — heck, we’ll even listen to Coltrane. Serial music is tougher. In general, music that doesn’t “sound good” to the western ear because of dissonance, etc. likely involves sound frequency relationships that make use of larger numbers. A minor chord, for example, requires use of the 6th and 8th partials. Its ratios, depending on how you define the third of the chord, require much larger numbers than that of a major chord. The numbers involved with frequencies heard in the Murail then, in all their roaring, bending, and hissing glory, beg a final question:

What is the limit? What is uncomputable for the human ear?


Parallels with music aside, readers not familiar with Turing’s life and the circumstances of his death should read this: turing wikipedia article. Oppression and intolerance are real, and Turing suffered greatly at the hands of the British government.


Why “N choose K”?

The name of the Knewton tech blog is “N choose K”. If it’s been a long time since you set foot in a math class, this may not have any meaning to you. However, it’s an interesting concept, and really not all that difficult. In this post, I’ll go over what the term means in simple English and discuss how we use it here at Knewton.

Imagine you’re visiting the newest, trendiest lounge in New York. Like many such lounges, this one specializes in producing excellent cocktails from before the Prohibition era. You look over the menu and see that they offer one drink for each type of alcohol: champagne, gin, rum, tequila, and vodka. You decide that you want to try all of them, in pairs rather than one at a time, to see how well they go with each other. You want to know how many different pairs‚ (one for each hand) you can make with these 5 drinks.

One way of doing this would be to count each of them. To do this, we can label each of the drinks with their main ingredient:


and systematically count each of the pairs (note that order of ingredients is not important):

for a total of 10 different pairs of drinks. Mathematically, this is known as n choose k, and is written as  shawn22with n being the total number of items, and k being the number to choose each time. You’re asking “how many ways can I choose 2 items out of 5,” which we know from our graph is: shawn3
It’s not too difficult to count each of these by hand. But now imagine that you’re an octopus at a different bar, which has 20 drinks, and you want to find out how many ways you can choose 8 drinks out of 20, i.e.Screen Shot 2015-01-29 at 6.00.37 PM
Clearly, counting is not an option here.

The great thing about n choose k is that there’s a simple formula to figure it out: you just plug in your values for n and k, and out comes the answer. Here’s the formula:

shawn4which is a lot less complicated than it seems. The exclamation point is known as a factorial, and you can find any factorial by multiplying together all the numbers from that number to 1. So, 3! is simply: shawn5Let’s try it on our original problem, how many ways can we choose 2 drinks out of 5?


Note how we can make the calculations easier by canceling the top and bottom of the fraction when they’re exactly the same.

Now withScreen Shot 2015-01-29 at 6.00.37 PM


It’s a good thing octopi have a strong tolerance for liquor.

Of course, “n choose k” pertains to more than just mixing drinks at bars. In fact, you can see this in all sorts of applications, and often in exactly the sort of work we do here at Knewton. For example, let’s say we have a graph that has only paths to adjacent nodes:

The question we need to answer is how many shortest paths are there from A to Z? (These are called lattice paths). There are algorithms that allow you to determine this exactly, but they require polynomial time to complete. Instead, let’s solve an easier problem: What is the upper bound of the number of shortest paths from A to Z? So, what would the graph would look like where the upper bound is the same as the actual number?

Well, a fully-connected graph would fit this scenario:

We can count the length of the path from A to Z by counting the length of the path:

Clearly we cannot have a shorter path than this, and so all shortest paths must have exactly 6 steps. The only independent variable then is the choice of which steps we take to the right; the dependent variable is which steps we take down (it’s dependent because it is completely determined by the other variables). So in the example above:

Independent: steps 1, 2, 3, 4 are to the right
Dependent: steps 5, 6 are to the bottom (whatever is left)

Starting to sound familiar? It’s the same as “n choose k”! Given the total number of steps, how many ways can we choose to go right? And so we get:

More generally, for an m x n lattice grid, we can use the formula:

where j is one of the dimensions of the grid and k is the other; which one you pick doesn’t matter, as you’ll see below.

Be aware, you might see this formula as shawn14. This is correct if you count the starting point A as (0, 0) and ending point Z as (m, n), which subtracts one in each direction.

You might be interested in why we chose which to the right and which down; in other words, can we switch the independent variable? The answer is yes, and it doesn’t matter, and it’s pretty neat, because as you could calculate: shawn15

In fact, if you calculate a number of these and group them into a triangle shape, it results in a pattern known as Pascal’s triangle:


To find shawn17 for example, count to the fifth row and the second column, and you’ll find that shawn18. As you can see, the left side of the triangle has the exact same values as the right side, which is why shawn19, and more generally:

These values are also known as binomial coefficients; if you’re interested in why, you can easily find more information on the web.

There are a couple neat patterns that Pascal’s triangle has. First, each number in the triangle is the sum of the two numbers above (you can think of the entire triangle surrounded by zeros if that helps). This is pretty miraculous if you recall our original formula for determining n choose k.

The second pattern has to do with its shape: if you take Pascal’s triangle and color only the odd numbers, you get a Sierpinski triangle, which you may be familiar with: it’s our n choose k blog logo!


Our work at Knewton allows us to play with ideas all day and turn those ideas into a reality. We chose the binomial coefficient  shawn22 to be the title of our tech blog because it reflects many of our core values as Knewton developers: our passion for mathematics and the language of ideas, and our commitment to intellectual endeavor.


In this entry we discussed what n choose k means and how you can use it to limit your alcohol consumption find the number of ways groups of things can be chosen. Then we found a simple formula to calculate it. We saw a practical use of formula in finding an upper bound of the number of shortest paths in a particular graph. Finally we looked at Pascal’s Triangle and the relationships between the binomial coefficients, aka the results of multiple n choose k computations. I hope you had some fun, learned a bit about math, and have something new to discuss at your next cocktail party.

Thanks to Kevin Wilson and Solon Gordon for their help with this post.

Knewton Crab Stacker: Innovations in Systems Engineering

Creating and maintaining the infrastructure to support a 40+ person developer team and more than a million users on the world’s most powerful and rigorously adaptive learning platform is no simple task. Conventional wisdom would suggest that a ten-person team with a wide swath of specialists would be the ideal arrangement. But in this regard, as with a number of other tech team practices, Knewton is anything but conventional.

Simply put, Knewton does not have an Ops team. Instead, the knowledge and tools required for infrastructure and systems tasks are distributed throughout the team. This structure confers a number of benefits. Developers are better able to write configurable, maintainable, deployable software because they have a strong understanding of our systems infrastructure. And systems-focused engineers are better able to optimize and maintain our infrastructure because they are also contributing members of the service developer teams.

In practice, this means that all software developers at Knewton are expected to both understand and utilize infrastructure technologies. All developers may find themselves on production support and tasked with updating environment configurations or deploying new code. Likewise, systems engineers all have a solid software engineering background and are expected to write production code.

Expectations and cross-pollination are only part of the process, however. Here’s some insight into the tools we use to create an environment where development and infrastructure work hand-in-hand.

AWS for everyone!

Every Knewton developer has access to our AWS console as well as the AWS command line tools pre-installed on their machines. Eventually every developer will go through the process of deploying code to production, at which point he or she will learn the basics of the AWS toolkit as well as our deployment and configuration management systems. As a result, no developer need waste time emailing the Systems group for basic information about EC2 instances or other AWS resources.

Knewton makes use of a number of AWS technologies, from EC2, to RDS, to ElastiCache and ElasticMapReduce. Many of these products have their own CLI tools, though they are all simple enough to install and configure that at Knewton we’ve made them part of box setup for new developers. While not every engineer is a wizard with the console or CLI commands, there is enough documentation to ensure any developer working alone on a late night will not be blocked by an AWS issue (unless the block is on AWS’ end…).

We have a few in-house tools that make use of AWS’ APIs, among them and the Knewton Crab Stacker. To focus this article a bit, I’ll dive specifically into the latter, as it addresses an important problem: deploying and updating software stacks.

CloudFormation and KCS

Knewton Crab Stacker, or KCS, makes use of the AWS CloudFormation toolset. CloudFormation makes it possible to define and deploy combinations of AWS resources, from EC2 instances to load balancers. The “define” part comes in the form of JSON templates, while deployment can be done either using the AWS console or a CLI tool.

Now, CloudFormation on its own is great for just “putting a box out there.” The templates let you define a wide variety of resources and set parameters and mappings for a base level service setup and tuning.

What CloudFormation doesn’t do well is box configuration. The JSON templates don’t allow you to do much in the way of defining functions, setting conditions, or even basic math. If you try to force configuration management onto it, you end up with a lot of bash scripts and config files floating around, or worse, hardcoded into templates.

Even relegating KCS to deploying boxes can be tricky. The launch command for a given stack can include a dozen or more command line arguments, such as EC2 instance parameters (size, type, etc.) and command flags.

The simplest case launch will make use of all defaults in the template and look something like this on the command line:

cfn-create-stack $StackName --template-file $Template

But if you need to use other CloudFormation functionality and override a number of parameters at launch, you’ll end up with something like this:

cfn-create-stack $StackName --template-file $Template --parameters AWSAccount=Production;InstanceType=m1.large;ClusterSize=4;ConfigTarballVersion=2.1.5;AppVersion=1.2.3;SSHKey=ProductionKey --capabilities CAPABILITY_IAM --disable-rollback

Yikes! That’s a lot to type from memory for each deploy. You’re especially going to want that last option to disable rollback as it keeps the instances from failed launches around for you to debug — essential for when you inevitably mistype a version number.

If stack launches are fairly consistent, you can mitigate the annoyance of launch commands with BASH scripts, but these will be a pain to maintain. But what if you have a number of frequently changing parameters or decisions that need to be made at launch? What if you need to work with multiple AWS accounts or validate components of your launch config so as to avoid a painful debug cycle? (Is that tarball you need in s3? Does this environment have all of my new stack’s dependencies?) Complex stacks can take ten to twenty minutes to launch. You don’t want to have to keep re-launching just because you fat-fingered the instance type.

The problem with the command above is that every parameter represents a potential point of failure. CloudFormation is only able to ensure that your template is logically consistent and the JSON is valid. It can’t know whether or not AppVersion 1.2.3 is a thing, or whether a four node cluster matches what is in the current environment, or numerous other details that can spoil an update before it begins.

This is where KCS steps in. Knewton Crab Stacker was developed by a team of Knewton engineers (including yours truly). KCS is a Python command line tool designed to make CloudFormation deployment much simpler.

The first nice thing KCS does is add the abstraction “environment” to our AWS accounts. It does this by simply taking the stackname parameter and appending $EnvironmentName + “-” to the front of it. From CloudFormation’s perspective, the stackname is “Dev-UserService,” but KCS understands the stack as “The UserService stack in the Dev environment.”

Making use of the namespace this way greatly simplifies the task of isolating test environments from one another. It adds one more piece to launch commands, which in the simplest case look like this:

kcs stacks create $Service $Template $Environment

The difference between this and the simple CloudFormation command above is what goes on behind the scenes.

Before initiating the create, KCS checks a number of things. First, KCS makes sure that the environment has any stacks that the new service needs. If a dependency is missing, you can still force a create. Secondly, KCS ensures that any s3 resources referenced in the template or launch command actually exist. In other words, if your launch command specifies “ServiceTarballVersion=0.3.444”, KCS makes sure that said tarball is actually there.

Updates are far more common than creates, and it is here where KCS really does a lot for you. Here’s a simple update command:

kcs stacks update $Service $Template $Environment

Like the create, KCS does a ton of validation on the template and environment. With the update however, KCS also runs a diff on the existing stack. Before the update actually runs, you will be shown a list of every parameter the update adds, removes, or changes. From there, you can either proceed with or cancel the update.

Before I do an update of a stack, I can also use “describe” to see what’s in the environment currently. The full command is “kcs stacks describe”, but I can shorten it using “s” and “d”, and aim it at our Dev environment like so:

kcs s d User Dev

Dev - $93.6 monthly
Stack   Status            $ monthly   Creation       Last Update
User    update complete   $93.6       4 months ago   3 months ago

SystemConfig Version: 1.1.55 User App Version: 1.1.224

i-01234567 (m1.medium)

This gives me a lot of cool info including the version of the App, some parameter information, as well as the instance ID, type, and hostname. If I want an exhaustive list of parameters I can do this:

kcs s d User Dev --detailed

Dev - $93.6 monthly
Stack   Status            $ monthly   Creation       Last Update
User    update complete   $93.6       4 months ago   3 months ago

Environment: Dev
Cluster Size: 4
SystemConfig Version: 1.1.55
Environment Class: Dev
User Version: 1.1.224
Instance Type: m1.medium
DB Password: ******
Billing Environment: Staging
ELBRegistration: Yes
AMI: ami-11223344
UserDB Address:
Key Name: STAGING-007
i-1234567   (m1.medium)
12.345.678.999                             us-east-1c

These commands make it easy to run updates without knowing much about the stack, but there is an even easier method for the truly lazy:

kcs stacks interactive-update $Service $Environment

This command uses the existing stack’s template and then lets you pass in values for each parameter while showing you what is in the current environment. It guarantees that you only change exactly what you want to change.

When the update actually runs, KCS adds a few layers of insurance that CloudFormation does not. For one, it spins up brand new instances, runs their launch config, and then waits for success signals before tearing down the old stack. This allows you to set up whatever level of functionality and performance testing you want as a condition of a successful update. If part of the launch config fails or does something unexpected, KCS rolls everything back.

All of this just scratches the surface at what KCS can do. I could write a few dozen pages about KCS’ other abilities, like grabbing service logs, executing remote commands, hotswapping jars on Java stacks, and even snapshotting entire environments and then recreating new environments from snapshots (need to copy a 100 instance staging environment? No problem).

The main thing that KCS does is to kill the need for a Release Engineer or “Deployment Guy.” Nobody is happier about this than I am, as I was the “Deployment Guy” for months. Instead, we have a situation now where systems engineers can focus on improving infrastructure and devs can get new code out easily.

The lion’s share of the credit for KCS has to go to Sarah Haskins and Trevor Smith, the two developers who did the bulk of the coding. It has made life easier for all developers here at Knewton, and we hope to open source it in the future.

Configuration management and future challenges

As nice as KCS is for our deployment workflow, it is only able to solve one part of our infrastructure needs. Like any moderately large tech team, there are natural conflicts of interest that arise between those focused on system stability and maintenance, and those trying to push out a slick new feature. We’re not immune from knowledge bottlenecks and technical debt, but as the team grows and practices are refined, the future looks brighter and brighter for our tech team.

At the very least, thanks to KCS, we have a pretty good handle on deploying services. Swell. But how do we configure boxes once CloudFormation puts them out there? How do we ensure that services are able to talk to one another and that stacks are resilient to a plethora of errors?

Those fantastic questions, I’m afraid, will have to be the subject of another “N Choose K” post.

Code Swarm Post Git: Data Visualization

The above visualization shows the history of commits for the Knewton Adaptive Learning Platform. A commit occurs when a developer alters the code and transfers the changes into our central git repository. Each file in a commit is represented by a colored-coded blob which indicates the type of file committed. Each time a developer commits a file, the algorithm brings the file closer to the user’s blob; this proximity reflects the associativity between code files and users as well as the relationships among users themselves. The histogram at the bottom displays the frequency of commits over time.

This visualization can also be referred to as a “code swarm”, and is an example of “organic information visualization” (a phrase coined by Ben Fry), that is, any data visualization that allows the elements to interact in a dynamic and unpredictable fashion. The benefit of such a visualization is that it brings the relationships between different project components into relief. The viewer can also observe the evolution of a project over time.