Algorithmically Generating Questions

Developing assessment materials is a lot of work. Teachers spend countless hours scavenging textbooks and developing original exercises for practice worksheets, homework problems, remedial materials, and exams. To prevent cheating, many teachers write several versions of each test, multiplying the work required for creating problems (not to mention grading!). The issue is exacerbated by the rising popularity of MOOCs, in which tens of thousands of students may be enrolled in the same course.

Furthermore, access to a large, varied pool of assessment items is a key to the success of many adaptive courses of study. Without a large pool of items, proficiency estimates can be compromised and personalized courses of study can become less effective than they would be otherwise.

Machine Generated Questions

Machine generated questions have been studied for decades as a component of intelligent tutoring systems. Most research falls into two categories: solution-oriented approaches, and template-based approaches.

Solution-Oriented Approach*

In this approach, questions are generated based on the set of skills and concepts required to solve them. For example, skills related to addition includes adding single digit numbers, adding multi-digit numbers, adding three or more numbers, and carrying digits.

A recent paper, entitled “A Trace-Based Framework for Analyzing and Synthesizing Educational Progressions,” describes an interesting implementation of solution-oriented question generation. On page three, the authors write out pseudocode for the standard, classroom addition procedure. They then annotate the code with symbols representing skills (for example, C for “carry digit”). Thus, by running the pseudocode and keeping track of the symbols, one can obtain a “trace” that categorizes each possible addition problem.

Because solution-oriented approaches group problems based on skills, they lend themselves well to adaptivity. As a student answers questions, one can identify skills he or she is struggling with, and then recommend material reinforcing those skills. However, a major drawback of solution-oriented approaches is that developing questions even for a topic as simple as addition requires a fair amount of labor and domain expertise.

Template-Based Approach*

In this approach, a question template is used to represent a potentially large class of problems. For example, consider a familiar question:

Find all roots of _x+ _x + _.

The underlines are “holes” that must be filled in by the question generator. A template might also specify valid ways to fill in the holes. For example, maybe each hole can only be filled in by the integers one through 10, leading to 10^3= 1000 possible questions. The instructor may wish to further restrict the template to only permit quadratics with real, distinct roots.

The biggest advantage of this approach is that it is accessible to a majority of instructors, provided there is an intuitive and robust templating language. In addition, template-based approaches are easily generalizable, capable of representing entire domains. A disadvantage of templates is that they tend to group problems based on appearance, not skills.

This summer, I set out to create an assessments generator engine that would be both accessible and expressive enough to generate a wide variety of problems. For these reasons, a templating approach made the most sense. Furthermore, Knewton already has infrastructure in place that will enable us to achieve adaptivity even with a template-oriented approach.

The Art of Templating

My first task was to devise a templating language. I decided that it would be a good exercise to define a domain specific language (DSL) that formalizes the space of possible templates. This DSL must let instructors specify the following:

  • Which variables in the question can be adjusted?

  • What values are these variables allowed to take on?

  • How is the correct answer computed?

  • How are the incorrect answers computed? (for multiple choice questions)

After several iterations, I came up with a format general enough to cover many (if not most) of the questions used by Knewton Math Readiness. I’ll go through some examples, beginning with simple addition, that illustrate the main features of the templating language.

Adding Two Numbers

The below template is used to generate questions of the form _ +_ = ?

template "add 2 numbers" {
  question is "<x>+<y>=?"
  answer is "<sum>"
  variables {
    x, y are integers between 1 and 9
    sum=x+y
  }
}

The question and answer are simply strings with variable names denoting the “holes.” Variables come in two flavors: generated (x and y) and derived (sum). Generated variables are bound to a sample set, which could be a range of integers, numbers with 2 decimal places, or even the set of fibonacci numbers. Derived variables are defined by mathematical expressions.

Finding Quadratic Roots

Let’s return to the earlier question of finding the roots of a quadratic. Following the addition example, we might try:

template "quadratic roots" {
  question is "<a>x^2 + <b>x + <c> = 0. Solve for x."
  answer is "x = <root1>, <root2>"
  variables {
    a, b, c are integers between -10 and 10
    discriminant = b^2 - 4*a*c
    root1 = -b + sqrt(discriminant)/(2*a)
    root2 = -b - sqrt(discriminant)/(2*a)
  }
}

Here, we are generating each coefficient (a, b, c) from the range [-10, 10]. However, the below table illustrates an issue with this template. For certain coefficients, the solutions can be integral, fractions, irrational numbers, or even imaginary numbers.

(a,b,c)

Solutions

(1, -10, 16)

2, 8

(-4, 7, -3)

0.75, 1

(3, 6, 2)

-1.577…, -0.422…

(1, 2, 1)

-1

(1, 4, 8)

-2 + 2i, -2 -2i

Because of this, the template can represent questions requiring different skill sets and mastery levels. It is important to give instructors the ability to maintain a consistent difficulty level, and to control the concepts covered by a single template. This is achieved using constraints.

template "quadratic roots" {
  question is "<a>x^2 + <b>x + <c> = 0. Solve for x."
  answer is "x = <root1>, <root2>"
  variables {
    a, b, c are integers between -10 and 10
    discriminant = b^2 - 4*a*c
    root1 = -b + sqrt(discriminant)/(2*a)
    root2 = -b - sqrt(discriminant)/(2*a)
  }
  constraints {
    root1, root2 must be integers
    root1 <> root2
    gcd(a, b, c) = 1
  }
}

In general, constraints are useful for narrowing the skill set covered by the template, and to ensure that instantiations of the template are sufficiently varied. In this example, requiring unique, integer roots is used to control difficulty level, while requiring gcd(a, b, c) = 1 ensures that no two questions will be multiples of one another.

Car Distance

Another important feature of the templating language is the ability to specify wrong answers.

template "car distance" {
  question is "How far does a car travel in <m> minutes traveling <r> miles/hour?"
  answer is "<dist>miles"
  variables {
    m is an integer between 30 and 90 divisible by 10
    r is an integer between 40 and 65 divisible by 5
    dist = rate*time/60
    wrong1 = rate*time
    wrong2 = rate/time
    wrong3 = rate/time/60
  }
  wrong answers {
    "<wrong1>miles"
    "<wrong2>miles"
    "<wrong3>miles"
  }
}

Wrong answers can be either static or variable. What’s powerful about this feature is that each wrong answer might be associated with a particular deficiency or misconception. In the example, a student might choose “rate/time/60” because she doesn’t know the correct distance formula, while another student might choose “rate*time” because she has trouble converting units. This is another source of information that Knewton can use to provide more targeted recommendations.

The Problem Generation Algorithm

Great, so we have a template. Now how do we actually generate questions? My first inclination was to start with the simplest possible algorithm:

  1. Go down the list of variables, selecting values for the generated variables uniformly at random from the sample sets and using the formulas to compute the derived variables

  2. If the variables satisfy all of the constraints, add it to the list of questions.

  3. Repeat.

This naive algorithm performs nicely given one key assumption: a large enough fraction of the sample space (the set of all possible questions, i.e. the cartesian product of the sample sets) must meet the constraints specified in the template. For instance, if 100 questions are desired and the algorithm can handle 100,000 iterations, roughly 1/1000 questions need to be valid. This isn’t too daunting: as long as we offer an expressive library of sample sets and constraints, instructors can be expected to provide templates meeting this requirement.

It is a very difficult to come up with a more efficient approach. For some problems, algorithms do exist to generate solutions (see Euclid’s method for pythagorean triples), but for others it is mathematically impossible (see Hilbert’s tenth problem). In many cases, introducing heuristics may improve the random algorithm. For instance, it may be possible to identify a large chunk of the sample space that leads to solutions that are too large, non integral, negative, etc.

The Prototype

I chose to implement the assessment generator in Scala for several reasons:

  • Scala’s interoperability with Java made integrating with the rest of the Knewton code base an easy task.

  • Scala’s powerful Parser Combinators library made implementing the template DSL straightforward. Because of their low overhead, I also used parser combinators for converting math expressions like “sqrt(x^3 + 5)” into my internal data model. While there are many existing Java/Scala libraries that accomplish this, I was unable to find any capable of manipulating general mathematical objects, such as fractions, matrices, polynomials, polygons, and groups.

  • Scala’s parallel collections allowed running iterations of the problem generator routine in parallel. Doing so only required swapping out a Map with a ParMap, and appending “.par” to the main program loop.

Here is a screenshot of the prototype in action.

*For examples of solution-oriented approaches in the literature, see https://act.org/research/researchers/reports/pdf/ACT_RR93-09.pdf (survey) and http://research.microsoft.com/en-us/um/people/sumitg/pubs/chi13.pdf (grade school math).

*For examples of template-based approaches in the literature, see http://www.eecs.berkeley.edu/~dsadigh/WESE12.pdf (embedded systems).

Ku Pima: To Measure

Trivia: Ku Pima means To Measure in the Xitsonga Language

An Analytics (Data Science) team is made up of engineers/scientists with a wide array of skills. This results from the nature of the goals the team has to meet. As an Electrical Engineering major at Wits University, I’ve spent two summers as an instrumentation engineering intern. Instrumentation deals with the task of engineering instruments that can measure certain quantities for industrial processes to be controlled. Examples of environments include manufacturing and chemical plants, houses, or even the International Space Station. I find analytics to be a similar process to instrumentation engineering in that useful measurements are sought and then the instruments to calculate those measures are engineered.

Building the Analytics Pipeline

On the Analytics team at Knewton, the data scientists develop measures that are useful to track, whether directly for a business case or for building blocks for future analytics. Within the Analytics team there is a Data Analysis component that develops analytics (measures). Another component, Data Infrastructure, engineers the pipeline (instruments) to actually calculate these analytics on a large/production scale. Initially an analytic is developed by exploring some interesting idea of a measure, using available organization data, and then refining it to arrive at the final analytic.

My internship was concerned with creating Data Infrastructure (the instrumentation) to compute some analytics at Knewton. My initial major task was to take a newly developed analytic, in this case Engagement, data from different sources within our products, and  engineer tools to calculate this analytic. This task itself encompases not only the implementation of an algorithm but also the engineering processes necessary to construct the components needed for the measure to be calculated. Further there is a need to analyze and validate the measure on a larger scale than the initial one used to develop it. This necessitates the need for a feedback loop between the data analysts and data infrastructure components.

Engagement is a many-faceted construct.  One facet is an analytic that serves to indicate how much time a student spends “actively engaged” on the Knewton platform. There are a number of ways it can be defined. Here is the basic premise: Let’s say a student is busy with a homework assignment. After they have submitted their task, the Knewton system sends recommendations on which material the student should tackle next. From these interactions one might want to know how engaged a student is with the learning platform. There can be many ways of quantifying this: time spent on the system; number of logins per week; time spent on recommended material, etc. The analytics team is tasked with investigating and developing the analytic into a form that will be useful internally or to a partner. After this initial synthesis, we need to engineer a pipeline that will take student interactions and the recommendations into account and calculate the engagement analytic. Further, this analytic is an example of analytic that needs to be inferred. By “infer” we mean that we cannot directly observe the data we want and thus have to deduce it from other data.

There Can Be Multiple Reasons to Infer an Analytic

The raw student interaction data needs to be cleaned and clustered: The raw data captures a lot of information, some of which may not be useful. Thus, there is a need for cleaning/filtering. Some student interactions can be clustered and thought of as a single event instead of multiple events. (Think of a student doing multiple tasks consecutively on a single homework assignment.)

You have to categorize the users’ intentions: The user’s intention is important as it can make an analytic useful or less so. For example: there is a situation where the student did not intend to do the next recommended action, not because they thought it was irrelevant, but because they had to move to another task (say, another homework assignment with an upcoming deadline). In this situation we would have to treat this as a data point that would not be useful in calculating engagement.

Resources: Available resources are always a factor for any organization. It might be faster to calculate the analytic in one way as opposed to another. It might be more efficient to infer an analytic from a readily available dataset than use a much richer, hard to obtain dataset that is less efficient and only provides a small boost in accuracy with an accompanying large use of resources.

Computing Engagement

The overall view of the pipeline created for Engagement is shown in the figure below. The pipeline takes in data generated by Knewton and its partners that contains student interactions as well as recommendations sent to students. From this data the student learning/task sessions are inferred. From this data we then calculate the Engagement analytic. The computed analytic is reported and the data used for its calculation stored for future use and analysis.

 Building the Analytics Pipeline

After the initial pipeline is engineered there are still a number of tasks to complete. Validation is very important in determining if the analytic can be interpreted as expected. We need to know whether with sample data it produces similar results to the analytic development stage. This part of the process involves some modeling and statistics and needs analysis tools in order to detect errors and may lead to refining the analytic itself.

Through this validation process we are able to refine the measure further. Validation gives us a better understanding of the shortcomings of the data we collect and what other data might be useful.

If you’re interested in this subject, I’d also recommend checking out these tech blog posts: Netflix’s Recommendation Engine, Finding similar users on Twitter, How Facebook Enabled Graph Search On Such a Scale.

Generative Models for Description and Prediction

This summer I worked on the Analytics team at Knewton. One of the goals of analytics is to condense the huge amount of information generated by students interacting with educational products and provide teachers with data-driven insights into how their students are doing.

In this post, I will discuss an approach to conveying information about student behaviors using generative models.

A generative model is a description of how to generate data that is similar to the observed data.

Imagine we have a stream of binary observations for each student. The observations represent whether the student did any work on the system today, whether the student answered questions correctly as she progressed through a course, or anything else about the student that can be represented in binary terms. The observations are viewed as a stream; over time more observations are added.

Here are some examples of observations of correctness on practice questions from when each student started a course until today.

StudentA: “1111111110”

StudentB: “100011110111100111111011100110011”

StudentC: “111011110111111111111111111111111”

We could specify different generative models for each student. One possible generative model for student A could be:

Model #1 (Student A)

“print nine 1’s and then one 0″.

If we wanted to describe student B similarly the description would be:

Model #1 (Student B)

“print one 1 then three 0’s then four 1s then one 0 …”.

Models like Model #1 exactly preserve all of the information that was present in the student data but in doing so don’t reduce the complexity of the observations at all. Often, in communicating information about student behaviours, we seek to preserve the important information while reducing the complexity to communicate concisely.

A generative model does not need to reproduce the data exactly. One way of reducing complexity while preserving important information is to attribute parts of the observation that we believe to be unimportant as being generated from a process with some randomness.

Throughout this post, we’ll use a simple generative model for demonstration. The generative model belongs to a family of models, parametrized by a value w.

Model #2:

“Flip a weighted coin that lands on heads with probability w. If it’s heads, print 1, otherwise print 0″.

In reporting to a teacher, I can report that the model that best fits with the student history is w=0.9.

That can be pretty informative. It means that in the family of Model #2, w=0.9 gives the model that is closest to the full description of StudentA (using KL divergence as a measure of closeness). Often, a student’s full data description is inconvenient to communicate, but now I can summarize it concisely with a description of the model family (Model #2) and a single parameter (w=0.9).

While the model parameters may be meaningful themselves (in this case they are the student’s overall score on practice questions), they also define a space in which we can compare students to each other (or the same student at different points in time). For example, I can note that StudentA is best described by w=0.9 while StudentB is best described by w0.65. If I wanted to find the student most similar to StudentC (w0.94), I would find the student with the closest value for w and choose StudentA. Unlike standard string comparisons (eg. Levenshtein distance, which would yield studentB as the closest student to studentC), a distance measure based on a generative model is not sensitive to differences in the number of observations for each student.

In this case, because the generative model is quite simple, I’m just reporting a summary statistic of the data (in this case the sample mean), but that is not always the case.*

In choosing Model #2 to convey my information, I’ve consciously made some decisions about what qualities of the data are important or not. For example this model is insensitive to the order of the data. If I were to imagine a new student (StudentD) whose data was the same as StudentA’s but had its order reversed, the two students would be reported identically. If I believed that the order of the 1’s and 0’s was important, I could choose a different generative model. While many generative models will be valid for describing the same set of observations, some of these models will describe the data better, so it is also possible to empirically evaluate the relative descriptiveness of models.

Generative models can also be used for prediction. For concreteness, let’s imagine that the data we’re interested in is related to student work habits and the 0’s and 1’s represent whether the student interacted with the system in each hour. If we want to predict how many hours a student will spend on the system next week, we could fit a generative model to the student’s past habits and then run that model forward from where it left off to estimate a distribution for hours spent on the system next week.

In one sense this is still a description of the past work habits of a student, just phrased as a prediction of the future: “If these habits continue for another week, the distribution over time spent in the system is X.”

In some cases, this future prediction may be the most useful type of summary. For example, if a teacher wants to intervene before a struggling student fails an exam, an extremely useful description of the student’s improvements of proficiency is phrased this way: “If this rate of improvement continues until the exam date, the student is projected to get a 20% on the exam!”

*For example, a common generative model used for summarizing text documents is the Latent Dirichlet Allocation (LDA) model is parameterized by a set of distributions. These distributions can be viewed as a summary of the documents.

Another example is the IRT model of student proficiency described by Alex in his N choose K post. Reporting parameters from this model forms the basis of how we at Knewton communicate about what students know.

On Comparable Noise and Satisfiable Sound

The Motivation

Working for a startup can be intense. Many in this great city succumb to the pressures of work life and relinquish their purely intellectual pursuits. Here at Knewton, however, our love affairs with art, music, literature, formal systems — love affairs that, for the most part, began within very walls of the institutions that we hope to disrupt — these passions inspire our work and our camaraderie with each other. And practically speaking, interview questions at New York’s hottest startups can lean heavily on what was learned in the undergraduate CS lecture hall. The need to keep these skills sharp is clearly still relevant.

The aim here is to kick off a series of blog posts that pair substantial results in CS or formal systems with significant works of classical music. The series is limited to classical works explicitly — there is plenty of headphones-on time at work to listen to The Postal Service.

The Works

Listen to this: Gondwana by Tristan Murail

Read this: On Computable Numbers by Alan Turing

Alan Turing’s On Computable Numbers, with an Application to the Entscheidungsproblem is a seminal paper. Anyone taking a CS undergraduate core will encounter the notion of computability and the halting problem. Further from the ivory tower, it is not difficult to find people who are near-giddy about Turing Completeness:

Breaking news: HTML + CSS3 is Turing Complete,

People looking for help developing a Turing machine in Microsoft Excel,

A Turing machine built using Legos,

Brainfuck, a language described on its website as “an eight instruction Turing-Complete Programming Language”

Turing’s name has become quite literally a “formal” stamp of approval in the computing world. Turing’s seminal paper defined a set of abstractions that formed the basis for what we now call “Turing Completeness,” a requirement of all first-class computing languages. What exactly did Turing say in that paper? And what does it have to do with French spectral music?

Turing’s Paper

Beyond a few feeble mechanical attempts, computers did not exist in 1936, and Mr. Turing didn’t appear that excited about making one — he was interested in making Godel’s long, German work on incompleteness comprehensible. That is, Turing wanted to intuitively construct a class of decision problems that could be proven undecidable. Turing could not tackle this without defining a notion of “computable.” His definition leaned heavily on a few notions that we would recognize today as common, but required a little more formal attention-to-detail back then.

Turing describes memory:

We may compare a man in the process of computing a real number to a machine which is only capable of a finite number of conditions q1, q2, …. qI; which will be called “m-configurations”

and the idea of output:

… the subsequence of the symbols printed by it which are of the first kind will be called the sequence computed by the machine. The real number whose expression as a binary decimal is obtained by prefacing this sequence by a decimal point is called the number computed by the machine.

Academic language aside, the machine Turing is defining is simple. Turing calls this machine an “‘automatic-machine’ (or alpha-machine).” The machine has the following properties:

a tape divided into squares

symbols that are on the squares

the scanned symbol (the rth symbol). Modern texts will likely refer to this as the symbol the ‘head’ is pointing to.

configuration: the current m-configuration and the current scanned symbol (qn, G(r))

But this paper is not without its difficulties, especially if it is approached today.

One initial difficulty: having completed standard undergraduate classes in formal computation theory, Turing’s initial choice of terminology — “Circle-free” and “Circular” — was extremely confusing to my modern eyes. Most undergraduates approach Turing’s ideas in their coursework from the “halting problem” angle. What is initially confusing here is the following:

CIRCLE-FREE != HALTS

How can this be? Turing quickly skirts over the definition of a circle-free machine early on in the paper when he describes it as the following:

If a computing machine never writes down more than a finite number of symbols of the first kind, it will be called circular. Otherwise it is said to be circle-free. A machine will be circular if it reaches a configuration from which there is no possible move, or if it goes on moving, and possibly printing symbols of the second kind, but cannot print any more symbols of the first kind. (p. 233)

Finite. This should be the same as a machine that “halts,” right?

No. Turing’s machines are built to write down numbers. These numbers are (somewhat sneakily, in my opinion) implied to be repeating binary decimals on page 232:

The real number whose expression as a binary decimal is obtained by prefacing this sequence by a decimal point is called the number computed by the machine.

So we’re essentially talking about endlessly repeating floats — the kind of numbers that you get when you type in [1] [/] [3] on a calculator, except in binary. Numbers with bars over the top of the last digit: 0.33333333… for example (0.010101010… in binary). In Turing’s paper, the machine that prints 0.010101010… is “circle-free.” This means it encounters no terminal difficulty in computation — it can keep printing out the endless decimal representation without pausing indefinitely to calculate. By this definition, “finite” numbers such as 4, 5, and 6 would be circle-free numbers, because the machine could be constructed to just print out a sequence of zeroes after the decimal point: 6.000000000….

“Halting” really has no place in his discussion: if we took a machine that “did not halt” by the “halting problem definition”, it would print out, say, 6.000 and then wait for an indefinite amount of time before printing out the next zero.

Understanding the important concept of a circle-free machine is critical, even if your only goal in reading the Turing is to see Godel’s diagonalization “done simply” (see the argument on page 247 of the article, where he argues that the machine H is circular by suggesting an infinite loop).

Why the Murail?

The Murail is a significant representation of the spectral music genre which has significant parallels with the work of Turing.

The Murail, like the Turing, uses an intuitive language, completely divorced from the method-as-music ideal of serialism (about serial music). Gondwana is based entirely off of the first sounds you hear in the piece — it is the subsequent deconstruction that leads the listener down a path that is nearly always defined momentarily, in that each subsequent moment of sound in the piece is very often based off of the prior moment. There are few moments of discontinuity.

The work starts off very simply. The piece begins with seven to eight moments of repeated sound (see 0:12, 0:30, 0:46, 1:01, 1:15, 1:28, 1:38, 1:48, and then MAYBE [1:55, 2:01, 2:07, 2:13-2:40] in this recording), as easily understood derivations from a simple sonic idea.

Indeed, these events are like Turing’s “appeal to intuition” by Murail. There is no recognition of pattern possible without repetition. With the seven chime-strike chords, Murail is handing us a ball of yarn — we will need it to find our way home from the complicated, thorny walk in the woods that is the rest of the piece. There will be no resounding return, no recapitulation to save the day here.

Beyond the opening, the Murail begins to dig deep into the essence of sound. At roughly eight minutes into the work, a hissing begins to become a main sonic feature. The connection to the beginning is simple — there is a brief, percussive sound at the beginning of the piece with a few strings playing, perhaps a maracca — there are lots of high, hiss-like frequencies that comprise the sound of a maracca. But it brings to mind an important question — a question that is at the heart of spectral music: what kind of music is happening in the parts of sound waves that we DON’T hear? What about all the extremely high frequencies that dogs can hear? Above those, even? They exist. They have to. A sound wave can be encoded by a number. Large numbers exist. What if we slowed the noise, say, of water molecules at room temperature bouncing around inside a mason jar? Anyone that has been distracted by the sound of a tea kettle on the verge of boiling already knows that interesting sonic things can happen with water molecules on the surface of a boiling hot-plate. But what about beyond the ear? What types of large-number-based frequencies exist at this sonic extremes of our world?

Computable vs. Uncomputable Sounds?

There is a strong argument that much of “Western Harmony” can be boiled down into small, rational-number relationships between different frequencies; numerical relationships, in fact, that are easily computable by a Turing machine (with a reasonable “skeleton table,” even!). A major triad, for example, is little more than the three ratios:

1, 5/4, and 3/2 (pitches ‘do’, ‘mi’ and ‘so’, if you’ve ever watched “The Sound of Music”)

By the standards of Murail (and Turing, for that matter), these are small numbers, and the human ear is quite adept at performing the computation of recognizing them and placing them in reasonable cultural and musical contexts.

The human ear, from a perspective of what it can understand or “compute” in this way, is quite good. We like minor chords, diminished chords — heck, we’ll even listen to Coltrane. Serial music is tougher. In general, music that doesn’t “sound good” to the western ear because of dissonance, etc. likely involves sound frequency relationships that make use of larger numbers. A minor chord, for example, requires use of the 6th and 8th partials. Its ratios, depending on how you define the third of the chord, require much larger numbers than that of a major chord. The numbers involved with frequencies heard in the Murail then, in all their roaring, bending, and hissing glory, beg a final question:

What is the limit? What is uncomputable for the human ear?

Note:

Parallels with music aside, readers not familiar with Turing’s life and the circumstances of his death should read this: turing wikipedia article. Oppression and intolerance are real, and Turing suffered greatly at the hands of the British government.

from: www.nateburke.com

Why “N choose K”?

The name of the Knewton tech blog is “N choose K”. If it’s been a long time since you set foot in a math class, this may not have any meaning to you. However, it’s an interesting concept, and really not all that difficult. In this post, I’ll go over what the term means in simple English and discuss how we use it here at Knewton.

Imagine you’re visiting the newest, trendiest lounge in New York. Like many such lounges, this one specializes in producing excellent cocktails from before the Prohibition era. You look over the menu and see that they offer one drink for each type of alcohol: champagne, gin, rum, tequila, and vodka. You decide that you want to try all of them, in pairs rather than one at a time, to see how well they go with each other. You want to know how many different pairs‚ (one for each hand) you can make with these 5 drinks.

One way of doing this would be to count each of them. To do this, we can label each of the drinks with their main ingredient:

and systematically count each of the pairs (note that order of ingredients is not important):

for a total of 10 different pairs of drinks. Mathematically, this is known as n choose k, and is written as with n being the total number of items, and k being the number to choose each time. You’re asking “how many ways can I choose 2 items out of 5,” which we know from our graph is:
It’s not too difficult to count each of these by hand. But now imagine that you’re an octopus at a different bar, which has 20 drinks, and you want to find out how many ways you can choose 8 drinks out of 20, i.e. . Clearly, counting is not an option here.

The great thing about n choose k is that there’s a simple formula to figure it out: you just plug in your values for n and k, and out comes the answer. Here’s the formula:


which is a lot less complicated than it seems. The exclamation point is known as a factorial, and you can find any factorial by multiplying together all the numbers from that number to 1. So, 3! is simply:
Let’s try it on our original problem, how many ways can we choose 2 drinks out of 5?

Note how we can make the calculations easier by canceling the top and bottom of the fraction when they’re exactly the same.

Now with :

It’s a good thing octopi have a strong tolerance for liquor.

Of course, “n choose k” pertains to more than just mixing drinks at bars. In fact, you can see this in all sorts of applications, and often in exactly the sort of work we do here at Knewton. For example, let’s say we have a graph that has only paths to adjacent nodes:


The question we need to answer is how many shortest paths are there from A to Z? (These are called lattice paths). There are algorithms that allow you to determine this exactly, but they require polynomial time to complete. Instead, let’s solve an easier problem: What is the upper bound of the number of shortest paths from A to Z? So, what would the graph would look like where the upper bound is the same as the actual number?

Well, a fully-connected graph would fit this scenario:


We can count the length of the path from A to Z by counting the length of the path:

Clearly we cannot have a shorter path than this, and so all shortest paths must have exactly 6 steps. The only independent variable then is the choice of which steps we take to the right; the dependent variable is which steps we take down (it’s dependent because it is completely determined by the other variables). So in the example above:

Independent: steps 1, 2, 3, 4 are to the right
Dependent: steps 5, 6 are to the bottom (whatever is left)

Starting to sound familiar? It’s the same as “n choose k”! Given the total number of steps, how many ways can we choose to go right? And so we get:

More generally, for an m x n lattice grid, we can use the formula:

where j is one of the dimensions of the grid and k is the other; which one you pick doesn’t matter, as you’ll see below.

Be aware, you might see this formula as . This is correct if you count the starting point A as (0, 0) and ending point Z as (m, n), which subtracts one in each direction.

You might be interested in why we chose which to the right and which down; in other words, can we switch the independent variable? The answer is yes, and it doesn’t matter, and it’s pretty neat, because as you could calculate:

In fact, if you calculate a number of these and group them into a triangle shape, it results in a pattern known as Pascal’s triangle:

To find  for example, count to the fifth row and the second column, and you’ll find that . As you can see, the left side of the triangle has the exact same values as the right side, which is why , and more generally:

These values are also known as binomial coefficients; if you’re interested in why, you can easily find more information on the web.

There are a couple neat patterns that Pascal’s triangle has. First, each number in the triangle is the sum of the two numbers above (you can think of the entire triangle surrounded by zeros if that helps). This is pretty miraculous if you recall our original formula for determining n choose k.

The second pattern has to do with its shape: if you take Pascal’s triangle and color only the odd numbers, you get a Sierpinski triangle, which you may be familiar with: it’s our n choose k blog logo!

Our work at Knewton allows us to play with ideas all day and turn those ideas into a reality. We chose the binomial coefficient   to be the title of our tech blog because it reflects many of our core values as Knewton developers: our passion for mathematics and the language of ideas, and our commitment to intellectual endeavor.

Conclusion

In this entry we discussed what n choose k means and how you can use it to limit your alcohol consumption find the number of ways groups of things can be chosen. Then we found a simple formula to calculate it. We saw a practical use of formula in finding an upper bound of the number of shortest paths in a particular graph. Finally we looked at Pascal’s Triangle and the relationships between the binomial coefficients, aka the results of multiple n choose k computations. I hope you had some fun, learned a bit about math, and have something new to discuss at your next cocktail party.

Thanks to Kevin Wilson and Solon Gordon for their help with this post.

Knewton Crab Stacker: Innovations in Systems Engineering

Creating and maintaining the infrastructure to support a 40+ person developer team and more than a million users on the world’s most powerful and rigorously adaptive learning platform is no simple task. Conventional wisdom would suggest that a ten-person team with a wide swath of specialists would be the ideal arrangement. But in this regard, as with a number of other tech team practices, Knewton is anything but conventional.

Simply put, Knewton does not have an Ops team. Instead, the knowledge and tools required for infrastructure and systems tasks are distributed throughout the team. This structure confers a number of benefits. Developers are better able to write configurable, maintainable, deployable software because they have a strong understanding of our systems infrastructure. And systems-focused engineers are better able to optimize and maintain our infrastructure because they are also contributing members of the service developer teams.

In practice, this means that all software developers at Knewton are expected to both understand and utilize infrastructure technologies. All developers may find themselves on production support and tasked with updating environment configurations or deploying new code. Likewise, systems engineers all have a solid software engineering background and are expected to write production code.

Expectations and cross-pollination are only part of the process, however. Here’s some insight into the tools we use to create an environment where development and infrastructure work hand-in-hand.

AWS for everyone!

Every Knewton developer has access to our AWS console as well as the AWS command line tools pre-installed on their machines. Eventually every developer will go through the process of deploying code to production, at which point he or she will learn the basics of the AWS toolkit as well as our deployment and configuration management systems. As a result, no developer need waste time emailing the Systems group for basic information about EC2 instances or other AWS resources.

Knewton makes use of a number of AWS technologies, from EC2, to RDS, to ElastiCache and ElasticMapReduce. Many of these products have their own CLI tools, though they are all simple enough to install and configure that at Knewton we’ve made them part of box setup for new developers. While not every engineer is a wizard with the console or CLI commands, there is enough documentation to ensure any developer working alone on a late night will not be blocked by an AWS issue (unless the block is on AWS’ end…).

We have a few in-house tools that make use of AWS’ APIs, among them k.aws and the Knewton Crab Stacker. To focus this article a bit, I’ll dive specifically into the latter, as it addresses an important problem: deploying and updating software stacks.

CloudFormation and KCS

Knewton Crab Stacker, or KCS, makes use of the AWS CloudFormation toolset. CloudFormation makes it possible to define and deploy combinations of AWS resources, from EC2 instances to load balancers. The “define” part comes in the form of JSON templates, while deployment can be done either using the AWS console or a CLI tool.

Now, CloudFormation on its own is great for just “putting a box out there.” The templates let you define a wide variety of resources and set parameters and mappings for a base level service setup and tuning.

What CloudFormation doesn’t do well is box configuration. The JSON templates don’t allow you to do much in the way of defining functions, setting conditions, or even basic math. If you try to force configuration management onto it, you end up with a lot of bash scripts and config files floating around, or worse, hardcoded into templates.

Even relegating KCS to deploying boxes can be tricky. The launch command for a given stack can include a dozen or more command line arguments, such as EC2 instance parameters (size, type, etc.) and command flags.

The simplest case launch will make use of all defaults in the template and look something like this on the command line:

cfn-create-stack $StackName --template-file $Template

But if you need to use other CloudFormation functionality and override a number of parameters at launch, you’ll end up with something like this:

cfn-create-stack $StackName --template-file $Template --parameters AWSAccount=Production;InstanceType=m1.large;ClusterSize=4;ConfigTarballVersion=2.1.5;AppVersion=1.2.3;SSHKey=ProductionKey --capabilities CAPABILITY_IAM --disable-rollback

Yikes! That’s a lot to type from memory for each deploy. You’re especially going to want that last option to disable rollback as it keeps the instances from failed launches around for you to debug — essential for when you inevitably mistype a version number.

If stack launches are fairly consistent, you can mitigate the annoyance of launch commands with BASH scripts, but these will be a pain to maintain. But what if you have a number of frequently changing parameters or decisions that need to be made at launch? What if you need to work with multiple AWS accounts or validate components of your launch config so as to avoid a painful debug cycle? (Is that tarball you need in s3? Does this environment have all of my new stack’s dependencies?) Complex stacks can take ten to twenty minutes to launch. You don’t want to have to keep re-launching just because you fat-fingered the instance type.

The problem with the command above is that every parameter represents a potential point of failure. CloudFormation is only able to ensure that your template is logically consistent and the JSON is valid. It can’t know whether or not AppVersion 1.2.3 is a thing, or whether a four node cluster matches what is in the current environment, or numerous other details that can spoil an update before it begins.

This is where KCS steps in. Knewton Crab Stacker was developed by a team of Knewton engineers (including yours truly). KCS is a Python command line tool designed to make CloudFormation deployment much simpler.

The first nice thing KCS does is add the abstraction “environment” to our AWS accounts. It does this by simply taking the stackname parameter and appending $EnvironmentName + “-” to the front of it. From CloudFormation’s perspective, the stackname is “Dev-UserService,” but KCS understands the stack as “The UserService stack in the Dev environment.”

Making use of the namespace this way greatly simplifies the task of isolating test environments from one another. It adds one more piece to launch commands, which in the simplest case look like this:

kcs stacks create $Service $Template $Environment

The difference between this and the simple CloudFormation command above is what goes on behind the scenes.

Before initiating the create, KCS checks a number of things. First, KCS makes sure that the environment has any stacks that the new service needs. If a dependency is missing, you can still force a create. Secondly, KCS ensures that any s3 resources referenced in the template or launch command actually exist. In other words, if your launch command specifies “ServiceTarballVersion=0.3.444”, KCS makes sure that said tarball is actually there.

Updates are far more common than creates, and it is here where KCS really does a lot for you. Here’s a simple update command:

kcs stacks update $Service $Template $Environment

Like the create, KCS does a ton of validation on the template and environment. With the update however, KCS also runs a diff on the existing stack. Before the update actually runs, you will be shown a list of every parameter the update adds, removes, or changes. From there, you can either proceed with or cancel the update.

Before I do an update of a stack, I can also use “describe” to see what’s in the environment currently. The full command is “kcs stacks describe”, but I can shorten it using “s” and “d”, and aim it at our Dev environment like so:

kcs s d User Dev

Dev - $93.6 monthly
Stack   Status            $ monthly   Creation       Last Update
User    update complete   $93.6       4 months ago   3 months ago

SystemConfig Version: 1.1.55 User App Version: 1.1.224

i-01234567  ec2-01-23-45-678.compute-1.amazonaws.com (m1.medium)


This gives me a lot of cool info including the version of the App, some parameter information, as well as the instance ID, type, and hostname. If I want an exhaustive list of parameters I can do this:

kcs s d User Dev --detailed

Dev - $93.6 monthly
Stack   Status            $ monthly   Creation       Last Update
User    update complete   $93.6       4 months ago   3 months ago

Environment: Dev
Cluster Size: 4
SystemConfig Version: 1.1.55
Environment Class: Dev
User Version: 1.1.224
Instance Type: m1.medium
DB Password: ******
Billing Environment: Staging
ELBRegistration: Yes
AMI: ami-11223344
UserDB Address: amazonrds.us-east-1.rds.amazonaws.com
Key Name: STAGING-007
i-1234567   ec2-01-23-45-678.compute-1.amazonaws.com   (m1.medium)
12.345.678.999                             us-east-1c

These commands make it easy to run updates without knowing much about the stack, but there is an even easier method for the truly lazy:

kcs stacks interactive-update $Service $Environment

This command uses the existing stack’s template and then lets you pass in values for each parameter while showing you what is in the current environment. It guarantees that you only change exactly what you want to change.

When the update actually runs, KCS adds a few layers of insurance that CloudFormation does not. For one, it spins up brand new instances, runs their launch config, and then waits for success signals before tearing down the old stack. This allows you to set up whatever level of functionality and performance testing you want as a condition of a successful update. If part of the launch config fails or does something unexpected, KCS rolls everything back.

All of this just scratches the surface at what KCS can do. I could write a few dozen pages about KCS’ other abilities, like grabbing service logs, executing remote commands, hotswapping jars on Java stacks, and even snapshotting entire environments and then recreating new environments from snapshots (need to copy a 100 instance staging environment? No problem).

The main thing that KCS does is to kill the need for a Release Engineer or “Deployment Guy.” Nobody is happier about this than I am, as I was the “Deployment Guy” for months. Instead, we have a situation now where systems engineers can focus on improving infrastructure and devs can get new code out easily.

The lion’s share of the credit for KCS has to go to Sarah Haskins and Trevor Smith, the two developers who did the bulk of the coding. It has made life easier for all developers here at Knewton, and we hope to open source it in the future.

Configuration management and future challenges

As nice as KCS is for our deployment workflow, it is only able to solve one part of our infrastructure needs. Like any moderately large tech team, there are natural conflicts of interest that arise between those focused on system stability and maintenance, and those trying to push out a slick new feature. We’re not immune from knowledge bottlenecks and technical debt, but as the team grows and practices are refined, the future looks brighter and brighter for our tech team.

At the very least, thanks to KCS, we have a pretty good handle on deploying services. Swell. But how do we configure boxes once CloudFormation puts them out there? How do we ensure that services are able to talk to one another and that stacks are resilient to a plethora of errors?

Those fantastic questions, I’m afraid, will have to be the subject of another “N Choose K” post.

Code Swarm Post Git: Data Visualization

The above visualization shows the history of commits for the Knewton Adaptive Learning Platform. A commit occurs when a developer alters the code and transfers the changes into our central git repository. Each file in a commit is represented by a colored-coded blob which indicates the type of file committed. Each time a developer commits a file, the algorithm brings the file closer to the user’s blob; this proximity reflects the associativity between code files and users as well as the relationships among users themselves. The histogram at the bottom displays the frequency of commits over time.

This visualization can also be referred to as a “code swarm”, and is an example of “organic information visualization” (a phrase coined by Ben Fry), that is, any data visualization that allows the elements to interact in a dynamic and unpredictable fashion. The benefit of such a visualization is that it brings the relationships between different project components into relief. The viewer can also observe the evolution of a project over time.

Parameter Recovery

1. Introduction

Providing students with good study recommendations begins with understanding what they already know. At Knewton, we’re combining existing work on psychometrics and large scale machine learning to perform proficiency estimation more accurately and at a larger scale than ever before.

As Alejandro Companioni [discussed in a previous post], Item Response Theory (IRT) is the core model around which our proficiency estimation is built. IRT proposes that there are latent features of students (proficiency) and questions (difficult, discrimination, and guessability) which explain the accuracy of responses we see on assessments. However, most IRT analyses assume that students bring the same level of proficiency to every question asked of them. This might be fine for a 2-hour standardized test, but we are interested in helping students over an entire course. If we’re doing our job, students should be improving their proficiency as time goes on!

This summer, I made our IRT models more sensitive to gains (and losses) in student proficiencies over time. I’ll leave the details of that model to another post, but as a teaser, I’ll show you our first figure. In this figure, the x-axis represents time as measured by the total number of questions answered. Dot represent the question which was answered at that time. Red dots represent incorrectly answered questions while green dots represent correctly answered questions. The {y}-value of the dot is its difficulty. The green line tracks our simulated student proficiency, and the blue line tracks our recovered student proficiency paramter given the questions that they have answered.

In this post, I will discuss three methods which we used to evalute the performance of our algorithms and discuss their relative strengths and weaknesses. We’ll focus on the mean-squared error, the log-likelihood, and the Kullback-Liebler divergence.

student_1

A visual representation of our temporal IRT model

2. Evaluating results 

To tackle the problems I faced, I explored statistical inference algorithms on probabilistic graphical models. To judge my results, I simulated classes of students answering sets of questions, and saw how accurately my algorithms recovered the parameters of the students and questions.

One method of quantifying the accuracy of estimates is the mean-squared error (MSE). It takes the mean of the square of the differences between the estimated parameters and their actual values. In symbols, if {\theta^{\text{act}}} is the vector of model parameters used to generate our data, and {\theta^{\text{rec}}} are the parameters our method recovered, then

\displaystyle \text{MSE} = \frac{1}{N} \sum_{i=1}^N \left(\theta^{\text{act}}_i - \theta_i^{\text{rec}}\right)^2.

While the MSE is a good indicator for accuracy in many places, it has problems when models have multiple solutions at different scales. Let’s see why this is through an example.

Suppose we have a class of students answering a set of questions. Suppose the questions are actually very hard and that the students happen to be very proficient at the material. Just looking at this set of data, however, we have no way of knowing that these students are actually very proficient. We can only assume the most likely scenario–the students have average proficiences and are answering the questions competently. From the data alone, we can only hope to discern the values of item parameters relative to the proficiency of students and vice-versa. We cannot hope to know their values absolutely. So there are many equally valid interpretations at different scales of the same data. Because the MSE looks at the difference between the two parameters, it will penalize parameters that are scaled differently, even though those parameters might be an equally valid solution given the data!

Let’s look at Figure 2. We see that the ordering is basically preserved, which we could measure with Pearson’s {r}. but the recovered parameters are not on the same scale. If that were the case, then the line of best fit would have slope one and {y}-intercept {0}. However, in this case, the slope is closer to {2/3}. In statistical terms, the variance of the recovered parameters is much less than the variance of the actual parameters.

scatter_thetas

Recovered student proficiencies plotted against actual student proficiencies from a simulation.

The log-likelihood gives us a more meaningful method of measuring accuracy. The log-likelihood tells us the log-probability of our recovered parameters given our data. At instantiation, our algorithms should have low log-likelihoods–the parameters that we guess first are random and don’t fit our data well. Our algorithms should iterate toward higher log-likelihoods, hopefully converging at the set of parameters with the highest log-likelihood. This is the philosophy behind the Expectation-Maximization algorithm. But the log-likelihood is susceptible to tail events. For instance, if a question is in reality extremely hard but, through random chance, a few students with average proficiency answer the question correctly, then maximizing the log-likelihood will lead to marking these very hard questions as easier than they actually are. This, of course, could be solved with more data from large numbers of extremely proficient students, but this data is often hard to come by. Instead, we introduce another way of measuring model fit: the Kullback-Leibler (KL) divergence. Suppose we have probability distributions {P} and {P'}, generated by our original and recovered parameters, respectively. That is,

\displaystyle P(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{act}} )

and

\displaystyle P'(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{rec}} )

In our case, a datum is one student’s response to one question and the {\theta} paramters are the student proficiency and question discrimination, difficult, and guessability discussed in Alex’s post. Then we can define

\displaystyle D_{KL} (P||P') = \displaystyle\sum_i \ln\left({\frac{P(i)}{P'(i)}}\right) P(i).

If {P = P'}, then {D_{KL}(P||P') = 0}. Moreover, Gibbs’ inequality implies that {D_{KL}(P||P') \geq 0}. If the original and recovered parameters generate similar probabilities, then the KL Divergence should be small. Note that since the KL divergence involves the ratio of the likelihoods, it is less susceptible to the influence of tail events than the log-likelihood alone.

The log-likelihood and KL divergence both use likelihoods to measure fit, which means that they only care about fit of the parameters to the data, and not the exact convergence to the original. So they often prove to be reliable measures of fit to judge our algorithms on. For instance, even though the MSE of our recovered parameters is large, Figure 2 shows us that our algorithm has likely converged (since the log-likelihood and KL divergences are not changing much) and give us a reliable measure of model fit.

metrics

The log-likelihood and KL divergence of our recovered parameters through a run of our algorithm.

Java-Scala Interoperability

To many people, Java is programming. It’s the only programming language most of my non-technical friends could name off-hand, and the first one that many programmers learn. Since its initial appearance in 1995, Java’s usage has steadily grown.

There are many reasons for this. The design of the Java platform, in which Java code compiles to bytecode that runs on the Java Virtual Machine (JVM), allows Java to be a “write once, run anywhere” language. That is, Java code that works on my computer will work exactly the same on your computer, or a server in the cloud (in theory, at least). Its language features, especially static typing and object-orientation, lend it to robust, modular, rock-solid code. Due to years of JVM and Java development, the platform is blazingly fast and stable. In addition, the many libraries available for Java allow anybody to leverage years of development and experience to perform complicated tasks simply and quickly. The Java community is also strong: on top of the official documentation, there are years of blog posts (an unscientific Google search for “java blog” yields 268,000,000 results) and close to 280,000 questions tagged “Java” on StackExchange. That’s a lot of collective knowledge and expertise.

At Knewton our systems are Java-based, for a few of the reasons mentioned above. But Java is far from perfect. Its syntax includes much boilerplate and support for concepts like concurrency, parallelism, and functional programming is unidiomatic. There are a lot of small tweaks or features that could make Java a lot more pleasant to work with. These imperfections aren’t significant enough to outweigh the benefits of using Java, but they do make daily work a little more challenging.

Enter Scala. Scala is what would happen if there were a Genie that eavesdropped on every Java developer as they cursed when their build broke due to a missing semicolon, then inferred a wish from their complaints. It offers what Java lacks: a much cleaner syntax (semicolons optional, types inferred), pattern matching and case classes, mix-ins (a more powerful inheritance/interface system), higher-order/first-class/anonymous functions, built-in immutable data structures, a built-in read-evaluate-print-loop (REPL) and actor-based concurrency. Even more importantly than all of this, though, is one huge advantage: it runs on the JVM. This means that, unlike other languages, Scala doesn’t require the reimplementation of large banks of code; programs can use existing Java libraries with minimal effort. For instance, the Java standard libraries are available as soon as your Scala REPL starts. To translate

ArrayList<String> fruits = new ArrayList<String>();
fruits.add("apple");
fruits.add("pear");
fruits.add("banana");
for (String fruit : fruits) {
    System.out.println(fruit);
}

into Scala, we simply write the following (JavaConversions is a Scala library that makes using Java collections like ArrayList much more idiomatic):

import java.util.ArrayList
import scala.collection.JavaConversions._
val fruits = new ArrayList[String]
fruits.add("apple")
fruits.add("pear")
fruits.add("banana")
for (fruit &lt;- fruits) {
    println(fruit)
}
// More idiomatically
fruits.foreach(println)

If you’re using libraries that aren’t packaged with Java or your own classes, just ensure that they lie in the classpath. In addition, the syntactic similarities between Java and Scala mean that the information from nearly any Java resource — a blog post, StackExchange question, Javadoc, or whiteboard code snippet — will be equally valuable in Scala, allowing Scala to tap into over 15 years of experience. While Scala may be a relatively new language, it has access to all of the network benefits of a behemoth like Java.

Portability works the other way, too, but with a few caveats. Consider the following Scala class, in the file “Hamster.scala”:

class Hamster()

The Java file “HamsterTest.java” might contain the following:

public class HamsterTest {
    public static void main(String[] args) {
        Hamster hamster = new Hamster();
    }
}

To compile and run, execute

$ scalac Hamster.scala
$ javac -cp path/to/scala-library.jar:. HamsterTest.java
$ java -cp path/to/scala-library.jar:. HamsterTest

in the shell, and hopefully find that everything works as expected. However, if we want to add a weight field to our Hamster class (“class Hamster(var weight: Int)”), things start to break.

Hamster hamster = new Hamster(42);
// This works (note: "weight()", not "weight"):
System.out.println("The hamster weighs " + hamster.weight());
// But if our hamster then goes on a diet, we're out of luck:
hamster.setA(10); // doesn't work
hamster.a(10); // doesn't work

We can solve this problem with the handy scala.reflect.BeanProperty annotation, which adds getWeight and setWeight methods (we could also add them explicitly):

class Hamster(@BeanProperty var weight: Int)

Then, from Java:

Hamster hamster = new Hamster(42);
hamster.setWeight(hamster.getWeight() - 32);

What about using Scala traits within Java? Since traits aren’t quite interfaces, but they aren’t quite abstract classes, they’re a little tricky. A basic Scala trait, such as

trait Mammal {
    def beHairy
}

can be used just as an interface. But if we add another trait, Rodent,

trait Rodent extends Mammal {
    def beHairy {
        println("I have fur!")
    }
    def gnaw
}

and attempt to extend it

public class Rat extends Rodent {
    public void gnaw() {
        System.out.println("gnawsome");
    }
}

we are met with the compile-time error “no interface expected here”, even though Rodent implements beHairy(). To see why we get this message, whip out javap, the Java disassembler bundled with every JDK (javap is an essential tool for deciphering strange Scala behavior).

$ javap Rodent
Compiled from "Rodent.scala"
public interface Rodent extends Mammal,scala.ScalaObject{
    public abstract void beHairy();
    public abstract void gnaw();
}

Thus, even though Rodent contains method implementation, it still compiles to an interface. Some compile-time magic allows Scala classes to use these traits. The easiest fix is to add something like

abstract class RodentJava extends Rodent

for each trait that implements methods, and have Java classes extend the abstract class, rather than the trait. This has the advantage of being dead simple, but the disadvantage of requiring the maintenance of a separate Java API and sacrificing the ability to mix-in multiple traits as in Scala (this is, however, impossible in Java, since it lacks multiple inheritance); a single abstract Scala class for each potentially used set of mix-ins is the way to go here, but this method risks a combinatorial explosion).

Traits are just one example of a Scala feature that lacks a direct Java equivalent. First-class functions, anonymous functions, Scala’s greater number of access level modifiers, pattern matching, tuples, implicits, and operator overloading are all either non-trivial or impossible to use in Java. In general, to be useful to Java programmers, a library that uses any Scala-specific features must offer a Java-oriented API (of greater complexity than the examples seen here, but involving similar workarounds). Both Apache Kafka [3] and Twitter’s Finagle [4], which I’ve worked with in my summer at Knewton, take this approach; while it’s possible to write Scala code that can be used seamlessly from Java, to do so is to give up the expressiveness of idiomatic Scala.

Whether, as some suggest, Scala will be a successor to Java, or merely a fad, its compatibility with Java is one of its greatest strengths. Combining Java and Scala is nowhere near the undertaking that combining two unrelated languages would be, so the cost of experimenting with Scala as a part of an existing Java architecture is minimal. Given the potential benefits and low risk, it makes sense to consider a total or partial move to Scala as part of a JVM-based ecosystem.

Web-Based, Real-Time Data Monitoring

This post was written by Joy Zheng, a student at Harvard University who interned at Knewton this summer.

THE PROBLEM

How do you quickly visualize and validate data in a large-scale, distributed, and real-time system? How do you make it deployable in multiple cloud environments? And how do you track changes to this data over time?

These are questions that I set out to answer in my internship this summer, as I worked to develop a web-based interface that displays how much data the Knewton system is receiving from hundreds of thousands of Pearson students across the country.

At a very high level, this is how things work: Hundreds of thousands of college kids across the country use a recommended textbook published by Pearson. These books have an online component with which the students interact, learn, take tests and quizzes, etc. Information regarding all this activity flows through to us at Knewton. We process it to make recommendations to each student about how to progress through the course, so that the student can achieve the milestones set by the professor.

For any given student, we need to know something about the student and his or her history of previous answers, as well as something about the course and the textbook that the course uses. We also need to track any custom content that instructors have created. In order to serve up recommendations, the work gets divided up among different services in the system — along with the data.

CONSIDERATIONS

With this problem set in mind, I came up with a list of overarching goals for a web-based dashboard that would provide this information. Before building the system, I had to keep in mind the following:

Recentness of data: While it isn’t vital for the dashboard to have up-to-the-second data, someone has to be able to login at any time and see fairly recent data — say, from the last hour. We also want to make sure that, when needed, users can get even more recent data on demand.

Responsiveness: How fast can we refresh the data?

Redundancy: When two users want the same piece of data from the platform, the dashboard should not have to fetch it twice.

Ability to get details on the fly: Given thousands of courses, how do you get more descriptive information on the fly?

Ease of use: Using the dashboard should not require deep knowledge of how data is being handled in the platform itself.

Configuration: In line with the last point, the dashboard should be configurable from the website; one should be able to change things like “refresh time” on the fly, for example.

IMPLEMENTATION

Ultimately, we settled on an implementation where data would be polled at regular intervals from various pieces of the Knewton platform and then cached in memory.

This resulted in three strategies for obtaining the data:

Auto-refresh: We query for updates every hour.

On-demand refreshes: If a user wants the most recent state for some piece of information, they can force a refresh.

Stats over time: Permanent persistence of summary statistics — because we want to be able to see some numbers over time, we save relevant data once every 24 hours.

As a result, data presented to the users is almost always drawn from the cache; we preferred this to live querying for two primary reasons.

First, as mentioned previously, we didn’t want to have to send off queries to the platform for data multiple times if we would be getting the same data back; we only wanted to have to pipe data when it had been updated.

Secondly, speed. After all, drawing information from local cache is faster than having to gather it, and we were even able to cache significant portions of the web pages themselves.

Additionally, as part of collating data, one of the roles of the dashboard was to cross-reference it — to line up the data about a textbook with the data for a course and the data for a student. Thus, the nature of the internal back-end references meant that querying on-demand for say, information about one course or user, would require querying in sequence, rather than querying in parallel.

TECHNOLOGIES

Given that most of the Knewton platform is written in Java, and non-Java pieces had Java APIs in place, the language choice for the dashboard was very straightforward; admittedly, though, there were points where I missed functional programming, especially when trying to make sure that the dashboard could handle multiple simultaneous users during parallelizing refreshes.

The webpage itself was run using an embedded Jetty servlet. In retrospect, now that I am more familiar with Jetty and having run through a cycle of configurations, I likely would have given this choice more consideration — either by looking at different server technologies or different configuration methods. Especially compared with Ruby on Rails, which defined my previous experience with servers, resolving relative URLS both for webpage and resource files ended up taking notably more effort than expected with this configuration style. This challenge was compounded by the use of JSP (Java Server Page = Java-embedded HTML) files which facilitated UI and CSS better than the Java-generated HTML that I had in the first running dashboard version.

The interaction within the system can be represented in the diagram below:

PERFORMANCE OPTIMIZATIONS

While testing functionality, I ran the code on a local machine against a very small set of data (really small — like counting actual numbers rather than trailing zeros on your fingers) just to make sure that everything was running and that we were gathering the expected data. Consequently, in our first run against a bigger sized data set (approaching 100,000), some areas of data took a few hours to refresh. We were also piping enough data around to have a tangible effect on the performance of the platform as a whole (ouch!).

To improve performance, we took a few steps:

Reduced field set: First, we took another look at all the data we gathered to make sure that we weren’t collecting anything we didn’t need. While our trimming efforts here probably didn’t have as much of an effect on performance, they definitely reduced the memory footprint.

Delta Refreshes: Next, we made the dashboard smarter by only receiving data that had been updated since the last refresh. Given API restrictions, this wasn’t possible for every service that we queried, but it still gave us a big speed boost.

Lazy loading: Data that we might reasonably want (but could live without 90% of the time) was taken off the automatic update list, and instead only queried and cached on demand. This sometimes meant a wait of 15-30 seconds by users if they really wanted to dig into the details, but it reduced the amount of data we had to cycle through every hour by an order of magnitude.

These tweaks improved performance by orders of magnitude, too, so that refreshes could be measured in minutes or seconds rather than in hours.

While the server sent out data in HTML tables, data manipulations were left to JavaScript on the client side. I started with the TableFilter library (tablefilter.free.fr); however, once running the dashboard against more realistically sized datasets, we hit the ceiling on the ability of this library to sort and search through large tables. After being pointed in a new direction by a fellow Knerd, we switched to the DataTables javascript library which, in addition to having much better performance, also offered a larger variety of available plugins and extensions. While JavaScript will work for the near-term, and using the pre-existing libraries allowed us to get off the ground quickly, processing of huge datasets will ultimately need to be moved to the back end as the number of Knewton users grows.

All in all, I was able to get the dashboard up and running into production just a few minutes before my final presentation of my internship. The dashboard is now being used officially at Knewton to track data across multiple environments and for several hundred thousand students across the country!