Deciding Who is Right: Using Item Response Theory Instead of Majority Rule to Crowdsource Answers

As the saying goes, it takes a village to build an adaptive learning technology provider. Even when you have an entire village at your disposal, however, the added help may contribute more headache than help as even the simplest question can generate differing opinions.

To place the situation in more concrete terms, imagine you have a stack of 10,000 photos of either kangaroos and kittens, but you do not know which photo depicts what. Because object recognition remains a difficult problem in artificial intelligence, even the most powerful computers will have a difficult time determining if a photo is a kangaroo or a kitten.

Classifying them all by yourself would be time consuming and potentially inaccurate if you start to lose focus. Luckily, 100 of your closest friends have offered to help; unluckily, some informal surveying already reveals that sometimes they disagree with each other. After all, kangaroos and kittens do sometimes look similar. What to do?

How can we decide the correct answer amidst a sea of potentially contradictory information? One straightforward approach would be to gather two or three (or ten) labels for each photo and take the majority vote. While the majority method would give us a rough idea of the correct label for each photo, it fails to incorporate the fact that some of your friends may be more gifted kangaroo/kitten labelers than others. Additionally, some of the photos might be harder to label, which would skew calculations about your friends’ abilities.

All of a sudden, our kangaroo-kitten problem is starting to sound like a problem we have already tackled: Item Response Theory (IRT)!

IRT Model

In order to solve our kangaroo-kitten problem, we can use a slightly modified form of IRT. Here at Knewton, we use IRT to determine student proficiency. Unlike standard tests where all questions are weighted equally, IRT assigns each student an ability value and each question a difficulty value, which provides a more nuanced picture of how a classroom of students is doing.

In the one-parameter logistic (1PL) IRT model, for question \beta j with difficulty j and student i with ability \theta i, the probability that question j is answered correctly, xj = 1, is

For more information about IRT, check out Alejandro Companioni’s previous N Choose K post.

The question then becomes, how can we use our knowledge about abilities and difficulties to determine who is better at kangaroo/kitten labeling?

Beyond IRT

If we stretch our minds a little bit, we can imagine our kangaroo/kitten problem as a test where the questions are the photos and the students are our friends. We want to determine which students are most proficient at the very strenuous task of animal classification. In addition to the parameters included in the 1PL IRT model, however, we also want to compute a probability to capture how likely each picture is to be either a kangaroo or a kitten.

Similar to the 1PL IRT model, the parameters in our model now include labels L, vector of abilities theta, and vector of difficulties beta. To make sure we’re all on the same page, the labels L represents all of the given labels by our labelers. Not all labelers need label every photo. Each of our abilities \theta can range from negative infinity to positive infinity. The greater the ability, the more skilled the labeler. Our difficulties range from zero to infinity where the higher the difficulty, the harder the image is to label correctly.

Consider how the observed labels, true labels, abilities, and difficulties all relate to each other. Would the difficulty of the question affect the accuracy of the observed label? Potentially. Would the true label of the image affect the ability of the labeler? Unlikely. Below we have drawn the general graphical model describing the relationships between these parameters where the shaded variables are observed.

Remember that in our case, we have 10,000 images and 100 labelers. Unsurprisingly, the difficulties, abilities, and the true labels are all independent of each other, meaning the accuracy of a labeler has no effect on the likelihood that a photo depicts a kangaroo or a kitten!

How does this all have anything to do with if the photo is a kangaroo or a kitten? For specific photo j, we can derive how likely the photo depicts either adorable animal. That is, the posterior probability of the correct label zj for photo j denotes the probability the photo depicts an animal.

Because we know that the photo contains either of the two animals, we can designate kangaroo as 0 and kitten as 1. Our posterior probability then designates from 0 to 1 how likely the photo is to contain either animal. If we assume that the correct label zj is independent of both abilities theta and difficulties beta, the probability simplifies dramatically.

The posterior probability now consists of two components: a prior belief and an IRT-based probability. Our first term p(zj) captures our prior knowledge about how many of the photos contain each. For example, if we suspected that the majority of the photos were kittens rather than kangaroos, we could use that parameter to include our prior belief in the model. The second probability uses our 1PL IRT probability to denote the probability the labeler gave a label (aka answered a test question) conditioned on the correct answer, the labeler ability, and the difficulty of the question.

Expectation-Maximization

Now that we have established our graphical model including relevant dependencies, we can use an Expectation-Maximization (EM) approach to obtain maximum likelihood estimates of the parameters of interest. Essentially we can now alternate between the expectation and maximization steps to find the most likely probabilities and parameters, given all of the other probabilities and parameters.

By a lucky coincidence, we have actually already determined our expectation step above when we computed the posterior probability of each label! A simpler way to think about the expectation step is to imagine that our abilities and difficulties are all fixed, and we calculate the animal image probabilities accordingly. If we only calculated the probabilities once, however, the probabilities would only depend on whatever values of abilities and difficulties we initialized in our model! How can we keep adjusting the model?

This revelation brings us to the second half of EM: the maximization step. For this step, we want to find a way to make our posterior probabilities as large as possible—denoting how certain we are overall of our guesses of the correct label—by adjusting our parameters and . More formally, we are trying to maximize the expectation of the joint log-likelihood of the observed and hidden variables (L, Z) given the parameters (\theta, \beta) with respect to the posterior probabilities that we calculated in the expectation step.

Our joint log-likelihood function is the expected value of the logarithm of our joint probabilities. That is, how certain are we of everything so far? Using our conditional independence assumptions outlined earlier, we can find our joint log-likelihood function:

Using gradient ascent, we can then find values of \theta and \beta that locally maximize Q.

From here, we simply alternate between expectation and maximization steps until convergence. To recap, expectation holds ability and difficulty constant, while calculating posterior probabilities. Maximization then calculates the ability and difficulty parameters to maximize joint log-likelihood, given constant posterior probabilities.

Depending on the gradient ascent implementation, the algorithm should converge quickly, revealing our best guesses for which animal is featured in which photo. As we can see below from our implementation based on simulated data, the EM approach outscores the majority approach at nearly 5% initially before converging later. Additionally, as we increase the number of voters, the accuracy increases. Success!

While our improvement over the majority method may be impressive, our E-M IRT model still has plenty of room to expand. What if we also had pictures of koalas and killer whales, increasing the number of options? What if we had reason to believe that the abilities of our friends fall in a Gaussian distribution, creating a prior distribution on our parameters? What if we assumed that our friends might become better labelers as they continued to label, making our model intertemporal?

References
Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., and J. Movellan. (2009). Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043.

de Ayala, R.J. (2008). The Theory and Practice of Item Response Theory, New York, NY: The Guilford Press.

Algorithmically Generating Questions

Developing assessment materials is a lot of work. Teachers spend countless hours scavenging textbooks and developing original exercises for practice worksheets, homework problems, remedial materials, and exams. To prevent cheating, many teachers write several versions of each test, multiplying the work required for creating problems (not to mention grading!). The issue is exacerbated by the rising popularity of MOOCs, in which tens of thousands of students may be enrolled in the same course.

Furthermore, access to a large, varied pool of assessment items is a key to the success of many adaptive courses of study. Without a large pool of items, proficiency estimates can be compromised and personalized courses of study can become less effective than they would be otherwise.

Machine Generated Questions

Machine generated questions have been studied for decades as a component of intelligent tutoring systems. Most research falls into two categories: solution-oriented approaches, and template-based approaches.

Solution-Oriented Approach*

In this approach, questions are generated based on the set of skills and concepts required to solve them. For example, skills related to addition includes adding single digit numbers, adding multi-digit numbers, adding three or more numbers, and carrying digits.

A recent paper, entitled “A Trace-Based Framework for Analyzing and Synthesizing Educational Progressions,” describes an interesting implementation of solution-oriented question generation. On page three, the authors write out pseudocode for the standard, classroom addition procedure. They then annotate the code with symbols representing skills (for example, C for “carry digit”). Thus, by running the pseudocode and keeping track of the symbols, one can obtain a “trace” that categorizes each possible addition problem.

Because solution-oriented approaches group problems based on skills, they lend themselves well to adaptivity. As a student answers questions, one can identify skills he or she is struggling with, and then recommend material reinforcing those skills. However, a major drawback of solution-oriented approaches is that developing questions even for a topic as simple as addition requires a fair amount of labor and domain expertise.

Template-Based Approach*

In this approach, a question template is used to represent a potentially large class of problems. For example, consider a familiar question:

Find all roots of _x+ _x + _.

The underlines are “holes” that must be filled in by the question generator. A template might also specify valid ways to fill in the holes. For example, maybe each hole can only be filled in by the integers one through 10, leading to 10^3= 1000 possible questions. The instructor may wish to further restrict the template to only permit quadratics with real, distinct roots.

The biggest advantage of this approach is that it is accessible to a majority of instructors, provided there is an intuitive and robust templating language. In addition, template-based approaches are easily generalizable, capable of representing entire domains. A disadvantage of templates is that they tend to group problems based on appearance, not skills.

This summer, I set out to create an assessments generator engine that would be both accessible and expressive enough to generate a wide variety of problems. For these reasons, a templating approach made the most sense. Furthermore, Knewton already has infrastructure in place that will enable us to achieve adaptivity even with a template-oriented approach.

The Art of Templating

My first task was to devise a templating language. I decided that it would be a good exercise to define a domain specific language (DSL) that formalizes the space of possible templates. This DSL must let instructors specify the following:

  • Which variables in the question can be adjusted?

  • What values are these variables allowed to take on?

  • How is the correct answer computed?

  • How are the incorrect answers computed? (for multiple choice questions)

After several iterations, I came up with a format general enough to cover many (if not most) of the questions used by Knewton Math Readiness. I’ll go through some examples, beginning with simple addition, that illustrate the main features of the templating language.

Adding Two Numbers

The below template is used to generate questions of the form _ +_ = ?

template "add 2 numbers" {
  question is "<x>+<y>=?"
  answer is "<sum>"
  variables {
    x, y are integers between 1 and 9
    sum=x+y
  }
}

The question and answer are simply strings with variable names denoting the “holes.” Variables come in two flavors: generated (x and y) and derived (sum). Generated variables are bound to a sample set, which could be a range of integers, numbers with 2 decimal places, or even the set of fibonacci numbers. Derived variables are defined by mathematical expressions.

Finding Quadratic Roots

Let’s return to the earlier question of finding the roots of a quadratic. Following the addition example, we might try:

template "quadratic roots" {
  question is "<a>x^2 + <b>x + <c> = 0. Solve for x."
  answer is "x = <root1>, <root2>"
  variables {
    a, b, c are integers between -10 and 10
    discriminant = b^2 - 4*a*c
    root1 = -b + sqrt(discriminant)/(2*a)
    root2 = -b - sqrt(discriminant)/(2*a)
  }
}

Here, we are generating each coefficient (a, b, c) from the range [-10, 10]. However, the below table illustrates an issue with this template. For certain coefficients, the solutions can be integral, fractions, irrational numbers, or even imaginary numbers.

(a,b,c)

Solutions

(1, -10, 16)

2, 8

(-4, 7, -3)

0.75, 1

(3, 6, 2)

-1.577…, -0.422…

(1, 2, 1)

-1

(1, 4, 8)

-2 + 2i, -2 -2i

Because of this, the template can represent questions requiring different skill sets and mastery levels. It is important to give instructors the ability to maintain a consistent difficulty level, and to control the concepts covered by a single template. This is achieved using constraints.

template "quadratic roots" {
  question is "<a>x^2 + <b>x + <c> = 0. Solve for x."
  answer is "x = <root1>, <root2>"
  variables {
    a, b, c are integers between -10 and 10
    discriminant = b^2 - 4*a*c
    root1 = -b + sqrt(discriminant)/(2*a)
    root2 = -b - sqrt(discriminant)/(2*a)
  }
  constraints {
    root1, root2 must be integers
    root1 <> root2
    gcd(a, b, c) = 1
  }
}

In general, constraints are useful for narrowing the skill set covered by the template, and to ensure that instantiations of the template are sufficiently varied. In this example, requiring unique, integer roots is used to control difficulty level, while requiring gcd(a, b, c) = 1 ensures that no two questions will be multiples of one another.

Car Distance

Another important feature of the templating language is the ability to specify wrong answers.

template "car distance" {
  question is "How far does a car travel in <m> minutes traveling <r> miles/hour?"
  answer is "<dist>miles"
  variables {
    m is an integer between 30 and 90 divisible by 10
    r is an integer between 40 and 65 divisible by 5
    dist = rate*time/60
    wrong1 = rate*time
    wrong2 = rate/time
    wrong3 = rate/time/60
  }
  wrong answers {
    "<wrong1>miles"
    "<wrong2>miles"
    "<wrong3>miles"
  }
}

Wrong answers can be either static or variable. What’s powerful about this feature is that each wrong answer might be associated with a particular deficiency or misconception. In the example, a student might choose “rate/time/60” because she doesn’t know the correct distance formula, while another student might choose “rate*time” because she has trouble converting units. This is another source of information that Knewton can use to provide more targeted recommendations.

The Problem Generation Algorithm

Great, so we have a template. Now how do we actually generate questions? My first inclination was to start with the simplest possible algorithm:

  1. Go down the list of variables, selecting values for the generated variables uniformly at random from the sample sets and using the formulas to compute the derived variables

  2. If the variables satisfy all of the constraints, add it to the list of questions.

  3. Repeat.

This naive algorithm performs nicely given one key assumption: a large enough fraction of the sample space (the set of all possible questions, i.e. the cartesian product of the sample sets) must meet the constraints specified in the template. For instance, if 100 questions are desired and the algorithm can handle 100,000 iterations, roughly 1/1000 questions need to be valid. This isn’t too daunting: as long as we offer an expressive library of sample sets and constraints, instructors can be expected to provide templates meeting this requirement.

It is a very difficult to come up with a more efficient approach. For some problems, algorithms do exist to generate solutions (see Euclid’s method for pythagorean triples), but for others it is mathematically impossible (see Hilbert’s tenth problem). In many cases, introducing heuristics may improve the random algorithm. For instance, it may be possible to identify a large chunk of the sample space that leads to solutions that are too large, non integral, negative, etc.

The Prototype

I chose to implement the assessment generator in Scala for several reasons:

  • Scala’s interoperability with Java made integrating with the rest of the Knewton code base an easy task.

  • Scala’s powerful Parser Combinators library made implementing the template DSL straightforward. Because of their low overhead, I also used parser combinators for converting math expressions like “sqrt(x^3 + 5)” into my internal data model. While there are many existing Java/Scala libraries that accomplish this, I was unable to find any capable of manipulating general mathematical objects, such as fractions, matrices, polynomials, polygons, and groups.

  • Scala’s parallel collections allowed running iterations of the problem generator routine in parallel. Doing so only required swapping out a Map with a ParMap, and appending “.par” to the main program loop.

Here is a screenshot of the prototype in action.

*For examples of solution-oriented approaches in the literature, see https://act.org/research/researchers/reports/pdf/ACT_RR93-09.pdf (survey) and http://research.microsoft.com/en-us/um/people/sumitg/pubs/chi13.pdf (grade school math).

*For examples of template-based approaches in the literature, see http://www.eecs.berkeley.edu/~dsadigh/WESE12.pdf (embedded systems).

Ku Pima: To Measure

Trivia: Ku Pima means To Measure in the Xitsonga Language

An Analytics (Data Science) team is made up of engineers/scientists with a wide array of skills. This results from the nature of the goals the team has to meet. As an Electrical Engineering major at Wits University, I’ve spent two summers as an instrumentation engineering intern. Instrumentation deals with the task of engineering instruments that can measure certain quantities for industrial processes to be controlled. Examples of environments include manufacturing and chemical plants, houses, or even the International Space Station. I find analytics to be a similar process to instrumentation engineering in that useful measurements are sought and then the instruments to calculate those measures are engineered.

Building the Analytics Pipeline

On the Analytics team at Knewton, the data scientists develop measures that are useful to track, whether directly for a business case or for building blocks for future analytics. Within the Analytics team there is a Data Analysis component that develops analytics (measures). Another component, Data Infrastructure, engineers the pipeline (instruments) to actually calculate these analytics on a large/production scale. Initially an analytic is developed by exploring some interesting idea of a measure, using available organization data, and then refining it to arrive at the final analytic.

My internship was concerned with creating Data Infrastructure (the instrumentation) to compute some analytics at Knewton. My initial major task was to take a newly developed analytic, in this case Engagement, data from different sources within our products, and  engineer tools to calculate this analytic. This task itself encompases not only the implementation of an algorithm but also the engineering processes necessary to construct the components needed for the measure to be calculated. Further there is a need to analyze and validate the measure on a larger scale than the initial one used to develop it. This necessitates the need for a feedback loop between the data analysts and data infrastructure components.

Engagement is a many-faceted construct.  One facet is an analytic that serves to indicate how much time a student spends “actively engaged” on the Knewton platform. There are a number of ways it can be defined. Here is the basic premise: Let’s say a student is busy with a homework assignment. After they have submitted their task, the Knewton system sends recommendations on which material the student should tackle next. From these interactions one might want to know how engaged a student is with the learning platform. There can be many ways of quantifying this: time spent on the system; number of logins per week; time spent on recommended material, etc. The analytics team is tasked with investigating and developing the analytic into a form that will be useful internally or to a partner. After this initial synthesis, we need to engineer a pipeline that will take student interactions and the recommendations into account and calculate the engagement analytic. Further, this analytic is an example of analytic that needs to be inferred. By “infer” we mean that we cannot directly observe the data we want and thus have to deduce it from other data.

There Can Be Multiple Reasons to Infer an Analytic

The raw student interaction data needs to be cleaned and clustered: The raw data captures a lot of information, some of which may not be useful. Thus, there is a need for cleaning/filtering. Some student interactions can be clustered and thought of as a single event instead of multiple events. (Think of a student doing multiple tasks consecutively on a single homework assignment.)

You have to categorize the users’ intentions: The user’s intention is important as it can make an analytic useful or less so. For example: there is a situation where the student did not intend to do the next recommended action, not because they thought it was irrelevant, but because they had to move to another task (say, another homework assignment with an upcoming deadline). In this situation we would have to treat this as a data point that would not be useful in calculating engagement.

Resources: Available resources are always a factor for any organization. It might be faster to calculate the analytic in one way as opposed to another. It might be more efficient to infer an analytic from a readily available dataset than use a much richer, hard to obtain dataset that is less efficient and only provides a small boost in accuracy with an accompanying large use of resources.

Computing Engagement

The overall view of the pipeline created for Engagement is shown in the figure below. The pipeline takes in data generated by Knewton and its partners that contains student interactions as well as recommendations sent to students. From this data the student learning/task sessions are inferred. From this data we then calculate the Engagement analytic. The computed analytic is reported and the data used for its calculation stored for future use and analysis.

 Building the Analytics Pipeline

After the initial pipeline is engineered there are still a number of tasks to complete. Validation is very important in determining if the analytic can be interpreted as expected. We need to know whether with sample data it produces similar results to the analytic development stage. This part of the process involves some modeling and statistics and needs analysis tools in order to detect errors and may lead to refining the analytic itself.

Through this validation process we are able to refine the measure further. Validation gives us a better understanding of the shortcomings of the data we collect and what other data might be useful.

If you’re interested in this subject, I’d also recommend checking out these tech blog posts: Netflix’s Recommendation Engine, Finding similar users on Twitter, How Facebook Enabled Graph Search On Such a Scale.

Generative Models for Description and Prediction

This summer I worked on the Analytics team at Knewton. One of the goals of analytics is to condense the huge amount of information generated by students interacting with educational products and provide teachers with data-driven insights into how their students are doing.

In this post, I will discuss an approach to conveying information about student behaviors using generative models.

A generative model is a description of how to generate data that is similar to the observed data.

Imagine we have a stream of binary observations for each student. The observations represent whether the student did any work on the system today, whether the student answered questions correctly as she progressed through a course, or anything else about the student that can be represented in binary terms. The observations are viewed as a stream; over time more observations are added.

Here are some examples of observations of correctness on practice questions from when each student started a course until today.

StudentA: “1111111110”

StudentB: “100011110111100111111011100110011”

StudentC: “111011110111111111111111111111111”

We could specify different generative models for each student. One possible generative model for student A could be:

Model #1 (Student A)

“print nine 1’s and then one 0″.

If we wanted to describe student B similarly the description would be:

Model #1 (Student B)

“print one 1 then three 0’s then four 1s then one 0 …”.

Models like Model #1 exactly preserve all of the information that was present in the student data but in doing so don’t reduce the complexity of the observations at all. Often, in communicating information about student behaviours, we seek to preserve the important information while reducing the complexity to communicate concisely.

A generative model does not need to reproduce the data exactly. One way of reducing complexity while preserving important information is to attribute parts of the observation that we believe to be unimportant as being generated from a process with some randomness.

Throughout this post, we’ll use a simple generative model for demonstration. The generative model belongs to a family of models, parametrized by a value w.

Model #2:

“Flip a weighted coin that lands on heads with probability w. If it’s heads, print 1, otherwise print 0″.

In reporting to a teacher, I can report that the model that best fits with the student history is w=0.9.

That can be pretty informative. It means that in the family of Model #2, w=0.9 gives the model that is closest to the full description of StudentA (using KL divergence as a measure of closeness). Often, a student’s full data description is inconvenient to communicate, but now I can summarize it concisely with a description of the model family (Model #2) and a single parameter (w=0.9).

While the model parameters may be meaningful themselves (in this case they are the student’s overall score on practice questions), they also define a space in which we can compare students to each other (or the same student at different points in time). For example, I can note that StudentA is best described by w=0.9 while StudentB is best described by w0.65. If I wanted to find the student most similar to StudentC (w0.94), I would find the student with the closest value for w and choose StudentA. Unlike standard string comparisons (eg. Levenshtein distance, which would yield studentB as the closest student to studentC), a distance measure based on a generative model is not sensitive to differences in the number of observations for each student.

In this case, because the generative model is quite simple, I’m just reporting a summary statistic of the data (in this case the sample mean), but that is not always the case.*

In choosing Model #2 to convey my information, I’ve consciously made some decisions about what qualities of the data are important or not. For example this model is insensitive to the order of the data. If I were to imagine a new student (StudentD) whose data was the same as StudentA’s but had its order reversed, the two students would be reported identically. If I believed that the order of the 1’s and 0’s was important, I could choose a different generative model. While many generative models will be valid for describing the same set of observations, some of these models will describe the data better, so it is also possible to empirically evaluate the relative descriptiveness of models.

Generative models can also be used for prediction. For concreteness, let’s imagine that the data we’re interested in is related to student work habits and the 0’s and 1’s represent whether the student interacted with the system in each hour. If we want to predict how many hours a student will spend on the system next week, we could fit a generative model to the student’s past habits and then run that model forward from where it left off to estimate a distribution for hours spent on the system next week.

In one sense this is still a description of the past work habits of a student, just phrased as a prediction of the future: “If these habits continue for another week, the distribution over time spent in the system is X.”

In some cases, this future prediction may be the most useful type of summary. For example, if a teacher wants to intervene before a struggling student fails an exam, an extremely useful description of the student’s improvements of proficiency is phrased this way: “If this rate of improvement continues until the exam date, the student is projected to get a 20% on the exam!”

*For example, a common generative model used for summarizing text documents is the Latent Dirichlet Allocation (LDA) model is parameterized by a set of distributions. These distributions can be viewed as a summary of the documents.

Another example is the IRT model of student proficiency described by Alex in his N choose K post. Reporting parameters from this model forms the basis of how we at Knewton communicate about what students know.

On Comparable Noise and Satisfiable Sound

The Motivation

Working for a startup can be intense. Many in this great city succumb to the pressures of work life and relinquish their purely intellectual pursuits. Here at Knewton, however, our love affairs with art, music, literature, formal systems — love affairs that, for the most part, began within very walls of the institutions that we hope to disrupt — these passions inspire our work and our camaraderie with each other. And practically speaking, interview questions at New York’s hottest startups can lean heavily on what was learned in the undergraduate CS lecture hall. The need to keep these skills sharp is clearly still relevant.

The aim here is to kick off a series of blog posts that pair substantial results in CS or formal systems with significant works of classical music. The series is limited to classical works explicitly — there is plenty of headphones-on time at work to listen to The Postal Service.

The Works

Listen to this: Gondwana by Tristan Murail

Read this: On Computable Numbers by Alan Turing

Alan Turing’s On Computable Numbers, with an Application to the Entscheidungsproblem is a seminal paper. Anyone taking a CS undergraduate core will encounter the notion of computability and the halting problem. Further from the ivory tower, it is not difficult to find people who are near-giddy about Turing Completeness:

Breaking news: HTML + CSS3 is Turing Complete,

People looking for help developing a Turing machine in Microsoft Excel,

A Turing machine built using Legos,

Brainfuck, a language described on its website as “an eight instruction Turing-Complete Programming Language”

Turing’s name has become quite literally a “formal” stamp of approval in the computing world. Turing’s seminal paper defined a set of abstractions that formed the basis for what we now call “Turing Completeness,” a requirement of all first-class computing languages. What exactly did Turing say in that paper? And what does it have to do with French spectral music?

Turing’s Paper

Beyond a few feeble mechanical attempts, computers did not exist in 1936, and Mr. Turing didn’t appear that excited about making one — he was interested in making Godel’s long, German work on incompleteness comprehensible. That is, Turing wanted to intuitively construct a class of decision problems that could be proven undecidable. Turing could not tackle this without defining a notion of “computable.” His definition leaned heavily on a few notions that we would recognize today as common, but required a little more formal attention-to-detail back then.

Turing describes memory:

We may compare a man in the process of computing a real number to a machine which is only capable of a finite number of conditions q1, q2, …. qI; which will be called “m-configurations”

and the idea of output:

… the subsequence of the symbols printed by it which are of the first kind will be called the sequence computed by the machine. The real number whose expression as a binary decimal is obtained by prefacing this sequence by a decimal point is called the number computed by the machine.

Academic language aside, the machine Turing is defining is simple. Turing calls this machine an “‘automatic-machine’ (or alpha-machine).” The machine has the following properties:

a tape divided into squares

symbols that are on the squares

the scanned symbol (the rth symbol). Modern texts will likely refer to this as the symbol the ‘head’ is pointing to.

configuration: the current m-configuration and the current scanned symbol (qn, G(r))

But this paper is not without its difficulties, especially if it is approached today.

One initial difficulty: having completed standard undergraduate classes in formal computation theory, Turing’s initial choice of terminology — “Circle-free” and “Circular” — was extremely confusing to my modern eyes. Most undergraduates approach Turing’s ideas in their coursework from the “halting problem” angle. What is initially confusing here is the following:

CIRCLE-FREE != HALTS

How can this be? Turing quickly skirts over the definition of a circle-free machine early on in the paper when he describes it as the following:

If a computing machine never writes down more than a finite number of symbols of the first kind, it will be called circular. Otherwise it is said to be circle-free. A machine will be circular if it reaches a configuration from which there is no possible move, or if it goes on moving, and possibly printing symbols of the second kind, but cannot print any more symbols of the first kind. (p. 233)

Finite. This should be the same as a machine that “halts,” right?

No. Turing’s machines are built to write down numbers. These numbers are (somewhat sneakily, in my opinion) implied to be repeating binary decimals on page 232:

The real number whose expression as a binary decimal is obtained by prefacing this sequence by a decimal point is called the number computed by the machine.

So we’re essentially talking about endlessly repeating floats — the kind of numbers that you get when you type in [1] [/] [3] on a calculator, except in binary. Numbers with bars over the top of the last digit: 0.33333333… for example (0.010101010… in binary). In Turing’s paper, the machine that prints 0.010101010… is “circle-free.” This means it encounters no terminal difficulty in computation — it can keep printing out the endless decimal representation without pausing indefinitely to calculate. By this definition, “finite” numbers such as 4, 5, and 6 would be circle-free numbers, because the machine could be constructed to just print out a sequence of zeroes after the decimal point: 6.000000000….

“Halting” really has no place in his discussion: if we took a machine that “did not halt” by the “halting problem definition”, it would print out, say, 6.000 and then wait for an indefinite amount of time before printing out the next zero.

Understanding the important concept of a circle-free machine is critical, even if your only goal in reading the Turing is to see Godel’s diagonalization “done simply” (see the argument on page 247 of the article, where he argues that the machine H is circular by suggesting an infinite loop).

Why the Murail?

The Murail is a significant representation of the spectral music genre which has significant parallels with the work of Turing.

The Murail, like the Turing, uses an intuitive language, completely divorced from the method-as-music ideal of serialism (about serial music). Gondwana is based entirely off of the first sounds you hear in the piece — it is the subsequent deconstruction that leads the listener down a path that is nearly always defined momentarily, in that each subsequent moment of sound in the piece is very often based off of the prior moment. There are few moments of discontinuity.

The work starts off very simply. The piece begins with seven to eight moments of repeated sound (see 0:12, 0:30, 0:46, 1:01, 1:15, 1:28, 1:38, 1:48, and then MAYBE [1:55, 2:01, 2:07, 2:13-2:40] in this recording), as easily understood derivations from a simple sonic idea.

Indeed, these events are like Turing’s “appeal to intuition” by Murail. There is no recognition of pattern possible without repetition. With the seven chime-strike chords, Murail is handing us a ball of yarn — we will need it to find our way home from the complicated, thorny walk in the woods that is the rest of the piece. There will be no resounding return, no recapitulation to save the day here.

Beyond the opening, the Murail begins to dig deep into the essence of sound. At roughly eight minutes into the work, a hissing begins to become a main sonic feature. The connection to the beginning is simple — there is a brief, percussive sound at the beginning of the piece with a few strings playing, perhaps a maracca — there are lots of high, hiss-like frequencies that comprise the sound of a maracca. But it brings to mind an important question — a question that is at the heart of spectral music: what kind of music is happening in the parts of sound waves that we DON’T hear? What about all the extremely high frequencies that dogs can hear? Above those, even? They exist. They have to. A sound wave can be encoded by a number. Large numbers exist. What if we slowed the noise, say, of water molecules at room temperature bouncing around inside a mason jar? Anyone that has been distracted by the sound of a tea kettle on the verge of boiling already knows that interesting sonic things can happen with water molecules on the surface of a boiling hot-plate. But what about beyond the ear? What types of large-number-based frequencies exist at this sonic extremes of our world?

Computable vs. Uncomputable Sounds?

There is a strong argument that much of “Western Harmony” can be boiled down into small, rational-number relationships between different frequencies; numerical relationships, in fact, that are easily computable by a Turing machine (with a reasonable “skeleton table,” even!). A major triad, for example, is little more than the three ratios:

1, 5/4, and 3/2 (pitches ‘do’, ‘mi’ and ‘so’, if you’ve ever watched “The Sound of Music”)

By the standards of Murail (and Turing, for that matter), these are small numbers, and the human ear is quite adept at performing the computation of recognizing them and placing them in reasonable cultural and musical contexts.

The human ear, from a perspective of what it can understand or “compute” in this way, is quite good. We like minor chords, diminished chords — heck, we’ll even listen to Coltrane. Serial music is tougher. In general, music that doesn’t “sound good” to the western ear because of dissonance, etc. likely involves sound frequency relationships that make use of larger numbers. A minor chord, for example, requires use of the 6th and 8th partials. Its ratios, depending on how you define the third of the chord, require much larger numbers than that of a major chord. The numbers involved with frequencies heard in the Murail then, in all their roaring, bending, and hissing glory, beg a final question:

What is the limit? What is uncomputable for the human ear?

Note:

Parallels with music aside, readers not familiar with Turing’s life and the circumstances of his death should read this: turing wikipedia article. Oppression and intolerance are real, and Turing suffered greatly at the hands of the British government.

from: www.nateburke.com

Why “N choose K”?

The name of the Knewton tech blog is “N choose K”. If it’s been a long time since you set foot in a math class, this may not have any meaning to you. However, it’s an interesting concept, and really not all that difficult. In this post, I’ll go over what the term means in simple English and discuss how we use it here at Knewton.

Imagine you’re visiting the newest, trendiest lounge in New York. Like many such lounges, this one specializes in producing excellent cocktails from before the Prohibition era. You look over the menu and see that they offer one drink for each type of alcohol: champagne, gin, rum, tequila, and vodka. You decide that you want to try all of them, in pairs rather than one at a time, to see how well they go with each other. You want to know how many different pairs‚ (one for each hand) you can make with these 5 drinks.

One way of doing this would be to count each of them. To do this, we can label each of the drinks with their main ingredient:

and systematically count each of the pairs (note that order of ingredients is not important):

for a total of 10 different pairs of drinks. Mathematically, this is known as n choose k, and is written as with n being the total number of items, and k being the number to choose each time. You’re asking “how many ways can I choose 2 items out of 5,” which we know from our graph is:
It’s not too difficult to count each of these by hand. But now imagine that you’re an octopus at a different bar, which has 20 drinks, and you want to find out how many ways you can choose 8 drinks out of 20, i.e. . Clearly, counting is not an option here.

The great thing about n choose k is that there’s a simple formula to figure it out: you just plug in your values for n and k, and out comes the answer. Here’s the formula:


which is a lot less complicated than it seems. The exclamation point is known as a factorial, and you can find any factorial by multiplying together all the numbers from that number to 1. So, 3! is simply:
Let’s try it on our original problem, how many ways can we choose 2 drinks out of 5?

Note how we can make the calculations easier by canceling the top and bottom of the fraction when they’re exactly the same.

Now with :

It’s a good thing octopi have a strong tolerance for liquor.

Of course, “n choose k” pertains to more than just mixing drinks at bars. In fact, you can see this in all sorts of applications, and often in exactly the sort of work we do here at Knewton. For example, let’s say we have a graph that has only paths to adjacent nodes:


The question we need to answer is how many shortest paths are there from A to Z? (These are called lattice paths). There are algorithms that allow you to determine this exactly, but they require polynomial time to complete. Instead, let’s solve an easier problem: What is the upper bound of the number of shortest paths from A to Z? So, what would the graph would look like where the upper bound is the same as the actual number?

Well, a fully-connected graph would fit this scenario:


We can count the length of the path from A to Z by counting the length of the path:

Clearly we cannot have a shorter path than this, and so all shortest paths must have exactly 6 steps. The only independent variable then is the choice of which steps we take to the right; the dependent variable is which steps we take down (it’s dependent because it is completely determined by the other variables). So in the example above:

Independent: steps 1, 2, 3, 4 are to the right
Dependent: steps 5, 6 are to the bottom (whatever is left)

Starting to sound familiar? It’s the same as “n choose k”! Given the total number of steps, how many ways can we choose to go right? And so we get:

More generally, for an m x n lattice grid, we can use the formula:

where j is one of the dimensions of the grid and k is the other; which one you pick doesn’t matter, as you’ll see below.

Be aware, you might see this formula as . This is correct if you count the starting point A as (0, 0) and ending point Z as (m, n), which subtracts one in each direction.

You might be interested in why we chose which to the right and which down; in other words, can we switch the independent variable? The answer is yes, and it doesn’t matter, and it’s pretty neat, because as you could calculate:

In fact, if you calculate a number of these and group them into a triangle shape, it results in a pattern known as Pascal’s triangle:

To find  for example, count to the fifth row and the second column, and you’ll find that . As you can see, the left side of the triangle has the exact same values as the right side, which is why , and more generally:

These values are also known as binomial coefficients; if you’re interested in why, you can easily find more information on the web.

There are a couple neat patterns that Pascal’s triangle has. First, each number in the triangle is the sum of the two numbers above (you can think of the entire triangle surrounded by zeros if that helps). This is pretty miraculous if you recall our original formula for determining n choose k.

The second pattern has to do with its shape: if you take Pascal’s triangle and color only the odd numbers, you get a Sierpinski triangle, which you may be familiar with: it’s our n choose k blog logo!

Our work at Knewton allows us to play with ideas all day and turn those ideas into a reality. We chose the binomial coefficient   to be the title of our tech blog because it reflects many of our core values as Knewton developers: our passion for mathematics and the language of ideas, and our commitment to intellectual endeavor.

Conclusion

In this entry we discussed what n choose k means and how you can use it to limit your alcohol consumption find the number of ways groups of things can be chosen. Then we found a simple formula to calculate it. We saw a practical use of formula in finding an upper bound of the number of shortest paths in a particular graph. Finally we looked at Pascal’s Triangle and the relationships between the binomial coefficients, aka the results of multiple n choose k computations. I hope you had some fun, learned a bit about math, and have something new to discuss at your next cocktail party.

Thanks to Kevin Wilson and Solon Gordon for their help with this post.

Knewton Crab Stacker: Innovations in Systems Engineering

Creating and maintaining the infrastructure to support a 40+ person developer team and more than a million users on the world’s most powerful and rigorously adaptive learning platform is no simple task. Conventional wisdom would suggest that a ten-person team with a wide swath of specialists would be the ideal arrangement. But in this regard, as with a number of other tech team practices, Knewton is anything but conventional.

Simply put, Knewton does not have an Ops team. Instead, the knowledge and tools required for infrastructure and systems tasks are distributed throughout the team. This structure confers a number of benefits. Developers are better able to write configurable, maintainable, deployable software because they have a strong understanding of our systems infrastructure. And systems-focused engineers are better able to optimize and maintain our infrastructure because they are also contributing members of the service developer teams.

In practice, this means that all software developers at Knewton are expected to both understand and utilize infrastructure technologies. All developers may find themselves on production support and tasked with updating environment configurations or deploying new code. Likewise, systems engineers all have a solid software engineering background and are expected to write production code.

Expectations and cross-pollination are only part of the process, however. Here’s some insight into the tools we use to create an environment where development and infrastructure work hand-in-hand.

AWS for everyone!

Every Knewton developer has access to our AWS console as well as the AWS command line tools pre-installed on their machines. Eventually every developer will go through the process of deploying code to production, at which point he or she will learn the basics of the AWS toolkit as well as our deployment and configuration management systems. As a result, no developer need waste time emailing the Systems group for basic information about EC2 instances or other AWS resources.

Knewton makes use of a number of AWS technologies, from EC2, to RDS, to ElastiCache and ElasticMapReduce. Many of these products have their own CLI tools, though they are all simple enough to install and configure that at Knewton we’ve made them part of box setup for new developers. While not every engineer is a wizard with the console or CLI commands, there is enough documentation to ensure any developer working alone on a late night will not be blocked by an AWS issue (unless the block is on AWS’ end…).

We have a few in-house tools that make use of AWS’ APIs, among them k.aws and the Knewton Crab Stacker. To focus this article a bit, I’ll dive specifically into the latter, as it addresses an important problem: deploying and updating software stacks.

CloudFormation and KCS

Knewton Crab Stacker, or KCS, makes use of the AWS CloudFormation toolset. CloudFormation makes it possible to define and deploy combinations of AWS resources, from EC2 instances to load balancers. The “define” part comes in the form of JSON templates, while deployment can be done either using the AWS console or a CLI tool.

Now, CloudFormation on its own is great for just “putting a box out there.” The templates let you define a wide variety of resources and set parameters and mappings for a base level service setup and tuning.

What CloudFormation doesn’t do well is box configuration. The JSON templates don’t allow you to do much in the way of defining functions, setting conditions, or even basic math. If you try to force configuration management onto it, you end up with a lot of bash scripts and config files floating around, or worse, hardcoded into templates.

Even relegating KCS to deploying boxes can be tricky. The launch command for a given stack can include a dozen or more command line arguments, such as EC2 instance parameters (size, type, etc.) and command flags.

The simplest case launch will make use of all defaults in the template and look something like this on the command line:

cfn-create-stack $StackName --template-file $Template

But if you need to use other CloudFormation functionality and override a number of parameters at launch, you’ll end up with something like this:

cfn-create-stack $StackName --template-file $Template --parameters AWSAccount=Production;InstanceType=m1.large;ClusterSize=4;ConfigTarballVersion=2.1.5;AppVersion=1.2.3;SSHKey=ProductionKey --capabilities CAPABILITY_IAM --disable-rollback

Yikes! That’s a lot to type from memory for each deploy. You’re especially going to want that last option to disable rollback as it keeps the instances from failed launches around for you to debug — essential for when you inevitably mistype a version number.

If stack launches are fairly consistent, you can mitigate the annoyance of launch commands with BASH scripts, but these will be a pain to maintain. But what if you have a number of frequently changing parameters or decisions that need to be made at launch? What if you need to work with multiple AWS accounts or validate components of your launch config so as to avoid a painful debug cycle? (Is that tarball you need in s3? Does this environment have all of my new stack’s dependencies?) Complex stacks can take ten to twenty minutes to launch. You don’t want to have to keep re-launching just because you fat-fingered the instance type.

The problem with the command above is that every parameter represents a potential point of failure. CloudFormation is only able to ensure that your template is logically consistent and the JSON is valid. It can’t know whether or not AppVersion 1.2.3 is a thing, or whether a four node cluster matches what is in the current environment, or numerous other details that can spoil an update before it begins.

This is where KCS steps in. Knewton Crab Stacker was developed by a team of Knewton engineers (including yours truly). KCS is a Python command line tool designed to make CloudFormation deployment much simpler.

The first nice thing KCS does is add the abstraction “environment” to our AWS accounts. It does this by simply taking the stackname parameter and appending $EnvironmentName + “-” to the front of it. From CloudFormation’s perspective, the stackname is “Dev-UserService,” but KCS understands the stack as “The UserService stack in the Dev environment.”

Making use of the namespace this way greatly simplifies the task of isolating test environments from one another. It adds one more piece to launch commands, which in the simplest case look like this:

kcs stacks create $Service $Template $Environment

The difference between this and the simple CloudFormation command above is what goes on behind the scenes.

Before initiating the create, KCS checks a number of things. First, KCS makes sure that the environment has any stacks that the new service needs. If a dependency is missing, you can still force a create. Secondly, KCS ensures that any s3 resources referenced in the template or launch command actually exist. In other words, if your launch command specifies “ServiceTarballVersion=0.3.444”, KCS makes sure that said tarball is actually there.

Updates are far more common than creates, and it is here where KCS really does a lot for you. Here’s a simple update command:

kcs stacks update $Service $Template $Environment

Like the create, KCS does a ton of validation on the template and environment. With the update however, KCS also runs a diff on the existing stack. Before the update actually runs, you will be shown a list of every parameter the update adds, removes, or changes. From there, you can either proceed with or cancel the update.

Before I do an update of a stack, I can also use “describe” to see what’s in the environment currently. The full command is “kcs stacks describe”, but I can shorten it using “s” and “d”, and aim it at our Dev environment like so:

kcs s d User Dev

Dev - $93.6 monthly
Stack   Status            $ monthly   Creation       Last Update
User    update complete   $93.6       4 months ago   3 months ago

SystemConfig Version: 1.1.55 User App Version: 1.1.224

i-01234567  ec2-01-23-45-678.compute-1.amazonaws.com (m1.medium)


This gives me a lot of cool info including the version of the App, some parameter information, as well as the instance ID, type, and hostname. If I want an exhaustive list of parameters I can do this:

kcs s d User Dev --detailed

Dev - $93.6 monthly
Stack   Status            $ monthly   Creation       Last Update
User    update complete   $93.6       4 months ago   3 months ago

Environment: Dev
Cluster Size: 4
SystemConfig Version: 1.1.55
Environment Class: Dev
User Version: 1.1.224
Instance Type: m1.medium
DB Password: ******
Billing Environment: Staging
ELBRegistration: Yes
AMI: ami-11223344
UserDB Address: amazonrds.us-east-1.rds.amazonaws.com
Key Name: STAGING-007
i-1234567   ec2-01-23-45-678.compute-1.amazonaws.com   (m1.medium)
12.345.678.999                             us-east-1c

These commands make it easy to run updates without knowing much about the stack, but there is an even easier method for the truly lazy:

kcs stacks interactive-update $Service $Environment

This command uses the existing stack’s template and then lets you pass in values for each parameter while showing you what is in the current environment. It guarantees that you only change exactly what you want to change.

When the update actually runs, KCS adds a few layers of insurance that CloudFormation does not. For one, it spins up brand new instances, runs their launch config, and then waits for success signals before tearing down the old stack. This allows you to set up whatever level of functionality and performance testing you want as a condition of a successful update. If part of the launch config fails or does something unexpected, KCS rolls everything back.

All of this just scratches the surface at what KCS can do. I could write a few dozen pages about KCS’ other abilities, like grabbing service logs, executing remote commands, hotswapping jars on Java stacks, and even snapshotting entire environments and then recreating new environments from snapshots (need to copy a 100 instance staging environment? No problem).

The main thing that KCS does is to kill the need for a Release Engineer or “Deployment Guy.” Nobody is happier about this than I am, as I was the “Deployment Guy” for months. Instead, we have a situation now where systems engineers can focus on improving infrastructure and devs can get new code out easily.

The lion’s share of the credit for KCS has to go to Sarah Haskins and Trevor Smith, the two developers who did the bulk of the coding. It has made life easier for all developers here at Knewton, and we hope to open source it in the future.

Configuration management and future challenges

As nice as KCS is for our deployment workflow, it is only able to solve one part of our infrastructure needs. Like any moderately large tech team, there are natural conflicts of interest that arise between those focused on system stability and maintenance, and those trying to push out a slick new feature. We’re not immune from knowledge bottlenecks and technical debt, but as the team grows and practices are refined, the future looks brighter and brighter for our tech team.

At the very least, thanks to KCS, we have a pretty good handle on deploying services. Swell. But how do we configure boxes once CloudFormation puts them out there? How do we ensure that services are able to talk to one another and that stacks are resilient to a plethora of errors?

Those fantastic questions, I’m afraid, will have to be the subject of another “N Choose K” post.

Code Swarm Post Git: Data Visualization

The above visualization shows the history of commits for the Knewton Adaptive Learning Platform. A commit occurs when a developer alters the code and transfers the changes into our central git repository. Each file in a commit is represented by a colored-coded blob which indicates the type of file committed. Each time a developer commits a file, the algorithm brings the file closer to the user’s blob; this proximity reflects the associativity between code files and users as well as the relationships among users themselves. The histogram at the bottom displays the frequency of commits over time.

This visualization can also be referred to as a “code swarm”, and is an example of “organic information visualization” (a phrase coined by Ben Fry), that is, any data visualization that allows the elements to interact in a dynamic and unpredictable fashion. The benefit of such a visualization is that it brings the relationships between different project components into relief. The viewer can also observe the evolution of a project over time.

Parameter Recovery

1. Introduction

Providing students with good study recommendations begins with understanding what they already know. At Knewton, we’re combining existing work on psychometrics and large scale machine learning to perform proficiency estimation more accurately and at a larger scale than ever before.

As Alejandro Companioni [discussed in a previous post], Item Response Theory (IRT) is the core model around which our proficiency estimation is built. IRT proposes that there are latent features of students (proficiency) and questions (difficult, discrimination, and guessability) which explain the accuracy of responses we see on assessments. However, most IRT analyses assume that students bring the same level of proficiency to every question asked of them. This might be fine for a 2-hour standardized test, but we are interested in helping students over an entire course. If we’re doing our job, students should be improving their proficiency as time goes on!

This summer, I made our IRT models more sensitive to gains (and losses) in student proficiencies over time. I’ll leave the details of that model to another post, but as a teaser, I’ll show you our first figure. In this figure, the x-axis represents time as measured by the total number of questions answered. Dot represent the question which was answered at that time. Red dots represent incorrectly answered questions while green dots represent correctly answered questions. The {y}-value of the dot is its difficulty. The green line tracks our simulated student proficiency, and the blue line tracks our recovered student proficiency paramter given the questions that they have answered.

In this post, I will discuss three methods which we used to evalute the performance of our algorithms and discuss their relative strengths and weaknesses. We’ll focus on the mean-squared error, the log-likelihood, and the Kullback-Liebler divergence.

student_1

A visual representation of our temporal IRT model

2. Evaluating results 

To tackle the problems I faced, I explored statistical inference algorithms on probabilistic graphical models. To judge my results, I simulated classes of students answering sets of questions, and saw how accurately my algorithms recovered the parameters of the students and questions.

One method of quantifying the accuracy of estimates is the mean-squared error (MSE). It takes the mean of the square of the differences between the estimated parameters and their actual values. In symbols, if {\theta^{\text{act}}} is the vector of model parameters used to generate our data, and {\theta^{\text{rec}}} are the parameters our method recovered, then

\displaystyle \text{MSE} = \frac{1}{N} \sum_{i=1}^N \left(\theta^{\text{act}}_i - \theta_i^{\text{rec}}\right)^2.

While the MSE is a good indicator for accuracy in many places, it has problems when models have multiple solutions at different scales. Let’s see why this is through an example.

Suppose we have a class of students answering a set of questions. Suppose the questions are actually very hard and that the students happen to be very proficient at the material. Just looking at this set of data, however, we have no way of knowing that these students are actually very proficient. We can only assume the most likely scenario–the students have average proficiences and are answering the questions competently. From the data alone, we can only hope to discern the values of item parameters relative to the proficiency of students and vice-versa. We cannot hope to know their values absolutely. So there are many equally valid interpretations at different scales of the same data. Because the MSE looks at the difference between the two parameters, it will penalize parameters that are scaled differently, even though those parameters might be an equally valid solution given the data!

Let’s look at Figure 2. We see that the ordering is basically preserved, which we could measure with Pearson’s {r}. but the recovered parameters are not on the same scale. If that were the case, then the line of best fit would have slope one and {y}-intercept {0}. However, in this case, the slope is closer to {2/3}. In statistical terms, the variance of the recovered parameters is much less than the variance of the actual parameters.

scatter_thetas

Recovered student proficiencies plotted against actual student proficiencies from a simulation.

The log-likelihood gives us a more meaningful method of measuring accuracy. The log-likelihood tells us the log-probability of our recovered parameters given our data. At instantiation, our algorithms should have low log-likelihoods–the parameters that we guess first are random and don’t fit our data well. Our algorithms should iterate toward higher log-likelihoods, hopefully converging at the set of parameters with the highest log-likelihood. This is the philosophy behind the Expectation-Maximization algorithm. But the log-likelihood is susceptible to tail events. For instance, if a question is in reality extremely hard but, through random chance, a few students with average proficiency answer the question correctly, then maximizing the log-likelihood will lead to marking these very hard questions as easier than they actually are. This, of course, could be solved with more data from large numbers of extremely proficient students, but this data is often hard to come by. Instead, we introduce another way of measuring model fit: the Kullback-Leibler (KL) divergence. Suppose we have probability distributions {P} and {P'}, generated by our original and recovered parameters, respectively. That is,

\displaystyle P(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{act}} )

and

\displaystyle P'(\text{datum}) = \frac{1}{\# \text{ of data}} Pr(\text{datum} | \theta^{\text{rec}} )

In our case, a datum is one student’s response to one question and the {\theta} paramters are the student proficiency and question discrimination, difficult, and guessability discussed in Alex’s post. Then we can define

\displaystyle D_{KL} (P||P') = \displaystyle\sum_i \ln\left({\frac{P(i)}{P'(i)}}\right) P(i).

If {P = P'}, then {D_{KL}(P||P') = 0}. Moreover, Gibbs’ inequality implies that {D_{KL}(P||P') \geq 0}. If the original and recovered parameters generate similar probabilities, then the KL Divergence should be small. Note that since the KL divergence involves the ratio of the likelihoods, it is less susceptible to the influence of tail events than the log-likelihood alone.

The log-likelihood and KL divergence both use likelihoods to measure fit, which means that they only care about fit of the parameters to the data, and not the exact convergence to the original. So they often prove to be reliable measures of fit to judge our algorithms on. For instance, even though the MSE of our recovered parameters is large, Figure 2 shows us that our algorithm has likely converged (since the log-likelihood and KL divergences are not changing much) and give us a reliable measure of model fit.

metrics

The log-likelihood and KL divergence of our recovered parameters through a run of our algorithm.

Java-Scala Interoperability

To many people, Java is programming. It’s the only programming language most of my non-technical friends could name off-hand, and the first one that many programmers learn. Since its initial appearance in 1995, Java’s usage has steadily grown.

There are many reasons for this. The design of the Java platform, in which Java code compiles to bytecode that runs on the Java Virtual Machine (JVM), allows Java to be a “write once, run anywhere” language. That is, Java code that works on my computer will work exactly the same on your computer, or a server in the cloud (in theory, at least). Its language features, especially static typing and object-orientation, lend it to robust, modular, rock-solid code. Due to years of JVM and Java development, the platform is blazingly fast and stable. In addition, the many libraries available for Java allow anybody to leverage years of development and experience to perform complicated tasks simply and quickly. The Java community is also strong: on top of the official documentation, there are years of blog posts (an unscientific Google search for “java blog” yields 268,000,000 results) and close to 280,000 questions tagged “Java” on StackExchange. That’s a lot of collective knowledge and expertise.

At Knewton our systems are Java-based, for a few of the reasons mentioned above. But Java is far from perfect. Its syntax includes much boilerplate and support for concepts like concurrency, parallelism, and functional programming is unidiomatic. There are a lot of small tweaks or features that could make Java a lot more pleasant to work with. These imperfections aren’t significant enough to outweigh the benefits of using Java, but they do make daily work a little more challenging.

Enter Scala. Scala is what would happen if there were a Genie that eavesdropped on every Java developer as they cursed when their build broke due to a missing semicolon, then inferred a wish from their complaints. It offers what Java lacks: a much cleaner syntax (semicolons optional, types inferred), pattern matching and case classes, mix-ins (a more powerful inheritance/interface system), higher-order/first-class/anonymous functions, built-in immutable data structures, a built-in read-evaluate-print-loop (REPL) and actor-based concurrency. Even more importantly than all of this, though, is one huge advantage: it runs on the JVM. This means that, unlike other languages, Scala doesn’t require the reimplementation of large banks of code; programs can use existing Java libraries with minimal effort. For instance, the Java standard libraries are available as soon as your Scala REPL starts. To translate

ArrayList<String> fruits = new ArrayList<String>();
fruits.add("apple");
fruits.add("pear");
fruits.add("banana");
for (String fruit : fruits) {
    System.out.println(fruit);
}

into Scala, we simply write the following (JavaConversions is a Scala library that makes using Java collections like ArrayList much more idiomatic):

import java.util.ArrayList
import scala.collection.JavaConversions._
val fruits = new ArrayList[String]
fruits.add("apple")
fruits.add("pear")
fruits.add("banana")
for (fruit &lt;- fruits) {
    println(fruit)
}
// More idiomatically
fruits.foreach(println)

If you’re using libraries that aren’t packaged with Java or your own classes, just ensure that they lie in the classpath. In addition, the syntactic similarities between Java and Scala mean that the information from nearly any Java resource — a blog post, StackExchange question, Javadoc, or whiteboard code snippet — will be equally valuable in Scala, allowing Scala to tap into over 15 years of experience. While Scala may be a relatively new language, it has access to all of the network benefits of a behemoth like Java.

Portability works the other way, too, but with a few caveats. Consider the following Scala class, in the file “Hamster.scala”:

class Hamster()

The Java file “HamsterTest.java” might contain the following:

public class HamsterTest {
    public static void main(String[] args) {
        Hamster hamster = new Hamster();
    }
}

To compile and run, execute

$ scalac Hamster.scala
$ javac -cp path/to/scala-library.jar:. HamsterTest.java
$ java -cp path/to/scala-library.jar:. HamsterTest

in the shell, and hopefully find that everything works as expected. However, if we want to add a weight field to our Hamster class (“class Hamster(var weight: Int)”), things start to break.

Hamster hamster = new Hamster(42);
// This works (note: "weight()", not "weight"):
System.out.println("The hamster weighs " + hamster.weight());
// But if our hamster then goes on a diet, we're out of luck:
hamster.setA(10); // doesn't work
hamster.a(10); // doesn't work

We can solve this problem with the handy scala.reflect.BeanProperty annotation, which adds getWeight and setWeight methods (we could also add them explicitly):

class Hamster(@BeanProperty var weight: Int)

Then, from Java:

Hamster hamster = new Hamster(42);
hamster.setWeight(hamster.getWeight() - 32);

What about using Scala traits within Java? Since traits aren’t quite interfaces, but they aren’t quite abstract classes, they’re a little tricky. A basic Scala trait, such as

trait Mammal {
    def beHairy
}

can be used just as an interface. But if we add another trait, Rodent,

trait Rodent extends Mammal {
    def beHairy {
        println("I have fur!")
    }
    def gnaw
}

and attempt to extend it

public class Rat extends Rodent {
    public void gnaw() {
        System.out.println("gnawsome");
    }
}

we are met with the compile-time error “no interface expected here”, even though Rodent implements beHairy(). To see why we get this message, whip out javap, the Java disassembler bundled with every JDK (javap is an essential tool for deciphering strange Scala behavior).

$ javap Rodent
Compiled from "Rodent.scala"
public interface Rodent extends Mammal,scala.ScalaObject{
    public abstract void beHairy();
    public abstract void gnaw();
}

Thus, even though Rodent contains method implementation, it still compiles to an interface. Some compile-time magic allows Scala classes to use these traits. The easiest fix is to add something like

abstract class RodentJava extends Rodent

for each trait that implements methods, and have Java classes extend the abstract class, rather than the trait. This has the advantage of being dead simple, but the disadvantage of requiring the maintenance of a separate Java API and sacrificing the ability to mix-in multiple traits as in Scala (this is, however, impossible in Java, since it lacks multiple inheritance); a single abstract Scala class for each potentially used set of mix-ins is the way to go here, but this method risks a combinatorial explosion).

Traits are just one example of a Scala feature that lacks a direct Java equivalent. First-class functions, anonymous functions, Scala’s greater number of access level modifiers, pattern matching, tuples, implicits, and operator overloading are all either non-trivial or impossible to use in Java. In general, to be useful to Java programmers, a library that uses any Scala-specific features must offer a Java-oriented API (of greater complexity than the examples seen here, but involving similar workarounds). Both Apache Kafka [3] and Twitter’s Finagle [4], which I’ve worked with in my summer at Knewton, take this approach; while it’s possible to write Scala code that can be used seamlessly from Java, to do so is to give up the expressiveness of idiomatic Scala.

Whether, as some suggest, Scala will be a successor to Java, or merely a fad, its compatibility with Java is one of its greatest strengths. Combining Java and Scala is nowhere near the undertaking that combining two unrelated languages would be, so the cost of experimenting with Scala as a part of an existing Java architecture is minimal. Given the potential benefits and low risk, it makes sense to consider a total or partial move to Scala as part of a JVM-based ecosystem.