Making Lives Easier with Knewton Crab Stacker

In a previous post, we discussed Knewton’s in-house deployment tool, Knewton Crab Stacker (KCS). To summarize, KCS is a command line tool used to make deployment of CloudFormation (one of Amazon’s services) easier. It saves our engineers from banging their heads against their desks when trying to deploy their services.

So what exactly makes KCS such a valuable, can’t-live-without-it tool? In this post, we’ll take a look at some of the many KCS commands that make the lives of a Knewton engineer easier.

SSH

Normally, when you want to ssh into an EC2 instance, you have to go through a long and arduous process to find the instance’s public DNS name, then locate your own ssh key for that instance, and then finally type out the command that lets you ssh into the instance. You have to do this every single time you want to ssh. As you may imagine, this gets annoying fast.

To make this whole process simpler, we have a KCS command that does everything for you. All you have to do is specify which stack and target environment you’re trying to ssh into, and then KCS will take care of the rest. It will find the public DNS of the instance, find your ssh key, and finally ssh-es into the box for you. Besides being a huge time saver, my favorite part about this command is that it adds colors to the instance’s terminal. Colors make everything better.

BrianZeng1.jpg

Kick

Often while working on a service, we will make modifications to the instance (which we get into by using the awesome KCS ssh command). But when you make modifications, inevitably something gets messed up. No one wants their instance to be messed up, so you have to restart it. This usually involves relaunching the stack the instance is a part of, and twiddling your thumbs while you wait.

Here at Knewton, we like to save time, so we created a command that allows us to essentially restart our instance. We call this command kick.

Underneath the hood, kick gets the address of the instance we want to kick, ssh-es into the instance, and re-runs cfn-init (the command that is first run when the instance is created). This re-downloads the needed resources and configures everything you need using Chef. After kicking an instance, the instance is essentially brand new and ready for more tinkering.

Roundhouse Kick

A very common scenario that an engineer comes across is when he’s made a change to a service, has finished testing it locally, and then wants to test it on a real stack. To do this using just CloudFormation, we would have to first upload our new build to S3, then update our stack to use the new build. Updating a stack takes quite a bit of time, anywhere from a couple to ten-plus minutes. That’s a lot of waiting around.

That’s why we invented roundhouse kick. Roundhouse kick does everything you need to update the version of your service without having to relaunch your stack.

Here’s how it works: first, it will upload your build to S3. Next, it will do what we call an in-place update of the stack. Instead of launching new instances as a regular update would do, an in-place update just updates the existing instances. The time saved with the in-place update makes up the majority of the time saved. After updating the stack, KCS will then kick all the instances of the stack, which, in effect, restarts the stack and grabs the new version of the service you uploaded earlier. We like to think we made Chuck Norris proud with roundhouse kick.

“Chuck Norris can upload his new build to S3, update his stack, and kick his stack all at once.”

Grab logs and Failure logs

Sometimes you roundhouse kick your stack too hard and it stops working (there’s no such thing as a soft roundhouse kick). To find out what’s wrong, you have to ssh into the instance and check the logs. But there are many logs. And you’ll probably have forgotten where all of these logs are located.

Don’t worry — KCS has got you covered.

With a simple command, you can get all of the logs from your instance in a nicely bundled tarball. To do this, KCS knows the location of your logs thanks to some coordination with the Chef recipes that set up the logging system. After determining these locations, KCS will then perform an scp command with all the needed arguments to retrieve all the files. Now you can find out why your stack couldn’t handle the roundhouse kick.

What’s Next for KCS?

Even with all the cool commands that KCS has, there’s always room for improvement. People want KCS to run faster, have more features, and be invincible to bugs. When there’s a bug in a new release of KCS (which are unfortunately inevitable), the deployment team gets bombarded with complaints from disgruntled KCS users. We then work to fix everything, and try to get a new release of KCS out. But even when we do release a new KCS, not everyone remembers to upgrade their version and we continue to get complaints. We ask them to check their version and then we find out they aren’t on the latest version. An upgrade then fixes the issue. This is annoying and unnecessary for both KCS users and the deployment team.

To solve this problem, we created KCSServer — the website version of KCS, which has been my baby during my summer internship. Since KCSServer is a website, we don’t have to worry about people having different versions of KCS. We can very easily make changes to KCSServer without having to worry about getting people to install the latest version.

Migrating KCS to a website also provides many other benefits. One of the main issues we wanted to address was the speed of KCS. As a command line tool, KCS is pretty slow. For a command (such as describing a stack), KCS has to make a call to Amazon after determining all the proper credentials, and then once it retrieves the information, it has to output everything in a readable format for the user. With KCSServer, we can make this command much faster by utilizing a cache. A command has to be run once. Then, for all other times the command is run, KCSServer can just retrieve the output from the cache (of course, we update the cache as needed). This reduces the latency of a command from a couple of seconds to milliseconds. Considering that our rapidly-growing team of engineers uses KCS a lot, these seconds saved will quickly become hours, then days of developer time saved.. Another added benefit? With some CSS, we can make KCSServer look a whole lot more pleasant to look at than the dull terminal.

What’s the Take-Away?

Hopefully after reading about how we at Knewton use KCS to maximize our efficiency, you’ll start thinking more about how to eliminate inefficiencies in your own deployment process, or any process for that matter. Hopefully you’ll start asking yourself, “What’s slowing me down at doing my job?” and “What can I do about it?” Then you can go out there and create your own version of KCS. Don’t forget to give it an awesome name.

Online Cross-Validation to Predict Future Student Behavior

At Knewton we use various mathematical models to understand how students learn. When building such models we want to make sure they generalize, or perform well for a large population of students. Cross-validation, a technique in machine learning, is a way to assess the predictive performance of a mathematical model. At Knewton, we use cross-validation extensively to test our models.

Cross-validation is based on the fact that we don’t have access to unlimited data. If we had all the possible data on student learning patterns, the solution would be straightforward. We would test all our models with the data and pick the one with the lowest error rate. In reality, we only have a finite set of student data to work with. Given a limited amount of data, how do we decide which model performs the best?

One approach is to use all of the available data to test our model. A major problem with this approach is overfitting, which is demonstrated in Figure 1.

Gizem1.jpg

Figure 1: Left: the model (blue) underfits the data (orange). This is an over-simplistic explanation of the data where the model would be a better fit if it had more parameters. Middle: the model fits the data just right, where the model captures the overall pattern in the data well. Right: the model overfits the data, where the model fits the noise in the dataset. (Source)

If our model overfits the data, the error rate will be low but if new data is added to the dataset, the model might perform poorly as the fit doesn’t explain the new data well. This is why models that overfit do not generalize well and should be avoided.

This is where cross-validation comes into play. In this approach, rather than fitting the model to the full dataset we split it into training and test sets. This is also referred to as holdout cross-validation, as we are leaving a portion of the data out for testing. The model is fitted using only the training portion of the dataset. Then we assess the predictive performance of the model on the left-out data, which is the test set.

As an example, one model we use to assess student learning is Item Response Theory (IRT). We want to cross-validate our IRT model for a set of student responses to test the performance of our model. To do this, we can split the student response data into training and test sets, fit the model to the training data, and validate it on the test data. If the fitted model predicts the student responses in the test set accurately we can accept this IRT model.

When measuring how students learn, we assume they learn over time. Therefore, it is useful to understand how students behave as time progresses. A shortcoming of the holdout cross-validation technique is that it makes comparisons between random bits of past student data so it can’t make predictions about how students will behave in the future. It would be very useful if we were able to make predictions about students’ future behavior given their past learning patterns.

Online cross-validation is a version of cross-validation which can validate over time series data. Going back to our student response data example, online cross-validation uses a student’s past data to predict how that student will behave in the future. The dataset for online cross-validation is a time-ordered set of responses the student gave in the past. We take the first k responses of a student and use them for the training set, then we try to predict that student’s k+1st, k+2nd, …, k+nth response. If our prediction accuracy is high, we can say that our model is a good fit for our dataset.

Let’s look at how online cross-validation works in more detail. The students answer some questions over time. Some of these responses are correct (green) and some are incorrect (red). Online cross-validation will start by training on the student’s first response only (k=1), then use this to predict whether the student is going to get the next item (k+1 = 2) correct or incorrect.

Gizem2.jpg

Figure 2: The first iteration of online cross-validation. The dots represent whether a student got a question correct (green) or incorrect (red). The model is fitted using the first response (k=1) and then used to predict the second, k+1st item (k+1=2). If our prediction matches the student response, our model accuracy increases. 0/1 refers to incorrect/correct.

In the next iteration of online cross-validation, we can use the first two responses (k=2) as our training set, fit the model using these two data points, and predict the third response (k+1=3).

Gizem3.jpg

Figure 3: The second iteration of online cross-validation. The dots represent whether a student got a question correct (green) or incorrect (red). The model is fitted using the first two responses (k=2) and then used to predict the third, k+1st item (k+1=3). 0/1 refers to incorrect/correct.

Online cross-validation continues until we run through all the iterations by increasing the training set one student response at a time. We expect to make better predictions as we add more data to our training set.

With online cross-validation, we are not limited to predicting only the next response in the future. We can predict a student’s next 2, 3, …, n responses. This makes online cross-validation a very useful technique if we want to make predictions far in the future.

Both holdout cross-validation and online cross-validation are very useful methods to assess the performance of models. Holdout cross-validation method is useful in assessing performance if we have a static dataset, whereas online cross-validation is helpful when we want to test a model on time series data.

 

Kankoku: A Distributed Framework for Implementing Statistical Models

As future-facing as Knewton’s adaptive learning platform may be, the concept of a personalized classroom has a surprisingly rich history. The idea has intrigued educators and philosophers for decades. In 1954, behavioral psychologist B.F. Skinner invented the concept of programmed instruction, along with a working mechanical prototype to boot. His “teaching machine” consisted of a wooden box on which questions were displayed to students on strips of paper controlled by turning knobs. One would only progress upon answering a question correctly. A crucial feature of the teaching machine was that “the items were arranged in a special sequence, so that, after completing the material in frame 1, the students were better able to tackle frame 2, and their behavior became steadily more effective as they passed from frame to frame.” The argument upon Skinner’s teaching machine was founded still holds water today: that “what is taught to a large group cannot be precisely what each student is ready just at that moment to learn.”1

KONICA MINOLTA DIGITAL CAMERA

Sixty years later, examining Skinner’s prototype still provides an insightful frame of reference. Knewton’s platform is responsible for tracking the individual learning states of each student at the granularity of individual concepts and questions. Like the teaching machine, we must deliver relevant recommendations in real-time and classroom analytics in near real-time. Those recommendations and analytics serve as a tool for both students and teachers to improve student outcomes. Considerations like these influence the engineering decisions we make on a daily basis, including the decision to use a stream-processing framework to power several of our statistical models. In this blog post, we will open the hood of our own teaching machine to explore the tradeoffs behind the design of Knewton’s scientific computing platform.

Why Stream Processing?

Knewton’s recommendation engine faces the task of providing recommendations to millions of students in real-time. As one of the pioneers of behaviorism, Skinner certainly understood the importance of delivering the right feedback at the right time.2 Respond to a student event (e.g., finishing an article) just two minutes late, and the impact of a recommendation diminishes rapidly. But what goes into each recommendation under the hood? A recommendation is essentially a ranked selection of instructional content that is most relevant to the subject matter that a student is studying at any particular time. Every student’s learning history (the data representing their interactions with content and their activity on the system) is taken into account. Knewton’s recommendation engine also considers other factors, such as each student’s learning goals and deadlines. All of this data is processed through a variety of psychometric and statistical models that estimate various characteristics of students (e.g., their proficiency or engagement level) and content (e.g., its difficulty or effectiveness). While some of these computations can be performed ahead of time, there are still numerous models that must be computed on the spot in response to a student interaction.3 Combining and processing all of this data results in a very large sequence of actions that must be performed in a small period of time.

Knewton is much more than just a differentiated learning app. Imagine if Skinner’s teaching machine knew every student’s individual learning history, knowledge state, habits, strengths, and upcoming goals, and could take into account goals set by teachers or administrators.

To handle all this data, Knewton has built Kankoku4, a stream processing framework that can respond to individual events in real-time.5 Stream processing systems operate under the requirement that inputs must be processed “straight-through” — that is, real-time feeds must trigger a set of downstream outputs without necessarily having to resort to polling or any intermediate storage. Stream processing systems are also characterized by their support of real-time querying, fault-tolerance, and ability to scale horizontally.6 The primary complement to stream processing is batch processing, consisting of programming models such as MapReduce that execute groups of events scheduled as jobs. Batch computing is fantastic for efficiently performing heavy computations that don’t require immediate response times.

However, these advantages of batch processing are also what make it less suitable for responsive, high availability systems like Knewton’s.7

Kankoku

Kankoku is a scientific computing Java framework developed in-house that provides a programming model for developing decoupled scientific models that can be composed to create any kind of computable result. The framework aims to abstract away the details of retrieving and storing data from databases, reliability, scalability, and data durability, letting model writers concentrate on creating accurate and efficient models. In the example workflow below, the nodes (or Kankokulators, as we call them) represent individual (or sets of) calculations. Streams are fed into Kankoku from a queue, which serves as a message broker by publishing received student events into various topics to which Kankoku subscribes.

Kankoku.jpg

With this framework, complex multi-stage computations can be expressed as networks of smaller, self-contained calculations. This style of programming is especially well-suited for data analysis where the outputs of an arbitrary statistical model could be used as inputs to another. One example of this could be aggregating student psychometrics as inputs for modeling student ability using Item Response Theory (IRT).

Speed and horizontal scalability are also important in developing a stream processing framework for real-time events. One of the many ways Knewton achieves horizontal scalability is by partitioning the input data stream using a partitioning key in the queue.8

“Kankoku” Means “Recommendation”

Similar to how Skinner’s teaching machine immediately responds to individual inputs, Kankoku streamlines responsive event processing for arbitrary, unbounded data streams. Both serve a complex need — providing personalized learning recommendations — yet have internal mechanisms that are easily decomposable, and execution that is reproducible.

But Kankoku is very different from the teaching machine. The software it powers is capable of understanding and analyzing the learning mechanisms of millions of students. Ensuring that Knewton doesn’t sacrifice quality to meet the demands of quantity or speed is a top priority. To meet these ends, we are continually revising and extending our models to run more efficiently while delivering better results. Kankoku’s design is a strength here. Not only does it help Knewton break down a complex task into smaller pieces, it also makes it simpler to understand and tweak each component. Monitoring these models requires complex visibility tools that allow Knewton to examine intermediate computation in real-time. Kankoku is less like one teaching machine than it is hundreds of small machines working together in concert.

So What?

In his exposition “Programming Instruction Revisited,” Skinner spoke of his dream of creating technology that would help classrooms evolve beyond the “phalanx formation” by helping teachers become even more attuned to every student’s individual needs. As history has shown us, implementing such technology at scale is an extremely difficult problem. Truly understanding student needs and providing feedback in real-time is a non-trivial challenge for any person, much less a computer program. Practical machine learning and “artificial intelligence” is in many ways a systems engineering challenge — building models that can handle real-time workloads at scale is crucial to creating a service that will actually be useful to students and teachers. Well-designed systems will never replace teaching, but they can provide an automated, responsive, and unified platform to expose insights about student learning to teachers and parents around the world, who do understand how to best act on those insights.

Teachingmachine.jpg

Acknowledgements

I’d like to thank the creators of Kankoku — Nikos Michalakis, Ferdi Adeputra, Jordan Lewis, Erion Hasanbelliu, Rafi Shamim, Renee Revis, Paul Kernfeld, Brandon Reiss, George Davis, and Kevin Wilson — for their tireless work as well as letting me play with such an awesome piece of technology. Stay tuned for part 2 of this blog post for more details on my internship project (extending the Kankoku framework with Apache Storm).


  1. B.F. Skinner. Programming Instruction Revisited

  2. Knewton is not preaching or practicing behaviorism. This is only meant to be an analogy. 

  3. http://en.wikipedia.org/wiki/Online_algorithm 

  4. Kankoku means “advice” or “recommendation” in Japanese. It also means “Korea.” 

  5. In addition to powering Knewton’s recommendation engine, stream processing suits a variety of applications, ranging from powering Google Trends to supporting fraud detection and “ubiquitous computing” systems built on cheap micro-sensor technology that demand high-volume and low-latency requirements. Other applications include powering bank transactions (which require exactly-once delivery), image processing for Google Street View, and command-and-control in military environments. See: Akidau, et al. MillWheel: Fault-Tolerant Stream Processing at Internet Scale

  6. Stonebraker, et al. The 8 Requirements of Real-Time Stream Processing

  7. Frameworks such as the Lambda Architecture exist that unite both programming models. There is also technically a gray zone between batch and streaming processing frameworks – for instance, Spark Streaming processes events in microbatches. Some of our models can’t be implemented with microbatching, but it is an interesting idea worth exploring. 

  8. Alternative terminology for “grouping”: sharding, shuffling. 

Kankoku: A Distributed Framework For Implementing Statistical Models (Part 2)

The focus of my internship project this summer was to extend Kankoku (Knewton’s scientific computing framework) to operate in a more distributed fashion. There are a few reasons that drove this change — namely, increased functionality in model expression and greater operational control in model execution. In this blog post I will analyze these objectives in detail and explore why they necessitated tradeoffs from both the systems engineering and business perspectives.

Kankoku is a scientific computing Java framework developed in-house at Knewton that provides a stream-processing programming model for developing decoupled scientific models that can be composed to create any kind of computable result. For a more detailed discussion on stream processing and Kankoku, see part one of this blog post.

Weathering the Storm: Introducing Distributed Kankoku

Partitioning, or dividing a set of inputs into a collection of subsets, is a key problem in any distributed system. Mod hashing and consistent hashing are examples of how shuffle groupings would be implemented, in which keys are uniformly distributed across partitions in a pseudorandom process. Kankoku currently performs a shuffle grouping before model execution, which allows workloads to be balanced across separate machine stacks that each run independent instances of the same topology.1 However, the calculation of certain psychometrics may require additional partitioning (i.e., multi-partitioning).

Recall that Knewton performs online analysis of both the student and the classroom. Consider the scenario in which the output of Kankokulator node A (calculating a student metric) serves as the input to Kankokulator node B (calculating a classroom metric). Since A processes events per student, the initial grouping must happen by student ID. However, B must process events per classroom. This presents a problem, since there is no guarantee that two students in the same class are grouped to the same partition. A simple solution might be to route the output of A through a queue serving as an intermediate message broker. This queue can then regroup the data stream based on class ID:

Kankokulator.jpg

However, this approach scales poorly for several reasons. Creating new queue shards to track each new multi-partition can be difficult to maintain from a development standpoint. Rerouting the data stream to an intermediate broker with every grouping also introduces extra overhead and network latency. There is also no guarantee that the models execute deterministically. Previously, each instantiation of a Kankoku topology ran on its own machine, processing each input in a topologically-ordered fashion. With intermediate queues, keys may be processed out of order due to varying latency. A more general-purpose solution is preferable.

This is where the Apache Storm framework (originally developed by Twitter) comes in as a possible candidate. Like Kankoku, Storm is a general stream-processing framework, but with one crucial design difference: it is strongly distributed, in that nodes in the same topology need not run sequentially on the same machine. As a result, Storm supports the ability to perform arbitrary groupings between each node, and multiple groupings within the same topology.2 Nodes in a Storm topology are referred to as bolts, and data sources are referred to as spouts.

Using Storm’s Trident API, declaring a new grouping within the topology is as simple as calling the function partitionBy. The example below shows how our hypothetical scenario above might be implemented using Storm instead of rerouting through a queue:

tridenttopology.jpg

Kankoku can therefore be extended by “wrapping” subtopologies (individual Kankokulators or groups of Kankokulators) within Storm bolts. Bolts will encompass contiguous Kankokulators expecting data streams partitioned by a common key type, and a new bolt will be created whenever an additional partitioning operation is required. This interaction introduces the functionality of multi-partitioning while still preserving our original model execution; bolts do not define how data is managed and arbitrary Kankokulator code can still run within a bolt. Hence, in this architecture Kankoku provides a higher-level programming model built upon Storm.

Another use case for this particular design arises from Storm’s convenient parallelism hint feature. Parallelism hints are the initial number of executor threads allocated to a particular bolt, which can be rebalanced during runtime. Tuning the parallelism hint of bolts gives us additional operational control over executing topologies by weighting CPU resources differently for separate subtopologies. Therefore, subtopologies that we expect to be more computationally expensive can be allocated more processing power, which in turn helps increase throughput.

Queue.jpg

The topology above shows how a Storm-Kankoku topology might be represented. Within each bolt, the Kankoku subtopology will run deterministically so as to take advantage of data locality. Hence, it is advantageous to wrap as many Kankokulators as possible within each given bolt while still fitting the constraints imposed by weighted parallelism and multi-partitioning.

Tradeoffs of Operating In A Distributed Environment

My internship project this summer consisted of implementing a prototype of the integrated Storm-Kankoku framework similar to the sample topology displayed above in addition to examining the tradeoffs behind extending the Kankoku framework using Storm. Introducing added parallelism at a platform level can have sweeping effects on the behavior of our statistical models, affecting both integrity and performance. A few considerations we explored:

A) Bolt-level deterministic execution. Although Storm may not be able to recover the state of execution within an individual bolt if it fails, Storm’s “Transactional Topologies” guarantee that “multiple batches can be processed in parallel, but commits are guaranteed to be ordered.” Hence, topological ordering still applies and we expect reproducible execution.

B) Fault-tolerance. Storm provides fault tolerance with clear guarantees across bolt execution and state-saving operations (either exactly-once or at-least-once delivery). By assigning a monotonically increasing transaction ID to each commit of events, Storm provides the semantics needed to detect and filter out duplicate events replayed by Storm in the event of a failure. Fault tolerance is especially important when the outputs of Kankokulator nodes are saved or transmitted during execution — without Storm’s guarantees, events might be lost or duplicated.

C) Horizontal Scalability. Any implementation must take care to increase throughput gains without decreasing latency. One possible performance pitfall faced in a distributed environment is the added latency introduced by redundant computations that must be computed by each bolt (such as loading the Knewton knowledge graph). This could potentially be solved by an off-node cache such as ElastiCache at the cost of introducing additional complexity. In general, careful load testing must be performed to determine the ideal method of data processing — whether to pass values across the wire or to utilize intermediate checkpointing and storage structures.

As expected, many of these tradeoffs don’t point to a single right answer. For instance, depending on the scenario Knewton might leverage Storm’s exactly-once functionality at the expense of introducing more latency. In situations like these, it becomes less a question of which approach to take and more so a question of how important each requirement is. How important is it to filter out duplicate events? What is the cost of producing a recommendation that is stale, possibly by just a few seconds? How important is it for Knewton to keep its latency as low as possible? These questions strike at the heart of both Knewton’s internal systems design and its core business value-add, and encapsulate much of what made my internship intellectually engaging and rewarding.

Sources


  1. By topology, we mean a directed acyclic graph (DAG) that defines the workflow of calculations. 

  2. Storm implements the ability to partition by using an abstraction called an Active Distributed Hash Table (Active DHT). Active DHTs extend distributed hash tables to allow an arbitrary user defined function (UDF) to be executed on a key-value pair. Source: A. Goel, Algorithms for Distributed Stream Processing

Make Your Test Suite Accessible

What does it mean for something as complex and dynamic as a platform to be “well-tested”? The answer goes beyond simple functional coverage.

Testing has been my specialty through much of my 14 years of experience in software. If there is one thing I’ve learned about testing, it is that tests can, and should, do more than just test. Tests can be used to communicate and collaborate. Tests can also be used to discover what your product is, as well as what it should be. At their best, tests can be the frame of reference that anchors a team and solidifies team goals into verifiable milestones.

Testing the platform

The Knewton platform is composed of many component services. Each of those services is developed by a dedicated team, and each service is tested on its own with the standard unit, integration, and performance tests. This article is not really about those tests but about how we test the platform as a whole.

The Knewton platform uses data to continuously personalize the delivery of online learning content for individual students. The platform determines student proficiencies at extremely detailed levels, provides activity recommendations, and generates analytics. To do all this, our platform must be fast, scalable, and reliable. Our team must be skilled at grappling with intricate technical problems, while maintaining high-level perspective and focus on the greater system. Testing is part of how we maintain this dual perspective.

Accessibility

Accessibility is the most important criteria we build into our tests to help us achieve the above goals.

In the context of a full-stack test suite, accessibility to me means at least the following:

- Anyone can run the tests
- Anyone can read the test report and analyze test failures
- Anyone can read, change, extend, or otherwise interact with the test definitions

Making tests accessible and promoting those accessible tests can be a tough cultural challenge as well as a tough technical challenge. But the cost of failing at this is high. The more isolated your test suite (and the engineers who create and execute it) are, the less value you will derive from it. Your tests will not reflect involvement from the greater organization, and more importantly, the information your tests generate will not be as widely disseminated throughout the organization as they could be.

So how is a test suite made “accessible”?

Anyone can run the tests

The best thing you can do with a test suite is get it running in your continuous integration server. At Knewton we use Jenkins as our CI server. Anyone in our organization can use Jenkins to invoke the tests against any testing environment, at any time, without any special setup on their computer whatsoever.

Additionally, the test code is in our Git repository, and everyone is encouraged to check it out and invoke the tests in very flexible ways. Developers have the option of running a single test, a set of related tests, tests that correlate with a given JIRA ticket, or other options. Developers can run the tests against a local development environment, or a deployed environment. A test suite that can be run in flexible ways is an important part of accessibility.

Anyone can read the test report

Our test suite produces several kinds of test reports. The report I enjoy showing off the most is the HTML report, which lists every test that runs and details every test that fails (this capability is built into the wonderful RSpec testing framework we use). This HTML report is archived in Jenkins with every test run, so anyone can read it for any test run right within their browser. And because the report uses plain English, it is comprehensible by anyone who is familiar with our platform’s features, developers or not.

Here is what a small portion of our HTML test report looks like, showing both passing and failing tests:

What may or may not be obvious here is that tests are really about information. When I test a piece of software, my product is actionable information. When I make an automated test suite, my product is an information generator. Building a generator of information is one of the more valuable and interesting bits of work a QA engineer can do; here at Knewton, we encourage this mentality.

Anyone can change the tests

First and foremost, my job at Knewton is to enable tests to take place easily. Secondly, my job is to assist and initiate the creation of actual tests. Here at Knewton, it’s great for me to see the testing framework I created be picked up by developers, changed, extended and generally used. While we do formal code reviews on the tests, we try to make that process very efficient in order to ensure that there are very low barriers for anyone who creates a platform test.

What does accessibility get you?

Here are just a few of the ways that an accessible test suite brings value to an organization:

-Raising awareness of the behaviors of the system and the interactions between various components in the system throughout the entire organization.

-Eliminating bottlenecks when releasing: got the code deployed and need to run the tests? Just go press the button.

-Enabling continuous deployment: when your tests are in your continuous integration system, it becomes easy to chain together build, deploy, and test plans into a continuous deployment scheme (we are still working on this one).

-Encouraging better tests: when non-testers are encouraged to get involved in testing, unexpected questions get asked.

More to come

Testing is a massively important part of the puzzle for Knewton as we scale our technology and our organization. We are learning more every day about how to make the best, most accessible and valuable tests we can. In a future post, I intend to share some of the technical details and tools we have been using to make our tests. In the meantime, I welcome your feedback on the ideas presented here and around testing in general.

Student Latent State Estimation with the Kalman Filter

The Kalman filter is an algorithm that estimates the values of unknown, or latent, variables from a series of noisy measurements taken over time. The Kalman filter has numerous applications in finance and different areas of engineering, and in this blog post, we will show how the Kalman filter can apply to student learning. We will use a Kalman filter to infer an approximation to a student’s ability given their response times to a set of questions.

To fully understand the concepts below, you’ll need a background in basic probability and statistics.

Model

To begin, we are going to describe our inference problem and a model of the system – a set of assumptions of how student learning works.

Given a series of scalar measurements:

Z=\{z_1,z_2,...,z_n\}

where z_i denotes the natural logarithm of the time it took our student to answer question i,

We want to infer the scalar latent variables:

X=\{x_1,x_2,...,x_n\}

where the student has ability x_i at the time question i is answered.

We model the change in student ability over time as a Gaussian random walk, meaning that the current ability value is based on the previous ability value, plus some Gaussian noise:

x_k=x_{k-1}+w_k, where w_k is drawn from N(0, \tau \sigma^2_x), where \tau is the time the student spent between questions, and \sigma^2_x is a hyperparameter that corresponds to the variance for this latent process (1).

Having the variance of the noise increase linearly with the time difference makes our equation consistent with a continuous Gaussian random walk, and is consistent with our intuition that a student’s ability needs time to change. In other words, a student is unlikely to experience significant change in ability if they’re between questions in a quiz, but if it’s been a day since they’ve answered a question, they’re more likely to have taken the time to learn more about the concept, or, conversely, they might have forgotten the material. Because the latent state variance is time-dependent, our filter is technically called a hybrid Kalman filter, since it assumes a continuous-time model for a discrete set of observations.

We don’t assume that the student ability x_k accounts for all the variability in the log of the student response time, z_k. For example, it’s possible that the student is familiar with this particular problem or is distracted by her environment. Therefore, we say that the observed times are likewise corrupted by some Gaussian noise, v_k:

z_k=x_k+v_k, where v_k is drawn from N(0, \sigma^2_z), where \sigma^2_z is a hyperparameter that corresponds to the variance of the Gaussian noise (2).


Kalman Filter blog post (v2)

The resulting model is pictured by the diagram above: the ability of the student at the previous question x_{k-1} determines the log response time z_{k-1}. The ability of the student at current question x_k is determined by the ability at the previous question x_{k-1}, and determines the current log response time, z_k.


Inference

The Kalman filter is an algorithm for estimating each x_k a posteriori — that is, it computes an estimate for each x_k given observations Z = \{z_1, z_2,...,z_k\}. It does this recursively, which means that it only needs an estimate of x_{k-1}, along with observation z_k, to output an estimate for x_k.

To make a new estimate, we first need to compute two intermediate pieces of information: a prior distribution and a likelihood distribution.

A note on syntax:

z_{1:k-1} denotes our observations z_1 through z_{k-1}, and x_{k-1|k-1} represents our estimate of ability at the k-1th question given observations z_{1:k-1}; likewise \sigma^2_{k-1|k-1} represents our estimate of the variance given observations z_{1:k-1}.

f denotes the Gaussian probability density function (pdf) with mean \mu  and variance \sigma^2:

f(x;\mu,\sigma^2) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{(x-\mu)^2}{2 \sigma^2}}

Calculating our prior distribution

Our prior term represents the knowledge we have of current latent state x_k having seen everything but our current observation z_{k}.

To calculate our prior, we start off with an estimate of x_{k-1}, represented as a Gaussian distribution with mean x_{k-1|k-1} and variance \sigma^2_{k-1|k-1}:

p(x_{k-1} |z_{1:k-1}) = f(x_{k-1};x_{k-1|k-1}, \sigma^2_{k-1|k-1}) (3)

From equation 1, we see that x_k is simply the addition of two independent random Gaussian random variables, x_{k-1} and w_k.

From probability theory, we know that the probability density of the addition of two independent random variables can be expressed as a convolution of the two composite probability densities. It happens that the convolution of two Gaussians is also a Gaussian:

p(x_k|z_{1:k-1}) = \displaystyle\int p(x_{k-1}|z_{1:k-1}) p(x_k|x_{k-1}) dx_{k-1} = f(x_k; x_{k|k-1}, \sigma^2_{k|k-1}), where

x_{k|k-1}=x_{k-1|k-1} and

\sigma^2_{k|k-1}=\sigma^2_{k-1|k-1}+ \tau \sigma^2_x  (4)

We call p(x_k|z_{1:k-1}) the prior knowledge we have about x_k. The next step is to look at the current observation, z_k and see what information it adds.

Calculating our likelihood distribution

Our likelihood term represents the information our current observation z_k gives us about our latent state x_k.

From equation 2, we see that likelihood of our observation z_k given our hidden variable x_k is simply a Gaussian centered at x_k. This becomes our likelihood term:

p(z_k|x_k) = f(z_k; x_k, \sigma^2_z) (5)

Combining prior and likelihood

The Kalman filter combines the prior knowledge we have about x_k and our likelihood term in accordance with Bayes’ rule, by multiplying the prior term with the likelihood term. We call this resulting distribution the posterior, or our estimate of x_k given all the information we have.

Luckily, the multiplication of two Gaussians is still a Gaussian, although unnormalized:

p(x_k|z_{1:k}) = \frac{1}{Z} p(x_k|z_{1:k-1})*p(z_k|x_k) = f(x_k;x_{k|k}, \sigma^2_{k|k}), where Z is a normalizing constant, and where:

x_{k|k}=x_{k|k-1}+C_k(z_k-x_{k|k-1})

C_k=\frac{\sigma^2_{k|k-1}}{\sigma^2_{k|k-1} + \sigma^2_z}

\sigma^2_{k|k} = (\frac{1}{\sigma^2_{k|k-1}} + \frac{1}{\sigma^2_z})^{-1} (6)

To summarize, given a Gaussian posterior distribution for x_{k-1} (Equation 3) and a new observation z_k, the Kalman filter estimates a new Gaussian posterior for x_k (Equation 6). By updating the Kalman filter as we receive new observations, we can obtain fast, real-time estimates of our latent state.

Open Source Code

An open source implementation of this hybrid Kalman filter algorithm is on Knewton’s GitHub:

https://github.com/Knewton/Kalman

Authors

Sophie Chou is a senior at Columbia University majoring in Computer Science. She’s interested in Machine Learning, Natural Language Processing, and becoming a better teacher. You can follow her on twitter @mpetitchou.

Andersen Chen is a senior at Brown University, majoring in Mathematics and Computer Science. He’s interested in data science, startups, and machine learning.


Cassandra and Hadoop – Introducing the KassandraMRHelper

Here at Knewton we use Cassandra for storing a variety of data. Since we follow a service-oriented architecture, many of our internal services are backed by their own data store. Some of the types of data we store include student events, recommendations, course graphs, and parameters for models. Many of these services and clusters are often deployed in more than two environments, increasing the total number of Cassandra clusters deployed.

On the Analytics team at Knewton we are constantly working on improving a lot of the inferential models that go into our platform, while at the same time building new ones. This often involves munging a lot of data in short periods of time. For a lot of our ad-hoc analysis we use a data warehouse by which analysts can query and extract data relatively quickly. One of the challenges we’ve faced at Knewton — and specifically in Analytics — involved how to go about populating our data warehouse with data from Cassandra clusters that predated our data warehouse. To solve this problem, we implemented an internal library for bulk extracting data out of Cassandra into Hadoop with zero hits to the Cassandra cluster. A few months later we opened sourced it here and called it the KassandraMRHelper.

KassandraMRHelper takes a slightly different approach than the constructs contained in the Hadoop package in the Cassandra source code (e.g. AbstractColumnFamilyInputFormat), in that it doesn’t require a live Cassandra cluster to extract the data from. This allows us to re-run map-reduce jobs multiple times without worrying about any performance degradation of our production services. This means that we don’t have to accommodate more traffic for these offline analyses, which keeps costs down.

How does it work?
The KassandraMRHelper includes specialized Input Formats and Record Readers for SSTables. First, here’s a little bit about SSTables:

SSTables are immutable; once they’re written they never change.
SSTables can exist independently of each other but collectively they form the complete data set.
SSTables consist of 4-5 parts depending on which version you’re using:

There are 4 to 5 different components:

  • Data.db files store the data compressed or uncompressed depending on the configuration
  • Index.db is an index to the Data.db file.
  • Statistics.db stores various statistics about that particular SSTable (times accessed etc)
  • Filter.db is a bloomfilter that’s loaded in memory by the cassandra node that can tell it quickly whether a key is in a table or not.
  • CompressionInfo.db may or may not be there depending on whether compression is enabled for a ColumnFamily. It contains information about the compressed chunks in the Data.db file.

Data in columns and rows are essentially key value pairs, with rows as the keys and columns as values to the rows. The columns are also key value pairs consisting of a name and a value.

Given how data are stored, Cassandra is in fact a really good fit for MapReduce. The same partitioning schemes that Cassandra uses can also be used in MapReduce. Columns and rows can be the keys and values that get passed in the Mappers or Reducers in a MapReduce job.

Some key components of KassandraMRHelper are:

  • The SSTableInputFormat: The SSTableInputFormat describes how to read the SSTables and filters through all the components and ColumnFamilies, keeping only what’s necessary for processing using a custom PathFilter. There are two types of SSTableInputFormats depending on how you want to represent key/value pairs in MapReduce. The SSTableColumnInputFormat constructs an SSTableColumnRecordReader in which keys in the Mapper represent the row keys in Cassandra and values represent a single column under that row key. Similarly the SSTableRowInputFormat constructs an SSTableRowRecordReader in which keys in the Mappers are the Cassadndra row keys and values are column iterators over all the columns under a single row key.
  • The SSTableRecordReader: It internally uses an SSTableScanner and a Descriptor similar to the one contained in recent version of Cassandra but with backwards compatibility to identify all the parts of an SSTable. As described previously, subclasses of the SSTableRecordReader can represent values as single columns under a row or entire column iterators.
  • The SSTableColumnMapper: This abstract mapper inherits from the default Hadoop mapper but adds a bit more structure to deal with the ByteBuffers and columns contained in the SSTables. It can also skip tombstoned columns.
  • The SSTableRowMapper: This is similar to the mapper above that deals with column iterators.

Example
Setting up a MapReduce job for reading a Cassandra cluster becomes very simple. The only missing piece is finding an easy way to get all the SSTables into a Hadoop cluster. At Knewton we found Netflix’s Priam to be a good match. Priam backs up our Cassandra cluster multiple times a day into S3 making it really easy to transfer the data to Elastic MapReduce (EMR).

This simple MapReduce job shows a complete example job that consumes student event data from backed up Cassandra SSTables. The example can also be found here.

public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  Job job = new Job(conf);
  SSTableInputFormat.setPartitionerClass(
      RandomPartitioner.class.getName(), job);
  SSTableInputFormat.setComparatorClass(LongType.class.getName(), job);
  SSTableInputFormat.setColumnFamilyName("StudentEvents", job);
  job.setOutputKeyClass(LongWritable.class);
  job.setOutputValueClass(StudentEventWritable.class);
  job.setMapperClass(StudentEventMapper.class);
  job.setReducerClass(StudentEventReducer.class);
  job.setInputFormatClass(SSTableColumnInputFormat.class);
  job.setOutputFormatClass(TextOutputFormat.class);
  SSTableInputFormat.addInputPaths(job, args[0]);
  FileOutputFormat.setOutputPath(job, new Path(args[1]));
  job.waitForCompletion(true);
}
public class StudentEventMapper extends SSTableColumnMapper
      <Long, StudentEvent, LongWritable, StudentEventWritable> {
  @Override
  public void performMapTask(Long key, StudentEvent value, Context context) {
    context.write(new LongWritable(getMapperKey(),
                  new StudentEventWritable(studentEvent));
  }
  @Override
  protected Long getMapperKey(ByteBuffer key, Context context) {
    ByteBuffer dup = key.slice();
    Long studentId = dup.getLong();
    return studentId;
  }
  @Override
  protected StudentEvent getMapperValue(IColumn iColumn, Context context) {
    return getStudentEvent(iColumn, context);
  }
}

Notice that the mapper extends from a specialized SSTableColumnMapper which can be used in conjunction with the SSTableColumnRecordReader.

The example above uses the identity reducer to simply write the data as comma separated values by calling the toString() method on the StudentEventWritable objects. The only additional task you have to worry about in the Reducer is deduping the data, since you will probably have a replication factor of > 1.

Automating this job becomes an easy task given that SSTables are immutable and older tables don’t have to be read if they were already read once. Enabling incremental snapshots can also help here.

Conclusion

If you want to get started on using the KassandraMRHelper you can check out the code here: https://github.com/Knewton/KassandraMRHelper. Contributions are more than welcome and encouraged.

If you’re interested in additional topics in Cassandra and Hadoop you should check out the presentation on bulk reading and writing Cassandra using Hadoop here with the slides shared here.

 

Deciding Who is Right: Using Item Response Theory Instead of Majority Rule to Crowdsource Answers

As the saying goes, it takes a village to build an adaptive learning technology provider. Even when you have an entire village at your disposal, however, the added help may contribute more headache than help as even the simplest question can generate differing opinions.

To place the situation in more concrete terms, imagine you have a stack of 10,000 photos of either kangaroos and kittens, but you do not know which photo depicts what. Because object recognition remains a difficult problem in artificial intelligence, even the most powerful computers will have a difficult time determining if a photo is a kangaroo or a kitten.

Classifying them all by yourself would be time consuming and potentially inaccurate if you start to lose focus. Luckily, 100 of your closest friends have offered to help; unluckily, some informal surveying already reveals that sometimes they disagree with each other. After all, kangaroos and kittens do sometimes look similar. What to do?

How can we decide the correct answer amidst a sea of potentially contradictory information? One straightforward approach would be to gather two or three (or ten) labels for each photo and take the majority vote. While the majority method would give us a rough idea of the correct label for each photo, it fails to incorporate the fact that some of your friends may be more gifted kangaroo/kitten labelers than others. Additionally, some of the photos might be harder to label, which would skew calculations about your friends’ abilities.

All of a sudden, our kangaroo-kitten problem is starting to sound like a problem we have already tackled: Item Response Theory (IRT)!

IRT Model

In order to solve our kangaroo-kitten problem, we can use a slightly modified form of IRT. Here at Knewton, we use IRT to determine student proficiency. Unlike standard tests where all questions are weighted equally, IRT assigns each student an ability value and each question a difficulty value, which provides a more nuanced picture of how a classroom of students is doing.

In the one-parameter logistic (1PL) IRT model, for question \beta j with difficulty j and student i with ability \theta i, the probability that question j is answered correctly, xj = 1, is

For more information about IRT, check out Alejandro Companioni’s previous N Choose K post.

The question then becomes, how can we use our knowledge about abilities and difficulties to determine who is better at kangaroo/kitten labeling?

Beyond IRT

If we stretch our minds a little bit, we can imagine our kangaroo/kitten problem as a test where the questions are the photos and the students are our friends. We want to determine which students are most proficient at the very strenuous task of animal classification. In addition to the parameters included in the 1PL IRT model, however, we also want to compute a probability to capture how likely each picture is to be either a kangaroo or a kitten.

Similar to the 1PL IRT model, the parameters in our model now include labels L, vector of abilities theta, and vector of difficulties beta. To make sure we’re all on the same page, the labels L represents all of the given labels by our labelers. Not all labelers need label every photo. Each of our abilities \theta can range from negative infinity to positive infinity. The greater the ability, the more skilled the labeler. Our difficulties range from zero to infinity where the higher the difficulty, the harder the image is to label correctly.

Consider how the observed labels, true labels, abilities, and difficulties all relate to each other. Would the difficulty of the question affect the accuracy of the observed label? Potentially. Would the true label of the image affect the ability of the labeler? Unlikely. Below we have drawn the general graphical model describing the relationships between these parameters where the shaded variables are observed.

Remember that in our case, we have 10,000 images and 100 labelers. Unsurprisingly, the difficulties, abilities, and the true labels are all independent of each other, meaning the accuracy of a labeler has no effect on the likelihood that a photo depicts a kangaroo or a kitten!

How does this all have anything to do with if the photo is a kangaroo or a kitten? For specific photo j, we can derive how likely the photo depicts either adorable animal. That is, the posterior probability of the correct label zj for photo j denotes the probability the photo depicts an animal.

Because we know that the photo contains either of the two animals, we can designate kangaroo as 0 and kitten as 1. Our posterior probability then designates from 0 to 1 how likely the photo is to contain either animal. If we assume that the correct label zj is independent of both abilities theta and difficulties beta, the probability simplifies dramatically.

The posterior probability now consists of two components: a prior belief and an IRT-based probability. Our first term p(zj) captures our prior knowledge about how many of the photos contain each. For example, if we suspected that the majority of the photos were kittens rather than kangaroos, we could use that parameter to include our prior belief in the model. The second probability uses our 1PL IRT probability to denote the probability the labeler gave a label (aka answered a test question) conditioned on the correct answer, the labeler ability, and the difficulty of the question.

Expectation-Maximization

Now that we have established our graphical model including relevant dependencies, we can use an Expectation-Maximization (EM) approach to obtain maximum likelihood estimates of the parameters of interest. Essentially we can now alternate between the expectation and maximization steps to find the most likely probabilities and parameters, given all of the other probabilities and parameters.

By a lucky coincidence, we have actually already determined our expectation step above when we computed the posterior probability of each label! A simpler way to think about the expectation step is to imagine that our abilities and difficulties are all fixed, and we calculate the animal image probabilities accordingly. If we only calculated the probabilities once, however, the probabilities would only depend on whatever values of abilities and difficulties we initialized in our model! How can we keep adjusting the model?

This revelation brings us to the second half of EM: the maximization step. For this step, we want to find a way to make our posterior probabilities as large as possible—denoting how certain we are overall of our guesses of the correct label—by adjusting our parameters and . More formally, we are trying to maximize the expectation of the joint log-likelihood of the observed and hidden variables (L, Z) given the parameters (\theta, \beta) with respect to the posterior probabilities that we calculated in the expectation step.

Our joint log-likelihood function is the expected value of the logarithm of our joint probabilities. That is, how certain are we of everything so far? Using our conditional independence assumptions outlined earlier, we can find our joint log-likelihood function:

Using gradient ascent, we can then find values of \theta and \beta that locally maximize Q.

From here, we simply alternate between expectation and maximization steps until convergence. To recap, expectation holds ability and difficulty constant, while calculating posterior probabilities. Maximization then calculates the ability and difficulty parameters to maximize joint log-likelihood, given constant posterior probabilities.

Depending on the gradient ascent implementation, the algorithm should converge quickly, revealing our best guesses for which animal is featured in which photo. As we can see below from our implementation based on simulated data, the EM approach outscores the majority approach at nearly 5% initially before converging later. Additionally, as we increase the number of voters, the accuracy increases. Success!

While our improvement over the majority method may be impressive, our E-M IRT model still has plenty of room to expand. What if we also had pictures of koalas and killer whales, increasing the number of options? What if we had reason to believe that the abilities of our friends fall in a Gaussian distribution, creating a prior distribution on our parameters? What if we assumed that our friends might become better labelers as they continued to label, making our model intertemporal?

References
Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., and J. Movellan. (2009). Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043.

de Ayala, R.J. (2008). The Theory and Practice of Item Response Theory, New York, NY: The Guilford Press.

Algorithmically Generating Questions

Developing assessment materials is a lot of work. Teachers spend countless hours scavenging textbooks and developing original exercises for practice worksheets, homework problems, remedial materials, and exams. To prevent cheating, many teachers write several versions of each test, multiplying the work required for creating problems (not to mention grading!). The issue is exacerbated by the rising popularity of MOOCs, in which tens of thousands of students may be enrolled in the same course.

Furthermore, access to a large, varied pool of assessment items is a key to the success of many adaptive courses of study. Without a large pool of items, proficiency estimates can be compromised and personalized courses of study can become less effective than they would be otherwise.

Machine Generated Questions

Machine generated questions have been studied for decades as a component of intelligent tutoring systems. Most research falls into two categories: solution-oriented approaches, and template-based approaches.

Solution-Oriented Approach*

In this approach, questions are generated based on the set of skills and concepts required to solve them. For example, skills related to addition includes adding single digit numbers, adding multi-digit numbers, adding three or more numbers, and carrying digits.

A recent paper, entitled “A Trace-Based Framework for Analyzing and Synthesizing Educational Progressions,” describes an interesting implementation of solution-oriented question generation. On page three, the authors write out pseudocode for the standard, classroom addition procedure. They then annotate the code with symbols representing skills (for example, C for “carry digit”). Thus, by running the pseudocode and keeping track of the symbols, one can obtain a “trace” that categorizes each possible addition problem.

Because solution-oriented approaches group problems based on skills, they lend themselves well to adaptivity. As a student answers questions, one can identify skills he or she is struggling with, and then recommend material reinforcing those skills. However, a major drawback of solution-oriented approaches is that developing questions even for a topic as simple as addition requires a fair amount of labor and domain expertise.

Template-Based Approach*

In this approach, a question template is used to represent a potentially large class of problems. For example, consider a familiar question:

Find all roots of _x+ _x + _.

The underlines are “holes” that must be filled in by the question generator. A template might also specify valid ways to fill in the holes. For example, maybe each hole can only be filled in by the integers one through 10, leading to 10^3= 1000 possible questions. The instructor may wish to further restrict the template to only permit quadratics with real, distinct roots.

The biggest advantage of this approach is that it is accessible to a majority of instructors, provided there is an intuitive and robust templating language. In addition, template-based approaches are easily generalizable, capable of representing entire domains. A disadvantage of templates is that they tend to group problems based on appearance, not skills.

This summer, I set out to create an assessments generator engine that would be both accessible and expressive enough to generate a wide variety of problems. For these reasons, a templating approach made the most sense. Furthermore, Knewton already has infrastructure in place that will enable us to achieve adaptivity even with a template-oriented approach.

The Art of Templating

My first task was to devise a templating language. I decided that it would be a good exercise to define a domain specific language (DSL) that formalizes the space of possible templates. This DSL must let instructors specify the following:

  • Which variables in the question can be adjusted?

  • What values are these variables allowed to take on?

  • How is the correct answer computed?

  • How are the incorrect answers computed? (for multiple choice questions)

After several iterations, I came up with a format general enough to cover many (if not most) of the questions used by Knewton Math Readiness. I’ll go through some examples, beginning with simple addition, that illustrate the main features of the templating language.

Adding Two Numbers

The below template is used to generate questions of the form _ +_ = ?

template "add 2 numbers" {
  question is "<x>+<y>=?"
  answer is "<sum>"
  variables {
    x, y are integers between 1 and 9
    sum=x+y
  }
}

The question and answer are simply strings with variable names denoting the “holes.” Variables come in two flavors: generated (x and y) and derived (sum). Generated variables are bound to a sample set, which could be a range of integers, numbers with 2 decimal places, or even the set of fibonacci numbers. Derived variables are defined by mathematical expressions.

Finding Quadratic Roots

Let’s return to the earlier question of finding the roots of a quadratic. Following the addition example, we might try:

template "quadratic roots" {
  question is "<a>x^2 + <b>x + <c> = 0. Solve for x."
  answer is "x = <root1>, <root2>"
  variables {
    a, b, c are integers between -10 and 10
    discriminant = b^2 - 4*a*c
    root1 = -b + sqrt(discriminant)/(2*a)
    root2 = -b - sqrt(discriminant)/(2*a)
  }
}

Here, we are generating each coefficient (a, b, c) from the range [-10, 10]. However, the below table illustrates an issue with this template. For certain coefficients, the solutions can be integral, fractions, irrational numbers, or even imaginary numbers.

(a,b,c)

Solutions

(1, -10, 16)

2, 8

(-4, 7, -3)

0.75, 1

(3, 6, 2)

-1.577…, -0.422…

(1, 2, 1)

-1

(1, 4, 8)

-2 + 2i, -2 -2i

Because of this, the template can represent questions requiring different skill sets and mastery levels. It is important to give instructors the ability to maintain a consistent difficulty level, and to control the concepts covered by a single template. This is achieved using constraints.

template "quadratic roots" {
  question is "<a>x^2 + <b>x + <c> = 0. Solve for x."
  answer is "x = <root1>, <root2>"
  variables {
    a, b, c are integers between -10 and 10
    discriminant = b^2 - 4*a*c
    root1 = -b + sqrt(discriminant)/(2*a)
    root2 = -b - sqrt(discriminant)/(2*a)
  }
  constraints {
    root1, root2 must be integers
    root1 <> root2
    gcd(a, b, c) = 1
  }
}

In general, constraints are useful for narrowing the skill set covered by the template, and to ensure that instantiations of the template are sufficiently varied. In this example, requiring unique, integer roots is used to control difficulty level, while requiring gcd(a, b, c) = 1 ensures that no two questions will be multiples of one another.

Car Distance

Another important feature of the templating language is the ability to specify wrong answers.

template "car distance" {
  question is "How far does a car travel in <m> minutes traveling <r> miles/hour?"
  answer is "<dist>miles"
  variables {
    m is an integer between 30 and 90 divisible by 10
    r is an integer between 40 and 65 divisible by 5
    dist = rate*time/60
    wrong1 = rate*time
    wrong2 = rate/time
    wrong3 = rate/time/60
  }
  wrong answers {
    "<wrong1>miles"
    "<wrong2>miles"
    "<wrong3>miles"
  }
}

Wrong answers can be either static or variable. What’s powerful about this feature is that each wrong answer might be associated with a particular deficiency or misconception. In the example, a student might choose “rate/time/60” because she doesn’t know the correct distance formula, while another student might choose “rate*time” because she has trouble converting units. This is another source of information that Knewton can use to provide more targeted recommendations.

The Problem Generation Algorithm

Great, so we have a template. Now how do we actually generate questions? My first inclination was to start with the simplest possible algorithm:

  1. Go down the list of variables, selecting values for the generated variables uniformly at random from the sample sets and using the formulas to compute the derived variables

  2. If the variables satisfy all of the constraints, add it to the list of questions.

  3. Repeat.

This naive algorithm performs nicely given one key assumption: a large enough fraction of the sample space (the set of all possible questions, i.e. the cartesian product of the sample sets) must meet the constraints specified in the template. For instance, if 100 questions are desired and the algorithm can handle 100,000 iterations, roughly 1/1000 questions need to be valid. This isn’t too daunting: as long as we offer an expressive library of sample sets and constraints, instructors can be expected to provide templates meeting this requirement.

It is a very difficult to come up with a more efficient approach. For some problems, algorithms do exist to generate solutions (see Euclid’s method for pythagorean triples), but for others it is mathematically impossible (see Hilbert’s tenth problem). In many cases, introducing heuristics may improve the random algorithm. For instance, it may be possible to identify a large chunk of the sample space that leads to solutions that are too large, non integral, negative, etc.

The Prototype

I chose to implement the assessment generator in Scala for several reasons:

  • Scala’s interoperability with Java made integrating with the rest of the Knewton code base an easy task.

  • Scala’s powerful Parser Combinators library made implementing the template DSL straightforward. Because of their low overhead, I also used parser combinators for converting math expressions like “sqrt(x^3 + 5)” into my internal data model. While there are many existing Java/Scala libraries that accomplish this, I was unable to find any capable of manipulating general mathematical objects, such as fractions, matrices, polynomials, polygons, and groups.

  • Scala’s parallel collections allowed running iterations of the problem generator routine in parallel. Doing so only required swapping out a Map with a ParMap, and appending “.par” to the main program loop.

Here is a screenshot of the prototype in action.

*For examples of solution-oriented approaches in the literature, see https://act.org/research/researchers/reports/pdf/ACT_RR93-09.pdf (survey) and http://research.microsoft.com/en-us/um/people/sumitg/pubs/chi13.pdf (grade school math).

*For examples of template-based approaches in the literature, see http://www.eecs.berkeley.edu/~dsadigh/WESE12.pdf (embedded systems).

Ku Pima: To Measure

Trivia: Ku Pima means To Measure in the Xitsonga Language

An Analytics (Data Science) team is made up of engineers/scientists with a wide array of skills. This results from the nature of the goals the team has to meet. As an Electrical Engineering major at Wits University, I’ve spent two summers as an instrumentation engineering intern. Instrumentation deals with the task of engineering instruments that can measure certain quantities for industrial processes to be controlled. Examples of environments include manufacturing and chemical plants, houses, or even the International Space Station. I find analytics to be a similar process to instrumentation engineering in that useful measurements are sought and then the instruments to calculate those measures are engineered.

Building the Analytics Pipeline

On the Analytics team at Knewton, the data scientists develop measures that are useful to track, whether directly for a business case or for building blocks for future analytics. Within the Analytics team there is a Data Analysis component that develops analytics (measures). Another component, Data Infrastructure, engineers the pipeline (instruments) to actually calculate these analytics on a large/production scale. Initially an analytic is developed by exploring some interesting idea of a measure, using available organization data, and then refining it to arrive at the final analytic.

My internship was concerned with creating Data Infrastructure (the instrumentation) to compute some analytics at Knewton. My initial major task was to take a newly developed analytic, in this case Engagement, data from different sources within our products, and  engineer tools to calculate this analytic. This task itself encompases not only the implementation of an algorithm but also the engineering processes necessary to construct the components needed for the measure to be calculated. Further there is a need to analyze and validate the measure on a larger scale than the initial one used to develop it. This necessitates the need for a feedback loop between the data analysts and data infrastructure components.

Engagement is a many-faceted construct.  One facet is an analytic that serves to indicate how much time a student spends “actively engaged” on the Knewton platform. There are a number of ways it can be defined. Here is the basic premise: Let’s say a student is busy with a homework assignment. After they have submitted their task, the Knewton system sends recommendations on which material the student should tackle next. From these interactions one might want to know how engaged a student is with the learning platform. There can be many ways of quantifying this: time spent on the system; number of logins per week; time spent on recommended material, etc. The analytics team is tasked with investigating and developing the analytic into a form that will be useful internally or to a partner. After this initial synthesis, we need to engineer a pipeline that will take student interactions and the recommendations into account and calculate the engagement analytic. Further, this analytic is an example of analytic that needs to be inferred. By “infer” we mean that we cannot directly observe the data we want and thus have to deduce it from other data.

There Can Be Multiple Reasons to Infer an Analytic

The raw student interaction data needs to be cleaned and clustered: The raw data captures a lot of information, some of which may not be useful. Thus, there is a need for cleaning/filtering. Some student interactions can be clustered and thought of as a single event instead of multiple events. (Think of a student doing multiple tasks consecutively on a single homework assignment.)

You have to categorize the users’ intentions: The user’s intention is important as it can make an analytic useful or less so. For example: there is a situation where the student did not intend to do the next recommended action, not because they thought it was irrelevant, but because they had to move to another task (say, another homework assignment with an upcoming deadline). In this situation we would have to treat this as a data point that would not be useful in calculating engagement.

Resources: Available resources are always a factor for any organization. It might be faster to calculate the analytic in one way as opposed to another. It might be more efficient to infer an analytic from a readily available dataset than use a much richer, hard to obtain dataset that is less efficient and only provides a small boost in accuracy with an accompanying large use of resources.

Computing Engagement

The overall view of the pipeline created for Engagement is shown in the figure below. The pipeline takes in data generated by Knewton and its partners that contains student interactions as well as recommendations sent to students. From this data the student learning/task sessions are inferred. From this data we then calculate the Engagement analytic. The computed analytic is reported and the data used for its calculation stored for future use and analysis.

 Building the Analytics Pipeline

After the initial pipeline is engineered there are still a number of tasks to complete. Validation is very important in determining if the analytic can be interpreted as expected. We need to know whether with sample data it produces similar results to the analytic development stage. This part of the process involves some modeling and statistics and needs analysis tools in order to detect errors and may lead to refining the analytic itself.

Through this validation process we are able to refine the measure further. Validation gives us a better understanding of the shortcomings of the data we collect and what other data might be useful.

If you’re interested in this subject, I’d also recommend checking out these tech blog posts: Netflix’s Recommendation Engine, Finding similar users on Twitter, How Facebook Enabled Graph Search On Such a Scale.