Web-Based, Real-Time Data Monitoring

This post was written by Joy Zheng, a student at Harvard University who interned at Knewton this summer.


How do you quickly visualize and validate data in a large-scale, distributed, and real-time system? How do you make it deployable in multiple cloud environments? And how do you track changes to this data over time?

These are questions that I set out to answer in my internship this summer, as I worked to develop a web-based interface that displays how much data the Knewton system is receiving from hundreds of thousands of Pearson students across the country.

At a very high level, this is how things work: Hundreds of thousands of college kids across the country use a recommended textbook published by Pearson. These books have an online component with which the students interact, learn, take tests and quizzes, etc. Information regarding all this activity flows through to us at Knewton. We process it to make recommendations to each student about how to progress through the course, so that the student can achieve the milestones set by the professor.

For any given student, we need to know something about the student and his or her history of previous answers, as well as something about the course and the textbook that the course uses. We also need to track any custom content that instructors have created. In order to serve up recommendations, the work gets divided up among different services in the system — along with the data.


With this problem set in mind, I came up with a list of overarching goals for a web-based dashboard that would provide this information. Before building the system, I had to keep in mind the following:

Recentness of data: While it isn’t vital for the dashboard to have up-to-the-second data, someone has to be able to login at any time and see fairly recent data — say, from the last hour. We also want to make sure that, when needed, users can get even more recent data on demand.

Responsiveness: How fast can we refresh the data?

Redundancy: When two users want the same piece of data from the platform, the dashboard should not have to fetch it twice.

Ability to get details on the fly: Given thousands of courses, how do you get more descriptive information on the fly?

Ease of use: Using the dashboard should not require deep knowledge of how data is being handled in the platform itself.

Configuration: In line with the last point, the dashboard should be configurable from the website; one should be able to change things like “refresh time” on the fly, for example.


Ultimately, we settled on an implementation where data would be polled at regular intervals from various pieces of the Knewton platform and then cached in memory.

This resulted in three strategies for obtaining the data:

Auto-refresh: We query for updates every hour.

On-demand refreshes: If a user wants the most recent state for some piece of information, they can force a refresh.

Stats over time: Permanent persistence of summary statistics — because we want to be able to see some numbers over time, we save relevant data once every 24 hours.

As a result, data presented to the users is almost always drawn from the cache; we preferred this to live querying for two primary reasons.

First, as mentioned previously, we didn’t want to have to send off queries to the platform for data multiple times if we would be getting the same data back; we only wanted to have to pipe data when it had been updated.

Secondly, speed. After all, drawing information from local cache is faster than having to gather it, and we were even able to cache significant portions of the web pages themselves.

Additionally, as part of collating data, one of the roles of the dashboard was to cross-reference it — to line up the data about a textbook with the data for a course and the data for a student. Thus, the nature of the internal back-end references meant that querying on-demand for say, information about one course or user, would require querying in sequence, rather than querying in parallel.


Given that most of the Knewton platform is written in Java, and non-Java pieces had Java APIs in place, the language choice for the dashboard was very straightforward; admittedly, though, there were points where I missed functional programming, especially when trying to make sure that the dashboard could handle multiple simultaneous users during parallelizing refreshes.

The webpage itself was run using an embedded Jetty servlet. In retrospect, now that I am more familiar with Jetty and having run through a cycle of configurations, I likely would have given this choice more consideration — either by looking at different server technologies or different configuration methods. Especially compared with Ruby on Rails, which defined my previous experience with servers, resolving relative URLS both for webpage and resource files ended up taking notably more effort than expected with this configuration style. This challenge was compounded by the use of JSP (Java Server Page = Java-embedded HTML) files which facilitated UI and CSS better than the Java-generated HTML that I had in the first running dashboard version.

The interaction within the system can be represented in the diagram below:


While testing functionality, I ran the code on a local machine against a very small set of data (really small — like counting actual numbers rather than trailing zeros on your fingers) just to make sure that everything was running and that we were gathering the expected data. Consequently, in our first run against a bigger sized data set (approaching 100,000), some areas of data took a few hours to refresh. We were also piping enough data around to have a tangible effect on the performance of the platform as a whole (ouch!).

To improve performance, we took a few steps:

Reduced field set: First, we took another look at all the data we gathered to make sure that we weren’t collecting anything we didn’t need. While our trimming efforts here probably didn’t have as much of an effect on performance, they definitely reduced the memory footprint.

Delta Refreshes: Next, we made the dashboard smarter by only receiving data that had been updated since the last refresh. Given API restrictions, this wasn’t possible for every service that we queried, but it still gave us a big speed boost.

Lazy loading: Data that we might reasonably want (but could live without 90% of the time) was taken off the automatic update list, and instead only queried and cached on demand. This sometimes meant a wait of 15-30 seconds by users if they really wanted to dig into the details, but it reduced the amount of data we had to cycle through every hour by an order of magnitude.

These tweaks improved performance by orders of magnitude, too, so that refreshes could be measured in minutes or seconds rather than in hours.

While the server sent out data in HTML tables, data manipulations were left to JavaScript on the client side. I started with the TableFilter library (tablefilter.free.fr); however, once running the dashboard against more realistically sized datasets, we hit the ceiling on the ability of this library to sort and search through large tables. After being pointed in a new direction by a fellow Knerd, we switched to the DataTables javascript library which, in addition to having much better performance, also offered a larger variety of available plugins and extensions. While JavaScript will work for the near-term, and using the pre-existing libraries allowed us to get off the ground quickly, processing of huge datasets will ultimately need to be moved to the back end as the number of Knewton users grows.

All in all, I was able to get the dashboard up and running into production just a few minutes before my final presentation of my internship. The dashboard is now being used officially at Knewton to track data across multiple environments and for several hundred thousand students across the country!

GraphLint: Creating a Domain Specific Language for Graph Validation

Kenny Yu is currently a sophomore at Harvard College where he is studying computer science. In the summer of 2011, he interned at Knewton as a software engineer. During his internship, he built a new source review and release process and created a graph validation engine based on configuration schemas. He interned at Knewton again in January 2012 and rebuilt the graph validator by creating a domain specific language. The following blog post is a summary of this project and the process he used to design the language and the new validation engine.

The Question

At the core of the Knewton Adaptive Learning Platform is the course structure that brings all the lessons, concepts, and content into one connected graph. This graph is so large that checking it by hand is infeasible. Thus, Knewton’s graph datastore needed a validation system to check the structure of its graphs.

The validator idea is simple: given a graph, how can we tell if it is “well-formed”? How can we create a set of rules to express what edge and node types are acceptable in a given graph? My mentors, Jordan Lewis and Trevor Smith, decided that I should build a domain specific language (DSL) for writing the ontology schemas to solve this problem.

Developing the Language: “Onto”

Before creating the DSL, I had to understand the use case of the language I was building. After speaking with developers and content creators at Knewton, I composed this list of considerations that would drive my planning process:

  1. Expressiveness of the language – Above all, the language must be capable of encoding complex relationships between nodes and edges.
  2. Simplicity and ease of writing the language – The language must be easy to write and so easy to understand that non-developers can think and write in it.
  3. Ease of implementing the language – It should be easy to take the language and parse it into a form that can be used for graph validation.
  4. Graph validation run-time – Once the ontology had been parsed from the DSL into usable form, how quickly can we validate a graph using the provided ontology?

The graphs I worked with are directed multigraphs, where edges and nodes may have associated data with them. In Knewton’s case, both nodes and edges have types (e.g. “Concept”, “Module” and “Contains”, “Represents”). Using these types, we want to assert relationships like:

For all nodes m of type Module, there exists exactly 1 node c of type Concept such that the directed edge (c,Contains,m) exists in the graph.

We also want to express logical combinations like the following:

For all nodes c of type Concept:

there exists exactly 1 node m of type Module such that the directed edge (c,Contains,m) OR (c,Represents,m) exists in the graph


there exists exactly 0 or 1 nodes m of type Module such that the the directed edge (c,Contains,m) exists in the graph.

I used statements like these as inspiration to create the DSL. The DSL consists of these statements:

  1. Universal quantifier statement: for all <variable> in <node type> <statement>
  2. Existential quantifier statement: there exists <number rule> <variable> in <nodetype> such that <statement>
  3. Edge rule assertion statement: (<variable>,<edge type>,<variable>)
  4. And statement: <statement> and <statement>
  5. Or statement: <statement> or <statement>
  6. Xor (Exclusive or) statement: <statement> xor <statement>
  7. Not statement: not <statement>

Thus, the above two statements, written in the DSL, look like this

and the second one:

Lacking creativity, I decided to give ontology file names the “.onto” extension, and thus the Onto language was born.

Language Implementation Choice

It was difficult to determine which language I would use to implement this project. Having recently taken a compilers class in OCaml and knowing the benefits of OCaml’s strongly typed system and functional programming language features (and my personal preference for coding in OCaml), I thought that OCaml seemed like the natural choice for this kind of project. However, after I considered the reusability of this project with other systems in the company, I realized that OCaml would not be the best choice. The company recently decided to move towards Java and possibly other JVM-based languages. I had researched Java-OCaml bindings and discovered that existing options couldn’t meet this project’s need. As a result, I chose to implement this project in Java to make it more compatible with other systems at Knewton.

Having only used ocamllex and ocamlyacc to create parser generators, I needed to research other parser generators for Java. My mentors suggested ANTLR. I found ANTLR to be a high quality project that was incredibly easy to use.

Validation Algorithm

Now that I could successfully parse ontology files into memory, how would I validate a graph? I decided to implement the algorithm in a functional way: traverse the abstract syntax tree (AST) generated by the parser and for each statement node, create a closure that, when provided an environment–a mapping from variable names to actual nodes in the graph–and called, would return “true” or “false” (and thus represent whether the graph passed validation on that statement). After traversing the entire AST, one closure would be produced and would represent the validate function. A list of errors would be generated if the graph failed validation.

For example, I did the following for the Universal quantifier statement:

for all <variable> in <node type> <statement>

  1. Generate the closure by traversing the AST in <statement>
  2. Iterate over all nodes with type <node type> in the provided graph, and for each node:
  1. extend the current environment by mapping: <variable> to current node
  2. call the closure on this new environment, and ensure the closure returns true

I followed a similar procedure for the rest of the node types in an ontology.

Unfortunately, closures are not directly accessible in Java. Instead, I had to generate wrapper classes that wrapped methods representing the closure. In addition, because of the lack of match statements in Java, I used the visitor pattern to traverse the AST.

Testing and Profiling

After unit testing the validation engine on small graphs and simple ontologies, I moved on to a real life error-riddled data set. The graph in the data set had 53,377 nodes and 53,960 edges. When I initially tried to validate this graph, the JVM exploded. Even with 2GB of memory, the JVM still did not have enough heap space to validate this graph in half an hour. After profiling, I discovered that over a hundred million Strings were being generated by error messages for failed validation. As a result, I made several changes to make the validator more memory efficient: I used the flyweight pattern when allocating space for environment variables, node and edge types, and I lazily generated the error messages, only creating them when necessary.

This solved the memory problem. However, the run-time of the validation algorithm was still horrible. Each nesting level of a for all and exists statement added a factor of V to the asymptotic run-time (where V is the number of vertices). For example, if an ontology had three nested for all statements, the run-time would be O(V^3). Fortunately, since edges could only connect two vertices (unlike in a hypergraph), I could optimize the exists statement to only check in the neighboring set of the current node. With this optimization, exists statements no longer multiplied the computational complexity. As a result of these memory and time optimizations, the run-time for the 50,000+ node data set went from 30+ minutes to less than 4 seconds. Changing the run-time from V^2 to V makes such a big difference!

Looking Back, and Next Steps

It’s a rare opportunity to go back and rebuild old projects from scratch. It’s an even rarer opportunity to get complete freedom (as an intern) to design a central component of the company’s system. At Knewton, I was fortunate to have both opportunities. Knewton tasked me with rebuilding the graph validation engine, and I had complete freedom to design the project however I wished. Even more remarkable–this project allowed me to to work at a scale (50,000+ nodes!) that I would rarely ever touch at school.

There are several things that in retrospect I would have done differently. The lack of functional language features in Java made it difficult to implement the validation algorithm. As a result, if I were to redo the project, I would probably implement it in Scala. Using Scala’s interoperability with Java and its functional language features (especially the ability to create closures and use case classes) would have made the implementation easier to write and more concise. Furthermore, there are many optimizations I could have implemented to make the validation engine faster: I could have used properties of second-order logic to transform quantifier and logic statements into equivalent statements with a faster run-time, and I could have parallelized the algorithm to make use of concurrency. But this is pretty good for working just two and a half weeks!