Kenny Yu is currently a sophomore at Harvard College where he is studying computer science. In the summer of 2011, he interned at Knewton as a software engineer. During his internship, he built a new source review and release process and created a graph validation engine based on configuration schemas. He interned at Knewton again in January 2012 and rebuilt the graph validator by creating a domain specific language. The following blog post is a summary of this project and the process he used to design the language and the new validation engine.
At the core of the Knewton Adaptive Learning Platform is the course structure that brings all the lessons, concepts, and content into one connected graph. This graph is so large that checking it by hand is infeasible. Thus, Knewton’s graph datastore needed a validation system to check the structure of its graphs.
The validator idea is simple: given a graph, how can we tell if it is “well-formed”? How can we create a set of rules to express what edge and node types are acceptable in a given graph? My mentors, Jordan Lewis and Trevor Smith, decided that I should build a domain specific language (DSL) for writing the ontology schemas to solve this problem.
Developing the Language: “Onto”
Before creating the DSL, I had to understand the use case of the language I was building. After speaking with developers and content creators at Knewton, I composed this list of considerations that would drive my planning process:
- Expressiveness of the language – Above all, the language must be capable of encoding complex relationships between nodes and edges.
- Simplicity and ease of writing the language – The language must be easy to write and so easy to understand that non-developers can think and write in it.
- Ease of implementing the language – It should be easy to take the language and parse it into a form that can be used for graph validation.
- Graph validation run-time – Once the ontology had been parsed from the DSL into usable form, how quickly can we validate a graph using the provided ontology?
The graphs I worked with are directed multigraphs, where edges and nodes may have associated data with them. In Knewton’s case, both nodes and edges have types (e.g. “Concept”, “Module” and “Contains”, “Represents”). Using these types, we want to assert relationships like:
For all nodes m of type Module, there exists exactly 1 node c of type Concept such that the directed edge (c,Contains,m) exists in the graph.
We also want to express logical combinations like the following:
For all nodes c of type Concept:
there exists exactly 1 node m of type Module such that the directed edge (c,Contains,m) OR (c,Represents,m) exists in the graph
there exists exactly 0 or 1 nodes m of type Module such that the the directed edge (c,Contains,m) exists in the graph.
I used statements like these as inspiration to create the DSL. The DSL consists of these statements:
- Universal quantifier statement: for all <variable> in <node type> <statement>
- Existential quantifier statement: there exists <number rule> <variable> in <nodetype> such that <statement>
- Edge rule assertion statement: (<variable>,<edge type>,<variable>)
- And statement: <statement> and <statement>
- Or statement: <statement> or <statement>
- Xor (Exclusive or) statement: <statement> xor <statement>
- Not statement: not <statement>
Thus, the above two statements, written in the DSL, look like this
and the second one:
Lacking creativity, I decided to give ontology file names the “.onto” extension, and thus the Onto language was born.
Language Implementation Choice
It was difficult to determine which language I would use to implement this project. Having recently taken a compilers class in OCaml and knowing the benefits of OCaml’s strongly typed system and functional programming language features (and my personal preference for coding in OCaml), I thought that OCaml seemed like the natural choice for this kind of project. However, after I considered the reusability of this project with other systems in the company, I realized that OCaml would not be the best choice. The company recently decided to move towards Java and possibly other JVM-based languages. I had researched Java-OCaml bindings and discovered that existing options couldn’t meet this project’s need. As a result, I chose to implement this project in Java to make it more compatible with other systems at Knewton.
Having only used ocamllex and ocamlyacc to create parser generators, I needed to research other parser generators for Java. My mentors suggested ANTLR. I found ANTLR to be a high quality project that was incredibly easy to use.
Now that I could successfully parse ontology files into memory, how would I validate a graph? I decided to implement the algorithm in a functional way: traverse the abstract syntax tree (AST) generated by the parser and for each statement node, create a closure that, when provided an environment–a mapping from variable names to actual nodes in the graph–and called, would return “true” or “false” (and thus represent whether the graph passed validation on that statement). After traversing the entire AST, one closure would be produced and would represent the validate function. A list of errors would be generated if the graph failed validation.
For example, I did the following for the Universal quantifier statement:
for all <variable> in <node type> <statement>
- Generate the closure by traversing the AST in <statement>
- Iterate over all nodes with type <node type> in the provided graph, and for each node:
- extend the current environment by mapping: <variable> to current node
- call the closure on this new environment, and ensure the closure returns true
I followed a similar procedure for the rest of the node types in an ontology.
Unfortunately, closures are not directly accessible in Java. Instead, I had to generate wrapper classes that wrapped methods representing the closure. In addition, because of the lack of match statements in Java, I used the visitor pattern to traverse the AST.
Testing and Profiling
After unit testing the validation engine on small graphs and simple ontologies, I moved on to a real life error-riddled data set. The graph in the data set had 53,377 nodes and 53,960 edges. When I initially tried to validate this graph, the JVM exploded. Even with 2GB of memory, the JVM still did not have enough heap space to validate this graph in half an hour. After profiling, I discovered that over a hundred million Strings were being generated by error messages for failed validation. As a result, I made several changes to make the validator more memory efficient: I used the flyweight pattern when allocating space for environment variables, node and edge types, and I lazily generated the error messages, only creating them when necessary.
This solved the memory problem. However, the run-time of the validation algorithm was still horrible. Each nesting level of a for all and exists statement added a factor of V to the asymptotic run-time (where V is the number of vertices). For example, if an ontology had three nested for all statements, the run-time would be O(V^3). Fortunately, since edges could only connect two vertices (unlike in a hypergraph), I could optimize the exists statement to only check in the neighboring set of the current node. With this optimization, exists statements no longer multiplied the computational complexity. As a result of these memory and time optimizations, the run-time for the 50,000+ node data set went from 30+ minutes to less than 4 seconds. Changing the run-time from V^2 to V makes such a big difference!
Looking Back, and Next Steps
It’s a rare opportunity to go back and rebuild old projects from scratch. It’s an even rarer opportunity to get complete freedom (as an intern) to design a central component of the company’s system. At Knewton, I was fortunate to have both opportunities. Knewton tasked me with rebuilding the graph validation engine, and I had complete freedom to design the project however I wished. Even more remarkable–this project allowed me to to work at a scale (50,000+ nodes!) that I would rarely ever touch at school.
There are several things that in retrospect I would have done differently. The lack of functional language features in Java made it difficult to implement the validation algorithm. As a result, if I were to redo the project, I would probably implement it in Scala. Using Scala’s interoperability with Java and its functional language features (especially the ability to create closures and use case classes) would have made the implementation easier to write and more concise. Furthermore, there are many optimizations I could have implemented to make the validation engine faster: I could have used properties of second-order logic to transform quantifier and logic statements into equivalent statements with a faster run-time, and I could have parallelized the algorithm to make use of concurrency. But this is pretty good for working just two and a half weeks!