The Knewton Blog

Our monthly newsletter features edtech and product updates, with a healthy dose of fun Knerd news.

NY Graph Meetup: Big Data & Dr. Jim Webber

Posted in Knerds on June 30, 2012 by

Photo courtesy NY Graph meetup

Knewton Knerds are big fans of NY Graph Meetup—a group devoted to the discussion of graph structures, graph theory, graph databases, and related topics. Why graphs? And why now? According to the group, graph structures and methods are becoming ubiquitous these days: “As the size of datasets being generated, processed, and presented continues to grow, the relative strengths and weaknesses of different data structures, including graphs, have become more and more significant in everyday applications.” Organized by developer Scott Bullard, the group aims to host technical presentations and lectures, product demos, lightning talks, seminars, town hall-style open forums and hands-on tutorials, in addition to less-structured monthly gatherings. The most recent gathering featured a presentation from Dr. Jim Webber (author of REST in practice, and Chief Scientist at Neo4j) and was well attended by several Knewton Knerds. The morning after, software engineers, Jordan Lewis and Urjit Bhatia shared their experience with us. CY: Why do you think the NY Graph Meetup is so popular in the NY startup community? Jordan: NY is establishing itself as the center for big data startups like Knewton, FourSquare and Spotify. For those who aren’t familiar with this stuff, “big data” refers to the tremendous volume, velocity, and variety of data generated by technology platforms these days, many of which involve the continuous or ubiquitous collection of data. A lot of the data involved with these new brand of startups is not simple data. In the old days, it used to be stuff like payroll data (social security numbers, salaries, etc, very simple to model). Now that the relationships between data points are growing more complex, we need to explore new ways of grappling with this stuff. With Knewton, for instance, we need to understand the proficiencies of each student in relation to the proficiencies of all other students in the network. This is how it works: in isolation, each student’s response to each question is only a tiny scrap of information, but when propagated through the entire system and understood in context, the value of that information is amplified tremendously. So yeah, I think the popularity of NY Graph Meetup reflects the fact that we’re at an inflection point as a community regarding graphs. Everyone is realizing how complex their data is and grappling with its complexity. It’s so intricate that it’s difficult to model in any conventional way. CY: What were some of the technical specifics of the talk? Urjit: With Neo4J, which is a popular and upcoming NoSQL graph database, there are limits to how much data can be held. The other aspect of Neo4J is that it doesn’t support data sharding yet. This means that we cannot have a setup in which multiple Neo4J servers divide the workload among themselves, but they are going to release a solution sometime next year. This is a major challenge in the big data paradigm. As the buzzword suggests, the data is very “big”—too big to be served and processed in entirety by a single server. Sharding is a clever solution to this problem. In the case of graph data for example, your graph can be chopped up into the number of underlying servers. This is all transparent to the consumers of the data but provides huge gains in terms of performance. Dr. Webber also talked about how the data model that is laid out on a whiteboard during brainstorming sessions translates more naturally into a graph-based data store than the conventional relational data model (which takes us through multiple normalization and de-normalization cycles) does. Normalization is the technique which minimizes duplication of data by grouping data in separate tables and linking them. In the real world, however, where the high traffic of users and services requires super-fast response times, normalized data is so divided that it leads to slower performance which then prompts people to de-normalize (and the cycle then distorts the original layout). The other interesting tidbit was the introduction of a language named Cypher, used for querying the graphs in the Neo4J database. I think this will prove to be a great tool in the graph-data toolbelt. Jim also talked about the concept of overlaying a search engine (Apache Lucene, a search index) on top of the graph data. It makes searching for nodes in your graph very easy. You can say to it: “give me a node in the graph that has this or that property…” CY: How will all this inform your work at Knewton going forward? Did you gain any insights which are immediately applicable? Urjit: A lot of these technologies have discrete math algorithms built into them. Dijkstra’s graph search, for example. So instead of every data scientist having to redo all the research that’s been done previously, one can leverage existing research in the field. Since Knewton works with very complex data models, many of the ideas discussed at this meetup are directly applicable to our work and we can leverage the learnings of other data scientists into making our product better and more robust. Cypher, a graph query language, is something that would be great to have at Knewton. It will make it easier to validate the data we have. We also learned about how people are dealing with things like super-nodes, sometimes called “Britney Spears Nodes” after her popular fan following on Twitter. Such nodes have millions of connections while others have only a couple hundred.