Trivia: Ku Pima means To Measure in the Xitsonga Language
An Analytics (Data Science) team is made up of engineers/scientists with a wide array of skills. This results from the nature of the goals the team has to meet. As an Electrical Engineering major at Wits University, I’ve spent two summers as an instrumentation engineering intern. Instrumentation deals with the task of engineering instruments that can measure certain quantities for industrial processes to be controlled. Examples of environments include manufacturing and chemical plants, houses, or even the International Space Station. I find analytics to be a similar process to instrumentation engineering in that useful measurements are sought and then the instruments to calculate those measures are engineered.
Building the Analytics Pipeline
On the Analytics team at Knewton, the data scientists develop measures that are useful to track, whether directly for a business case or for building blocks for future analytics. Within the Analytics team there is a Data Analysis component that develops analytics (measures). Another component, Data Infrastructure, engineers the pipeline (instruments) to actually calculate these analytics on a large/production scale. Initially an analytic is developed by exploring some interesting idea of a measure, using available organization data, and then refining it to arrive at the final analytic.
My internship was concerned with creating Data Infrastructure (the instrumentation) to compute some analytics at Knewton. My initial major task was to take a newly developed analytic, in this case Engagement, data from different sources within our products, and engineer tools to calculate this analytic. This task itself encompases not only the implementation of an algorithm but also the engineering processes necessary to construct the components needed for the measure to be calculated. Further there is a need to analyze and validate the measure on a larger scale than the initial one used to develop it. This necessitates the need for a feedback loop between the data analysts and data infrastructure components.
Engagement is a many-faceted construct. One facet is an analytic that serves to indicate how much time a student spends “actively engaged” on the Knewton platform. There are a number of ways it can be defined. Here is the basic premise: Let’s say a student is busy with a homework assignment. After they have submitted their task, the Knewton system sends recommendations on which material the student should tackle next. From these interactions one might want to know how engaged a student is with the learning platform. There can be many ways of quantifying this: time spent on the system; number of logins per week; time spent on recommended material, etc. The analytics team is tasked with investigating and developing the analytic into a form that will be useful internally or to a partner. After this initial synthesis, we need to engineer a pipeline that will take student interactions and the recommendations into account and calculate the engagement analytic. Further, this analytic is an example of analytic that needs to be inferred. By “infer” we mean that we cannot directly observe the data we want and thus have to deduce it from other data.
There Can Be Multiple Reasons to Infer an Analytic
The raw student interaction data needs to be cleaned and clustered: The raw data captures a lot of information, some of which may not be useful. Thus, there is a need for cleaning/filtering. Some student interactions can be clustered and thought of as a single event instead of multiple events. (Think of a student doing multiple tasks consecutively on a single homework assignment.)
You have to categorize the users’ intentions: The user’s intention is important as it can make an analytic useful or less so. For example: there is a situation where the student did not intend to do the next recommended action, not because they thought it was irrelevant, but because they had to move to another task (say, another homework assignment with an upcoming deadline). In this situation we would have to treat this as a data point that would not be useful in calculating engagement.
Resources: Available resources are always a factor for any organization. It might be faster to calculate the analytic in one way as opposed to another. It might be more efficient to infer an analytic from a readily available dataset than use a much richer, hard to obtain dataset that is less efficient and only provides a small boost in accuracy with an accompanying large use of resources.
The overall view of the pipeline created for Engagement is shown in the figure below. The pipeline takes in data generated by Knewton and its partners that contains student interactions as well as recommendations sent to students. From this data the student learning/task sessions are inferred. From this data we then calculate the Engagement analytic. The computed analytic is reported and the data used for its calculation stored for future use and analysis.
Building the Analytics Pipeline
After the initial pipeline is engineered there are still a number of tasks to complete. Validation is very important in determining if the analytic can be interpreted as expected. We need to know whether with sample data it produces similar results to the analytic development stage. This part of the process involves some modeling and statistics and needs analysis tools in order to detect errors and may lead to refining the analytic itself.
Through this validation process we are able to refine the measure further. Validation gives us a better understanding of the shortcomings of the data we collect and what other data might be useful.
If you’re interested in this subject, I’d also recommend checking out these tech blog posts: Netflix’s Recommendation Engine, Finding similar users on Twitter, How Facebook Enabled Graph Search On Such a Scale.