As part of my summer internship at Knewton, I helped design and implement an offline parameter estimation batch. This blog post discusses some of the design considerations and challenges associated with the implementation of offline parameter learning. But first, I’ll provide a little context around the model and an overview of the parameter estimation process.
Vocabulary and Context
- A student event is any interaction a student has with educational content that is recorded in Knewton’s data store (e.g., watching an instructional video or answering a quiz question).
- When an event consists of a student responding to an assessment question, it is referred to as an item interaction. In this case, the item refers to the actual question.
- Items are responsible for assessing one or more concepts (e.g., long division). The content for a particular book consists of a collection of concepts.
- A graph structure identifies the relationship between concepts (e.g., a concept might be a prerequisite for another concept). The graph structure also identifies which concept(s) a particular item assesses. Therefore, a single graph identifies the relationships for all items and concepts contained in the book used for a course.
- By understanding the characteristics of each item (e.g., its level of difficulty) as well as a student’s current level of proficiency in a particular concept, Knewton is able to provide a recommendation on what piece of educational content the student should interact with next in order to meet the learning objectives of the course. This explanation glosses over a number of other factors taken into consideration by Knewton’s recommendation engine, but suffice to say, the recommendation relies on an understanding of item characteristics and student proficiency.
- Item characteristics and student proficiency are modeled using Item Response Theory (IRT) as discussed in a previous blog post.
Parameter Estimation Overview
Once the initial model components are identified based on experimentation and evaluation by data science teams, Knewton has to assess how the parameters for this model will be estimated over time to ensure robustness and accuracy as more data are collected. The team can choose to train parameters in an ad hoc mode, a real time online mode, a recurring offline mode, or some hybrid.
In this mode, the data scientist(s) might derive parameters outside of the operational functions of the product based on their analysis. Parameters are configured into the product based on this analysis in an ad hoc manner based on additional experiments/analysis by the data scientist(s).
The key operational drawback of this approach is the lack of a systematic and real-time response to new student events. Parameter updates have to wait until the next time the data scientist(s) choose to gather and incorporate new student events into a parameter estimation run. In addition, without a standardized and repeatable process in place, different data scientists may approach the parameter updates in a slightly different way based on their familiarity with the problem. This can pose a challenge in maintaining quality consistency and reproducibility.
The benefit of this approach, however, is that it has few constraints imposed upon it, and so is incredibly flexible. Data scientists are able to tweak the training process as they see fit.
This mode refers to a recurring operational process by which parameters are regularly re-estimated as part of some (semi-)automated process. At Knewton, this is achieved through scheduled batch jobs that automate the gathering of training data and execute the parameter learning.
While this limits the flexibility afforded to the parameter estimation method (since the logic has to be implemented in an automated batch), it improves robustness by simplifying the process by which model parameters can be updated in response to changes in the underlying data. This process still does not provide real-time parameter updates, but, at the time of execution, it can incorporate all available data into its parameter estimates.
This mode refers to the integration of the parameter update process into the online behavior of the product. At Knewton, this means that each incoming student event can trigger a parameter update. This is necessary because each additional student interaction provides insight into the current proficiency level attained by the student. An application of this approach outside of Knewton involves having to cluster news articles into related topic areas. Since new articles are constantly coming in over a newswire, there is a need to associate and update the clusters to reflect the update corpus of news article .
The key operational challenge of online estimation is that data generally have to be processed incrementally. Multiple scans of the full data set are not always feasible. In addition, real-time response rate requirements limit the amount of computation that can take place for each incoming data point (particularly if the number of incoming events is high) . As an example in the context of news articles and topic models, Banerjee et al. employ algorithms which hold the “number of clusters” constant and only update the cluster parameters for the topic area which appears to be most closely aligned with an incoming news article .
While these techniques are effective in providing real-time response rates, the accuracy of model predictions may suffer over time if there are underlying changes to the model parameters that were held constant. In the case of news topic clusters, the study finds that, in general, “online algorithms give worse nMI (a measure of cluster quality) results than corresponding batch algorithms, which is expected since the online algorithms can only update the cluster statistics incrementally” . In Knewton’s case, new data might suggest that item parameters need to be tweaked from the default settings with which they were originally configured.
The authors of the news topic clustering study — like Knewton — propose the use of a hybrid model where offline parameter estimation is used to regularly refresh model parameters that are otherwise being updated incrementally through the online parameter update . This provides the ability for models parameters to react in real time to incoming data points, while systematically refreshing parameters that might otherwise become stale as a result of only making incremental online parameter adjustments.
The hybrid approach does pose the challenge of having to identify an optimal frequency at which offline parameter estimation should be carried out to augment the performance of the online parameter updates. The frequency is likely to be dependent on how quickly errors accumulate due to online incremental updates and the tolerance range for such prediction errors.
Design Considerations for Offline Parameter Estimation
The remainder of this post discusses the offline batch process that was to be introduced in support of the hybrid parameter estimation approach. As such, the objectives of this batch job were to:
- Extract item interactions from the student events recorded by Knewton;
- Parse them into training data for a parameter learning algorithm for the IRT model; and
- Estimate new item parameters grouped by book. By grouping items by the book, the estimation process can use relationships between different concepts (e.g., if mastery of concept 1 is a prerequisite for mastery of concept 2) to inform parameter selection for items assessing each of those concepts.
There were four major design considerations which went into achieving these objectives: robustness, flexibility, scale, and transparency.
One of the main considerations was to design an offline parameter estimation job which would be able to accommodate future changes in the estimation process as well as the underlying data. As a result, there was heavy use of modularized design to decouple various components of the batch process:
- Rather than directly retrieve student events from the database, the batch relies on an intermediary batch job that retrieves student events from the database and parses them into an easily consumable input format.
- Finally, a separate library is defined that accepts a training data set as input and uses it to perform the actual parameter estimation.
This allows the batch process to gather a training dataset and pass it to a machine learning algorithm for parameter estimation while being agnostic of the underlying database structure, graph structure, and parameter estimation algorithm details.
Though the batch job is itself written in Java, the library responsible for performing parameter estimation is written in Python in order to provide the flexibility of leveraging mathematical packages such as pandas and numpy. To facilitate this, the batch job accommodates python code execution as a post-Reduce calculation step using a custom MapReduce framework (which uses Thrift to communicate between the Java and Python components).
The library used to perform the actual parameter estimation is the same library which the data science teams utilize for their experimental analysis. This eliminates the need to generate duplicate code going from exploratory analysis to offline parameter estimation — particularly as the parameter estimation techniques are tweaked. In addition, the batch is able to support input parameters corresponding to the full range of optimization settings used by the data science teams to tweak the parameter learning process.
The biggest challenge was dealing with the amount of data involved in the parameter estimation step. Knewton records over 500 million student interactions each month. Using a MapReduce design allowed the extraction and parsing of item interactions from student events to be achieved by distributing the student event records across multiple Hadoop partitions. Using a 15 node cluster of R3 AWS instances, the batch was able parse student events for a given month within an hour. The R3 instances are optimized for in-memory analytics, providing 61 GiB of memory and 160 GB of SSD storage.
Even after the batch had extracted the item interactions from the student event records, this still left between one and five million training data instances per book for the estimation step. Such large data sets place a heavy load on memory during the iterative parameter estimation process. While this issue was addressed by using AWS instances optimized for memory, another option would be to sub-sample down to manageable levels as the amount of data increases over time. The other alternative would be to modify the learning algorithm to avoid storing all 5 million training data instances in memory during the update calculations. However, devising a special algorithm for this may limit the long-term flexibility of the parameter estimation process.
The batch job uses Counters and logging to provide transparency into the parameter estimation steps. The logging is facilitated by the parameter estimation library which enables the user to see parameter estimation statistics (like log-likelihood values) at the end of each iteration in the algorithm. This was designed so that one can look inside the parameter estimation process and evaluate discrepancies or optimization opportunities.
Finally, a systematic QA step is an important feature for any automated offline parameter estimation. Knewton exercises a number of manual validation steps ranging from quantitative analysis (e.g., running historical events through the model to evaluate predictive power) to qualitative analysis (e.g., reviewing recommendations produced with the parameter estimates). In the future, some of these QA steps could be incorporated into the batch process to provide fully automated offline parameter estimation.
|||A. Banerjee and S. Basu, “Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning,” Society for Industrial and Applied Mathematics, pp. 431-436, 2007.|
|||S. Zhong, “Efficient streaming text clustering,” Neural Networks, vol. 18, no. 5-6, pp. 790-798, 2005.|