In order to provide students with sensible, targeted recommendations, the Knewton platform uses a variety of statistical models to track and interpret student progress. These models regularly undergo improvements designed to progressively increase our understanding of student behavior. One tool for assessing the accuracy of these improvements is a model dashboard — a visual display of model fitness metrics that allows a reviewer to interpret a model’s ability to capture and explain diverse student behaviors within the Knewton platform.
Model dashboards are primarily useful for analyzing performance on student behavior observed after the initial model release. Models are rigorously validated during the development phase, wherein data scientists clearly state assumptions made by the model and test how well it explains held-out data collected from student interactions. However, as Knewton launches new adaptive integrations, the population of students, and therefore student behaviors, diversifies and drives the need for further monitoring of model performance. A dashboard that displays quantitative metrics on the model’s function can help pinpoint when and how the model deviates from expectations, providing empirical motivation and direction for adjustments. By monitoring the models during the cycle of modification and validation, data scientists discover ways to make improvements responsibly and scientifically.
Since proficiency estimation is a fundamental pillar of Knewton recommendations, it was the first model that I instrumented with a dashboard. Via the proficiency model, Knewton computes a quantitative measure of a student’s proficiencies on concepts for every recommendation, which can then be used to predict the probability of a student getting an item (e.g., a question) correct. The system then fuses this proficiency data with several other types of inferences to develop a comprehensive understanding of a student’s academic abilities and thereby make personalized recommendations.
One of the most fundamental requirements for the proficiency model is the ability to successfully predict student responses. There are several ways to measure accuracy. Two examples of relevant questions are, “How far off were the predictions from the observations?” and “How well do our predictions discriminate correct from incorrect responses?” The metrics I chose to represent for the dashboard’s initial presentation attempt to address a few of these different nuances.
For each item that Knewton recommends, we compute a response probability (RP) representing our prediction of whether the student will respond correctly to the item given her estimated proficiency in each of its related concepts. For example, a response probability of 0.7 would predict a 70% chance that a student would answer the question correctly. To explore prediction accuracy from a variety of perspectives, I implemented three separate accuracy metrics for the first iteration of the proficiency dashboard.
For the first metric, I used a very rough measure of how close the predictions were to the observed responses by computing a mean squared error (MSE) between the correctness values and the prediction probabilities. While a low MSE is a strong indicator of accurate predictions, a large MSE is not as easily interpreted. Another metric is needed to characterize this case.
It may be that although the RPs are not near 0 or 1, by defining a threshold of, say, .5, the RPs do a reasonable job at separating correct responses from incorrect responses in that most RPs greater than 0.5 pair with a correct response, and vice versa. It would not be very informative, though, to choose a single threshold, since for different choices of thresholds there will often be a tradeoff between correct and incorrect categorizations. In order to easily visualize this tradeoff, I computed a receiver operator characteristic (ROC) for the second metric. A ROC plots the true positive rate (TPR) against the false positive rate (FPR) for a sliding threshold from 0 to 1. More specifically and using the following data, known as a confusion matrix:
|Incorrect response||Correct response|
|RP > threshold||FP||TP|
|RP < threshold||TN||FN|
the TPR (TP/(TP+FN)) is plotted against the FPR (FP/(FP+TN)). As the probability threshold decreases from 1, a highly predictive model will increase the TPR more rapidly than the FPR, while a completely random prediction would increase them both roughly equally. So, the ROC curve provides an easy interpretation at a glance — a strong deviation from a 45-degree line shows that the RPs are good at separating correct from incorrect responses.
For the last metric, I computed the empirical distribution of probabilities for the correct and incorrect responses, displayed as a histogram of RPs. Data scientists examine the asymmetry of RPs over correct versus incorrect responses to validate prior assumptions made about student proficiencies in the model. In addition, outliers in this distribution (such as a peak around high RPs for incorrect responses) may not be apparent in the previous two metrics, yet may reflect subsets of students whose behaviors do not adhere to model assumptions and ought to be investigated.
These metrics provide a graded set of interpretations of the model’s accuracy. In order to implement the dashboard, data required by each metric must be transformed over a series of phases from the live service to their final presentable form. The proficiency model runs as a component of the Recommendation Service, which runs on an AWS EC2 instance. The model running within the service takes in student responses and updates student proficiencies online. For each batch of responses, metadata are logged in a format containing RPs and item correctness. These logs are swept on a periodic basis to an S3 bucket, providing a persistent, canonical record of model state at the time that we actually served a student.
A workflow executes periodically to refresh the metrics on display. Steps include ETL scripts for loading the logs into a Redshift RDS cluster, parsing the logs, transforming them into the metrics, and publishing the results to the dashboard. For displaying these metrics, Knewton currently uses Looker, which provides a sophisticated graphical visualization for the metrics stored in the database.
The dashboard currently implements the three metrics outlined above, yet there are many other important validation criteria for the proficiency model that will benefit from similar automation and reporting. Iterating on and augmenting the metrics will provide a fresher and more holistic snapshot of model behavior that will continue to expedite how Knewton data scientists track model performance day-to-day. The dashboard that I delivered provides our data scientists with actionable feedback to ensure that our models serve diverse student needs with a powerful adaptive education experience.