At Knewton we use various mathematical models to understand how students learn. When building such models we want to make sure they generalize, or perform well for a large population of students. Cross-validation, a technique in machine learning, is a way to assess the predictive performance of a mathematical model. At Knewton, we use cross-validation extensively to test our models.
Cross-validation is based on the fact that we don’t have access to unlimited data. If we had all the possible data on student learning patterns, the solution would be straightforward. We would test all our models with the data and pick the one with the lowest error rate. In reality, we only have a finite set of student data to work with. Given a limited amount of data, how do we decide which model performs the best?
One approach is to use all of the available data to test our model. A major problem with this approach is overfitting, which is demonstrated in Figure 1.
Figure 1: Left: the model (blue) underfits the data (orange). This is an over-simplistic explanation of the data where the model would be a better fit if it had more parameters. Middle: the model fits the data just right, where the model captures the overall pattern in the data well. Right: the model overfits the data, where the model fits the noise in the dataset. (Source)
If our model overfits the data, the error rate will be low but if new data is added to the dataset, the model might perform poorly as the fit doesn’t explain the new data well. This is why models that overfit do not generalize well and should be avoided.
This is where cross-validation comes into play. In this approach, rather than fitting the model to the full dataset we split it into training and test sets. This is also referred to as holdout cross-validation, as we are leaving a portion of the data out for testing. The model is fitted using only the training portion of the dataset. Then we assess the predictive performance of the model on the left-out data, which is the test set.
As an example, one model we use to assess student learning is Item Response Theory (IRT). We want to cross-validate our IRT model for a set of student responses to test the performance of our model. To do this, we can split the student response data into training and test sets, fit the model to the training data, and validate it on the test data. If the fitted model predicts the student responses in the test set accurately we can accept this IRT model.
When measuring how students learn, we assume they learn over time. Therefore, it is useful to understand how students behave as time progresses. A shortcoming of the holdout cross-validation technique is that it makes comparisons between random bits of past student data so it can’t make predictions about how students will behave in the future. It would be very useful if we were able to make predictions about students’ future behavior given their past learning patterns.
Online cross-validation is a version of cross-validation which can validate over time series data. Going back to our student response data example, online cross-validation uses a student’s past data to predict how that student will behave in the future. The dataset for online cross-validation is a time-ordered set of responses the student gave in the past. We take the first k responses of a student and use them for the training set, then we try to predict that student’s k+1st, k+2nd, …, k+nth response. If our prediction accuracy is high, we can say that our model is a good fit for our dataset.
Let’s look at how online cross-validation works in more detail. The students answer some questions over time. Some of these responses are correct (green) and some are incorrect (red). Online cross-validation will start by training on the student’s first response only (k=1), then use this to predict whether the student is going to get the next item (k+1 = 2) correct or incorrect.
Figure 2: The first iteration of online cross-validation. The dots represent whether a student got a question correct (green) or incorrect (red). The model is fitted using the first response (k=1) and then used to predict the second, k+1st item (k+1=2). If our prediction matches the student response, our model accuracy increases. 0/1 refers to incorrect/correct.
In the next iteration of online cross-validation, we can use the first two responses (k=2) as our training set, fit the model using these two data points, and predict the third response (k+1=3).
Figure 3: The second iteration of online cross-validation. The dots represent whether a student got a question correct (green) or incorrect (red). The model is fitted using the first two responses (k=2) and then used to predict the third, k+1st item (k+1=3). 0/1 refers to incorrect/correct.
Online cross-validation continues until we run through all the iterations by increasing the training set one student response at a time. We expect to make better predictions as we add more data to our training set.
With online cross-validation, we are not limited to predicting only the next response in the future. We can predict a student’s next 2, 3, …, n responses. This makes online cross-validation a very useful technique if we want to make predictions far in the future.
Both holdout cross-validation and online cross-validation are very useful methods to assess the performance of models. Holdout cross-validation method is useful in assessing performance if we have a static dataset, whereas online cross-validation is helpful when we want to test a model on time series data.