About irenechen

Irene is a senior at Harvard College where she is studying Applied Math with a focus field in Computer Science and Economics.

Deciding Who is Right: Using Item Response Theory Instead of Majority Rule to Crowdsource Answers

As the saying goes, it takes a village to build an adaptive learning technology provider. Even when you have an entire village at your disposal, however, the added help may contribute more headache than help as even the simplest question can generate differing opinions.

To place the situation in more concrete terms, imagine you have a stack of 10,000 photos of either kangaroos and kittens, but you do not know which photo depicts what. Because object recognition remains a difficult problem in artificial intelligence, even the most powerful computers will have a difficult time determining if a photo is a kangaroo or a kitten.

Classifying them all by yourself would be time consuming and potentially inaccurate if you start to lose focus. Luckily, 100 of your closest friends have offered to help; unluckily, some informal surveying already reveals that sometimes they disagree with each other. After all, kangaroos and kittens do sometimes look similar. What to do?

How can we decide the correct answer amidst a sea of potentially contradictory information? One straightforward approach would be to gather two or three (or ten) labels for each photo and take the majority vote. While the majority method would give us a rough idea of the correct label for each photo, it fails to incorporate the fact that some of your friends may be more gifted kangaroo/kitten labelers than others. Additionally, some of the photos might be harder to label, which would skew calculations about your friends’ abilities.

All of a sudden, our kangaroo-kitten problem is starting to sound like a problem we have already tackled: Item Response Theory (IRT)!

IRT Model

In order to solve our kangaroo-kitten problem, we can use a slightly modified form of IRT. Here at Knewton, we use IRT to determine student proficiency. Unlike standard tests where all questions are weighted equally, IRT assigns each student an ability value and each question a difficulty value, which provides a more nuanced picture of how a classroom of students is doing.

In the one-parameter logistic (1PL) IRT model, for question \beta j with difficulty j and student i with ability \theta i, the probability that question j is answered correctly, xj = 1, is

For more information about IRT, check out Alejandro Companioni’s previous N Choose K post.

The question then becomes, how can we use our knowledge about abilities and difficulties to determine who is better at kangaroo/kitten labeling?

Beyond IRT

If we stretch our minds a little bit, we can imagine our kangaroo/kitten problem as a test where the questions are the photos and the students are our friends. We want to determine which students are most proficient at the very strenuous task of animal classification. In addition to the parameters included in the 1PL IRT model, however, we also want to compute a probability to capture how likely each picture is to be either a kangaroo or a kitten.

Similar to the 1PL IRT model, the parameters in our model now include labels L, vector of abilities theta, and vector of difficulties beta. To make sure we’re all on the same page, the labels L represents all of the given labels by our labelers. Not all labelers need label every photo. Each of our abilities \theta can range from negative infinity to positive infinity. The greater the ability, the more skilled the labeler. Our difficulties range from zero to infinity where the higher the difficulty, the harder the image is to label correctly.

Consider how the observed labels, true labels, abilities, and difficulties all relate to each other. Would the difficulty of the question affect the accuracy of the observed label? Potentially. Would the true label of the image affect the ability of the labeler? Unlikely. Below we have drawn the general graphical model describing the relationships between these parameters where the shaded variables are observed.

Remember that in our case, we have 10,000 images and 100 labelers. Unsurprisingly, the difficulties, abilities, and the true labels are all independent of each other, meaning the accuracy of a labeler has no effect on the likelihood that a photo depicts a kangaroo or a kitten!

How does this all have anything to do with if the photo is a kangaroo or a kitten? For specific photo j, we can derive how likely the photo depicts either adorable animal. That is, the posterior probability of the correct label zj for photo j denotes the probability the photo depicts an animal.

Because we know that the photo contains either of the two animals, we can designate kangaroo as 0 and kitten as 1. Our posterior probability then designates from 0 to 1 how likely the photo is to contain either animal. If we assume that the correct label zj is independent of both abilities theta and difficulties beta, the probability simplifies dramatically.

The posterior probability now consists of two components: a prior belief and an IRT-based probability. Our first term p(zj) captures our prior knowledge about how many of the photos contain each. For example, if we suspected that the majority of the photos were kittens rather than kangaroos, we could use that parameter to include our prior belief in the model. The second probability uses our 1PL IRT probability to denote the probability the labeler gave a label (aka answered a test question) conditioned on the correct answer, the labeler ability, and the difficulty of the question.


Now that we have established our graphical model including relevant dependencies, we can use an Expectation-Maximization (EM) approach to obtain maximum likelihood estimates of the parameters of interest. Essentially we can now alternate between the expectation and maximization steps to find the most likely probabilities and parameters, given all of the other probabilities and parameters.

By a lucky coincidence, we have actually already determined our expectation step above when we computed the posterior probability of each label! A simpler way to think about the expectation step is to imagine that our abilities and difficulties are all fixed, and we calculate the animal image probabilities accordingly. If we only calculated the probabilities once, however, the probabilities would only depend on whatever values of abilities and difficulties we initialized in our model! How can we keep adjusting the model?

This revelation brings us to the second half of EM: the maximization step. For this step, we want to find a way to make our posterior probabilities as large as possible—denoting how certain we are overall of our guesses of the correct label—by adjusting our parameters and . More formally, we are trying to maximize the expectation of the joint log-likelihood of the observed and hidden variables (L, Z) given the parameters (\theta, \beta) with respect to the posterior probabilities that we calculated in the expectation step.

Our joint log-likelihood function is the expected value of the logarithm of our joint probabilities. That is, how certain are we of everything so far? Using our conditional independence assumptions outlined earlier, we can find our joint log-likelihood function:

Using gradient ascent, we can then find values of \theta and \beta that locally maximize Q.

From here, we simply alternate between expectation and maximization steps until convergence. To recap, expectation holds ability and difficulty constant, while calculating posterior probabilities. Maximization then calculates the ability and difficulty parameters to maximize joint log-likelihood, given constant posterior probabilities.

Depending on the gradient ascent implementation, the algorithm should converge quickly, revealing our best guesses for which animal is featured in which photo. As we can see below from our implementation based on simulated data, the EM approach outscores the majority approach at nearly 5% initially before converging later. Additionally, as we increase the number of voters, the accuracy increases. Success!

While our improvement over the majority method may be impressive, our E-M IRT model still has plenty of room to expand. What if we also had pictures of koalas and killer whales, increasing the number of options? What if we had reason to believe that the abilities of our friends fall in a Gaussian distribution, creating a prior distribution on our parameters? What if we assumed that our friends might become better labelers as they continued to label, making our model intertemporal?

Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., and J. Movellan. (2009). Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043.

de Ayala, R.J. (2008). The Theory and Practice of Item Response Theory, New York, NY: The Guilford Press.