When I first arrived for my internship at Knewton, I had a bit of free time to get settled in and decided I wanted to do something fun (but still related to data science). So I made a map of the world using Twitter tweets.
I began by writing a script in Python to extract geotagged tweets using Twitter’s API. I ignored the text contained in the tweets, and only stored the longitude and latitude of the tweets as they streamed in. Within an hour or so, I had collected around 150,000 locations.
Plotting the tweets on a scatter plot in MATLAB yielded a nice visualization. We can already see rough outlines of populated areas:
This prompted me to investigate further. It was clear that these points represented a rough map of the world, but there were a few things that needed to be improved:
- Above a certain threshold, it was hard to see the relative density of tweets. For example, it was impossible to differentiate between cities and rural areas in the eastern half of the US because it was completely blue.
- Multiple tweets coming from the same location, or being quantized to the same location, only showed up as one point.
- The map, by itself, didn’t make any probabilistic inferences. We’d mapped out our data, but we hadn’t derived anything from it.
I wanted to approximate the probability density distribution of tweets, or the relative likelihood of a tweet to take on a certain location. To do that, I used a technique called kernel density estimation. By smoothing out our data points, kernel density estimation infers a probability distribution from the given data.
Kernel density estimation got me this:
Areas of high tweet density are red, and areas of even higher tweet density are yellow.
This visualization turned out pretty much as I expected. Places that are generally uninhabited, such as the Amazon Rainforest, don’t have as many tweets. The Western world, especially the east coast of the US, the UK, and the Netherlands, have the highest density of tweets. Note that the UK and the Netherlands have the highest tweet density of the European nations, probably because a majority of their citizens speak English.
China is noticeably absent from this map, which makes sense; Twitter is blocked by the Chinese government.
There are a few interesting things to note:
- Southeast Asia was surprisingly active in terms of tweets.
- Australia was surprisingly dormant. Perhaps it was because I sampled the tweets at around 11 AM EST, which meant that it was pretty late at night there. But that didn’t stop Southeast Asia and Japan from tweeting.
- Notice that the western half of the US (not including the West Coast) had a sparser tweet density than the eastern half of the US. There seems to be a dividing line at around -100 longitude.
This exercise shows why it is an awesome time to be a data scientist. So much data is readily available online for us to analyze, and we now have the tools to analyze it efficiently. In just a few hours, I was able to go from an idea to a visualization. This summer, I look forward to using these same tools to rapidly iterate through mathematical models to make inferences from student data.