Saving Money with Beautiful Soup and Hashing
Fangfei Shen | October 11, 2018
Back in the summer of 2017, Knewton’s alta images had an expensive problem. Whether the images were of parabolas, molecules, or supply and demand curves, they were all missing two important things: alt text and long descriptions.
Why do our images need alt text and long descriptions?
As part of being ADA Compliant, alta’s images need alt text and long descriptions to make the images accessible to screen readers. This is important so that our blind or visually impaired students can use assistive technology like screen readers to interact with our visual content.
An accessible image needs alternative text (alt text) and possibly also a long description, which the screen reader can read out to the user. Alternative text is typically brief (we limit ours to 255 characters) and should always be included on accessible images. Long descriptions are used to describe more complex images, like a detailed diagram.
Alt text and long descriptions are added to images via HTML attributes. Here’s an example image:
Let’s say that the HTML tag for this image is:
<img src="cat.jpg" alt="Fluffy gray cat" longdesc="cat.html"/>
longdesc attributes in the image tag, screen readers can read out “Fluffy gray cat” and the contents of
cat.html to the user.
The idea is that a visually impaired student can get all the information they need about an image via its alt text and long description. Take this image for example:
This image’s alt text is very specific:
This figure shows two curves. The first curve is marked in blue and passes through the points (negative 1, 2), (0, 1), and (1, 1 over 2). The second curve is marked in red and passes through the points (negative 1, 3), (0, 1), and (1, 1 over 3).
(Note: Due to limitations there are no alt text and long descriptions for images in this blog post, but we’ve included image captions as an alternative for screen readers.)
Unsurprisingly, writing alt text and long descriptions for thousands of images gets expensive fast. If there were only a way for us to not start from scratch…
Scrape OpenStax, save money
A good chunk of Knewton’s content is curated from the open source OpenStax textbooks.
Greg, our Senior Manager of Content, was staring at OpenStax textbooks online — as senior managers of content are wont to do — when it hit him: OpenStax includes alt text with their images. We could scrape out alt text from OpenStax and match them with the images used in our courses! To Greg, this sounded like an ideal Hack Day project.
Knewton holds “Hack Days” a few times a year, in which Knerds (the Knewton employees) get to work on whatever project we wanted. For the August 2017 Hack Day, Greg and I teamed up to make his OpenStax-scraping, money-saving dream happen. Our solution had two steps:
- Scrape OpenStax textbooks for images and their associated alt text.
- Associate the scraped alt text with their images in our content management system’s database.
Finding images with Beautiful Soup
Conveniently, each OpenStax textbook has a downloadable zip, containing all of the textbook’s HTML and image files.
I wrote a Python script to walk through all the directories of the unzipped book, looking for HTML files. Then it became a straightforward application of Beautiful Soup, a popular Python HTML parser.
>>> import codecs >>> from bs4 import BeautifulSoup
# file_path is an HTML file's path >>> page = codecs.open(file_path, 'r', 'utf-8')
>>> soup = BeautifulSoup(page.read(), 'html.parser') >>> image_tags = soup.find_all('img')
Notice how simple it is to find all
img tags. Once I got the HTML content, I just needed two lines:
- Create a “soup” using the HTML.
- Use the soup’s
For each tag in
image_tags, Beautiful Soup makes it easy to extract the
>>> tag <img alt="Cute puppy" src="puppy.jpg"/>
>>> tag['alt'] 'Cute puppy'
>>> tag['src'] 'puppy.jpg'
alt attribute contains, well, the alt text. The
src attribute contains the file path to the image, which will be used in the next section.
Matching up images with hashing
Now that I have scraped all the alt text in an OpenStax book, I need to match them up with images in our content management system (CMS). This is where hashing comes in: two identical images have the same hash, while two different images have different hashes. If an alt text’s corresponding OpenStax image has a hash that matches the hash of an image in our CMS, then we can apply that alt text to that image in the CMS.
In Python, hashing an image is a matter of using the image’s file path (which was scraped with Beautiful Soup) to read the image’s bytes, and then feeding the bytes into one of Python’s built-in hashing algorithms.
>>> import hashlib >>> with open(image_path, 'rb') as image: ... image_bytes = image.read() ... >>> hashlib.sha1(image_bytes).hexdigest() 'bd61b167999f3ad52ae51f3223e6570ec52ac218'
That hash there is of this image below. Try hashing it yourself with the same hashing algorithm and you should see the same result. (This is assuming that the image’s compression has not changed since this post’s publishing.)
What about the long descriptions?
You’ll notice that I glossed over long descriptions in the last few sections. Did we scrape for that at all? Yes and no. OpenStax image tags do not have
longdesc attributes. However, we did end up repurposing many OpenStax alt texts as long descriptions because they were so detailed and, well, long.
We also did not use every OpenStax alt text verbatim, as our team of subject matter experts sometimes improved upon them or shortened them to fit within our 255 character limit.
How much money did we save?
We haven’t calculated exactly how much money we saved (we’ve been busy building out alta instead 😉). But if we do a back-of-the-envelope calculation:
- We were able to scrape several thousand alt texts from OpenStax.
- Writing the alt text and long description for an image costs between $5.00 and $40.00 depending on the image’s complexity.
Therefore, thousands of images times tens of dollars per image equals tens of thousands of dollars saved! Not too shabby for a hack day project that I coded in a day.