Project Gutenberg in R

Project Gutenberg is one of the greatest digital libraries you can find on the internet, with over 60,000 free ebooks created entirely by volunteers. It hosts the world’s great literature with focus on older works for which U.S. copyright has expired. Many literary scholars use the Project Gutenberg for research purposes. People often need to download the texts from the digital library in order to perform text analysis or other research procedures on them. A handy tool to convert texts to data sets is R (one usually use RStudio to write in R), and Project Gutenberg has its own R package for download and processing works called the gutenbergr. This tutorial is an introduction to the gutenbergr library in R and how to convert an ebook into a data frame.

Step 1: Install R and RStudio on your machine.

Since this tutorial focuses on the gutenbergr library, I will not go into the specific steps of the installation. One can refer to this guide for installation and this W3Schools webpage for general R tutorial.

Step 2: Install and import the gutenbergr package.

Run the following code chunk in R:

install.packages("gutenbergr")
library(gutenbergr)

The gutenbergr package should be in your environment now, and you have access to the following functions:

gutenberg_authors: a data frame with author information.
gutenberg_download: Download one or more works by their Project Gutenberg IDs into a data frame.
gutenberg_get_mirror: Get the recommended mirror for Gutenberg files by accessing the wget harvest path.
gutenberg_languages: a data frame with metadata about the languages of each Project Gutenberg work.
gutenberg_metadata: a data frame with metadata about each of the Project Gutenberg works.
gutenberg_strip: strip header and footer content from a Project Gutenberg book.
gutenberg_subjects: a data frame with metadata about the subject of each work.
gutenberg_works: get a table of Gutenberg work metadata that has been filtered by some common (settable) defaults along with the option to add additional filters.
read_zip_url: download, read, and delete a .zip file.

Step 3: Download your book(s) of interest from Gutenberg.

Simply pass the book ID as a parameter into the gutenberg_download function.

# save the crude text as a variable
book_crude <- gutenberg_download(ID)

Step 4: Convert the crude text into a data frame of words:

You would need to use another R package here called tidytext.

library(tidytext)

# create a data frame with each row representing a word in a line 
book_words <- book_crude %>%
  mutate(LineNumber = row_number()) %>%
  unnest_tokens(word, text)

Step 5: Remove stop words from the data frame:

The stop_words is a data frame containing the most common words in literature that one usually wants to avoid. You can manipulate the stop_words data frame as you wish by either taking out or adding in words.

book_words_clean <- book_words %>%
  anti_join(stop_words)

The result is a cleaned data set of all the words in a book. Taking Twenty Thousand Leagues under the Sea as an example, the resulting data frame will look something like this:

Now you can conduct other cool analysis with this cleaned data frame!

Word Count:

# count each word in the data frame and 
# extract the top 10 most frequently used words
book_words_clean %>%
  count(word) %>%
  top_n(10)

Still taking Twenty Thousand Leagues under the Sea as an example, the output will be like the following:

Word Cloud:

You will need to import a library called wordcloud. You can customize the maximum number of words to include in the word cloud by inputting a number in the max.words parameter.

library(wordcloud)

# get word count first
word_count <- book_words_clean %>%
  count(word)

# create word cloud
wordcloud(words = word_count$word, freq = word_count$n, max.words = max)

Here is a word cloud of the top 50 most frequently used words in Twenty Thousand Leagues under the Sea.

Hope this tutorial is helpful! Enjoy hacking the humanities!

jeannyzhang

Leave a Reply Cancel reply