Text Analysis and Network Analysis

For today’s class, we are going to start by generating our own qualitative survey data set that we can use to try out some network analysis for homework.

Fill out the quick five-question survey below:

h/t to Miriam Posner for the framework for this exercise

We can use OpenRefine to transform our survey data into a format that can be used for Network analysis, using a technique to Transpose cells across columns into rows.


Text Analysis 101

Now that we know what XML markup looks like, we can turn to the broader and more fundamental question facing digital humanists: why should we mark up texts in the first place?

Computer-assisted text analysis is one of the oldest forms of digital humanities research, going back to the origins of “humanities computing” in the 1940s.  Computers are well suited to data mining large bodies of texts, and with the rise of massive digitization projects like Google Books, many research questions can be answered using plain “full text” versions of documents.  The Google Ngram viewer is probably the best known of these tools, allowing simple comparisons of the frequencies of words across Google’s enormous corpus of digitized texts.  Someone interested in Prohibition for instance might compare the frequency of “alcohol” and “prohibition” in the American English corpus to see how the two terms were used during the period of its enforcement.

Google Ngram viewer based on its digitized corpus

More sophisticated text analysis tools also exist that let you perform some pretty impressive data mining analytics on plain texts.  

Voyant Tools is one of the most well known and useful tools out there, that will permit some pretty impressive analysis and visualization on plain texts, but also allows you to upload XML and TEI files that permit fine tuning of your data.  For how-to information, check out their extensive documentation page.

Voyant Tools — the “gateway drug” of text analysis for DH

Exercise (Plain Text Analysis)

Let’s take a look at what these text analysis tools can do with a classic example text: Shakespeare’s Romeo and Juliet.

  • Go to Voyant Tools, click the Open folder icon and choose Shakespeare’s Play’s to load the open-access plain-text versions of all the Bard’s plays.

Voyant Tools default main navigation dashboard.
  • Explore the interface, read the summary statistics and hover your mouse over various words to see what pops up
    • What do you notice about the Cirrus tag cloud?
    • Are there words you would expect that are missing?
  • To make it more useful, let’s edit the stop word list, “a set of words that should be excluded from the results of a tool”
    • Hover over the “?” in the top right corner of any tool to reveal additional options
      • Click the option slider icon to launch the options menu
      • By default Voyant auto-selects a list for you, but you can also choose, e.g. the English list, which contains all the common short words, prepositions and articles in English.  
      • Most prepositions are removed, but not some contractions like “I’ll” which shows up in the word cloud. Click “Edit List” and manually add “I’ll” or additional words you deem not semantically meaningful to the list.
      • Make sure the Apply Globally checkbox is checked so this option will act across all tools.
    • Click on any word in the Cirrus cloud to filter the other tools — the fact that they are connected provides a lot of explortory data viz power.
  • Swap out the default tool set for a few more using the windows icon dropdown list to explore the possibilities that Voyant offers.
Voyant offers a wide range of visualization option, making it a superb exploratory data viz tool

We’ve been exploring the entire corpus, but you can also filter the list down to individual documents.

  • Open the Documents window at the bottom left, and click on Romeo and Juliet to load just that play’s statistics
    • Investigate the Trends, and Contexts tools to analyze some key thematic words in the play, like “love” and “death”
  • There are a number of other analysis and visualization tools that Voyant offers, which can be accessed via a direct URL in the address bar.

DISCUSSION:

  1. What kinds of questions can you answer with this sort of data?
  2. Are there research topics that would NOT benefit from the approach?

Structured Full Text: XML and the TEI

Text analysis and data mining tools can get a lot of statistics out of the full texts, but what if we are interested in more fine grained questions?  What if we want to know, for instance, how the words Shakespeare used in dialogues differed from soliloquies?  Or what if we were working with a manuscript edition that was composed by several different hands and we wanted to compare them?  For these kinds of questions, we need to go beyond full text. We need our old friend metadata! One way to add it is to encode our texts with meaningful tags.  Enter XML and TEI.

Over the past few weeks we have discussed and seen how the modern dynamic web—and the digital humanities projects it hosts—comprise structured data (usually residing in a relational database) that is served to the browser based on a user request where it is rendered in HTML markup.  

XML (eXtensible Markup Language) is a sibling of HTML, and a mainstay of DH methods. Whereas HTML can include formatting instructions telling the browser how to display information, XML is merely descriptive.  XML doesn’t do anything, it just describes the data in a regularized, structured way that allows for the easy storage and interchange of information between different applications.  Making decisions about how to describe the contents of a text involves interpretive decisions that can pose challenges to humanities scholarship.

We went over the main parameters in class, but an excellent primer to XML in the context of DH has been put together by Frédéric Kaplan for his DH101 course and can be viewed in the slideshow below.

Introduction to XML from Frederic Kaplan

For humanities projects, the de facto standard markup language is that specified by the Text Encoding Initiative.  They have spent years working out a flexible yet consistent system of tags with the aim of standardizing the markup of literary, artistic and historical documents to ensure interoperability between different projects, while at the same time allowing scholars the discretion to customize a tag set matching the needs of their specific project.  This flexibility can make it a daunting realm to enter for newbies, and we will not be going very far down this path.

The best gentle introduction to getting a TEI project up and running can be found at TEI By Example.  For our purposes, let’s see how a properly marked up version of Romeo and Juliet differs from the plain text version and what that means for our scholarly pursuits.


Exercise (Encoded Text Analysis)

The Folger Shakespeare Libary’s Digital Texts is a major text encoding initiative that provides high quality digital editions of all the plays online for free.  These are not only available online through a browser, but the Folger has also made their source code available for download.

Folger Shakespeare formatted HTML R&J
  • First, look at the online edition of Romeo and Juliet in formatted HTML.
    • This provides a much nicer reading experience than a plain text version, mimicking the print edition, but adds search and navigation
      • Explore the features and options and think about what this presentation offers compared to a traditional print version
  • To see how they made this happen, download the XML source code from this page, and open it on your computer.
    • Open the files in a text editor and try to make sense of them.
    • For the XML file, you’ll see that this XML is much more complex than the HTML we looked at before.
      • Try to find the Act 1, Scene 1 opening in the image above and compare the two.
        (HINT: you’ll need to scroll ALL the way down to line 645 or so)
      • What elements have they tagged?  
      • How fine-grained did they get?  
      • What is the point of tagging to this level of detail?
  • You may find their Tag Guide documentation helpful to understand what each element represents
    • Using the guide, try to locate a speech by Romeo and another by Juliet.

As you might imagine, this kind of granularity of markup opens the door to much more sophisticated queries and analyses.  Let’s say, for instance, that we wanted to compare the text of R Project-Projectomeo’s speeches with those of Juliet as part of a larger project exploring gender roles on the Elizabethan stage.  The detailed TEI encoding of the Folger edition should let us do this pretty easily.  Unfortunately the user interfaces for analysis of TEI documents have not been developed as much as the content model itself, and most serious analytical projects rely on statistical software packages used in scientific and social science research, like the open-source R Project.  We’re not going to go that route for this class.

Voyant Tools will allow us to select certain elements of the marked up code, but only if we understand the XML document’s structure and know how to navigate it with XPATH (part of XSL and used in conjunction with XSLT).  So let’s see how that works on our Romeo and Juliet example.

To do so, we’re actually going to use a slightly simpler XML markup version of Romeo and Juliet, so that it’s easier to see what’s going on.

Screen Shot 2015-10-22 at 9.36.23 AM
  • Go back to Voyant Tools and paste the URL below into the Add Texts box
    • BEFORE YOU CLICK REVEAL: open the Options dialog in the top right corner
http://www.ibiblio.org/xml/examples/shakespeare/r_and_j.xml
  • Under “XML > Content“, type (or copy/paste) the following expression   
    //SPEECH[contains(SPEAKER,"ROMEO")]/LINE
  • Let’s also give this document the Title of ROMEO
    • Under “Title“, try to alter the expression above to select the SPEAKER values within Romeo’s speeches, instead of the LINE values
  • Finally, click on OK, and then Reveal
    • Apply any stop words and explore how this differs from the full text version of the play we examined earlier

I’m sure you can see the powerful possibilities afforded by using the encoded tags to quickly select certain portions of the text.  When you do this type of analysis on an entire corpus, you can generate lots of data to compare with relative ease.  So let’s compare Romeo’s text to Juliet’s

  • Edit the Trends tools to show trend lines for the words “love” and “death
  • To preserve Romeo’s text analysis, we can get a static URL to this instance of Voyant.
    • Go to the export icon in the top right, and Export a URL for the Trends tool and current data
    • then click on the URL link to launch the current view in a new window

Now you can go back to the home screen and repeat the process above, making changes to the Path expressions in order to get all of Juliet’s lines and title the new document JULIET.

  • Apply any stop words again
  • Now you can use all of the analytical tools to compare the two lover’s words!
  • What does the encoded version offer that the other lacked?
Love and death in Romeo’s speeches
Love and death in Juliet’s speeches

Network Analysis 101

The advent of the internet, and especially of its more socially connected Web 2.0 variant, has ushered in a golden age for the concept of the network.  The interconnected world we now live in has changed not only the way we study computers and the internet, but the very way we envision the world and humanity’s place in it, as Thomas Fisher has argued.  The digital technologies that we are learning to use in this class are tightly linked to these new understandings, making network analysis a powerful addition to the Digital Humanist’s toolkit.  According to Fisher,

The increasingly weblike way of seeing the world … has profound implications for how and in what form we will seek information. The printed book offers us a linear way of doing so. We begin at the beginning—or maybe at the end, with the index—and work forward or backward through a book, or at least parts of it, to find the information we need. Digital media, in contrast, operate in networked ways, with hyperlinked texts taking us in multiple directions, social media placing us in multiple communities, and geographic information systems arranging data in multiple layers. No one starting place, relationship, or layer has privilege over any other in such a world.

Small Network

To study this world, it can therefore be helpful to privilege not the people, places, ideas or things that have traditionally occupied humanistic scholarship, but the relationships between them.  Network analysis, at root, is the study of the relationships between discrete objects, which are represented as graphs of nodes or vertices (the things) and edges (the relationships between those things).  This is a very active area of research that emerged from mathematics but is being explored in a wide array of disciplines, resulting in a vast literature.  (Scott Weingart offers a gentle introduction for the non-tech savvy in his Networks Demystified series and you can get a sense of the scope from the Wikipedia entry on Network Theory.)  As hackers, we are not going to get too deep into the mathematical underpinnings and rely mostly on software platforms that make network visualization relatively easy, but it is important to have a basic understanding of what these visualizations actually mean in order to use them critically and interpret them correctly.


Network Analysis DH Projects

Now that you know the basics of what a network graph is, let’s explore some much more sophisticated network analysis DH projects.  With your neighbors, explore one or more of the following projects:

Assignment

As you explore the project, consider the following questions about the nature of this network analysis and write a blog post for next class that answers two or more of them, with a link to your project:

  • What (or who) are the nodes and what are the edges?
  • How are the relationships characterized and categorized?
  • What interactions does the project allow?
    • How does this impact their effectiveness and/or your engagement?
  • How was the project created?
    • See if you can dig around in the documentation and discover what tools or data manipulation steps produced the outcome you see.
    • Does the project combine network analysis with any other information or technique, like spatial analysis or text mining?

BONUS (Optional)

Follow along with Miriam Posner’s Gephi tutorial to see how Network Analysis works in practice. Download Gephi here, which is the network graphing software most used in DH projects.

See how far you get and come to the next class with questions!


Resources

XML DOM and JavaScript

w3schools is always a good place to start for the by-the-book definition of a language or standard with some good interactive examples, and their XML DOM tutorial is no exception.

SimpleXML in PHP 5

If your project allows server-side scripting it is MUCH easier to use PHP to parse XML than JavaScript.  The w3schools introduction to simpleXML in PHP 5 is solid, but TeamTreehouse.com has a more readable and accessible real-world example of how to parse XML with php’s simpleXML functions.

XSLT

XSLT (eXtensible Stylesheet Language Transformations) is to XML what CSS is to HTML, but it’s also a lot more.  More like a programming language within markup tags than a regular markup language, it’s a strange but powerful hybrid that will let you do the same things as the languages above: transform XML data into other XML documents or HTML.  Learn the basics at w3schools’ XSLT tutorial.  If you’d like a more in depth explanation of how XML and XSLT work together check out this tutorial xmlmaster.org, which is geared at the retail/business world but contains basic information relevant for DH applications.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.