Bashing my head against the command line

BASHing–get it?? (Too violent a pun? See also: hack.) 

The terminal window displays the commands involved in generating lists of the people and organizations mentioned in the collection of Turner Cassity poems.

A screenshot of my session in Terminal.

MARBL houses and owns the rights to the poet Turner Cassity’s papers, including born digital materials from one computer. Dorothy Waugh, my colleague on the Digital Archives team, processed the born digital materials and is now working to get them online and publicly available on an Omeka site. I am experimenting with text analysis of his born digital materials.

So far, my experimentation has entailed using the Stanford Natural Language Processing group’s Named Entity Recognition software, following this tutorial by William Turkel.

My introduction to the command line

Over the past month or so, I have become somewhat comfortable on the command line from working with text analysis tools and generating network graph data via the command line.

I started by working through bits of Prof Hacker’s Guide to the Command Line, at the suggestion of Sara Palmer, electronic text specialist at ECDS.

Since then, I have mastered basic navigating among directories. “cd [name of directory]” and “ls” (which lists the files and directories within that directory) came to be easy. I also had a better time when I realized that you either have to navigate to the folder that has the stuff you’re working with, or else you have to include the file path within your command to make things work. I learned how to create and edit files within the command line, with vim. “Vim” plus the name of the new file will create and open a file within Terminal, then you hit “i” to “insert” text, “esc” to exit the edit mode, and “:wq” to save and quit. As with many of these sorts of things, internet searches are your friend when you get stuck.

Preparing files for text analysis

First, I wanted to get the text from all of the pdfs of Cassity’s files into a single plain text file. To do so, I got pdftotext, and used Ken Benoit’s instructions to convert the folder of pdfs to a plain text file. Following his instructions, I saved the following script as “convertmyfiles.sh”:

#!/bin/bashFILES=~/Documents/ECdsmarblprojects/642_TurnerCassityPapers/642_TurnerCassityPapers_Omeka/*.pdf
*.pdf for f in $FILES do echo “Processing $f file…” pdftotext -enc UTF-8 $f done
But when I ran the script, not all of the pdfs were converted to txt files. I realized that I had issues wherever the file names included spaces, because (I have learned) the command line can’t deal with file names with spaces in them. So, I deleted all of the text files that had been created by entering “rm *.txt”. (Note that “rm” is one of those commands you have to be extra careful with in the command line. If you delete something in the command line it won’t be recoverable.). Then I set out to fix the file names.
I ran a command (which I found by googling) to replace any spaces in file names with an underscore.
for i in *.pdf; do mv “$i” “`echo $i | sed -e ‘s, ,_,g’`”; done
I ran the convertmyfiles.sh script again, and it worked. I checked that there was a .txt file for each .pdf by listing the files in the directory (ls).
Terminal window listing .txt and .pdf files in the directory

List of files in the directory.

Next, I wanted to get all of the newly created text files into one .txt file, so that I could run the name recognition software against a single file. So I concatenated all of the .txt files:
cat *.txt > everything.txt
I opened the file to check that it had work.
open everything.txt
“Open” opens the text in an external text editor, but I can also open the file within the command line, with “vim everything.txt.” I also created a combined pdf, just for my own perusal, by installing pdftk and following this tutorial on how to join multiple pdfs.

Named Entity Recognition Software

Once I had my everything.txt file, I turned to Turkel’s tutorial on using Stanford Natural Language Processing group’s Named Entity Recognizer (NER) software.

I had to upgrade to the latest version of java–which then messed up Gephi (the network graph program I’m using for another project), which doesn’t work with newer versions of java. Using the command line, I had to edit the Gephi configuration file to point to an older version of java, but that is another story.
I got the Named Entity Recognizer software according to Turkel’s instructions, but changing the address in the command to access the latest version (“wget http://nlp.stanford.edu/software/stanford-ner-2015-01-29.zip”).  Then, I ran the ner.sh script on the everything.txt file:
stanford-ner/ner.sh everything.txt > cassity_ner.txt
The NER script appends a label to each word in the file. Words detected as names of a person, location, or organization are labeled, respectively, /PERSON, /LOCATION, /ORGANIZATION, and all of the other words get an /O.
Text file includes text of poems with labels appended, for instance, "Teatro/ORGANIZATION Amazonas/ORGANIZATION and/O the/O Rand/ORGANIZATION Club/ORGANIZATION ,/O ranks/O Of/O derricks/O in/O the/O Caspian/LOCATION ,/O now/O brighten/O blanks/O On/O surveys/O past/O and/O stay/O to/O say/O the/O worst/O of/O times/O May/O be/O the/O best/O if/O it/O can/O overlook/O the/O crimes/O ./O John/PERSON Maynard/PERSON Keynes/PERSON ,/O Ploesti/LOCATION ,/O and/O Malaya/LOCATION "

Text file with NER labels. Note that in lines 2-3 the script didn’t catch Red Sea as a /LOCATION.

Turkel’s tutorial then walked me through:

  • removing all of the “/O” labels to create a clean copy with only the person, organization, and location labels:
    • sed ‘s/\/O / /g’ <cassity_ner.txt > cassity_ner_clean.txt
  • using egrep to create files of the words that precede each of the /PERSON, /ORGANIZATION, and /LOCATION labels:
    • egrep -o -f pattr cassity_ner_clean.txt > cassity_ner_pers.txt
    • Pattr is a file that gives the rules for what to retrieve (the string of letters that precede the /PERSON label, and join adjacent /PERSON strings, i.e. Maya/PERSON Angelou/PERSON would be retrieved as a single named entity:
      • (([[:alpha:]]|\.)*/PERSON([[:space:]]|$))+
  • sorting the lists of people, organizations, and locations by number of frequency:
    • cassity_ner_loc.txt | sed ‘s/\/LOCATION//g’ | sort | uniq -c | sort -nr > cassity_ner_loc_freq.txt

I ended up with three files with the lists of people, organizations, and locations, in order of frequency. The winners are (if most appearances in the text equals winning):

  • Person: Galt
  • Organization: Artemis (actually a personal name…organization seems to be the trickiest designation to determine, because the list is the most confused of the three). The first actual organization to appear on the list is Teatro Amazonas.
  • Location: Lombok

Now, my next step will be to explore the best way to clean up or at least make more sense of the data. As Turkel says in the tutorial, even though the data isn’t perfect, it still is useful: “these errors are interesting, in the sense that they give us a bit better idea of what this text might be about.”

Turner Cassity and Place

Turner Cassity was a poet and librarian who was born and buried in Mississippi; he lived for much of his life in Georgia, and worked as a librarian at Emory; he also traveled and lived outside of the continental U.S. for significant periods of his life. I plan to use text analysis to explore his the places he writes about in poetry.

Cassity was known as a southern poet, and a poet who wrote about  a geographically diverse array of places. The Atlanta Journal Constitution’s obituary of Turner Cassity forefronts his identity as a southern poet who didn’t write about the South:

“He was so very Southern that he didn’t need to write about the region to prove it,” Dana Gioia, former chairman of the National Endowment for the Arts, wrote in an e-mail. “He didn’t write about conventional ‘Southern’ literary themes because he represented the more cosmopolitan side of Southern identity. He was also a Southern eccentric in the style of Flannery O’Connor or John Kennedy Toole.” 

Critics often comment on Cassity’s relationship with place.

Donald Davie, in an earlier assessment of Cassity, observed:
Cassity is very much a world-wandering poet, a globetrotter, and although the places that he responds to have something in common (mostly, for instance, they are post-colonial places), still the sheer irreducible variety of visitable places and climates is something brought powerfully home to us as we read Cassity. 

Yi-Hsuan Tso notes:

[Cassity’s] poem “Cartography is an Inexact Science” accentuates his idea of the syndication of culture, suggesting that geography is defined more by people’s interrelationships and less by space. 

If cartography is an inexact science, the extraction of place names from poetry and geolocating them on a map is certainly even more inexact.
I look forward to this roundabout exploration of Cassity’s work.

← Previous post

Next post →

1 Comment

  1. Anne Donlon

    A resource for learning the command line: Zed A. Shaw’s Command Line Crash Course.

Leave a Reply

 OpenCUNY » login | join | terms | activity 

 Supported by the CUNY Doctoral Students Council.  

OpenCUNY.ORGLike @OpenCUNYLike OpenCUNY