SI601/618: Data Retrieval and Analysis Techniques

Jackie Cerretani, Fall 2007

Unix Utilities, Large Corpora

Notes

In assignment one, there was a fundamental mistake in the data analysis. Instead of reviewing the column that indicated word frequency (column 5), I reviewed a different column. In this revision, I fix that error and reprint the data.

Abtract

In this assignment, I chose to compare the word choices of two groups of authors when writing in a particular genre. Specifically, University-wide administrators and the like (group B) to Academic schools and departments (group C) in a the "officially/politically" genre (genre 6). I in my experience writing the Multicultural and Race Relations beat at the Cornell Daily Sun during my undergraduate years, the voices of these two groups tended to be quite different. At Cornell, the way officials and admins spoke about diversity tended to be highly scripted, spun, and heavily focused on university politics. My conversations with faculty tended to be more nuanced, more world-aware, and often aligned with the more radical or intellectual ideas of students. I thought it would be interesting to compare the language used by these two groups to see if the phenomenon I observed at Cornell was true across all the universities represented in the Diversity Kaliedescope.

Data Choice

The data used is a subset of a corpus of documents gathered from U.S. university websites that treat the topic of diversity called the Diversity Kaliedescope. 1216 documents were authored by University-wide administrators and 588 were authored by Academic departments, all speaking in the "officially/politically" genre.

Process Diary

First, I checked for line counts for various combinations of author and genre, using grep and then piping the results to wc.

grep "|b|" * > allbfiles #find documents with the author category "b"
grep "|6|" < allbfiles | wc # all files in author category "b" that are in genre 6
grep "|c|" * > allcfiles #find documents with the author category "c"
grep "|6|" < allcfiles | wc # all files in author category "c" that are in genre 6

When I had found a pair of author/genre combinations that had an interesting relationship, as well as the required number of documents, I wrote them to their own files.

grep "|6|" < allbfiles > b6files
grep "|6|" < allcfiles > c6files

Then I used fetchdocs.pl to grab the docs from the web and save their contents to a directory. I did this for both b6 files and c6 files.

#first, make the directory where the files will be saved
mkdir b6filesformonty
mkdir c6filesformonty


# grab column four in this "|" delimeted file, pipe it to fetchdocs to grab the documents, and then save them in the folder "filestofetch"
cut -d "|" -f 4 b6files | ./fetchdocs.pl b6filesformonty


# grab column four in this "|" delimeted file, pipe it to fetchdocs to grab the documents, and then save them in the folder "filestofetch"
cut -d "|" -f 4 b6files | ./fetchdocs.pl c6filesformonty

This retrieved approximately 1800 documents, stored in two folders, each named for its subgroups.

Next, I ran the droid series on the folders of files.

#call the droid to get the file list
perl droidfilelist.pl c6filesformonty
perl calldroid.pl droidfilelist.xml
perl convertdocs.pl droidoutput.xml

Next I reran MontyLingua using test-6.py in order to get all of the analyzable data.

#run monty
python test-6.py ~/si618/c6filesconverted > ~/si618/week1rev/c6filesmontyedfull.txt
python test-6.py ~/si618/b6filesconverted > ~/si618/week1rev/b6filesmontyedfull.txt

Then I pulled the lines with nps and converted them to lists for each group:

grep "\bnp\b" c6filesmontyed.txt > c6listtemp.txt
grep "\bnp\b" b6filesmontyed.txt > b6filestemp.txt

Finally, I rearranged the columns and sorted the data:

#get the nounphrases only in a list
cut c6listtemp.txt -f 5 > nounphrases.txt

# sort the nounphrases
sort nounphrases.txt > sortednounphrases.txt

#count the noun phrases and sort in descending order
uniq -c sortednounphrases.txt | sort -nr > sortedcountedc6.txt

#now do the same for b6 files
#get the nounphrases only in a list
cut b6listtemp.txt -f 5 > nounphrasesb6.txt

# sort the nounphrases
sort nounphrasesb6.txt > sortednounphrasesb6.txt

#count the noun phrases and sort in descending order
uniq -c sortednounphrasesb6.txt | sort -nr > sortedcountedb6.txt

Output

Click below to view the lists:

  • sortedcountedb6.txt
  • sortedcountedc6.txt

Academic Schools and Departments

University-wide Administrators

728 that
675 diversity
609 University
572 who
542 it
523 which
507 we
496 students
446 they
421 It

316 that
290 diversity
229 students
216 who
216 we
212 which
210 University
198 We
190 it
183 Diversity

Revised Results and Meaning

It is revealing that the most frequent word for both author groups is "that". This signifies to me that removal of the stop words will make these results significantly more relevant. It is also clear from the repetition of the word "we" in both lowercase and initial cap in the University-wide Administrators list that changing the noun phrases to lowercase before counting would be helpful (though I understand this is not the case for all situations).

I hesitate to draw conclusions based on the above, but for the sake of interest, I will note that both groups use the word diversity very frequently. Also, from the inverse placement of the words "students" and "University" in the lists, it would appear that University-wide Administrators are more concerned with the former, where actual Academic Schools and Departments are more concerned with the latter.

Other than that, I'd like to withold further interpretation until removal of the stop words and and changing to lower case are complete, which happens in Assignment 2.

Session II: Exploratory Data Analysis (html)

Unix Utilites, Large Corpora

SQLite databases

Data Display in R

Server Logs, IP parsing

Advanced Regular Expressions

Dissimilarity Matrices & Dendrograms

Session I: Data Manipulation (pdfs)

Parsing Large Text Files, Map Visualizations

Parsing Large Text Files

Regular Expressions and Tree Diagrams

Interacting with Large Data Sets

Parsing Server Logs

Scraping Data from Web Pages, Network Visualizations

Scraping Data from Web Pages, Multi-category Visualization

Parsing XML

Using APIs

Perl to CGI

Parsing Query Logs in SQL