Jackie Cerretani, Fall 2007
In assignment one, there was a fundamental mistake in the data analysis. Instead of reviewing the column that indicated word frequency (column 5), I reviewed a different column. In this revision, I fix that error and reprint the data.
In this assignment, I chose to compare the word choices of two groups of authors when writing in a particular genre. Specifically, University-wide administrators and the like (group B) to Academic schools and departments (group C) in a the "officially/politically" genre (genre 6). I in my experience writing the Multicultural and Race Relations beat at the Cornell Daily Sun during my undergraduate years, the voices of these two groups tended to be quite different. At Cornell, the way officials and admins spoke about diversity tended to be highly scripted, spun, and heavily focused on university politics. My conversations with faculty tended to be more nuanced, more world-aware, and often aligned with the more radical or intellectual ideas of students. I thought it would be interesting to compare the language used by these two groups to see if the phenomenon I observed at Cornell was true across all the universities represented in the Diversity Kaliedescope.
The data used is a subset of a corpus of documents gathered from U.S. university websites that treat the topic of diversity called the Diversity Kaliedescope. 1216 documents were authored by University-wide administrators and 588 were authored by Academic departments, all speaking in the "officially/politically" genre.
First, I checked for line counts for various combinations of author and genre, using grep and then piping the results to wc.
grep "|b|" * > allbfiles #find documents with the author category "b"
grep "|6|" < allbfiles | wc # all files in author category "b" that are in genre 6
grep "|c|" * > allcfiles #find documents with the author category "c"
grep "|6|" < allcfiles | wc # all files in author category "c" that are in genre 6
When I had found a pair of author/genre combinations that had an interesting relationship, as well as the required number of documents, I wrote them to their own files.
grep "|6|" < allbfiles > b6files
grep "|6|" < allcfiles > c6files
Then I used fetchdocs.pl to grab the docs from the web and save their contents to a directory. I did this for both b6 files and c6 files.
#first, make the directory where the files will be saved
mkdir b6filesformonty
mkdir c6filesformonty
# grab column four in this "|" delimeted file, pipe it to fetchdocs to grab the documents, and then save them in the folder "filestofetch"
cut -d "|" -f 4 b6files | ./fetchdocs.pl b6filesformonty
# grab column four in this "|" delimeted file, pipe it to fetchdocs to grab the documents, and then save them in the folder "filestofetch"
cut -d "|" -f 4 b6files | ./fetchdocs.pl c6filesformonty
This retrieved approximately 1800 documents, stored in two folders, each named for its subgroups.
Next, I ran the droid series on the folders of files.
#call the droid to get the file list
perl droidfilelist.pl c6filesformonty
perl calldroid.pl droidfilelist.xml
perl convertdocs.pl droidoutput.xml
Next I reran MontyLingua using test-6.py in order to get all of the analyzable data.
#run monty
python test-6.py ~/si618/c6filesconverted > ~/si618/week1rev/c6filesmontyedfull.txt
python test-6.py ~/si618/b6filesconverted > ~/si618/week1rev/b6filesmontyedfull.txt
Then I pulled the lines with nps and converted them to lists for each group:
grep "\bnp\b" c6filesmontyed.txt > c6listtemp.txt
grep "\bnp\b" b6filesmontyed.txt > b6filestemp.txt
Finally, I rearranged the columns and sorted the data:
#get the nounphrases only in a list
cut c6listtemp.txt -f 5 > nounphrases.txt
# sort the nounphrases
sort nounphrases.txt > sortednounphrases.txt
#count the noun phrases and sort in descending order
uniq -c sortednounphrases.txt | sort -nr > sortedcountedc6.txt
#now do the same for b6 files
#get the nounphrases only in a list
cut b6listtemp.txt -f 5 > nounphrasesb6.txt
# sort the nounphrases
sort nounphrasesb6.txt > sortednounphrasesb6.txt
#count the noun phrases and sort in descending order
uniq -c sortednounphrasesb6.txt | sort -nr > sortedcountedb6.txt
Click below to view the lists:
Academic Schools and Departments | University-wide Administrators |
|
728 that |
316 that |
It is revealing that the most frequent word for both author groups is "that". This signifies to me that removal of the stop words will make these results significantly more relevant. It is also clear from the repetition of the word "we" in both lowercase and initial cap in the University-wide Administrators list that changing the noun phrases to lowercase before counting would be helpful (though I understand this is not the case for all situations).
I hesitate to draw conclusions based on the above, but for the sake of interest, I will note that both groups use the word diversity very frequently. Also, from the inverse placement of the words "students" and "University" in the lists, it would appear that University-wide Administrators are more concerned with the former, where actual Academic Schools and Departments are more concerned with the latter.
Other than that, I'd like to withold further interpretation until removal of the stop words and and changing to lower case are complete, which happens in Assignment 2.
Session II: Exploratory Data Analysis (html)
Unix Utilites, Large CorporaSession I: Data Manipulation (pdfs)
Parsing Large Text Files, Map Visualizations