SI601/618: Data Retrieval and Analysis Techniques

Jackie Cerretani, Fall 2007

SQLite Databases

Abstract

In the last assignment, I pulled approximately 1500 documents related to diversity from university websites. I chose to look at documents that were of the "offically/politically" speaking genre, written by two different audiences: University-wide administrators adn Academic schools and departments.

In this assignment, I take the full list of nounphrases for each user/author group gathered in assignment 1 and put them in an sql database using sqlite in a perl script. I then query the database and output a list of noun phrases and their frequency in the corpus using a second perl script.

Process Diary

The following process is repeated for both c6 documents and b6 documents, so I will only include one set here for deomonstration.

Script 1

These scripts can be viewed at c6makedb.pl and b6makedb.pl.

First, I read in a stopwords file and created a hash to use for comparison as I read in lines from the input.

#read in the stopwords file for eliminating those records
open (STOPFILE, "stopwords.txt") || die "Cannot open $stopfile: $!";
#make a hash of stopwords
my %stopwordlist;
while (defined ($stopword = )) {
chomp($stopword);
$stopwordlist{$stopword}=1;
}
close(STOPFILE);


# test the hash
#for $stopword (keys %stopwordlist) {
# print "$stopword\n";
# }

Next, I opened both the input file and a connection to the database via the SQLite module.

#open the c6 file
open(IN, 'c6listtemp.txt') || die "Can't read in file\n";
#connect to the database
my $dbh;
$dbh = DBI->connect( "dbi:SQLite:week3.dbl" ) || die "Cannot connect: $DBI::errstr";

Next I made the table and defined the fields.

$dbh->do( "DROP TABLE c6" );
$dbh->do( "CREATE TABLE c6 (id integer primary key, docname varchar(255), author, genre, phrase, phrasecount)" );


Next, I read in the file, compared the noun phrase to the stop words, and if they didn't match, inserted the variables from the line in to the table.

#read in file
my $line;
while ($line=) {
#lowercase the line
$line = lc($line);
($docname, $qnum, $qnum2, $qnum3, $phrase, $phrasecount) = split(/\t/, $line);
print $docname . "\n" . $phrase . "\n" . $phrasecount;


#check for stopwords
unless (exists($stopwordlist{$phrase})) {

#insert into the db
$dbh->do("INSERT INTO c6 VALUES (NULL,'$docname', 'c','6', '$phrase', '$phrasecount')");
}

}

Script 2

Next, I used a second script to query the database and output the list.The scripts are c6query.pl and b6query.pl.

#connect to the database
my $dbh;
$dbh = DBI->connect( "dbi:SQLite:week3.dbl" ) || die "Cannot connect: $DBI::errstr";

#open our eventual output file
open (OUT, '>c6comparelist.txt');


#prepare and execute the query
#that will pick noun phrases and count them
$query=$dbh->prepare("SELECT phrase, count(*) as npcount FROM c6 GROUP BY phrase ORDER BY npcount desc");
$query->execute();

#print to output file
my @results;
while (@results = $query->fetchrow_array) {
$count = $results[1];
$phrase = $results[0];


print OUT "$count\t$phrase\n";
}

#close the query handle to avoid errors
#close the db
undef $query;
$dbh->disconnect();

Output

The files can be viewed here and here.

Academic Schools and Departments

University-wide Administrators

478 diversity
316 students
265 university
180 college
167 school
155 faculty
139 people
135 office
134 edu
132 1

1059 diversity
929 university
712 students
543 office
431 campus
420 faculty
402 people
375 edu
336 members
308 1

Results and Meaning

Though the differences between these two lists is small, it's interesting to note where they diverge. On both lists, "diversity" takes the top spot, however, after that they differ. My original hypothesis was that Academic Schools and Departments (AS&D) are more student-focused, while University-wide Administrators (UWA) are more organizationally focused. The placement of "students" before "university" at spots 2 and 3 in the AS&D lists, and vice-versa in the UWA list, seems to support this.

We see further evidence in the words in places 5 and 6. For Academic Schools and Departments, "college" and "school" occupy these spots, while for University-wide Administrators, "office" and "campus" are there.

"Faculty" occupies position 6 on both lists, and "people" occupies position 7 in both as well. After that, the appearance of "edu" and the number 1 suggests that perhaps domain extensions and numbers need to be added to the stopwords list.

Finally, one puzzling list item "members" appears at spot 9 for the Administrators. Could this be a generalized self-reference term used by Administrators but not by those in Academic departments?

Further Analysis

Right now, these figures are in absolute quantities, which makes the comparison shaky due to the fact that one group has almost three times the number of documents than the other. I think it would be more revealing to compare the lists if they were changed to show relative frequency of a nounphrase in comparison to all the others in the list, thus abstracting away the difference in volume of documents in each corpus.

Session II: Exploratory Data Analysis (html)

Unix Utilites, Large Corpora

SQLite databases

Data Display in R

Server Logs, IP parsing

Advanced Regular Expressions

Dissimilarity Matrices & Dendrograms

Session I: Data Manipulation (pdfs)

Parsing Large Text Files, Map Visualizations

Parsing Large Text Files

Regular Expressions and Tree Diagrams

Interacting with Large Data Sets

Parsing Server Logs

Scraping Data from Web Pages, Network Visualizations

Scraping Data from Web Pages, Multi-category Visualization

Parsing XML

Using APIs

Perl to CGI

Parsing Query Logs in SQL