SI601/618: Data Retrieval and Analysis Techniques

Jackie Cerretani, Fall 2007

Dissimilarity Matrices & Dendrograms

Abstract

Using dissim, I produced a dissimilarity matrix of from a list of documents and terms generated in the previous exercises using MontyLingua. The lists derive from a group of official documents written by two groups from the Diversity Kaliedescope data: Administrators and Academic Departments at universities. The goal of this exercise is to use the dissimilarity matrix to produce a dendrogram, which will help us examine the similarity boundaries present in the body of documents, and thus lend linguistic evidence to the question of whether the above groupings are arbitrary or meaningful.

Process Diary

First, I chose an arbitrary subset of documents from my two groups to analyze, the 50 largest from each. Because my earlier database files did not contain all the columns from the MontyLingua output, I chose to rerun it on my new subset of documents.

In order to be able to analyze all the fields together, but still easily see later which author/genre group they came from (in this case c6 or b6), I prepended a code at the beginning of each file name indicating the group.

First, I used a perl script to generate a database:

use DBI;

# connect to the database
my $dbh;
$dbh = DBI->connect( "dbi:SQLite:week6.dbl" ) || die "Cannot connect: $DBI::errstr";

# make nounphrase table
$dbh->do( "DROP TABLE nounphrases" );
$dbh->do( "CREATE TABLE nounphrases (id integer primary key, docname, docnumber, termnumber, termtype)" );

#open the flat file
open (IN, "100files.txt") || die "Couldn't open 100files.txt";

#make all the hashes
my %namenum;
my %termfreq;

while ($line = ) {

@vars = split(/\t/,$line);
$docname = $vars[0];
$docnumber = $vars[1];
$termnumber= $vars[2];
$termtype = $vars[3];

next if ($termtype ne "np");

$dbh->do("INSERT INTO nounphrases VALUES (NULL, '$docname', '$docnumber', '$termnumber', '$termtype')");

# then, if we haven't see this doc before
# record its name/number relationship

if (!$namenum{$docnumber}) {
$namenum{$docnumber} = $docname;
}

}

Then I made views in the database to get the required lists for dissim:

#make the namenumlist

CREATE VIEW namenum AS SELECT count(*) AS doccount, docname, docnumber FROM nounphrases GROUP BY docname;

CREATE VIEW namenumlist AS SELECT docname, docnumber FROM namenum;

# count the distinct word values present
# and choose only those who occur in two or more documents
CREATE VIEW termcount AS SELECT termnumber, count(*) AS termcount FROM nounphrases GROUP BY termnumber;
CREATE VIEW termcounttwoplus AS SELECT * FROM termcount WHERE termcount >= 2;

# get a list from of docnumber-termnumber pairs for nounphrases tables
# matching only to the 2+ count terms

CREATE VIEW doctermtwoplus AS SELECT nounphrases.docnumber, nounphrases.termnumber FROM nounphrases, termcounttwoplus WHERE nounphrases.termnumber=termcounttwoplus.termnumber;

From here, I used a perl script to grab the records from the database and print into a text file so I could run dissim on them.

#connect to the database
use DBI;

my $dbh;
$dbh = DBI->connect( "dbi:SQLite:week6.dbl" ) || die "Cannot connect: $DBI::errstr";

#open our eventual output file
open (OUT, '>dissim.txt');


#prepare and execute the query
#that will pick noun phrases and count them
$query=$dbh->prepare("SELECT * from doctermtwo");
$query->execute();

#print to output file
my @results;
while (@results = $query->fetchrow_array) {
$docnumber = $results[1];
$termnumber = $results[0];


print OUT "$docnumber\t$termnumber\n";
}

#close the query handle to avoid errors
#close the db
undef $query;
$dbh->disconnect();

I used the following script to count the number of documents and terms.

open(IN, "dissim1.txt") || die ("Can't open student file: $!\n");

$numdocs = 0;
$numterms = 0;
my %doccount;
my %termcount;

while ($linein = ) {
@vars = split(/\t/,$linein);
$doc = $vars[0];
$term = $vars[1];
print "$doc\t$term\n";
if (!exists $doccount{$doc}) {
$doccount{$doc}++;
$numdocs++;
}

if (!exists $termcount{$term}) {
$termcount{$term}++;
$numterms++;
}
}

print "Docs: $numdocs, Terms: $numterms\n";

Output 1

I ran dissim on the textfile created above to create the matrix using the following code.

~mcq/mdscaler/dissim -r 100 -c 7152 < dissim1.txt > matrix.txt

Click here for the output.

Then I ran the matrix through R to create my dendrogram using the following code adapted from the professor's example, with labeling techniques from Mark's assignment.

setwd("/Users/lostartmedia/Documents/_msi_classes/si618/week6")
table <- read.table("matrix.txt", header=FALSE)
matr <- as.matrix(table);
scalematr <- t(scale(t(matr)))
hr <- hclust(as.dist(1-cor(t(scalematr),method="pearson")),method="complete")
library(lattice)
library(stats)
label_table <- read.table("namenum.txt", header=FALSE)
par("ps"=8)
plot(hr, labels=label_table$V1)

Output 2

I created a few dendrograms, including the heatmaps included in the professor's code. However, the following was the most interesting. Click on the image (or here) for a pdf version whose labels are easier to read.

Interpretation

Each document in the dendrogram is labeled with the name of the document, preceeded by its grouping code (c6 and b6). Judging from the labels, it appears that the documents from both groups are evenly dispersed between the divisions. Indicating that, by this analysis, the original groupings aren't as lingustically different as one might imagine. With that in mind, I would choose to cut the dendrogram at the first or second division, because both show a high degree of association between their component documents.

Session II: Exploratory Data Analysis (html)

Unix Utilites, Large Corpora

SQLite databases

Data Display in R

Server Logs, IP parsing

Advanced Regular Expressions

Dissimilarity Matrices & Dendrograms

Session I: Data Manipulation (pdfs)

Parsing Large Text Files, Map Visualizations

Parsing Large Text Files

Regular Expressions and Tree Diagrams

Interacting with Large Data Sets

Parsing Server Logs

Scraping Data from Web Pages, Network Visualizations

Scraping Data from Web Pages, Multi-category Visualization

Parsing XML

Using APIs

Perl to CGI

Parsing Query Logs in SQL