SI601/618: Data Retrieval and Analysis Techniques

Jackie Cerretani, Fall 2007

Advanced Regular Expressions

Abstract

In this assignment, I pre-filter the text files used in assignments 1 and 2 with regular expressions in order to produce more accurate and interesting results in Monty Lingua. I not only identify words and characters that should be removed, but was also cognizant of removing them in an order that would not disrupt Monty Lingua's ability to discrern sentence structure.

Process Diary

First, I looked at my original noun-phrase output from Monty Lingua to see what might bear tidying up.

Here is a list of what I chose to excise from the original files:

  • gif, jpg and png images, both names and extensions
  • numbers
  • strings of punctuation characters
  • translated html characters, like ” and • (also s “It “I » )
  • words for numbers, such as "two", "three"
  • percentages (6%, 20%)
  • Pipe seperated lists... but not sure how to do this one
  • email addresses

Here are some replacement I thought might be useful:

  • Merge Ph.D. and PhD
  • Merge U.S. and US (before lowercasing the words)
  • Merge E-mail and Email

Once you have the lists, some records to extract:

  • Single initials, such as A. or C.
  • single letters that are not words, such as "s" and "e" as opposed to "i"
  • Those with stop words
  • Those with pipes and commas
  • Single punctuation units
  • Noun phrases starting with comma-space, i.e ", "

Then I ran the following command on just one of my groups (due to the slow processing of Monty Lingua to compare the output with my previous lists.

Below I'll go through each regex I used in order, with it's output to show its effectivness. Code is on black, output is on gray.

#first, get email addresses
$line=~s/(^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$)//gi;
sweaver@wcl.american.edu
gbfranklin@cmu.edu
rf51@andrew.cmu.edu
jbartter@bus.ucf.edu
jay@jayporcher.com
jbartter@bus.ucf.edu
jay@jayporcher.com
jbartter@bus.ucf.edu
#first, let's knock out images and their names
#this will kill some of the numbers
# we also take out their extensions, which are sometimes referred to
$line =~ s/(\[?(?:\w+\.gif|\w+\.jpg|\w+\.png)\]?)//g;
$line =~ s/(gif|jpg|png)//g;
[greydot.gif]
[calendar.gif]
[emailicon.gif]
[head_2.gif]
[hed_gra.gif]
[gra_req2.gif]
[gra_fun2.gif]
[gra_ath3.gif]
[gra_div2.gif]
[gra_faq2.gif]
[skypurp.jpg]
[skyhead2.jpg]
[notespin.jpg]
strip.jpg]
[DSOVIEWASC.jpg]
063.jpg]
[spacer.gif]
[bullet2.gif]
# also, any string of numbers, doing this before punctution
# so we catch percent signs, hyphen, periods later on
$line =~ s/\s(\d+)//g;
1995
467
30
28
21
13
5
41
7
13
10
8
2

In the above, I learned that \b only works for bounding \w, not for bounding digits.

# before we kill loose punctuation, let's kill
# the pipe seperated lists, which seem to be for links
$line =~ s/((\w+)\s+(\|)\s*)+//ig;
Life |
Life |
Life |
Archives |
Archives |
Stop |
Stop |
Policy |
Policy |
events |
events |
Resources |
Resources |
Committee |
Committee |
Resources |
Resources |
Committee |
Committee |
# find the umlot-A translaned html characters
# I wanted to do this just by capturing strings starting with the
# odd character, but it captured words next to them as well
# like this: $line =~ s/(,?\S+\s)//g;
# so instead, I had to grab each and every one!
# bleck.
$line =~ s/(‘|’|•||”|,|,(\S?))//g;
•
•
•
•
”

’
•
•
”
”
”
”
”
# and now punctuation -- first strings of several together
# and single ones with word borders
# originally I tried to include them all
#$line =~ s/[¦ñ÷~òøãá^]+//gi;
# but then decided excluding was easier -- very SI 500
# interestingly, it looks as if this catches all of the above umlaut-A characters
$line =~ s/([^\w\d\s.?!'"",&():;-]+)//gi;
óƾê
=Í}Î
ã
¹
ñû
æ
¾¾Æÿ
ÜãÃÏÌé
©ò#Ú~
´º
Ç»ÓÚÏ$
²
Ô>>¢Û

You can't see all the crazy characters here because of the html, but if you view source, you'll see eveything it caught. I chose not to use the file provided by the professor for stripping characters because the characters in my document were of a different variety (specifically, the all started with comma-umlaut-A).

I tried to grab web addresses, but was unsuccessful. Here are some of my tries:

# grab and nix web addresses, which might take out
# some of the images as well
#$line = m{^http://([^/:]+)(:(\d+))?(/.+)?$}i;
#$line =~ m/^(http:\/\/\S+)$/g;
#$line =~ s{(http(s*):\/\/)?(www\.)?((\S+\.)+)+([^.?!,]+)}{}gi;
#$line =~ m/(html|www)/gi;

After I'd finished with these, I attempted to run them against the stopwords list, but was unsuccessful. Here is my code.

open (IN,"c6tmp9.txt") || die "couldn't open in";
open (OUT, "> c6tmp10.txt") || die "couldn't open out";

#read in the stopwords file for eliminating those records
open (STOPFILE, "stopwords.txt") || die "Cannot open $stopfile: $!";

#make a hash of stopwords
my %stopwordlist;
while (defined ($stopword = )) {
chomp($stopword);
$stopwordlist{$stopword}=1;
}
close(STOPFILE);

#check for stopwords
while ($line=) {
foreach $phrase (%stopwordlist) {
#print $phrase . "\n";
if ($line = m/(\s+\d\s($phrase)\n)/gi) {
next;
} else {
print OUT $line;
}
} }

In the above, I attempted to use perl variables in a replace regex, though I'm not sure if it worked.


After this, I followed the same procedure as in Homework 1, Revised to generate the noun phrase list, with the exception that I ran two additional scripts. The first is a simple one to lowercase the file before sorting (click here to see) and the second is a script to pull out the stopwords.

Output

The table below contains my original output for the Academic Schools and Departments group of files on the left, and my new output after scrubbing the files with regular expressions before processing.

Original Results

Regex Results

316 that
290 diversity
229 students
216 who
216 we
212 which
210 University
198 We
190 it
183 Diversity

4 student affairs
4 recruitment and retention
4 our nation
4 our department
4 multicultural education
4 graduate students
4 faculty and staff
4 edu
4 diversity diversity
4 diversity and community

Interpretation

There is very dramatic difference between these two sets, some of which is interesting and some of which is suspect. What is suspect is that the number of occurrences of the top words is much lower. This could be because many words that were parts of navigation or not actual parts of speech (but rather parts of bulleted lists) have been removed. Alternately, it could be that my regex were too greedy, and ate more of the document than I expected, though I was systematic in checking each one, so this would seem likely only in the case that my testing methods were faulty. (Generally speaking, I tested the query by running the regex on every line, and for each one, I printed out variables $1 to $10 to see what it had captured.) There is evidence both ways -- on one hand there are coherent noun phrases on most lines; on the other hand, I'd be interested to read the documents where "diversity diversity" was a common phrase.

If we can assume, for the moment that my techniques were correct, we can comment on the content of the lists. What stands out most boldly to me is that the phrases in the regex list are much more weighted and tell a lot more about the authors (academic departments) and genre (official documents) than the original set. Where in the first, we see mostly stop words, with just a few telling nouns, in the second, we have almost all coherent phrases. These phrases seem to indicate that that this group has an interest in its own context, both broad and narrow -- they speak about "our nation" and "our department." There is a greater weight on the academic members of the universities -- five of the terms mention some form of or action relating to students and faculty. Finally, the last term "diversity and community" indicates an interest in the result diversity will have on the nature of the group itself.

For my next assignment, I hope to clean up any misformed regular expressions used here, run this analysis on both groups, and thus be able to make a more meaningful comparison between the two.

Bibliography

Perl Manual Page on Regular Expressions
http://www.perl.com/doc/manual/html/pod/perlre.html

Email regular expression (for when you just can't figure out why it won't work!)
http://www.regular-expressions.info/email.html

Some details about perl substitution with regex
http://www.comp.leeds.ac.uk/Perl/sandtr.html

Help with using variables in replace regex
http://www.comp.leeds.ac.uk/Perl/sandtr.html

Session II: Exploratory Data Analysis (html)

Unix Utilites, Large Corpora

SQLite databases

Data Display in R

Server Logs, IP parsing

Advanced Regular Expressions

Dissimilarity Matrices & Dendrograms

Session I: Data Manipulation (pdfs)

Parsing Large Text Files, Map Visualizations

Parsing Large Text Files

Regular Expressions and Tree Diagrams

Interacting with Large Data Sets

Parsing Server Logs

Scraping Data from Web Pages, Network Visualizations

Scraping Data from Web Pages, Multi-category Visualization

Parsing XML

Using APIs

Perl to CGI

Parsing Query Logs in SQL