Jackie Cerretani, Fall 2007
In this assignment, we access the data stored in our sqlite3 database, and feed it to R to create a double histogram visualization of the frequency that word appear in the corpus for each author/genre group.
In homework 2, I created a database that contained the tables of nounphrases created by MontyLingua, so I first had to access that database from R and recreate the tables of term frequency in documents to be able to graphically output them.
#! Load the RSQLite library
library(RSQLite)
#! Instantiate the driver
m <-SQLite(max.con = 16, fetch.default.rec = 500, force.reload = FALSE, shared.cache=FALSE)
#! Connect to the database
lists <- dbConnect (m, dbname="week3.dbl")
#! Confirm connection by listing tables
dbListTables(lists)
#! Query the db for the b6 list and store in a variable
loadb6 <- dbSendQuery(lists, "SELECT phrase, count(*) as npcount FROM b6 GROUP BY phrase ORDER BY npcount desc")
while(!dbHasCompleted(loadb6)) {
datab6 <- fetch(loadb6, n=20000)
}
#clear the results handle to avoid R errors
dataClearResult(loadb6)
#! Query the db for the c6 list and store in a variable
loadc6 <- dbSendQuery(lists, "SELECT phrase, count(*) as npcount FROM c6 GROUP BY phrase ORDER BY npcount desc")
while(!dbHasCompleted(loadc6)) {
datac6 <- fetch(loadc6, n = 20000)
}
#clear the results handle to avoid R errors
dbClearResult(loadc6)
Next I tried several different methods to generate output, including histbackback alone, histbackback for axes plus barplots, several different methods to label axes, including axis. Below is the process that generated the most successful output for me without consulting other students' homework.It wasn't perfect, though, though I am please with how far I got on my own.
#set the number of decimal points on our numbers
options(digits=1)
set.seed(1)
# get the package
require(Hmisc)
#generate the histbackback, even though the data doesn't really work
# because it sets up the margins nicely
histbackback(-datab6$npcount,datac6$npcount,xlim=c(-1000,1000))
#specify the margins around the output, though this doesn't seem to be working either
mai=c(3,3,3,3)
#Now, let's overlay the vertical barplots on the histbackback basis so it's centered in the output
#! This plots the values, using the words on the axes, xlab and ylab are the names for axes, col names color, beside stacks bars beside each other rather than on top of one another, names.arg are the values of the labels, las is a par value that turns the labels horizontal, add adds additional graphics rendering to the same plot, width gives the width of the bars, and xlim delineats how much space to put between them
barplot(datab6$npcount, beside=TRUE, horiz=TRUE, col="blue", cex.names=1.5, axes=FALSE, las=2, names.arg=datab6$phrase, add=TRUE, width=.3, xlim=.1,.1)
barplot(-datac6$npcount,beside=TRUE, horiz=TRUE, col="orange", cex.names=1.5, axes=FALSE, las=2, names.arg=datac6$phrase,add=TRUE, width=.3, xlim=.1,.1)
Output
I have yet to successfully label these margins without overlap, but here is the output so far:
Results and Interpretation
This output hasn't revealed much more than comparing two lists side by side, especially considering that I haven't been able to label the bars. I think it would be helpful here to compare these words as proportions of the total number of noun-phrase occurrences in the corpus. That would make the bars on each side of the central axis easier to visually compare.
I've learned from this assignment that R is a very powerful tool for information display, that isn't terribly hard to learn, but requires time enough for immersion and study of its idiosyncracies.
Bibliography
- Demo of barplot/histbackback mashup
- Explanation of how to use a par parameter to make horizontal labels on a barplot
- The ever-helpful R documentation
I learned that with many graphics functions, in addition to the variables they specifically take, you can also add general graphics parameters from the set (par).
Addendum
The morning this assignment was due, I checked out others' homeworks. I wanted to do as much of this as I could without using others' processes so I could try to learn to problem solve in this environment more independently. However, once I did that, I also thought it would be helpful to see what they had done and improve my work based on it. So here is my revised graphics portion of my script, using some of the properties other students used, and my revised output.
Addendum Try #1
In this try, I just used additional properties of par and axis to generate fix the labels I had met an impasse with above.
#use par to set properties of the graph
par("bg"="#ffffff") #! this sets the bg color
par("mar"=c(2.2,1,1,1)) #! this sets the margin -- we knew this before
par("ps"=12) #! this changes the point size of the font
par("fg"c="#303333") #! this changes the font color
#! now, instead of drawing the histbackback for the margins, first draw the 2 barplots
#! many fewer params than I used in mine
barplot(datab6$npcount, horiz=TRUE, space=0, col="blue", xlim=c(-1000,1000))
barplot(-datac6$npcount, horiz=TRUE, col="orange", space=0,add=TRUE)
#! and then add the axes
#! first the b6 axis, offset by -2
axis(2, at=1:20, labels=datab6$phrase, pos=0, col.axis="black", las=2, tick=FALSE, hadj=-2, padj=1, mgp=c(3,0,0))
#! then the left axis, offset by 2
axis(2, at=1:20, labels=datac6$phrase, pos=0, col.axis="black", las=2, tick=FALSE, hadj=2, padj=1, mgp=c(3,0,0))
#! 2=lower access, 1:20 is for the 20 terms, labels define labels, pos locates the axis at the center, las orients the labels, tick turns off ticks, hadj centers the labels, padj lines up wiht each bar, mgp turns off margin b/w labels and axis. Thanks Mark!Addendum Try #1 Output
Here's the graphical output:
![]()
Addendum Try #1 Interpretation
Clearly, I still don't have the alignment of the labels perfectly right, but this is a big improvement over my last diagram. I did this first addendum differently than the other students because that is how I understood the original assignment: to put the most frequent nounphrases in each group side by side to compare in the style of "You say...We say," i.e. match up relative frequency of the bars side by side, rather than matching words side by side. However, I did find that matching the words side by side had it's own benefits as a visualization, so I hope to recreate that in a second addendum(below) in the future.
Addendum Try #2
In this try, I'd like to mimic Yarun's query statement in order to try to get lists with matching words in them, and to order them by those words. As I'm visiting family, time for this extra work is slim right now (11/25/07) but I hope to work on it in the coming week.
Session II: Exploratory Data Analysis (html)
Unix Utilites, Large CorporaSession I: Data Manipulation (pdfs)
Parsing Large Text Files, Map Visualizations