Jackie Cerretani, Fall 2007
In this assignment, I analyze a server log showing the visits over one week to a set of online slideshows on mickmcquaid.com. I looked at the popularity of individual photos and galleries, the frequency of visits from different IP addresses, and the geographic and ownership data associated with those IPs. I display results in a treemap, two geographical maps and a bar chart. Then I collaborate with Vahed to investigate the 80/20 rule as it relates to IP addresses viewing the most popular photos.
I first needed to parse the server logs, save the data into an sql database, and gather some basic information about them.
#!/usr/bin/perl
use warnings;
use DBI;
#open the photologs file
open(IN, 'photo-logs.txt') || die "Can't read in file\n";
#connect to the database
my $dbh;
$dbh = DBI->connect( "dbi:SQLite:week4try5.dbl" ) || die "Cannot connect: $DBI::errstr";
# make the database table for the photolog
$dbh->do( "DROP TABLE photolog" );
$dbh->do("CREATE TABLE photolog1 (id integer primary key, ip, ident, user, date, time, timezone, requestedfile, protocol, status, size, referrer, mozilla, mystery5 )" ) || die "Couldn't make the db";
#read in file and parse using regex
while () {
my ($ip, $ident, $user, $date, $time, $timezone, $requestedfile, $protocol, $status, $size, $referrer, $mozilla, $mystery5) =
/^(\S+) (\S+) (\S+) \[(\S+):(\S+) -(\S+) "(\S+) (\S+) (\S+) (\S+) (\S+) "(\S+)" "(.*)" "(\S+)"$/;
# save variables into the photolog table
$dbh->do("INSERT INTO photolog1 VALUES (NULL, '$ip', '$ident', '$user', '$date', '$time', '$timezone', '$requestedfile', '$protocol', '$status', '$size', '$referrer', '$mozilla', '$mystery5')");
}
$dbh->disconnect;
close(IN);
Next, I created a table with the frequency of views of photos.
# make photofreq table
$dbh->do( "DROP TABLE photofreq" );
$dbh->do( "CREATE TABLE photofreq (photoname varchar(255), photocount)" );
#make a selection from the db
$query=$dbh->prepare ( "SELECT requestedfile, count(*) as filecount FROM photolog GROUP BY requestedfile ORDER BY filecount desc");
$query->execute() || die "query didn't execute";
# save results to new table photofreq
my @results;
while (@results = $query->fetchrow_array) {
$photoname = $results[0];
$count = $results[1];
$dbh->do("INSERT INTO photofreq VALUES ('$photoname','$count')");
print "$photoname\t$count\n";
}
From here, I ran Visitors on my culled log file. Here is the resulting file.
./visitors ../photo-logs-culled.txt > ~/HTML/week4rpt.html
Next I ran GeoIP to get the locations of my visitors, including city, state, country, zip code, area code latitude and longitude, among a few other parameters. See the first output and the count list.
# run the app
test-geoip-city > tmp2
# then get the ip addresses and count them
sort -n tmp2 | uniq -c | sort -n
Process
Using the data above, I wrote a perl script that retrieved the top 10 most visited photos, and then counted the number of times each IP viewed each photo. Here is the script. With this script, I generated a treemap. Click on the image below to see the interactive visualization.
Interpretation
This tree diagram is interesting, because it allows one to see both which IPs saw each photo most often, and which IPs were active across multiple photos. You can also glean which photos were visited most often, although in this case, since all the photos were visited a simlar number of times, it's harder to detect visually.
Process
For this, I created a simple visualization showing the number of IPs from various countries represented in the server logs. I used a simple script to access the database:
create view countrycount AS SELECT count(*) as countrycount, country FROM countip GROUP BY country;
I then used both a world map with bubbles corresponding to size, and a bar chart, to display the data. Click the image below to see the interactive visualization.
Number of IPs visiting slideshows by country of origin

Interpretation
The map is overkill for such a small dataset; it is easier to see the relative relationships by looking at the bar chart. Clearly the majority of IPs visiting the slideshows were from the US.
It is important to note that these data displays only tell us how many IPs were from each country, now how much activity came from the IPs in those countries.
Process
Again, I made a view to gather information about the number of IPs from different states.
create view statecount AS SELECT count(*) as statecount, statname FROM countip GROUP BY statname;
I then used a US map with bubbles to represent the number of IPs from each state. Click on the image below to see the larger visualization.

Interpretation
This reveals that the majority of IPs were from Michigan. The near equal size of the bubbles on the rest of the states accurately reveals that the number of IPs from each those places were nearly the same. However, it does not tell us which state's IPs were responsible for the majority of activity. Instead of exploring that question, however, I chose to work on the next visualization, which was more challenging and resulted in much more learning about sqlite3, R, perl and data comparison.
Process
Vahed and I worked together to investigate the 80/20 rule. First we found the top 22 most visited photos the set (any photo with more than 13 views). Then we found the top 20% most active IP addresses. Next, we found the total number of views of the top 20 photos by the top 20% most active IP addresses. Finally, we plotted these together on a stacked bar plot in R.
Here's how we created the appropriate views and tables from the sql database, working from the command line in sqlite3.
# select only the photos viewed more than 13 times # and create a view of them called "topphotos" CREATE VIEW topphotos AS SELECT * FROM photofreq WHERE photocount >= 14;
# grab previous output from geoip above and put into a sql table
CREATE TABLE ipgeodata (ip, country, statecode, statname, city, zip, lat, long, dmacode, areacode, timezone);
.import allipscopy.txt ipgeodata
# now we create a view of the ips counted by frequency
# and select the top 21, which is 20% of the total
CREATE VIEW topip AS SELECT * FROM (SELECT count(*) as ipcount,ip,country,statname,city FROM ipgeodata GROUP BY ip) AS allip WHERE ipcount >14 ORDER BY ipcount DESC;
# next, do a subselect from main table of photos viewed only # by the top 21 ips, and then count the total occurrences for each photo
CREATE VIEW top21photo AS SELECT count(*) as top20freq, photo FROM (SELECT photolog1.requestedfile AS photo, photolog1.ip FROM photolog1 INNER JOIN topip ON photolog1.ip = topip.ip) AS temptable GROUP BY photo ORDER BY top20freq ASC;
#next join the tables to get photo, total views, views from top 20% ips
CREATE VIEW result1 AS SELECT topphotos.photoname AS photo, topphotos.photocount AS totalcount, top21photo.top20freq AS top21count FROM topphotos INNER JOIN top21photo ON topphotos.photoname = top21photo.photo ORDER BY totalcount DESC, top21count DESC;
Finally, we wrote a script for R that created stacked barplots and labels, drawing directly from the sql database.
#go to the correct director
setwd("~/Documents/_msi_classes/si618/week4")
#! Load the RSQLite library
library(RSQLite)
#! Instantiate the driver
m <-SQLite(max.con = 16, fetch.default.rec = 500, force.reload = FALSE, shared.cache=FALSE)
#! Connect to the database
lists <- dbConnect (m, dbname="week4try5.dbl")
#! Confirm connection by listing tables
dbListTables(lists)
#! Query the db for the b6 list and store in a variable
query <- dbSendQuery(lists, "SELECT * FROM result1")
result1 <- fetch(query)
#! Expand right side of clipping rect to make room for the legend
par(xpd=T, mar=par()$mar+c(0,0,0,4))
#! make the barplots
barplot(result1$totalcount, col="blue",ylim=c(0,20))
barplot(result1$top21count, col="red", add=TRUE)
#! define tick marks along left axis
axis(2, at=1*0:20)
#! add the photonames as labels along the bottom axis
text((.5:22)*1.19,5, srt=90, adj=1, labels=c(result1$photo), xpd=T, cex=0.8)

Interpretation
Though inexact (since we have not actually calculated standard deviation and error), we feel that our graphical output indicates that the most active IP addresses did comprise a significant portion of the total photo views. We can see by the bars that in for 90% of the photos, high-frequency IPs comprise more than 50% of all views. Vahed also says that we can imagine this plot as a birpartate graph, with one part as the IPs and the other as the photos and with the directed edges showing a visit. If you do this, one can easily see that good hubs point to good authorities (i.e. good photos are visited by active visitors) and good authorities are pointed to by good hubs.
Many Eyes -- really great data display and visualization for the everyman
The R documentation, but with the TOC in frames -- much easier to use
Producing Simple Graphs with R -- incredibly helpful for figuring out the barplots and finessing the axes.
Really good tips for debugging perl
A short summary of the SQL Select statements
Session II: Exploratory Data Analysis (html)
Unix Utilites, Large CorporaSession I: Data Manipulation (pdfs)
Parsing Large Text Files, Map Visualizations