SI601/618: Data Retrieval and Analysis Techniques

Jackie Cerretani, Fall 2007

In the fall of 2007, I took a two-course series on Data Retrieval and Analysis Techniques. This website houses all of my homework write-ups from those classes, including code, visualizations, and interpretations. It reflects the evolution of my skills and my way of reasoning through problems they can be applied to.

Here are the course descriptions from the University of Michigan School of Information website:

SI 601: Data Manipulation
Aims to help students get started with their own data harvesting, processing, and aggregation. Data analysis is crucial to evaluating and designing solutions and applications, as well as understanding users' information needs and uses. In many cases, the data we need to access is distributed online among many Web pages, stored in a database or available in a large text file. Often these data (e.g., Web server logs) are too large to obtain and/or process manually. Instead, we need an automated way to gather the data, parse it, and summarize it before we can do more advanced analysis. In this course, you will learn to use Perl and its modules to accomplish these tasks in a quick and easy yet useful and repeatable way. The companion half of this half-semester course, SI 618: "Exploratory Data Analysis," teaches how to further glean insights from the data through analysis and visualization.

618: Exploratory Data Analysis
Aims to help students get started with their own data acquisition and analysis. Data analysis is crucial to evaluating and designing solutions and applications as well as to understanding information needs and use. Students in this course (who will have just completed SI 601: "Data Manipulation") will learn techniques of exploratory data analysis using scripting, text parsing, structured query language, regular expressions, graphing, and clustering methods to explore data. Students will be able to make sense of and see patterns in otherwise intractable quantities of data.

Session II: Exploratory Data Analysis (html)

Unix Utilites, Large Corpora

SQLite databases

Data Display in R

Server Logs, IP parsing

Advanced Regular Expressions

Dissimilarity Matrices & Dendrograms

Session I: Data Manipulation (pdfs)

Parsing Large Text Files, Map Visualizations

Parsing Large Text Files

Regular Expressions and Tree Diagrams

Interacting with Large Data Sets

Parsing Server Logs

Scraping Data from Web Pages, Network Visualizations

Scraping Data from Web Pages, Multi-category Visualization

Parsing XML

Using APIs

Perl to CGI

Parsing Query Logs in SQL