For this week’s part of the project, our plan was to gain access to the data that we will use in the following weeks to create our Map. The first step of the data gathering process was to decide and narrow down what type of data we were hoping to display. After some discussion, we settled on gathering information on the following variables: Name, Year, Location (State, City, Town), High School, Major, and Gender.
We began our search for available data on the internet. After some searching, we found the Alumni Database, giving us access to Carleton majors for the last hundred years; however, no locational information is displayed. As the locational information is a key component of our project, we needed to find another supplementary source to support this information. Finally, PDF of old zoo-books were the key. The zoo-books display all the other information that was not available in the directory.
Our plan is to compile the two sources together into one dataset by associating names of the individuals from both sites.
We have also sent emails to Alumni Relations, Carleton Archives, and Admissions. In the emails we asked for any available data that may have already been compiled, pertaining to or associated with the subject we are examining.
At the moment we are running into two large problems each related to the two respective sources we are planning to use for our data.
i) Alumni Directory:
While all the information here is easily available in digitized form, it is not all located in one condensed location, meaning that to go through and transcribe all the information by hand would be overly time-consuming.
The problem with the Zoobook is the direct inverse of the one we had with the Directory: while all our information is in one place, it is located in separate unusable PDF format.
Tools and Techniques
We are planning on using a PDF –to–text converter to make the Zoobooks usable. After we have the information in text format, we are going to use a self-created Python code to reformat and import the massive text blocks (created by the converter) into an Excel spreadsheet.
As for the Alumni Directory, we are hoping to gain the raw data via Alumni-Relations. If Alumni-Relations proves fruitless, then we may have to turn towards Data scraping.
We are expecting to be a little behind. However, as long as we get all our data over the course of the week, then we should be back on track for data scrubbing and formatting next weekend.