3A: Big Data and Digital Humanities

In-Class Project

Last week we tore furiously through the front-end of web development — HTML, CSS and JavaScript.  But we are not learning coding here, we are doing Digital Humanities. Today let’s look at how we can put those skills into practice for humanities data to tell a story.

One of the longest-running types of applications is an interactive timeline.  We will do a class project to convert a flat timeline into an interactive one using an easy to use application, but there are other tools that require more coding and show you how these work.  Take a look at the SIMILE timeline tool for an example of one of those.

Today, we are going to use the beautiful TimelineJS framework to give the college archive’s Timeline of Carleton History a web 2.0 overhaul.  This timeline was created in 1991 for the college’s 125th anniversary celebration, and while the content is still great, the presentation could use an update as we near the 150th anniversary in 2016.  Archivist Tom Lamb has given us permission to use the timeline as our dataset, and we are going to build a new, dynamic JSified instance of it as our first group project.

The timeline is broken into five date ranges between 1866 and 2002.  I have set up the first one as an example, and it is each group’s task to replicate this work for your own date range by doing the following:

  1. Go to the TimelineJS page and follow the 4 step instructions to Make a New Timeline
    • In Step 2, whoever downloads the Google template should share it with the other group members so that all may editScreen Shot 2015-01-15 at 12.17.03 PM
    • In Step 4, copy the embed code and paste it into a new jsBin.  This is where you will work on your own date ranges for now, and we will combine them all together next week.
  2. Once you are setup, delete the template data and move over your group’s data from the Timeline of Carleton History. The dates and captions should come over with an easy copy/paste, but then you’ll probably need to finesse the rest of the data a bit.
    • You might need to change the number format of the Date columns to a different date display or even Number > Plain Text to get them to display and order properly
    • All entries should have a brief headline that summarizes the text on that date’s card, which you’ll need to write
    • Where there are images on your page, click them to bring up the full resolution version in the fancybox viewer, then use your DevTools knowledge to find the image URL to paste in the appropriate Media column in the sheet
    • Where there are no images, see if you can insert a Google Map if appropriate.  Or search the Carleton Archives Digital Collections to find other appropriate photos or scanned documents
      • NB: All Media should have a Media Credit, which will usually be “Archives, Carleton College Gould Library
    • Finally, explore what happens to the timeline when you use tags to categorize events.  I used buildings and people as two basic categories on my example

If you’re in doubt or stuck, post a comment, ask a question on your blog or (as a last resort) send an email and we’ll try to help each other out.


Big Data

Big Data generally refers to extremely large datasets that require demanding computational analysis to reveal patterns and trends, such as the map below generated from the data in millions of Twitter posts. We are producing reams of this data in the 21st century, but how do we analyze it from a humanities perspective?  How do we perform these sorts of analyses if we are interested in periods before regular digital record keeping?

World travel and communications recorded on Twitter
World travel and communications recorded on Twitter

Enter digitization and citizen science initiatives.  One of the major trends in Digital Humanities work is the digitization of old records or print books that are then made searchable and available online for analysis.  Google Books is the most well-known project of this type, and we also read Tim Hitchcock’s article about his pioneering historical projects in this arena, e.g. the Old Bailey Online and London Lives.  These projects took years to build and required the dedicated paid labor of a team of scholars and professionals.  But there’s another model out there that relies on the unpaid labor of thousands of non-expert volunteers who collectively are able to do this work faster and more accurately than our current computers: crowdsourcing.

Zooniverse is a crowdsourcing initiative that bills itself as “the world’s largest and most popular platform for people-powered research.”  This platform takes advantage of the fact that people can distinguish detailed differences between images that regularly trip up computers, and empowers non-experts to contribute to serious research by reducing complex problems to relatively straightforward decisions:

  • is this galaxy a spiral or an ellipse?
  • is this a lion or a zebra?
  • is this the Greek letter tau or epsilon?

The project that Evan and his team just launched, Measuring the Anzacs, seeks to study demographic and health trends in the early 20th century by transcribing 4.5 million pages worth of service records from the Australian and New Zealand Army Corps during WWI.  This data would take countless years to process with a small team of researchers, but as Evan told us, they hope to speed up this process tremendously by taking advantage of the fact that there are lots of people who have access to a computer, speak English and can read handwriting.

Tim Hitchcock ended his piece with a conundrum:

How to turn big data in to good history?  How do we preserve the democratic and accessible character of the web, while using the tools of a technocratic science model in which popular engagement is generally an afterthought rather than the point.

The Zooniverse model has taken a major step towards resolving this tension and turning formerly restricted research practices into consciously public digital humanities work.

 


Assignment

Explore the Measuring the Anzacs project and work your way through at least one document, marking and transcribing the text.

Screen Shot 2015-09-30 at 6.24.24 AM

When you’re done, post a brief comment below giving some feedback on the process.  Were the instructions easy to follow?  Was the text easy to transcribe?  Did you feel like you were making a real contribution to the project?  What did you get out of the project, from a humanities perspective?  Did you come away with a greater understanding of either the research process or the lived experience of the individual people whose records you were working with?

14 Replies to “3A: Big Data and Digital Humanities”

  1. I actually found myself getting really distracted by what I was reading. That’s not certainly a bad thing; I’m sure the researchers want people to engage with the material. In fact, if they were keeping track of how much time people spent on each page, it might provide an interesting meta-analysis that could suggest how interested people will be in the project as a whole.

    As far as the site itself goes, I thought the little blurb they had at the beginning was concise and to the point. I jumped right in, the interface was smooth, and the right hand menu guided me through collecting the information. The only UI elements that were a little iffy for me were the buttons that appeared when you hovered over text that had already been marked. They were sometimes hard to click.

    I think it’s really smart to have multiple people go over each document. That data should allow the researchers to get an idea of which pages are most legible and easily understood. Pages that have a lot of competing mark-up or transcription will need further examination, but they may also be hard or impossible to decipher.

    The process that this team has come up with not only furthers their research, it also provides a meaningful experience in and of itself.

  2. I really enjoyed transcribing the pages and thought that it was a cool way to interact with historical documents and really see what they said (instead of just looking at them and going “Mm, yes, old.”) It is a bit odd how incomplete some of them were, especially regarding the planter and laborer’s documents I was given to mark. Also, on the second page of the History sheet, the prompts on the right hand side didn’t draw attention to the top most category of that sheet so I was forced to put those marks under “other”. That seemed odd to me, and makes me wonder if that part of the form is of less interest to the researchers or they just forgot to incorporate that part in their instructions(?)

    I will explore further

  3. I found the transcription process to be surprisingly interesting. The instructions though confusing at first were easy simple enough to follow. I started with transcribing documents, looking at what seemed like promotion papers and troop reinforcement requests- some of the letters were difficult to decipher with the words often being abbreviated or incomplete. After a few transcriptions, I gave a shot at marking up the papers which went a lot smoother as I was mainly scanning for printed instead of hand written material. I got to mark up a History Sheet of a recipient for the British medal of honor which pretty awesome. I think it’s interesting to examine the primary source material that we would otherwise not have access too and come way from the process with more respect for the data transcribers out there. ( it’s hard work)

  4. Interacting with old documents was an interesting experience that I have never done before. At first I did not know whether I should choose marking or transcribing first. For marking, I felt I was making a real contribution marking each section that corresponds to each category. For transcribing, however, I had very very hard time reading or deciphering cursive script so that I sadly marked most of them illegible. But, it gave me a good understanding of how the process of digitizing old documents works. You first categorize fields in the document first and then enter the data that correspond to each field. Also, it gave me an impression how it is painful to reading through and transcribing tons of those cursive documents.

  5. I thought that it was interesting to interact with these documents rather than just look at them. I was certainly a lot more curious about Mr. Titchener and what happened to him after reading through his papers than I would have been had I simply seen them in a museum. I didn’t think that the instructions were as helpful as they could have been. For instance, even though nearly every Historical Sheet that I came across has notes covering sections I was supposed to highlight, I wasn’t sure how I should record that fact or if I need to at all. A video example might have been a more effective way to demonstrate how to mark and transcribe these documents. As for transcribing I was pretty much worthless. I could pretty much only transcribe the print.

  6. The instructions for marking were very easy to follow. At each step in the process I had to make a simple yes/no choice, or pick the best suited marked out of a few options. There really wasn’t any point where I wasn’t sure if my marking was correct or not. I enjoyed this part, and felt that the results would be fairly accurate for most markers.

    Transcribing, on the other hand, was a complete mess. I felt I was not very helpful with my contributions. Many times I was asked to transcribe a large table. First, I don’t know how they want the table data organized, and I felt far to intimidated to try and tackle a transcription like that myself. There were other sections where I couldn’t make out a few letters. I wasn’t sure if its more important for everything to be transcribed, or if it was important that the transcriptions be completely accurate. Certain highlighted sections to be transcribed seemed to be a bit blurry or over-zoomed in. I also have an especially hard time reading cursive, because I almost never see it written anymore.

    After participating in the transcribing, I felt that the results would have a fair degree of inaccuracy to them. That being said, projects of this magnitude cannot be done by an individual. Text recognition algorithms and other forms of classification may not be accurate enough to perform these kinds of tasks yet, so human outsourcing makes sense for this type of project.

    I thought it was pretty fun to view these primary sources. Something about the old WWI era made it especially enjoyable for me. Going through and marking some documents really sparked my imagination.

  7. This was an interesting experience for me. I found that I quite enjoyed the transcription tasks — I enjoyed trying to read what was often near-illegible scrawl. The instructions were clear for both that and the marking task, however I didn’t enjoy the marking ones. I clicked through perhaps five documents, and each of them had already been completed by previous users. There were multiple rectangles dragged over each of the responses on the form, and I didn’t particularly want to be the fourth person to identify his name, DOB, address, etc. I understand that there is merit to the concept of reliability through redundancy, but perhaps they should take away the previous markings if that is their goal (given that you would want independent responses).

  8. There wasn’t really much material that required instructions since most of it was straightforward, but it did take me a few minutes to feel out the site. As has been mentioned in other comments, the marking portion was kind of disheartening seeing as all the documents had already been marked thoroughly, which made me unsure as to if I was supposed to mark something new or not. The transcription portion was more interesting (when legible) especially the annotations of wounds. At some points however there were parts of sentences or sections where I couldn’t make out one of the words or a number or so on, and was again unsure whether it would be more useful to click the illegible button and skip the entire thing or to transcribe what I could of it and replace the words I couldn’t get with a question mark or something. The transcription process on its own brought to light some of the issues with the marking part, since many times the site would prompt you to transcribe an empty box or a data set that wasn’t highlighted. I guess they could have added more instructions in regards to their personal preferences and how to deal with some of the issues one might run into so as to make it easier on their end.

  9. Personally, I felt as if though this experience really brought to the foreground the potential of digital humanities and collaboration. Furthermore, the entire experience also felt really human. There is something slightly shocking about reading through the history sheets and finding that details about a person who died during a war you have only read about. Reading through these sheets and seeing the information they filled out from their next of kin to their age puts a much more human perspective on the lives of people who otherwise would simply be a number in a history book. There is something so much more empathetic about the reality which these sheets presents. I could easily imagine myself in the same situation filling out the forms and the brooding fear I would have. In terms of the experience itself I found marking to be much easier than transcribing. The areas which had to be marked were often obvious and legible whereas the handwriting was much harder to decipher. However, I guess the ineligibility of the handwriting sort of necessitates the transcription so my complaining is sort of irrelevant. Nonetheless, the collaboration gave me a new perspective on the potential for digital humanities and the advancement and cataloging of knowledge in general. Overall the experience was an interesting one which I think could provide an interesting twist for history students especially, offering a more human approach to studying.

  10. The instruction were very straight forward and easy to understand. I loved how the website made it very easy for the reader to transcribe what was shown on the page by highlighting the part they were asking you to transcribe. The instructions were very intuitive, i.e. highlighted text plus question asking you about said text? I found that the answer box and the few buttons at the bottom of the text box was super simple to understand. The people who made this program obviously thought about making the process as simple as possible, to minimize areas of different interpretations. The buttons provided all the choice needed to go through the transcription, obviously when there are specific cases of problems the buttons won’t be enough to solve them, but as a general problem solver they do get the job done.

    The text was easy to transcribe to a certain extent. There were times when the hand writing was super clear, and there were times when I had to guess what they were writing. But there were also other times when the text was completely unreadable, and so I had to click the “unintelligible” button at the bottom of the text box to skip the sentence. Again transcribing handwriting can be a hit or a miss. The degree of legibility of handwriting varied greatly between different texts. I could see why computers and programs would have a hard time transcribing correctly the text because sometimes I had to infer the text or make an intelligent guess on what the text was, so definitely it required some human pattern recognition skills.

    I didn’t really feel I was making a noticeable contribution to the project. This is because the scale of the project is so huge that one or two entries from me wouldn’t make any difference in the grand scale of things. I also didn’t see the fruits of my labor. They didn’t have an end screen showing me the progress made so far by all the contributions people have made. It would’ve been cool if they had a graphic that was a little bar, showing how close they are to transcribing the whole thing, that was slowly filling up after each submission I made, no matter how minuscule the bar would rise, I think that if I saw some sort of change in the grand scale of things I would have felt like I made a contribution to this project. So I felt like they should’ve put some perspective on how the contribution impacted the overall project.

    Crowd sourcing is easy and an effective resource. More and more big scale projects such as this one is looking towards crowd sourcing to effectively complete project. I remember reading that NASA is using crowd sourcing to find new supernovas in our sky by releasing deep space photographs to the public, so that people can see if they can find new never before found super novas. More specifically in the humanities field, when there are big projects that requires the synthesis of huge amounts of data, that would require expensive computers and software to complete, they could use turn towards crowd sourcing which has the same effect but with less cost, as well as maybe more importantly, to indirectly get people more INTERESTED in the topic, by allowing them to contribute to the topic. By crowd sourcing you are also getting more and more people involved and interested in the thing that they are contributing to, and the feeling of contribution to a project you care about is important to a lot of people.

    Oh yes, I definitely saw an effective way of getting people to help you with crowd sourcing to synthesis huge amounts of data like this. You want the instructions to be simple and visual so that there is the smallest amount of the possibility of interpretation. In order to get clear cut instructions to a massive random crowd, one needs to make the instructions as simple as possible, and to help with this, take advantage of the fact that humans are naturally a visual species.

  11. The instructions are easy to follow from my perspective, and text easy to transcribe. I did run into trouble once when there is a sticky notes on top of the document which blocked my view to transcribe or mark the rest of the document. In terms of my contribution, I feel that the instructions were so clear that I was left to wonder why this was made an online collaborative efforts instead of hiring under/grad students to transcribe the documents.
    Thinking about what I learned from the project, I think this kind of project does expose the intricacies of knowledge production—not in the sense that everyone can participate in some way, but rather that even (or especially) moving to a digital phase, hierarchies still persist in thinking about who gets to produce knowledge and what is produced. From this standpoint, I think I’ve been largely disappointed by the researchers and the kind of research that is considered “trendy” or “status quo” in the field. The current state of digital humanities research makes me wonder about the obvious reproduction of privileged perspectives and their (seeming) invisibility to the class, to the field, and to bystanders.

  12. I thoroughly enjoyed my time helping marking and transcribing the war stories of New Zealanders. The layout and interface used for the visitor to contribute to the project were simple to use and understand. Halfway through the transcription of one document, I actually found myself wondering off and actually studying the document to better understand its context and the information that it brings. I believe this was their ideal goal, of having a visitor both contribute and learn to and from their project.
    However, crowdsourcing in such a manner is an inefficient way of collecting data, due to the large margin of error each transcriber/marker is subject to as well as the large amount of time required for the data to be transcribed/marked.

  13. I’ve done crowdsourced projects before, but nothing as complex as this one. Usually it’s just animal identification in camera traps. (As a side note, I only felt comfortable reading the cursive script because I started re-teaching myself how to read and write cursive one year ago this winter actually!)

    On sources like this, I would be a bit more cautious about letting users view the transcription others have done to avoid a “group think” interpretation of any difficult words. I feel like I’ve come away with maybe a bit more surface knowledge of the material these documents contain, but ultimately I’m not sure if I truly understand their significance, which is an issue in a project like this. The crowd is handed a bunch of material, given a job, and no real context or interpretation of what that meaterial means.

    1. What site did you do animal identification for, would you happen to remember? My dad used to work with WCS where a lot of his work was setting up camera traps and then working with the images they took.

Leave a Reply