On the last day of class (Thursday 3/9) each group will give a Pecha Kucha style presentation on their completed and published project. The rules of such a presentation are below, with credits for the format going to Ryan Cordell, via Jim Spickard.
Here is an example slideshow I produced last year to illustrate the format:
In this presentation, you will have exactly 6 minutes and 40 seconds to present your material: 20 slides that auto-advance every 20 seconds. These presentations will follow the Pecha Kucha presentation format. Here are the rules:
You will have exactly 6 minutes and 40 seconds.
Your presentation will use PowerPoint (or Keynote or Google Presentations), but you’ll be restricted to 20 slides. No more, no less. Period.
Each slide must be set to auto-advance after 20 seconds. No clickers, no exceptions.
Your presentation must also follow the 1/1/5 rule. You must have at least one image per slide, you can use each exact image only once, and you should add no more than five words per slide.
Your images can be some of the ones you used for your project itself, screen shots of your process, or illustrative ones gathered from the internet.
You may trade off between your members however you see fit, but the presentation should be rehearsed and polished.
You should not attempt tell us everything that you might say in a written paper nor explain every nuance of your argument. That level of detail should be on the project website itself.
Instead, you should be looking to give us an overview of the project and highlighting its particular strengths. When designing the presentation, think SHORT, INFORMAL, and CREATIVE. Perhaps surprisingly, the Pecha Kucha form’s restriction (paradoxically) promotes this creativity.
A Social Network of Carleton College during the First World War
Today we are going to try out a real-world network analysis project and attempt to reconstruct the social network of Carleton College around its 50th anniversary in 1916. This was a turbulent time for Carleton, since the US was engaged in World War One, and many Carls past and present went to Europe to fight. Carleton itself was destabilized in the fall of 1918, when the Carleton-St Olaf Unit of the Student Army Training Corps was formed and took over the men’s dormitory as a barracks.
The school was small enough at that time that everyone probably knew everyone else, so the social world was relatively restricted, but we obviously cannot interview people to ask who their friends were. Of course, Facebook didn’t exist in 1916, so we can’t easily download a list of friends to start exploring.
So how are we going to construct an historical social network? Well, fortunately, the Algol Yearbook was first published in 1890 and all issues since have been digitized by our helpful librarians and made available through the library catalog. During the war the yearbooks went on hold, and so the volume published in 1919-1920 contains several years’ worth of data. As the yearbook committee wrote in their Forward:
SINCE the last appearance of the Algol, in the fall of 1916, Carleton has passed from days of peace and normal activities, through the unrest and sacrifice of war, to those of triumphant victory, bringing readjustment, and preparation for even greater progress than before. To the classes of 1919 and 1920 has fallen the peculiar task and privilege of recording Carleton life, not only on the campus, but in military forces of the nation and battleships of our fighting forces, during these eventful years. This book, we hope, will be more than a record, more than a list of names and facts, more than a celebration of victory; it is intended to be a simple expression of appreciation, a permanent memorial, to those who, clad in the Khaki and Blue, served their country in her time of need.
The yearbook won’t exactly tell us who was friends with whom, but they do list the complete class lists for a given year (providing all the individuals that will be the nodes in our network) and also all of the organizations on campus with their member lists. We can assume that people who belonged to the same organization interacted regularly and construct relationships (the edges of our network) based on these affiliations.
This is called an affiliation network and can be a powerful way of exploring connections and influence, as illustrated by the graph of regional networks of railroad directors below. Ours will allow us to construct a bimodal network graph of individuals and organizations in Carleton’s wartime history, from which we can derive a one-mode network of connected individuals based on their co-membership in these clubs. The result should allow us to ascertain who were the most central individuals at Carleton that year, who formed bridges between different communities and other useful network metrics.
But we are hackers here and are going to use easy, off-the-shelf tools: Google Sheets to collect our data collaboratively in the cloud and the free NodeXL template for Microsoft Excel to visualize the network. (Microsoft Excel itself, is of course, not free, but so widely available that this tool should be accessible and of use to most readers.)
First, we need to extract the lists of students and their membership affiliations from the yearbook. The volumes were OCRd, but the process is far from perfect, making automatic data extraction difficult. We will need to do a good amount of copy/pasting and hand editing to get these lists into shape.
Go to the 1919-20 Algol in the Carleton Digital Collections and explore the item
The interface allows us to view the PDF images and the scanned text
Go to page 82-83 which lists the Senior Class, and
Click the View PDF & Text button to see what I mean about the data quality
The original includes a lot of white space and multiple columns for formatting, which does not translate very well into a clean table formatted CSV, our ultimate end goal. These yearbooks were also scanned as two-page spreads, which caused the OCR to freak out and mix up values from both columns.
If it were cleaner, we might be able to automate some of the data cleaning process by using a python script and regular expressions, or a tool like Open Refine, but alas, for a project of this scale, this would be more labor than it is worth.
So we’ll need to do this the old fashioned way with manual copy/pasting into spreadsheets. We’ll try to speed it up, at least, by dividing the labor among us.
Getting our Data
NB: The instructions for doing this are below, but we are going to leverage the work of last year’s students and just re-use their gathered and cleaned data:
We are interested in names of students and their organizational affiliations, and you’ll notice that all of this information is right here for the Senior Class. Great! Unfortunately, it is only there for the Seniors and Juniors, but not for anyone else.
We are going to gather our data from Book 4 of the Yearbook: Literary Societies and Other Organizations. These lists contain full names, under each heading, and often class year, and they include underclassmen and -women as well. Plus, these “literary” societies were as close as Carleton came to Greek fraternities and sororities, and since they served a similar social function they are especially relevant to our social network analysis.
Download the PDF to your local machine so that it is easier to work with.
Open the PDF in Preview or Acrobat
Go to page 161, Book 4 and pick a society.
Highlight the text and see what the underlying OCR will let you grab at one go. Chances are the class years will get all out of whack, so you’ll probably need to copy names individually.
For instance, you could do a find/replace swapping out something like .s(a period followed by whitespace) for \n (a new line, or carriage return).
The result should be a single list containing the names of all members of the society with their class year
Now lets compile the Master List
Copy your data into the Master List sheet below that of others
Since the yearbook compilers were not thinking in terms of machine compatibility, they often recorded names differently in different lists, e.g. “Hannah M. Griffith” in the class list might show up as “Griffith, H.M.” in a member list.
To make our graph meaningful we would need to create an Index of Persons from the Index of Names we just created. This is a standard step in prosopographical research (that is, research on a group of named individuals). For an example of a major DH project underway to distinguish persons from names in the ancient world, see the SNAP:DRGN project.
To do this properly, we would construct a database with unique IDs for each name, linked as foreign keys to a new table listing unique persons, along with new fields describing the rationale for our decisions. We don’t have the time for that in this project, so we are just going to assume that if the initials match, we have the same person
We will just try to get as accurate as we can.
Once everyone’s data is entered, we will sort it alphabetically by name to put all the similar names together.
Use the Google =UNIQUE formula to check for slight differences
Go through the list of names, and replace each instance with its equivalent
Repeat until cleaned
Resort the list based on Organization
Exercise: Affiliation Network in NodeXL
NodeXL is a free extension that gives Excel the functionality of a network analysis program, allowing you to store, analyze and visualize networks.
Now that we have all of the yearbook data collected in our sheet, we need to turn it into a format that NodeXL can read. The simplest of these is an edge list, which consists of two columns containing the node names, where each row is an edge connecting those two vertices.
The Orgs sheet you created above will be our edge list, which we can use to make a bimodal display of our affiliation network data in NodeXL.
Follow the installation instructions and launch a blank template
You should now have a form to begin filling in, and a new NodeXL ribbon of tools at the top of the Excel window (like the image above).
Copy and paste the edge list of all your people and organizations into the Edges tab’s first two Vertex Columns
This detailed introductory tutorial by Derek Hanson and Ben Schneiderman will give you a step by step guide to using the program. The Serious Eats Analysis section beginning on page 27 provides an example of working with bimodal data like ours.
Follow the instructions in the tutorial linked above to test out NodeXL’s capabilities. Try to figure out how to do the following
Generate a graph,
Generate graph metrics
Add styling and color to vertices and edges
Change the layout
Experiment with good options for visualizing a multimodal network (hint: try the Harel-Koren Fast Multiscale option)
What insights can you glean from this visualization?
Is it helpful to see the affiliations in this way?
Do any individuals or organizations stand out as particularly isolated or well connected?
Exercise: Person-to-Person Network
We are not just interested in the indirect connections through affiliations, however, we also want to see how co-membership creates direct connection between people. In order to transform our bimodal network into a person-to-person unimodal one we need to turn this edge list into a matrix. A network can be represented as a binary matrix wherein a connection is indicated by a 1 and no connection receives a 0. The following two tables represent the same network information, showing directed relationships between Nodes 1 and 2 (note that Alice’s row contains all 0s, since she never appears in the Node 1 list).
To get a matrix like this for our data, we would put the Organizations along the top axis, and enter 1s wherever people were members and zeroes everywhere else. Doing this by hand for a large dataset would be very time consuming. Statistical packages like R have functions that will do these transformations for you, but it can also be done using Excel’s Pivot Table feature to generate the person-to-affiliation matrix and the SumProduct function on the result to connect people to people based on the number of organizations they both attended.
If you want a model download the excel file below to see how everything listed below works together.
Put your cursor in the data range for your edge list and select Pivot Table from the Insert or Data menu
Drag the field names into the areas indicated in the image, so that
Names label Rows,
Organizations label Columns,
Then drag Organizations into Values as well, so that Count of Organization fills in the matrix.
The result should like like that below
Great, we’ve got a matrix showing a 1 (true) wherever a given person is affiliated with an organization and nothing (false) where there was no affiliation. This is just a different way to represent the same data that was in our edge list. But we want to see how people were connected with people, not the groups, so we need to do some matrix math.
Specifically we are going to compare the records of every two people in our matrix and if make use of binary math to see where their membership in groups overlaps. We’ll make a new matrix that has the people’s names across both axes, and the values in this new matrix will indicate how many affiliations the people at each intersection shared.
Make sense? Sort of? Hopefully it will make more as we step through the process in the exercise below.
Creating a Person-to-Person Matrix
The first step is to create a new box for our data that has the names on both axes, not just the vertical. So we’ll copy the names over to a new range, and also transpose them across the top.
Copy the names (and only the names) from the Pivot Table into a new empty column to the right of the Pivot Table.
For example, my pivot table takes up columns A through D, so I pasted my names into column F.
Click in the cell above and to the right of the top name (the first cell of the column headings), choose Edit > Paste Special… and check the Transpose box to copy the same list across instead of down
You should now have an empty matrix with names across both axes
To fill this grid, we need to use some advanced Excel features: named ranges, and nested formulas. The formula we will be using is complex and looks like this:
I will try to break it down a bit. If you don’t care about how this works, feel free to skip this section!
The main component is SUMPRODUCT, which will take as input two rows of our person-to-affiliation matrix, multiplying each set of values together and adding the results. 1X0 or 0x0 gives us a 0, but wherever we have two ones in the same column (i.e. the people in those rows belonged to the same group) 1×1 = 1.
If the two people attended multiple groups, those 1s are added together, giving us a weighted value in our new matrix: 1s for one co-membership, 0s for none, and higher number for multiples.
SUMPRODUCT takes two ranges as inputs, and to select them we have two OFFSET functions
This looks in a range of cells (reference), starting a certain number of rows (rows), and columns (cols) away from the reference point, and returns a selected range of (height) and (width) cells.
In our example, the reference is the named range “matrix”, which will return a range 1 row high and the width of our “matrix” range (COLUMNS(matrix)). The number of rows is provided by the output of another function
This returns the position of the “lookup_value” in “lookup_array”, which in our case is matching the name in the current row in our new matrix ($G2 in the example above)
The second OFFSET is identical to the first, except it will match against the name in the current column of our matrix (H$1) above
Finally, we wrap the whole thing in an IF function
This makes sure the names in the current row and column are not equal to each other ($G2<>H$1), and runs the function only if true, otherwise returning a 0
Why? Because otherwise we would get very high values across the diagonal since each person obviously shared membership with themselves in each group they belonged to!
Phew! Got all that? It’s a lot, I know, but if you name the ranges correctly and set the values for your first cell, you should be able to copy the formula into the rest of the blank cells and Hey Presto! a weighted person-to-person matrix should appear.
Using Named Ranges and Modifying the Formulas
Name your ranges so that the formula will work
In your Matrix sheet, select the binary values (omitting the Grand Totals!)
Put your cursor in the name field at top left and type “matrix” then hit return
Select the list of names (omitting the Row Labels and Grand Total)
Put your cursor back in the name field and type “names” then hit return
Almost there! Now for the formula
Copy the full formula above
Double click in the top left cell of your blank person-to-person matrix and paste the formula
Before you hit return
You need to change the values of $G2 and H$1 to select the first names on each axis
In my example, I would need to change BOTH $G2s to $F4s, and BOTH H$1s to G$3s
Make sure you keep the dollar signs in place, so that the labels remain selected
Copy the cell you just entered the formula in and paste it in the rest of the range and you should see all the values magically populate
Finally, we can output this matrix as a csv file and import it into NodeXL
Copy the whole matrix you just created including labels (e.g. F3:J7 above)
Add a new sheet called CSV
Click cell A1
Edit > Paste Special…
Click on the Values radio button and hit OK
Got to File > Save As
Choose Format: Comma Separated Values (.csv)
We are finally ready to load the person-to-person matrix you created above into NodeXL
Go to NodeXL > Import > From Open Matrix Workbook… and import the csv file you just made
Generate a graph and explore visualization options.
You did it! Give yourself a pat on the back. That was hard work.
Assignment Tutorial Blog Post (Due TUESDAY, 11/10)
For this assignment, create a step-by-step tutorial as a blog post demonstrating a particular technique, tool, or other helpful how-to discovery you’ve made over the past several weeks in this course.
Pick a DH tool that we haven’t discussed yet and figure out an interesting use case for it (or, vice versa, pick a use case and figure out a potentially viable DH tool or methodology). You can highlight a technique that you have discovered in class, or in the preparation of your projects, as long as it isnot one we’ve all covered together already. If you’re stuck for ideas, the Dirt Digital Research Tools directory offers an extensive list of software for academic uses.
Once you have an idea, create an online tutorial for the rest of us and the wider world to start paying forward what you’ve learned in the course and becoming the “local computer expert.” For examples, you can look at some of the posts for this class, think back on all those SketchUp resources you’ve looked through, or see the software posts on the Profhacker blog.
Your tutorial blog post should include:
An introductory paragraph explaining clearly
what the tool or technique is and
why or in what context it would be useful
A step-by-step walkthrough of how to accomplish a specific task using the tool that contains
At least 5 steps
EITHER screenshots illustrating the steps where appropriate
OR a screencast video in which you record your actions while speaking about the process into a microphone
A link to at least two further resources like the software’s documentation or other tutorials around the web
The advent of the internet, and especially of its more socially connected Web 2.0 variant, has ushered in a golden age for the concept of the network. The interconnected world we now live in has changed not only the way we study computers and the internet, but the very way we envision the world and humanity’s place in it, as Thomas Fisher has argued. The digital technologies that we are learning to use in this class are tightly linked to these new understandings, making network analysis a powerful addition to the Digital Humanist’s toolkit. According to Fisher,
The increasingly weblike way of seeing the world … has profound implications for how and in what form we will seek information. The printed book offers us a linear way of doing so. We begin at the beginning—or maybe at the end, with the index—and work forward or backward through a book, or at least parts of it, to find the information we need. Digital media, in contrast, operate in networked ways, with hyperlinked texts taking us in multiple directions, social media placing us in multiple communities, and geographic information systems arranging data in multiple layers. No one starting place, relationship, or layer has privilege over any other in such a world.
To study this world, it can therefore be helpful to privilege not the people, places, ideas or things that have traditionally occupied humanistic scholarship, but the relationships between them. Network analysis, at root, is the study of the relationships between discrete objects, which are represented as graphs of nodes or vertices (the things) and edges (the relationships between those things). This is a very active area of research that emerged from mathematics but is being explored in a wide array of disciplines, resulting in a vast literature. (Scott Weingart offers a gentle introduction for the non-tech savvy in his Networks Demystified series and you can get a sense of the scope from the Wikipedia entry on Network Theory.) As hackers, we are not going to get too deep into the mathematical underpinnings and rely mostly on software platforms that make network visualization relatively easy, but it is important to have a basic understanding of what these visualizations actually mean in order to use them critically and interpret them correctly.
Exercise: Your (analog) social network
The basics of visualizing a network are fairly intuitive and can be done with pen and paper.
Draw a simple diagram of your own social network including
10-12 people as nodes and
your relationship to them as edges
Put yourself at the center and then place other people around you.
Start with your immediate family (your kinship network) and then expand out to include extended family, friends, people you know through clubs or activities, etc.
Draw lines to connect these people to yourself
Now draw lines to connect them to each other.
How many have relationships that do not run through you?
As undifferentiated lines, these are probably not very informative, so code the lines to indicate the nature of each relationship
What takeaways emerge from your diagram?
Are there connections that surprised you or figures that emerge as more central to your network than you had realized?
Swap diagrams with your neighbor and see if the diagram helps you understand their network more easily.
Exercise: Your (digital) social network
The relationships you just drew can be expressed in a simple data model as a “triple” comprised of a subject, a predicate, and an object. My relationship to my friend Chris for instance can be expressed as a triple in the following format:
Austin — is friends with — Chris
subject — predicate — object
Each relationship in your whole network can be represented this way as a set of triples, that allow for easily readable data storage and ready network visualization. Many DH projects make extensive use of the RDF (Resource Description Framework) specifications for modeling large sets of data as an RDF graph of triples. For our small example, we are going to recast our personal network as a set of triples and visualize it as a digital network using Google’s Fusion Tables application.
As we’ve already seen, Fusion Tables is an experimental platform for data visualization that Google has developed to allow spreadsheet data to be quickly visualized in any number of ways from traditional bar and line charts to maps and network visualizations. Google launched its first MOOC around Fusion Tables a while back called Making Sense of Data that you can still view if you want an in depth look at how to use all the features of this application. For now, we going to focus on its Network Graph capabilities.
Our first step will be to populate a Google Sheet with triples representing our own network data, and then import it into Fusion Tables and visualize it.
Launch Google Drive and create a new sheet with the following three columns: Person A, Relation, and Person B
Go through your hand-drawn diagram and translate each network relationship into a triple following the model above
(One word of caution — there are two types of relationship that can be expressed here: mutual and unreciprocated. “Is friends with” or “is a sibling of” would be mutual relationships that produce an undirected graph. Directed graphs map one-sided relationships like “is the parent of,” “is the student of” or “is in love with” by drawing a directional arrow for the edge. Both are possible and can be used, but you should be aware of the distinction as you draw up your triples and stick to one or the other.)
This data model is unlike a relational database in that you will be repeating names in order to express all of the relationships in the graph.
Try to connect each person or node with at least two others
Make sure you are logged in and save your sheet
Import your data into Fusion Tables
Go to the Fusion Tables start page, click on Google Spreadsheets and import your data, checking the Export box if you wish to make the data public and downloadable.
A window should open showing your data table. You will add a new chart by clicking the red plus sign of the type “Network graph” and change the options to Show the Link Between your Person A and Person B columns.
Congratulations! You have just made a graph of your social network. Explore the limited options and apply some filters, then click and drag around the graph to see how you can change the visualization.
Now that you know the basics of what a network graph is and how to create a rudimentary one, let’s explore some much more sophisticated network analysis DH projects. With your group, explore one or more of the following projects:
For this assignment, create a step-by-step tutorial as a blog post demonstrating a particular technique, tool, or other helpful how-to discovery you’ve made over the past several weeks in this course.
You can highlight a technique that you have discovered in class, in your project work, or pick a DH tool that we haven’t discussed yet and figure out an interesting use case for it. (Or, vice versa, pick a use case and figure out a potentially viable DH tool.)
If you’re stuck for ideas, the Dirt Digital Research Tools directory offers an extensive list of software for academic uses.
This week we are going to explore some dos and don’ts of data visualization as you prepare for your final projects. Edward Tufte is widely considered one of the world’s leading data visualization gurus, and has been called everything from “Leonardo da Vinci of data” to the “Galileo of graphics.” Tufte will be our guide as we think through what good visualizations say and how bad data displays can lie and distort or even undermine your intended argument.
The Minard Map
It may well be the best statistical graphic ever drawn.
—Edward Tufte, The Visual Display of Quantitative Information (1983)
Why is this considered such a landmark visualization, if not the best ever?
What are the key features that make it stand out?
How would you improve on it, if you were to take a stab?
Keeping it Honest: How Not to Lie with Pictures
This may well be the worst graphic ever to find its way into print.
—Edward Tufte, The Visual Display of Quantitative Information (1983)
We’ve already discussed how not to lie with maps, but it’s easy to do with visualizations as well. One of the biggest issues that Tufte stresses in his seminal work is how to stay honest with infographics. One of the easiest errors to make, for instance, is to scale the radius of circles, or one axis of two dimensional shapes, which results in massively larger areas than your data actually warrants.
What mistakes did you not think of before that you might want to avoid?
What examples might you like to emulate for your own projects?
Google Motion Charts (Gapminder)
One of the most impressive data visualization breakthroughs of recent years was Hans Rosling’s invention of Gapminder: an application that really unleashed the “fourth dimension” of time and allowed data to be animated in an immediately understandable and powerful way. His TED talk below illustrating global health data with the tool is legendary.
Google bought the technology and made it available for all to use as Motion Charts.
We’ve already explored some visualization environments, but here are two more very impressive tools to check out:
Over the past few weeks we have discussed and seen how the modern dynamic web—and the digital humanities projects it hosts—comprise structured data (usually residing in a relational database) that is served to the browser based on a user request where it is rendered in HTML markup. This week we are exploring how these two elements (structured data and mark up) come together in a mainstay of DH methods: encoding texts using XML and the TEI.
XML (eXtensible Markup Language) is a sibling of HTML, but whereas the latter can include formatting instructions telling the browser how to display information, the former is merely descriptive. XML doesn’t do anything, it just describes the data in a regularized, structured way that allows for the easy storage and interchange of information between different applications. Making decisions about how to describe the contents of a text involves interpretive decisions that can pose challenges to humanities scholarship, which we’ll discuss more in the next class on the Text Encoding Initiative (TEI). For now, we’re going to explore the basics of XML and see how we can store, access, and manipulate data.
We went over the main parameters in class, but an excellent primer to XML in the context of DH has been put together by Frédéric Kaplan for his DH101 course and can be viewed in the slideshow below.
Now that we know what XML markup looks like, we can turn to the broader and more fundamental question facing digital humanists: why should we mark up texts in the first place?
Computer-assisted text analysis is one of the oldest forms of digital humanities research, going back to the origins of “humanities computing” in the 1940s. Computers are well suited to data mining large bodies of texts, and with the rise of massive digitization projects like Google Books, many research questions can be answered using plain “full text” versions of documents. The Google Ngram viewer is probably the best known of these tools, allowing simple comparisons of the frequencies of words across Google’s enormous corpus of digitized texts. Someone interested in Prohibition for instance might compare the frequency of “alcohol” and “prohibition” in the American English corpus to see how the two terms were used during the period of its enforcement.
More sophisticated text analysis tools also exist that let you perform some pretty impressive data mining analytics on plain texts. Voyant Tools is one of the most well known and useful tools out there, that will permit some pretty impressive analysis and visualization on plain texts, but also allows you to upload XML and TEI files that permit fine tuning of your data. For how-to information, check out their extensive documentation page.
Exercise (Plain Text Analysis)
Let’s take a look at what these text analysis tools can do with a classic example text: Shakespeare’s Romeo and Juliet.
Go to Voyant Tools, click the Open folder icon and choose Shakespeare’s Play’s (Moby) to load the open-access plain-text versions of all the Bard’s plays.
Explore the interface, read the summary statistics and hover your mouse over various words to see what pops up
What do you notice about the Cirrus tag cloud?
To make it more useful, add a stop word list, “a set of words that should be excluded from the results of a tool”
Click the gear icon to launch the options menu
Choose the English (Taporware) list, which contains all the common short words, prepositions and articles in modern English. Since this is Shakespeare you’ll still be left with a lot of thees, thous, and thys, and if you wanted you could go back into the options, click “Edit Stop Words” and manually add more words to the list.
Click on any word to launch more tools and investigate what they tell you and continue to explore the possibilities that Voyant offers.
Open the Corpuswindow at the bottom, and click on Romeo and Juliet to load just that play’s statistics
Investigate the Word Trends, and Keywords in Context tools to analyze some key thematic words in the play, like “love” and “death”
There are a number of other analysis and visualization tools that Voyant offers, which can be accessed via a direct URL in the address bar.
What kinds of questions can you answer with this sort of data?
Are there research topics that would not benefit from the approach?
Text analysis and data mining tools can get a lot of statistics out of the full texts, but what if we are interested in more fine grained questions? What if we want to know, for instance, how the words Shakespeare used in dialogues differed from soliloquies? Or what if we were working with a manuscript edition that was composed by several different hands and we wanted to compare them? For these kinds of questions, we need to go beyond full text. We need to encode our texts with meaningful tags. Enter XML and TEI.
For humanities projects, the de facto standard markup language is that specified by the Text Encoding Initiative. They have spent years working out a flexible yet consistent system of tags with the aim of standardizing the markup of literary, artistic and historical documents to ensure interoperability between different projects, while at the same time allowing scholars the discretion to customize a tag set matching the needs of their specific project. This flexibility can make it a daunting realm to enter for newbies, and we will not be going very far down this path.
The best gentle introduction to getting a TEI project up and running can be found at TEI By Example. For our purposes, let’s see how a properly marked up version of Romeo and Juliet differs from the plain text version and what that means for our scholarly pursuits.
Exercise (Encoded Text Analysis)
The Folger Shakespeare Libary’s Digital Texts is a major text encoding initiative that provides high quality digital editions of all the plays online for free. These are not only available online through a browser, but the Folger has also made their source code available for download.
Using the guide, try to locate a speech by Romeo and another by Juliet.
As you might imagine, this kind of granularity of markup opens the door to much more sophisticated queries and analyses. Let’s say, for instance, that we wanted to compare the text of Romeo’s speeches with those of Juliet as part of a larger project exploring gender roles on the Elizabethan stage. The detailed TEI encoding of the Folger edition should let us do this pretty easily. Unfortunately the user interfaces for analysis of TEI documents have not been developed as much as the content model itself, and most serious analytical projects rely on statistical software packages used in scientific and social science research, like the open-source R. We’re not going to go that route for this class.
Voyant Tools will allow us to select certain elements of the marked up code, but only if we understand the XML document’s structure and know how to navigate it with XPATH (part of XSL and used in conjunction with XSLT). So let’s see how that works on our Romeo and Juliet example.
To do so, we’re actually going to use a slightly simpler XML markup version of Romeo and Juliet, so that it’s easier to see what’s going on.
Go back to Voyant Tools and paste the URL above into the Add Texts box
Before you click Reveal, open the Options dialog in the top right corner
Under “XPath to Content“, type (or copy/paste) the following expression //SPEECH[contains(SPEAKER,"ROMEO")]/LINE
Let’s also give this document the Title of ROMEO
Under “XPath to Title“, try to alter the expression above to select the SPEAKER values within Romeo’s speeches, instead of the LINE values
Finally, click on OK, and then Reveal
Apply your stop words and explore how this differs from the full text version of the play we examined earlier
I’m sure you can see the powerful possibilities afforded by using the encoded tags to quickly select certain portions of the text. When you do this type of analysis on an entire corpus, you can generate lots of data to compare with relative ease. So let’s compare Romeo’s text to Juliet’s
To preserve Romeo’s text analysis, we can get a static URL to this instance of Voyant.
Go to the floppy disk image in the top right, and Export a URL for this tool and current data
then click on the URL link to launch the current view in a new window
Now you can go back to the home screen and repeat the process above, making changes to the Path expressions in order to get all of Juliet’s lines and title the new document JULIET.
Apply your stop words again
Now you can use all of the analytical tools to compare the two lover’s words!
Now that we’ve seen an example of a large TEI project and gotten a glimpse of the analytical power detailed markup affords, let’s think through the interpretive implications of deciding on a tag set for a project.
The content model for any given document is decided on by the interpreter, so your first task is to figure out what questions you think the document will be used to answer. As you might imagine, this up front work will shape the final outcome and must be considered (and carefully documented!) when starting a new project.
The classic example is to take a recipe and figure out what tags to use to meaningfully mark up its elements.
Download the following text and work with your group to mark it up in tags.
Consult your neighbors and then compare with another group.
How did you do? Would your individual mark up efforts combine to form a usable corpus, or would you need to do some adjusting?
Voyant lists many more examples of how people use the tools in their Examples Gallery.
The main TEI site has links to a ton of documentation on the Initiative, including many how-tos.
The TEI at Oxford teaching pages also contain a lot of slides and exercises from previous workshops, in many of which you can see witness prominent TEI members Lou Bernard and Sebastian Rahtz wrestling with the challenges of maintaining the TEI’s dual goals of flexibility and interoperability.
XSLT (eXtensible Stylesheet Language Transformations) is to XML what CSS is to HTML, but it’s also a lot more. More like a programming language within markup tags than a regular markup language, it’s a strange but powerful hybrid that will let you do the same things as the languages above: transform XML data into other XML documents or HTML. Learn the basics at w3schools’ XSLT tutorial. If you’d like a more in depth explanation of how XML and XSLT work together check out this tutorial xmlmaster.org, which is geared at the retail/business world but contains basic information relevant for DH applications.
Manual 3D modeling techniques are very effective and have had a long history of producing impressive digital humanities projects. Lisa Snyder’s long-running project to recreate the World’s Columbian Exposition of 1863 in Chicago is a prime example of what these techniques can accomplish in skilled hands.
Increasingly, however, computers are doing more of the heavy lifting and there are several methods of generating 3D models that rely on algorithms to create geometric meshes that are being adopted for DH projects.
This term refers to the generation of complex geometry from basic shapes through the application of code-based rules. The leading platform for this type of work in DH is CityEngine, owned by ESRI, the makers of ArcGIS. This technique allows a user to produce, modify and update large, textured models of entire cities quickly and iteratively. The output can be explored online or integrated with gaming software or 3D animation packages to produce video games, simulations and movies.
This software was developed for modern city planners and urban architects, but has increasingly been put to use on historic landscapes and built environments, as in the impressive work of Marie Saldaña who developed a Roman temple rule set.
We will explore this technique briefly using CityEngine on the lab computers.
Download the zipped file of Carleton College buildings at the link below (or from our Google Drive shared folder) and choose “Import Zipped Project into Workspace” from the File menu to get started.
Photogrammetry is another algorithmic modeling technique that consists of taking multiple overlapping photographs and deriving measurements from them to create 3D models of objects or scenes. The basic principle is quite similar to the way many cameras these days allow you to create a panorama by stitching together overlapping photographs into a 2D mosaic. Photogrammetry takes the concept one step further by using the position of the camera as it moves through 3D space to estimate X, Y and Z coordinates for each pixel of the original image; for that it is also known as structure from motion or SfM.
Photogrammetry can be used to make highly accurate and realistically photo textured models of buildings, archaeological sites, landscapes (if the images are taken from the air) and objects. Close range photogrammetry of historical objects offers the possibility of both digitally preserving artifacts before they may be lost or damaged, and of allowing a whole suite of digital measurements, manipulations and other analyses to be performed that allow insights into the material that might not be visible to the naked eye. The technique is gaining in popularity and usage, since it produces very impressive results comparable to high end laser scanning technologies for a mere fraction of the cost.
We will learn this technique using PhotoScan, the leading photogrammetry software. A demo mode is available for free that will let you try everything except exporting and saving your model. If you want to explore more, they offer a 30-day free trial of the full Standard or full Professional editions.
They have also provided us with guidance on how to translate the metadata found in the archives into the standard Dublin Core metadata schema employed by Omeka. We will eventually be building a collection in an Omeka site, but IT is still setting up our server so we’ll start (as many good DH projects do) gathering data on a simple spreadsheet that you should have access to in Google Drive.
We will be using SketchUp’s Match Photo technique to create our models of historic buildings on Carleton’s campus, which we went over together briefly in class. The clearest step-by-step introduction I’ve found to the Match Photo technique of geomodeling is several years old but the basic principles still apply. The first link takes you to step-by-step instructions for the process using a photo of a barn as an example, and the next two link to videos walking through the same example.
The other method often used to model existing buildings is the “Extruded Footprint” technique, which has the benefit of georeferencing your model with Google Earth from the outset. If your building is still standing and visible in Google satellite imagery, then this method might work as a starting point for you, but it won’t provide the level of detail we want unless you combine it with matched photos to add the photo textures and architectural elements. The video below offers an excellent introduction to how these two techniques can be combined to produce an accurate model, in the same way we practiced together in class.
We are primarily interested in exteriors for this class, but if you find floor plans or architectural blue prints for your building in the archives and want to go nuts and try to start modeling a version with the interior walls, go for it. Here are a few basic resources to get you started.
The past decade has witnessed a proliferation of web mapping tools and platforms. These tools have long allowed the simple display of and basic interactions with spatially referenced data, but until recently, if you wanted to do any sort of analysis you had to use a desktop GIS system. That situation has begun to change, however, and there are now many solutions out there for mapping your own data alongside hosted layers from around the web.
There are a lot of open source GIS options out there (see this list for a complete rundown) and which one will be right for you depends on the needs of your project and your familiarity and comfort with coding. Today we’re going to explore some of the most common web mapping platforms out there and see how you can start making fairly complex maps with relatively little startup cost.
Google Maps / Fusion Tables / API
Google’s mapping products are the most well known to the general public, since Google Maps and Google Earth have been around a long time and are ubiquitous. For an example of how to use the API, let’s briefly the 2015 Google Maps API tutorial.
First, let’s look briefly at Google’s newer offering, Fusion Tables, which offers a dead simple way to convert a spreadsheet with location data into points on a map. This can be great to get a first look at your data, and do some basic filtering, but if you want to do any more complex visualization or analysis you need a more powerful tool.
Getting Data to Map
For today’s exercises, we are going to use a dataset that is, at least thematically, relevant to our interest in the early history of Carleton. Our dataset can be accessed in the shared folder at this link. (The file, early-colleges.csv, is part of Lincoln Mullen’s historydata datasets on GitHub.)
Download the CSV and open it in Excel or a text editor to see what you will be mapping.
What information does this file hold?
Where is the spatial data? What kind of spatial data is there?
Where might it have come from? How reliable do you think it is?
ArcGIS is the industry leading GIS platform. It is very powerful, very difficult to learn, and very expensive, since it is proprietary software created and owned by a company called ESRI.
ArcGIS Online is the company’s attempt to reach a more mass market audience. It is a cloud-based GIS service that offers an easy way to add, store, and visualize spatial data, much like Fusion Tables. But ArcGIS Online also offers sophisticated ways to analyze data that until recently were only available in high-end desktop software, and — crucially for humanities projects — any map you create can be turned into a Story Map like the Battle of Gettysburg example we looked at last time.
As with the desktop version, ArcGIS is not free, however. It does offer a public version, but it is very limited and offers no analysis capabilities. “Subscription” accounts for organizations start at $2,500/year for 5 users — not exactly cheap. We are going to use it in this class, because we are fortunate that Carleton has a subscription and excellent support in the person of our GIS specialist, Wei-Hsin Fu. But I’ve also included information on open source alternatives at the bottom of this post.
Logging in to your College Account
In your email account, you should have received a message from ArcGIS Notifications.
Follow the link in the email to the Carleton College ArcGIS Online homepage (or go directly to carleton.maps.arcgis.com). You will be prompted to sign in to the account. Choose the Carleton College Account option and enter in your account information on the following page.
Creating a Map
In ArcGIS Online, click on theMaptab at the top of the page to bring up the main map editor window
Explore the main map interface. A few things to notice
There are three Guided Tours to get started (we’ll go over these in detail, but you have some built in help here)
The main operations you can perform are listed across the top menu. What can you do with a blank map? What can’t you do yet?
Note the relationship between the scale bar and the zoom control as you navigate around the map
What data do they give you by default?
Click the Basemap tab and explore the options. Pick one that is fairly neutral as a background to our data.
Upload and Map our Data
Now lets add our data to the map. There are two ways to do this
Click Add and choose to Add Layer from File then navigate through your computer to find the file you want. Note the types of files they allow you to upload
You can import a zipped shape file (ZIP: the default ESRI format for desktop GIS that is widely recognized by other GIS and web mapping platforms)
a comma, semi-colon, or tab delimited text file (CSV or TXT: this kind of tabular data is the most common way to collect your own information and will be portable just about anywhere, not to mention about as future proof as you can get
a GPS Exchange Format (GPX: this is data upload from a GPS tracker, say following a route you ran or biked that was logged by your phone or another GPS enabled device)
Drag and drop a file onto the map window
The second option is much easier and quicker, but either way, find “early-colleges.csv” and upload it. You should be presented with the following import options
Like Google’s Fusion Tables, ArcGIS Online is going to try to geocode your data and provide latitude and longitude coordinates for any place references in your data set by comparing them to a gazetteer (a place-based dictionary) somewhere. It should have correctly recognized the city and state columns as Location Fields
Click Add Layer and see how it did.
Click on a couple of the points on your new map at random to verify if they look correct. The geocoder is pretty good, but ArcGIS does not provide much in the way of error checking and you can’t easily tell what it got wrong just by looking. Buyer beware!
Symbolize and Visualize the Data
By default, the application will try to figure out some pattern in your data and will suggest an attribute to symbolize by. Already, we can see the benefits of a more robust GIS over the simple uniform symbols used on the Fusion Table map.
Explore the style options down the left hand column.
How do the data look if you choose a different attribute to show?
How do the single symbol, heat map, and types options differ? Which is most appropriate for these data?
Explore the various symbolization options under the Types (Unique symbols) drawing style.
Figure out a map display that you are comfortable with and press DONE when you are finished.
Now that you have mapped some data, what else can do with it?
Check out the options below the layer and see if you can do the following
Show the table of the data. How is it different than or similar to the original CSV?
Create labels for the points that use an appropriate value from the data table, and change the style to your liking
Configure the Pop-Up to show a Custom Attribute Display and combine the data from several fields into a sentence.
Change the value of a point already in your dataset
Add Carleton College to the data set
So what? Asking and Answering Questions
We’ve got a map, but what does it tell us? What can we learn from it that we couldn’t learn from reading the data in a table?
One of the greatest benefits of a GIS system is that we can compare different types of data based on their geographic proximity and spatial relationships. ArcGIS online allows access to a multitude of other datasets hosted on its own servers and around the internet. Let’s see how our data look compared to other information. Can we make it tell other stories of interest?
Click Add again, but this time choose Search for Layers
See if you can find some boundary layers, population data, or land cover information that seems to have a relationship with the colleges that might be meaningfully interpreted.
You might also search the internet for other GIS datasets that might be fruitfully compared.
You’ll need to Zip (compress) the individual year folders in order to upload them, but once you do see if there is any correlation between the growth of railroads between 1840, 1845 and 1850 and the spread of the colleges.
Save and Share your map
Click Save on the toolbar to title your map and save it to your account. You will need to enter in a title and tags. The map description is optional. Click Save Map when you are satisfied with your descriptions.
You can share a link directly to this map view, but you can also publish a nicer looking layout to share publicly.
Click SHARE on the top toolbar. A new window will pop up. Choose to share this with “Everyone” and then click on Create a Web App.
In the next window, chose the format of the web application. There are many options and you can preview them by clicking on “Create” then “Preview”. Once you find a template you like, click on “Create” then “Create” again.
On the next screen, enter a title and summary for your application (it can be the same as your map title and description). Click SAVE AND PUBLISH.
Congratulations! You’ve made a live interactive web map!
Create a new blog post on your personal WordPress site and embed the new map.
Write a few sentences explaining what it is and how you made it.
Open Source Alternatives
CartoDB is a “freemium” service, that offers an entirely cloud-based all-in-one mapping service. You can upload your data, perform analyses, and publish from their servers to the web through a very simple and clean user interface. They also offer one of the best styling interfaces around with many nice looking templates and the ability to tweak them or roll your own using the equivalent of CSS for web mapping that is called, appropriately enough, CartoCSS.And it’s all open source!
One of the big benefits of CartoDB is their “torque map” feature, which allows you to animate your data with minimal effort. In the map below, I’ve uploaded the same early-colleges.csv dataset and animated it so that colleges pop up in the order of their founding.
We won’t go into detail on CartoDB here, but I encourage you to explore on your own with the same data or a different dataset to see how this tool compares with ArcGIS Online.
WorldMap is an open source project developed by the Center for Geographic Analysis (CGA) at Harvard with the aim of filling the gap between powerful but hard to learn desktop GIS applications and lightweight online map viewers. You can easily create your own map layers in standard, exportable formats, and you can also view the many maps created by others, including Africa Map, the project that launched the program, and a Tweet Map, a great example of a “big data” geographic data mashup.
WorldMap also hosts its own instance of MapWarper, which will let you georeference an image for free.
The blog assignment for this weekend is to work on your final project proposal with your group.