8B: Network Analysis: A Real-World Project

A Social Network of Carleton College during the First World War

Today we are going to try out a real-world network analysis project and attempt to reconstruct the social network of Carleton College around its 50th anniversary in 1916.  This was a turbulent time for Carleton, since the US was engaged in World War One, and many Carls past and present went to Europe to fight.  Carleton itself was destabilized in the fall of 1918, when the Carleton-St Olaf Unit of the Student Army Training Corps was formed and took over the men’s dormitory as a barracks.

The school was small enough at that time that everyone probably knew everyone else, so the social world was relatively restricted, but we obviously cannot interview people to ask who their friends were.  Of course, Facebook didn’t exist in 1916, so we can’t easily download a list of friends to start exploring.

So how are we going to construct an historical social network?  Well, fortunately, the Algol Yearbook was first published in 1890 and all issues since have been digitized by our helpful librarians and made available through the library catalog.  During the war the yearbooks went on hold, and so the volume published in 1919-1920 contains several years’ worth of data. As the yearbook committee wrote in their Forward:

SINCE the last appearance of the Algol, in the fall of 1916, Carleton has passed from days of peace and normal activities, through the unrest and sacrifice of war, to those of triumphant victory, bringing readjustment, and preparation for even greater progress than before. To the classes of 1919 and 1920 has fallen the peculiar task and privilege of recording Carleton life, not only on the campus, but in military forces of the nation and battleships of our fighting forces, during these eventful years. This book, we hope, will be more than a record, more than a list of names and facts, more than a celebration of victory; it is intended to be a simple expression of appreciation, a permanent memorial, to those who, clad in the Khaki and Blue, served their country in her time of need.

The yearbook won’t exactly tell us who was friends with whom, but they do list the complete class lists for a given year (providing all the individuals that will be the nodes in our network) and also all of the organizations on campus with their member lists.  We can assume that people who belonged to the same organization interacted regularly and construct relationships (the edges of our network) based on these affiliations.

This is called an affiliation network and can be a powerful way of exploring connections and influence, as illustrated by the graph of regional networks of railroad directors below.  Ours will allow us to construct a bimodal network graph of individuals and organizations in Carleton’s wartime history, from which we can derive a one-mode network of connected individuals based on their co-membership in these clubs.  The result should allow us to ascertain who were the most central individuals at Carleton that year, who formed bridges between different communities and other useful network metrics.

Networks of Boards of Directors of Western railroads, from “Railroaded” by the Stanford Spatial History Project

There are many sophisticated ways to perform this type of analysis (here’s a good Stanford tutorial on doing so with R, and here’s a tutorial for using Palladio at the great Programming Historian).

But we are hackers here and are going to use easy, off-the-shelf tools: Google Sheets to collect our data collaboratively in the cloud and the free NodeXL template for Microsoft Excel to visualize the network.  (Microsoft Excel itself, is of course, not free, but so widely available that this tool should be accessible and of use to most readers.)

 


Data Collection

First, we need to extract the lists of students and their membership affiliations from the yearbook.  The volumes were OCRd, but the process is far from perfect, making automatic data extraction difficult.  We will need to do a good amount of copy/pasting and hand editing to get these lists into shape.

  1. Go to the 1919-20 Algol  in the Carleton Digital Collections and explore the item
    • The interface allows us to view the PDF images and the scanned text
    • Go to page 82-83 which lists the Senior Class, and
  2. Click the View PDF & Text button to see what I mean about the data quality
    • The original includes a lot of white space and multiple columns for formatting, which does not translate very well into a clean table formatted CSV, our ultimate end goal. These yearbooks were also scanned as two-page spreads, which caused the OCR to freak out and mix up values from both columns.

If it were cleaner, we might be able to automate some of the data cleaning process by using a python script and regular expressions, or a tool like Open Refine, but alas, for a project of this scale, this would be more labor than it is worth.

So we’ll need to do this the old fashioned way with manual copy/pasting into spreadsheets.  We’ll try to speed it up, at least, by dividing the labor among us.

 

Getting our Data

 

NB: The instructions for doing this are below, but we are going to leverage the work of last year’s students and just re-use their gathered and cleaned data:

 NetworkAnalysisDataCollection Google Sheet

We are interested in names of students and their organizational affiliations, and you’ll notice that all of this information is right here for the Senior Class.  Great!  Unfortunately, it is only there for the Seniors and Juniors, but not for anyone else.

We are going to gather our data from Book 4 of the Yearbook: Literary Societies and Other Organizations. These lists contain full names, under each heading, and often class year, and they include underclassmen and -women as well.  Plus, these “literary” societies were as close as Carleton came to Greek fraternities and sororities, and since they served a similar social function they are especially relevant to our social network analysis.

  1. Download the PDF to your local machine so that it is easier to work with.
  2. Open the PDF in Preview or Acrobat
  3. Go to page 161, Book 4 and pick a society.
    • Highlight the text and see what the underlying OCR will let you grab at one go.  Chances are the class years will get all out of whack, so you’ll probably need to copy names individually.
  4. Go to our NetworkAnalysisDataCollection Google Sheet
    • You’ll see that I’ve set up a Master List sheet with the following columns
      • Name
      • Organization
      • Class
    • We’ll all populate this at the end, but to avoid stepping all over each other, we’ll each make a separate sheet for primary data collectionScreen Shot 2015-11-05 at 12.47.25 PM
  5. Duplicate the Sheet
    1. Rename it to the Name of your Organization
      1. E.g. Adelphic Society
  6. Now the hard part:
    1. Copy all the names to Name column
    2. Add the Class year where appropriate
    3. Fill in the name of your organization for each row
  7. Clean the data if necessary to get rid of any extraneous characters
    • You might find Google’s Find and Replace using regular expressions useful.
      • For instance, you could do a find/replace swapping out something like
        .s(a period followed by whitespace) for
        \n (a new line, or carriage return).
    • The result should be a single list containing the names of all members of the society with their class year
  8. Now lets compile the Master List
    1. Copy your data into the Master List sheet below that of others

Data Cleaning

Since the yearbook compilers were not thinking in terms of machine compatibility, they often recorded names differently in different lists, e.g. “Hannah M. Griffith” in the class list might show up as “Griffith, H.M.” in a member list.

To make our graph meaningful we would need to create an Index of Persons from the Index of Names we just created.  This is a standard step in prosopographical research (that is, research on a group of named individuals).  For an example of a major DH project underway to distinguish persons from names in the ancient world, see the SNAP:DRGN project.

To do this properly, we would construct a database with unique IDs for each name, linked as foreign keys to a new table listing unique persons, along with new fields describing the rationale for our decisions. We don’t have the time for that in this project, so we are just going to assume that if the initials match, we have the same person

We will just try to get as accurate as we can.

  1. Once everyone’s data is entered, we will sort it alphabetically by name to put all the similar names together.
  2. Use the Google =UNIQUE formula to check for slight differences
    • Go through the list of names, and replace each instance with its equivalent
      • Repeat until cleaned
  3. Resort the list based on Organization

 


Exercise: Affiliation Network in NodeXL

NodeXL is a free extension that gives Excel the functionality of a network analysis program, allowing you to store, analyze and visualize networks.

Now that we have all of the yearbook data collected in our sheet, we need to turn it into a format that NodeXL can read.  The simplest of these is an edge list, which consists of two columns containing the node names, where each row is an edge connecting those two vertices.

The Orgs sheet you created above will be our edge list, which we can use to make a bimodal display of our affiliation network data in NodeXL.

  1. Download and install NodeXL Template from this site
    • Follow the installation instructions and launch a blank template
NodeXLribbon
The NodeXL ribbon of tools added to Microsoft Excel

You should now have a form to begin filling in, and a new NodeXL ribbon of tools at the top of the Excel window (like the image above).

  1. Copy and paste the edge list of all your people and organizations into the Edges tab’s first two Vertex Columns

This detailed introductory tutorial by Derek Hanson and Ben Schneiderman will give you a step by step guide to using the program.  The Serious Eats Analysis section beginning on page 27 provides an example of working with bimodal data like ours.

Follow the instructions in the tutorial linked above to test out NodeXL’s capabilities.  Try to figure out how to do the following

  • Generate a graph,
  • Generate graph metrics
    • Particularly degree
  • Add styling and color to vertices and edges
  • Add labels
  • Change the layout
    • Experiment with good options for visualizing a multimodal network (hint: try the Harel-Koren Fast Multiscale option)

What insights can you glean from this visualization?

Is it helpful to see the affiliations in this way?

Do any individuals or organizations stand out as particularly isolated or well connected?

 


Exercise: Person-to-Person Network

We are not just interested in the indirect connections through affiliations, however, we also want to see how co-membership creates direct connection between people.  In order to transform our bimodal network into a person-to-person unimodal one we need to turn this edge list into a matrix.  A network can be represented as a binary matrix wherein a connection is indicated by a 1 and no connection receives a 0.  The following two tables represent the same network information, showing directed relationships between Nodes 1 and 2 (note that Alice’s row contains all 0s, since she never appears in the Node 1 list).

Screen Shot 2015-02-24 at 11.33.31 AM

To get a matrix like this for our data, we would put the Organizations along the top axis, and enter 1s wherever people were members and zeroes everywhere else.  Doing this by hand for a large dataset would be very time consuming.  Statistical packages like R have functions that will do these transformations for you, but it can also be done using Excel’s Pivot Table feature to generate the person-to-affiliation matrix and the SumProduct function on the result to connect people to people based on the number of organizations they both attended.

If you want a model download the excel file below to see how everything listed below works together.

Matrix_Example

Creating an Affiliation Matrix

  1. Put your cursor in the data range for your edge list and select Pivot Table from the Insert or Data menu
    • Drag the field names into the areas indicated in the image, so that
      • Names label Rows,
      • Organizations label Columns,
    • Then drag Organizations into Values as well, so that Count of Organization fills in the matrix.
    • The result should like like that belowScreen Shot 2015-02-24 at 11.51.23 AM

Great, we’ve got a matrix showing a 1 (true) wherever a given person is affiliated with an organization and nothing (false) where there was no affiliation.  This is just a different way to represent the same data that was in our edge list.  But we want to see how people were connected with people, not the groups, so we need to do some matrix math.

Specifically we are going to compare the records of every two people in our matrix and if make use of binary math to see where their membership in groups overlaps.  We’ll make a new matrix that has the people’s names across both axes, and the values in this new matrix will indicate how many affiliations the people at each intersection shared.

Make sense?  Sort of?  Hopefully it will make more as we step through the process in the exercise below.

 

Creating a Person-to-Person Matrix

The first step is to create a new box for our data that has the names on both axes, not just the vertical.  So we’ll copy the names over to a new range, and also transpose them across the top.

  • Copy the names (and only the names) from the Pivot Table into a new empty column to the right of the Pivot Table.  
    • For example, my pivot table takes up columns A through D, so I pasted my names into column F.
  • Click in the cell above and to the right of the top name (the first cell of the column headings), choose Edit > Paste Special… and check the Transpose box to copy the same list across instead of downScreen Shot 2015-02-24 at 11.57.33 AM
  • You should now have an empty matrix with names across both axes

To fill this grid, we need to use some advanced Excel features: named ranges, and nested formulas.  The formula we will be using is complex and looks like this:

=IF($G2<>H$1,SUMPRODUCT(OFFSET(matrix,MATCH($G2,names,0)-1,0,1,COLUMNS(matrix)),OFFSET(matrix,MATCH(H$1,names,0)-1,0,1,COLUMNS(matrix))),0)

I will try to break it down a bit.  If you don’t care about how this works, feel free to skip this section!

  1. The main component is SUMPRODUCT, which will take as input two rows of our person-to-affiliation matrix, multiplying each set of values together and adding the results. 1X0 or 0x0 gives us a 0, but wherever we have two ones in the same column (i.e. the people in those rows belonged to the same group) 1×1 = 1.
    • If the two people attended multiple groups, those 1s are added together, giving us a weighted value in our new matrix: 1s for one co-membership, 0s for none, and higher number for multiples.
  2. SUMPRODUCT takes two ranges as inputs, and to select them we have two OFFSET functions
    • OFFSET(reference,rows,cols,height,width)
      • This looks in a range of cells (reference), starting a certain number of rows (rows), and columns (cols) away from the reference point, and returns a selected range of (height) and (width) cells.
      • In our example, the reference is the named range “matrix”, which will return a range 1 row high and the width of our “matrix” range (COLUMNS(matrix)).  The number of rows is provided by the output of another function
        • MATCH(lookup_value,lookup_array,match_type)
          • This returns the position of the “lookup_value” in “lookup_array”, which in our case is matching the name in the current row in our new matrix ($G2 in the example above)
      • The second OFFSET is identical to the first, except it will match against the name in the current column of our matrix (H$1) above
  3. Finally, we wrap the whole thing in an IF function
    • IF(logical_test,value_if_true,value_if_false)
    • This makes sure the names in the current row and column are not equal to each other ($G2<>H$1), and runs the function only if true, otherwise returning a 0
      • Why?  Because otherwise we would get very high values across the diagonal since each person obviously shared membership with themselves in each group they belonged to!

Phew! Got all that? It’s a lot, I know, but if you name the ranges correctly and set the values for your first cell, you should be able to copy the formula into the rest of the blank cells and Hey Presto! a weighted person-to-person matrix should appear.

 

Using Named Ranges and Modifying the Formulas

  1. Name your ranges so that the formula will work
    • In your Matrix sheet, select the binary values (omitting the Grand Totals!)
      • Put your cursor in the name field at top left and type “matrix” then hit returnScreen Shot 2015-02-24 at 12.39.42 PM
    • Select the list of names (omitting the Row Labels and Grand Total)
      • Put your cursor back in the name field and type “names” then hit returnScreen Shot 2015-02-24 at 12.41.18 PM
  2. Almost there! Now for the formula
    • Copy the full formula above
    • Double click in the top left cell of your blank person-to-person matrix and paste the formula
    • Before you hit return
      • You need to change the values of $G2 and H$1 to select the first names on each axisScreen Shot 2015-02-24 at 12.45.12 PM
        • In my example, I would need to change BOTH $G2s to $F4s, and BOTH H$1s to G$3s
          • Make sure you keep the dollar signs in place, so that the labels remain selected
    • Copy the cell you just entered the formula in and paste it in the rest of the range and you should see all the values magically populate
  3. Finally, we can output this matrix as a csv file and import it into NodeXL
    1. Copy the whole matrix you just created including labels (e.g. F3:J7 above)
      1. Add a new sheet called CSV
      2. Click cell A1
      3. Edit > Paste Special…
      4. Click on the Values radio button and hit OK
    2. Got to File > Save As
      1. Choose Format: Comma Separated Values (.csv)

 

We are finally ready to load the person-to-person matrix you created above into NodeXL

  1. Go to NodeXL > Import >  From Open Matrix Workbook… and import the csv file you just made
  2. Generate a graph and explore visualization options.

 

You did it! Give yourself a pat on the back.  That was hard work.


Assignment Tutorial Blog Post (Due TUESDAY, 11/10)

For this assignment, create a step-by-step tutorial as a blog post demonstrating a particular technique, tool, or other helpful how-to discovery you’ve made over the past several weeks in this course.

Pick a DH tool that we haven’t discussed yet and figure out an interesting use case for it (or, vice versa, pick a use case and figure out a potentially viable DH tool or methodology).  You can highlight a technique that you have discovered in class, or in the preparation of your projects, as long as it is not one we’ve all covered together already.  If you’re stuck for ideas, the Dirt Digital Research Tools directory offers an extensive list of software for academic uses.

Once you have an idea, create an online tutorial for the rest of us and the wider world to start paying forward what you’ve learned in the course and becoming the “local computer expert.”  For examples, you can look at some of the posts for this class, think back on all those SketchUp resources you’ve looked through, or see the software posts on the Profhacker blog.

Your tutorial blog post should include:

  • An introductory paragraph explaining clearly
    • what the tool or technique is and
    • why or in what context it would be useful
  • A step-by-step walkthrough of how to accomplish a specific task using the tool that contains
    • At least 5 steps 
    • EITHER screenshots illustrating the steps where appropriate
    • OR a screencast video in which you record your actions while speaking about the process into a microphone
  • A link to at least two further resources like the software’s documentation or other tutorials around the web

For screen capture software, if you Google “how to create tutorials screenshot” you’ll be overwhelmed with options.

8A: Network Analysis

The advent of the internet, and especially of its more socially connected Web 2.0 variant, has ushered in a golden age for the concept of the network.  The interconnected world we now live in has changed not only the way we study computers and the internet, but the very way we envision the world and humanity’s place in it, as Thomas Fisher has argued.  The digital technologies that we are learning to use in this class are tightly linked to these new understandings, making network analysis a powerful addition to the Digital Humanist’s toolkit.  According to Fisher,

The increasingly weblike way of seeing the world … has profound implications for how and in what form we will seek information. The printed book offers us a linear way of doing so. We begin at the beginning—or maybe at the end, with the index—and work forward or backward through a book, or at least parts of it, to find the information we need. Digital media, in contrast, operate in networked ways, with hyperlinked texts taking us in multiple directions, social media placing us in multiple communities, and geographic information systems arranging data in multiple layers. No one starting place, relationship, or layer has privilege over any other in such a world.

Small NetworkTo study this world, it can therefore be helpful to privilege not the people, places, ideas or things that have traditionally occupied humanistic scholarship, but the relationships between them.  Network analysis, at root, is the study of the relationships between discrete objects, which are represented as graphs of nodes or vertices (the things) and edges (the relationships between those things).  This is a very active area of research that emerged from mathematics but is being explored in a wide array of disciplines, resulting in a vast literature.  (Scott Weingart offers a gentle introduction for the non-tech savvy in his Networks Demystified series and you can get a sense of the scope from the Wikipedia entry on Network Theory.)  As hackers, we are not going to get too deep into the mathematical underpinnings and rely mostly on software platforms that make network visualization relatively easy, but it is important to have a basic understanding of what these visualizations actually mean in order to use them critically and interpret them correctly.

 


Exercise: Your (analog) social network

The basics of visualizing a network are fairly intuitive and can be done with pen and paper.

  • Draw a simple diagram of your own social network including
    • 10-12 people as nodes and
    • your relationship to them as edges
  • Put yourself at the center and then place other people around you.
    • Start with your immediate family (your kinship network) and then expand out to include extended family, friends, people you know through clubs or activities, etc.
  • Draw lines to connect these people to yourself
  • Now draw lines to connect them to each other.
    • How many have relationships that do not run through you?
  • As undifferentiated lines, these are probably not very informative, so code the lines to indicate the nature of each relationship

What takeaways emerge from your diagram?

Are there connections that surprised you or figures that emerge as more central to your network than you had realized?

  • Swap diagrams with your neighbor and see if the diagram helps you understand their network more easily.

 


Exercise: Your (digital) social network

The relationships you just drew can be expressed in a simple data model as a “triple” comprised of a subject, a predicate, and an object.  My relationship to my friend Chris for instance can be expressed as a triple in the following format:

Austin — is friends with — Chris

subject — predicate — object

Each relationship in your whole network can be represented this way as a set of triples, that allow for easily readable data storage and ready network visualization.   Many DH projects make extensive use of the RDF (Resource Description Framework) specifications for modeling large sets of data as an RDF graph of triples.  For our small example, we are going to recast our personal network as a set of triples and visualize it as a digital network using Google’s Fusion Tables application.

As we’ve already seen, Fusion Tables is an experimental platform for data visualization that Google has developed to allow spreadsheet data to be quickly visualized in any number of ways from traditional bar and line charts to maps and network visualizations.  Google launched its first MOOC around Fusion Tables a while back called Making Sense of Data that you can still view if you want an in depth look at how to use all the features of this application.  For now, we going to focus on its Network Graph capabilities.

Our first step will be to populate a Google Sheet with triples representing our own network data, and then import it into Fusion Tables and visualize it.Screen Shot 2015-02-17 at 11.38.45 AM

  • Launch Google Drive and create a new sheet with the following three columns: Person A, Relation, and Person B
    • Go through your hand-drawn diagram and translate each network relationship into a triple following the model above

(One word of caution — there are two types of relationship that can be expressed here: mutual and unreciprocated.  “Is friends with” or “is a sibling of” would be mutual relationships that produce an undirected graphDirected graphs map one-sided relationships like “is the parent of,” “is the student of” or “is in love with” by drawing a directional arrow for the edge.  Both are possible and can be used, but you should be aware of the distinction as you draw up your triples and stick to one or the other.)

This data model is unlike a relational database in that you will be repeating names in order to express all of the relationships in the graph.

  • Try to connect each person or node with at least two others
  • Make sure you are logged in and save your sheet

Import your data into Fusion Tables

Go to the Fusion Tables start page, click on Google Spreadsheets and import your data, checking the Export box if you wish to make the data public and downloadable.  Screen Shot 2015-02-17 at 11.49.41 AM

  • A window should open showing your data table.  You will add a new chart by clicking the red plus sign of the type “Network graph” and change the options to Show the Link Between your Person A and Person B columns.Screen Shot 2015-02-17 at 11.57.27 AM
  • Congratulations!  You have just made a graph of your social network.  Explore the limited options and apply some filters, then click and drag around the graph to see how you can change the visualization.

 


 

Assignment

Now that you know the basics of what a network graph is and how to create a rudimentary one, let’s explore some much more sophisticated network analysis DH projects.  With your group, explore one or more of the following projects:

As you explore the project, consider the following questions about the nature of this network analysis:

  • What (or who) are the nodes and what are the edges?
  • How are the relationships characterized and categorized?
  • What interactions does the project allow?
    • How does this impact their effectiveness and/or your engagement?
  • How was the project created?
    • See if you can dig around in the documentation and discover what tools or data manipulation steps produced the outcome you see.
    • Does the project combine network analysis with any other information or technique, like spatial analysis or text mining?