Photogrammetry rocks!

Sorry the last third sounds like it was recorded with a toaster. There were some hardware difficulties. And software difficulties. . But nonetheless, here is a tutorial outlining my process of making a 3D Model, including aligning photos by marker and merging chunks by marker.

My Agisoft Tutorial!


Data Visualization Using RStudio + ggplot2


I took a 300 level stats class last term called Data Science, and in that class I learnt about how to use the ggplot2 package in R (a programming language that is great for statistical analysis) to plot various interesting graphs for data visualization. I found this R package extremely useful (I am actually using it to plot various graphs for my Computer Science Senior Capstone project). I want to share with you here the very basics of data visualization using ggplot2 in RStudio (which is an IDE- integrated development environment for R).


So first of all, you need to download R and RStudio before you get started. Once you have downloaded them, click the RStudio icon (which appears as the one below) to launch it.

After launching it, you will get something like this:

Next, select File-> New File -> R Markdown, and name the R Markdown file you are going to create. I named it DataVisualizationTutorial.

Then, an R Markdown file will appear that replaces the console, and you will notice that there’s already something in this file:

In our next step, we will insert a script in the R Markdown file to upload some R packages including the ggplot2 package we will be using:

After typing this chunk of script in the R Markdown file right after the {r setup} chunk, you can click the green triangle on the top right corner of the chunk you just typed in to run this chunk of script (to actually install the packages).

Then we want to load in a dataset in .csv form into the RStudio.

This data file serves as an example in this tutorial. When you want to visualize your own data, you just load your own .csv file to RStudio.

We load the .csv file by adding the following script in our R Markdown file (at the bottom of the image below):

Then we click the green triangle run button for this R script chunk to run our newly added script and load the data. After doing so, you will see the Data window on the top right corner of your screen will now have this graphDataFrame variable.

By clicking this graphDataFrame in the Data window, another window tab will pop up next to our R Markdown file. This is a R dataframe that we just created from the .csv file.

Now that the data is loaded as a dataframe in R, we can use ggplot commands to plot the data. Below is the script to plot a point graph with x axis corresponding to the “bin_size” column in our dataframe, y being the rmse, and color/shape of the points corresponds to the variable.

If we click the green button, we will see the graph showing up in the bottom right corner of the screen:

We can now change labels for the x,y axes and also add a title to our graph by writing the following script:

Now if we run it again, we will have a graph with better x,y labels and a title!

Next, we can export this visualization using the Export button:

Then, we will get a .png file for this newly created plot!


Here are two further resources about using ggplot2 to make data visualizations:

8B: Network Analysis: A Real-World Project

A Social Network of Carleton College during the First World War

Today we are going to try out a real-world network analysis project and attempt to reconstruct the social network of Carleton College around its 50th anniversary in 1916.  This was a turbulent time for Carleton, since the US was engaged in World War One, and many Carls past and present went to Europe to fight.  Carleton itself was destabilized in the fall of 1918, when the Carleton-St Olaf Unit of the Student Army Training Corps was formed and took over the men’s dormitory as a barracks.

The school was small enough at that time that everyone probably knew everyone else, so the social world was relatively restricted, but we obviously cannot interview people to ask who their friends were.  Of course, Facebook didn’t exist in 1916, so we can’t easily download a list of friends to start exploring.

So how are we going to construct an historical social network?  Well, fortunately, the Algol Yearbook was first published in 1890 and all issues since have been digitized by our helpful librarians and made available through the library catalog.  During the war the yearbooks went on hold, and so the volume published in 1919-1920 contains several years’ worth of data. As the yearbook committee wrote in their Forward:

SINCE the last appearance of the Algol, in the fall of 1916, Carleton has passed from days of peace and normal activities, through the unrest and sacrifice of war, to those of triumphant victory, bringing readjustment, and preparation for even greater progress than before. To the classes of 1919 and 1920 has fallen the peculiar task and privilege of recording Carleton life, not only on the campus, but in military forces of the nation and battleships of our fighting forces, during these eventful years. This book, we hope, will be more than a record, more than a list of names and facts, more than a celebration of victory; it is intended to be a simple expression of appreciation, a permanent memorial, to those who, clad in the Khaki and Blue, served their country in her time of need.

The yearbook won’t exactly tell us who was friends with whom, but they do list the complete class lists for a given year (providing all the individuals that will be the nodes in our network) and also all of the organizations on campus with their member lists.  We can assume that people who belonged to the same organization interacted regularly and construct relationships (the edges of our network) based on these affiliations.

This is called an affiliation network and can be a powerful way of exploring connections and influence, as illustrated by the graph of regional networks of railroad directors below.  Ours will allow us to construct a bimodal network graph of individuals and organizations in Carleton’s wartime history, from which we can derive a one-mode network of connected individuals based on their co-membership in these clubs.  The result should allow us to ascertain who were the most central individuals at Carleton that year, who formed bridges between different communities and other useful network metrics.

Networks of Boards of Directors of Western railroads, from “Railroaded” by the Stanford Spatial History Project

There are many sophisticated ways to perform this type of analysis (here’s a good Stanford tutorial on doing so with R, and here’s a tutorial for using Palladio at the great Programming Historian).

But we are hackers here and are going to use easy, off-the-shelf tools: Google Sheets to collect our data collaboratively in the cloud and the free NodeXL template for Microsoft Excel to visualize the network.  (Microsoft Excel itself, is of course, not free, but so widely available that this tool should be accessible and of use to most readers.)


Data Collection

First, we need to extract the lists of students and their membership affiliations from the yearbook.  The volumes were OCRd, but the process is far from perfect, making automatic data extraction difficult.  We will need to do a good amount of copy/pasting and hand editing to get these lists into shape.

  1. Go to the 1919-20 Algol  in the Carleton Digital Collections and explore the item
    • The interface allows us to view the PDF images and the scanned text
    • Go to page 82-83 which lists the Senior Class, and
  2. Click the View PDF & Text button to see what I mean about the data quality
    • The original includes a lot of white space and multiple columns for formatting, which does not translate very well into a clean table formatted CSV, our ultimate end goal. These yearbooks were also scanned as two-page spreads, which caused the OCR to freak out and mix up values from both columns.

If it were cleaner, we might be able to automate some of the data cleaning process by using a python script and regular expressions, or a tool like Open Refine, but alas, for a project of this scale, this would be more labor than it is worth.

So we’ll need to do this the old fashioned way with manual copy/pasting into spreadsheets.  We’ll try to speed it up, at least, by dividing the labor among us.


Getting our Data


NB: The instructions for doing this are below, but we are going to leverage the work of last year’s students and just re-use their gathered and cleaned data:

 NetworkAnalysisDataCollection Google Sheet

We are interested in names of students and their organizational affiliations, and you’ll notice that all of this information is right here for the Senior Class.  Great!  Unfortunately, it is only there for the Seniors and Juniors, but not for anyone else.

We are going to gather our data from Book 4 of the Yearbook: Literary Societies and Other Organizations. These lists contain full names, under each heading, and often class year, and they include underclassmen and -women as well.  Plus, these “literary” societies were as close as Carleton came to Greek fraternities and sororities, and since they served a similar social function they are especially relevant to our social network analysis.

  1. Download the PDF to your local machine so that it is easier to work with.
  2. Open the PDF in Preview or Acrobat
  3. Go to page 161, Book 4 and pick a society.
    • Highlight the text and see what the underlying OCR will let you grab at one go.  Chances are the class years will get all out of whack, so you’ll probably need to copy names individually.
  4. Go to our NetworkAnalysisDataCollection Google Sheet
    • You’ll see that I’ve set up a Master List sheet with the following columns
      • Name
      • Organization
      • Class
    • We’ll all populate this at the end, but to avoid stepping all over each other, we’ll each make a separate sheet for primary data collectionScreen Shot 2015-11-05 at 12.47.25 PM
  5. Duplicate the Sheet
    1. Rename it to the Name of your Organization
      1. E.g. Adelphic Society
  6. Now the hard part:
    1. Copy all the names to Name column
    2. Add the Class year where appropriate
    3. Fill in the name of your organization for each row
  7. Clean the data if necessary to get rid of any extraneous characters
    • You might find Google’s Find and Replace using regular expressions useful.
      • For instance, you could do a find/replace swapping out something like
        .s(a period followed by whitespace) for
        \n (a new line, or carriage return).
    • The result should be a single list containing the names of all members of the society with their class year
  8. Now lets compile the Master List
    1. Copy your data into the Master List sheet below that of others

Data Cleaning

Since the yearbook compilers were not thinking in terms of machine compatibility, they often recorded names differently in different lists, e.g. “Hannah M. Griffith” in the class list might show up as “Griffith, H.M.” in a member list.

To make our graph meaningful we would need to create an Index of Persons from the Index of Names we just created.  This is a standard step in prosopographical research (that is, research on a group of named individuals).  For an example of a major DH project underway to distinguish persons from names in the ancient world, see the SNAP:DRGN project.

To do this properly, we would construct a database with unique IDs for each name, linked as foreign keys to a new table listing unique persons, along with new fields describing the rationale for our decisions. We don’t have the time for that in this project, so we are just going to assume that if the initials match, we have the same person

We will just try to get as accurate as we can.

  1. Once everyone’s data is entered, we will sort it alphabetically by name to put all the similar names together.
  2. Use the Google =UNIQUE formula to check for slight differences
    • Go through the list of names, and replace each instance with its equivalent
      • Repeat until cleaned
  3. Resort the list based on Organization


Exercise: Affiliation Network in NodeXL

NodeXL is a free extension that gives Excel the functionality of a network analysis program, allowing you to store, analyze and visualize networks.

Now that we have all of the yearbook data collected in our sheet, we need to turn it into a format that NodeXL can read.  The simplest of these is an edge list, which consists of two columns containing the node names, where each row is an edge connecting those two vertices.

The Orgs sheet you created above will be our edge list, which we can use to make a bimodal display of our affiliation network data in NodeXL.

  1. Download and install NodeXL Template from this site
    • Follow the installation instructions and launch a blank template
The NodeXL ribbon of tools added to Microsoft Excel

You should now have a form to begin filling in, and a new NodeXL ribbon of tools at the top of the Excel window (like the image above).

  1. Copy and paste the edge list of all your people and organizations into the Edges tab’s first two Vertex Columns

This detailed introductory tutorial by Derek Hanson and Ben Schneiderman will give you a step by step guide to using the program.  The Serious Eats Analysis section beginning on page 27 provides an example of working with bimodal data like ours.

Follow the instructions in the tutorial linked above to test out NodeXL’s capabilities.  Try to figure out how to do the following

  • Generate a graph,
  • Generate graph metrics
    • Particularly degree
  • Add styling and color to vertices and edges
  • Add labels
  • Change the layout
    • Experiment with good options for visualizing a multimodal network (hint: try the Harel-Koren Fast Multiscale option)

What insights can you glean from this visualization?

Is it helpful to see the affiliations in this way?

Do any individuals or organizations stand out as particularly isolated or well connected?


Exercise: Person-to-Person Network

We are not just interested in the indirect connections through affiliations, however, we also want to see how co-membership creates direct connection between people.  In order to transform our bimodal network into a person-to-person unimodal one we need to turn this edge list into a matrix.  A network can be represented as a binary matrix wherein a connection is indicated by a 1 and no connection receives a 0.  The following two tables represent the same network information, showing directed relationships between Nodes 1 and 2 (note that Alice’s row contains all 0s, since she never appears in the Node 1 list).

Screen Shot 2015-02-24 at 11.33.31 AM

To get a matrix like this for our data, we would put the Organizations along the top axis, and enter 1s wherever people were members and zeroes everywhere else.  Doing this by hand for a large dataset would be very time consuming.  Statistical packages like R have functions that will do these transformations for you, but it can also be done using Excel’s Pivot Table feature to generate the person-to-affiliation matrix and the SumProduct function on the result to connect people to people based on the number of organizations they both attended.

If you want a model download the excel file below to see how everything listed below works together.


Creating an Affiliation Matrix

  1. Put your cursor in the data range for your edge list and select Pivot Table from the Insert or Data menu
    • Drag the field names into the areas indicated in the image, so that
      • Names label Rows,
      • Organizations label Columns,
    • Then drag Organizations into Values as well, so that Count of Organization fills in the matrix.
    • The result should like like that belowScreen Shot 2015-02-24 at 11.51.23 AM

Great, we’ve got a matrix showing a 1 (true) wherever a given person is affiliated with an organization and nothing (false) where there was no affiliation.  This is just a different way to represent the same data that was in our edge list.  But we want to see how people were connected with people, not the groups, so we need to do some matrix math.

Specifically we are going to compare the records of every two people in our matrix and if make use of binary math to see where their membership in groups overlaps.  We’ll make a new matrix that has the people’s names across both axes, and the values in this new matrix will indicate how many affiliations the people at each intersection shared.

Make sense?  Sort of?  Hopefully it will make more as we step through the process in the exercise below.


Creating a Person-to-Person Matrix

The first step is to create a new box for our data that has the names on both axes, not just the vertical.  So we’ll copy the names over to a new range, and also transpose them across the top.

  • Copy the names (and only the names) from the Pivot Table into a new empty column to the right of the Pivot Table.  
    • For example, my pivot table takes up columns A through D, so I pasted my names into column F.
  • Click in the cell above and to the right of the top name (the first cell of the column headings), choose Edit > Paste Special… and check the Transpose box to copy the same list across instead of downScreen Shot 2015-02-24 at 11.57.33 AM
  • You should now have an empty matrix with names across both axes

To fill this grid, we need to use some advanced Excel features: named ranges, and nested formulas.  The formula we will be using is complex and looks like this:


I will try to break it down a bit.  If you don’t care about how this works, feel free to skip this section!

  1. The main component is SUMPRODUCT, which will take as input two rows of our person-to-affiliation matrix, multiplying each set of values together and adding the results. 1X0 or 0x0 gives us a 0, but wherever we have two ones in the same column (i.e. the people in those rows belonged to the same group) 1×1 = 1.
    • If the two people attended multiple groups, those 1s are added together, giving us a weighted value in our new matrix: 1s for one co-membership, 0s for none, and higher number for multiples.
  2. SUMPRODUCT takes two ranges as inputs, and to select them we have two OFFSET functions
    • OFFSET(reference,rows,cols,height,width)
      • This looks in a range of cells (reference), starting a certain number of rows (rows), and columns (cols) away from the reference point, and returns a selected range of (height) and (width) cells.
      • In our example, the reference is the named range “matrix”, which will return a range 1 row high and the width of our “matrix” range (COLUMNS(matrix)).  The number of rows is provided by the output of another function
        • MATCH(lookup_value,lookup_array,match_type)
          • This returns the position of the “lookup_value” in “lookup_array”, which in our case is matching the name in the current row in our new matrix ($G2 in the example above)
      • The second OFFSET is identical to the first, except it will match against the name in the current column of our matrix (H$1) above
  3. Finally, we wrap the whole thing in an IF function
    • IF(logical_test,value_if_true,value_if_false)
    • This makes sure the names in the current row and column are not equal to each other ($G2<>H$1), and runs the function only if true, otherwise returning a 0
      • Why?  Because otherwise we would get very high values across the diagonal since each person obviously shared membership with themselves in each group they belonged to!

Phew! Got all that? It’s a lot, I know, but if you name the ranges correctly and set the values for your first cell, you should be able to copy the formula into the rest of the blank cells and Hey Presto! a weighted person-to-person matrix should appear.


Using Named Ranges and Modifying the Formulas

  1. Name your ranges so that the formula will work
    • In your Matrix sheet, select the binary values (omitting the Grand Totals!)
      • Put your cursor in the name field at top left and type “matrix” then hit returnScreen Shot 2015-02-24 at 12.39.42 PM
    • Select the list of names (omitting the Row Labels and Grand Total)
      • Put your cursor back in the name field and type “names” then hit returnScreen Shot 2015-02-24 at 12.41.18 PM
  2. Almost there! Now for the formula
    • Copy the full formula above
    • Double click in the top left cell of your blank person-to-person matrix and paste the formula
    • Before you hit return
      • You need to change the values of $G2 and H$1 to select the first names on each axisScreen Shot 2015-02-24 at 12.45.12 PM
        • In my example, I would need to change BOTH $G2s to $F4s, and BOTH H$1s to G$3s
          • Make sure you keep the dollar signs in place, so that the labels remain selected
    • Copy the cell you just entered the formula in and paste it in the rest of the range and you should see all the values magically populate
  3. Finally, we can output this matrix as a csv file and import it into NodeXL
    1. Copy the whole matrix you just created including labels (e.g. F3:J7 above)
      1. Add a new sheet called CSV
      2. Click cell A1
      3. Edit > Paste Special…
      4. Click on the Values radio button and hit OK
    2. Got to File > Save As
      1. Choose Format: Comma Separated Values (.csv)


We are finally ready to load the person-to-person matrix you created above into NodeXL

  1. Go to NodeXL > Import >  From Open Matrix Workbook… and import the csv file you just made
  2. Generate a graph and explore visualization options.


You did it! Give yourself a pat on the back.  That was hard work.

Assignment Tutorial Blog Post (Due TUESDAY, 11/10)

For this assignment, create a step-by-step tutorial as a blog post demonstrating a particular technique, tool, or other helpful how-to discovery you’ve made over the past several weeks in this course.

Pick a DH tool that we haven’t discussed yet and figure out an interesting use case for it (or, vice versa, pick a use case and figure out a potentially viable DH tool or methodology).  You can highlight a technique that you have discovered in class, or in the preparation of your projects, as long as it is not one we’ve all covered together already.  If you’re stuck for ideas, the Dirt Digital Research Tools directory offers an extensive list of software for academic uses.

Once you have an idea, create an online tutorial for the rest of us and the wider world to start paying forward what you’ve learned in the course and becoming the “local computer expert.”  For examples, you can look at some of the posts for this class, think back on all those SketchUp resources you’ve looked through, or see the software posts on the Profhacker blog.

Your tutorial blog post should include:

  • An introductory paragraph explaining clearly
    • what the tool or technique is and
    • why or in what context it would be useful
  • A step-by-step walkthrough of how to accomplish a specific task using the tool that contains
    • At least 5 steps 
    • EITHER screenshots illustrating the steps where appropriate
    • OR a screencast video in which you record your actions while speaking about the process into a microphone
  • A link to at least two further resources like the software’s documentation or other tutorials around the web

For screen capture software, if you Google “how to create tutorials screenshot” you’ll be overwhelmed with options.

Gephi Tutorial

Last lesson we learned about network analysis. Network analysis focuses on displaying relationship data. Usually, this data is presented in a sort of mind map consisting of what is known as nodes and edges. Nodes are the individual data points and edges are the connections between these nodes. This picture below demonstrates this:


Network analysis is useful in the sense that it shows us the relationships in possibly large quantities of points. A common example that has often been used is the network analysis of Facebook friends. Below is an example of a facebook friend network analysis created through Gephi taken from this youtube tutorial.

This image demonstrates that this person has friends from distinct groups. In the video, he elaborates saying that the groups were generally divided into different universities that he had attended or just people different geographical locations. This data shows that his facebook is populated with large distinct groups of friends rather than small scatters. It shows the different communities he has been a part of throughout his life and is a much better representation of the people in his friends list than the simple alphabetical list provided by facebook.

This tutorial however will not involve facebook data as facebook has recently placed regulations making it much harder to obtain data regarding friends and their respective connections. If we were to rewind a year, we could use a facebook application called netvizz which would extract our respective facebook friend data in a format usable in Gephi (GDF).

Instead, this tutorial will use a data set concerning the co appearances of characters in the novel Les Miserables. D. E. Knuth. This data set can be obtained in the following link.

After obtaining the data set, proceed to download Gephi. Gephi is a free program and can be downloaded here. After you have downloaded both files, install Gephi and unzip the Les Miserables file. Additionally, you may receive an error when opening Gephi for the first time stating that your version of java is not compatible. This can by simply downloading the newest version of java through a simple google search.

Once you have completed all of this, open up Gephi and you should be greeted
with a screen similar to this:

Once you have opened Gephi, click on file on the top left hand corner and then open. Then select the Les Miserables file you extracted before and you should be prompted with a screen with a few options. Select the following and click okay. You will then be greeted with an interesting looking cluster of nodes and edges (77 nodes and 254 edges to be exact!).

From here you can sort of organize and format the points by first choosing a layout. Personally, I prefer Forceatlas 2 though you can pick and choose which one you like best. This layout will cause the data to disperse and somewhat organize itself into relevant groups with similar connections making your network look something like this (you can zoom in using the middle mouse scroll).

Now although this seems pretty good thus far, numerous improvements can still be made. First of all, currently, all of the nodes are the same size. Depending on the complexity of the data, Gephi can also alter the size of the nodes based on numerous factors. These options can be accessed through this box located on the left hand side of the screen.

Due to the simplicity of the data set, we can only analyze the nodes in respect to their degree which really means their number of connections. We can display this either by changing the color of the nodes putting each one on a spectrum based on its degree. Or, we could change the size of the node which is the second tab with the three circles. For this tutorial lets change the size of the nodes. Press on the icon with the three circles and enter values for the minimum and maximum size of the nodes. Here are the values I chose to enter along with the resulting graph.

As you can see the nodes now greatly differ in size based on the number of connections they have with other nodes. However, the graph itself seems to be a bit squished. This can be solved by choosing the layout option: Expansion. This layout option will cause the nodes to move further apart. After using applying this layout a few times, the graph will look much more spaced out. I decided that it was probably better to change the node sizes a bit and ended up with the following:

Now that the Data points are more spread out and the nodes are different sizes based on their degree we can more clearly see the different groups present in the novel based on their co appearances. In order to see the names associated with each point, press on the T symbol on the button left hand corner as shown below. In addition you can also change the color of the nodes with the color block above the underlined A. I would suggest changing the color in order to more clearly see the names of the characters.

Additionally, one can also use color to define the different groups which are present in the data set. In order to do click on the statistics tab and click run on the modularity function. This function can be found on the right hand side of the screen.

A pop up will then appear, simply click okay and close the graph showing the calculations. The modularity basically calculates how the nodes are placed and their promixity to other nodes thus basically showing which groups are present. In order to display this data, once again return to the appearance tab but this time click on the color option.

Once here, select modularity class and click apply to see your graph be divided into different groups separated by color. Finally, now you can examine the data set and see for yourself the different groups present.

This data shows that not all the characters interact with each other in the novel based on co appearance. Rather, there are different groups of characters which follow different storylines. Still there are some characters which are parts of multiple groups and these can be seen near the middle of the cluster as larger nodes.

Overall this tutorial was aimed at showing the potential of network analysis and how it can be used to analyze and visualize data in a way which separates different groups. This data set is relatively simple and Gephi itself can deal with much more complicated things. This was merely an introduction. Well I hope enjoyed the tutorial!

Java Data Analysis Tutorial

So I’m gonna begin by saying this is in no way the best or most efficient method we’ve come across to analyze data, but I figured it would be fun (at least a little bit) and different than the usual methods. Anyway, on to the good part.

Seeing as my group ran into a few issues in finding ways to extract the important pieces of information from the mess that was our OCR’d, scanned (decade-old) student directories,  I figured I’d try applying what little Computer Science knowledge I’ve picked up during my time here to the issue. This tutorial assumes you have a certain understanding of the Java Programming Language (I promise it isn’t that much) which you can find in a few different ways.

Codecademy has some pretty good, interactive video lessons on the subject, and the Oracle (official Java site) website has its own text tutorials. Alternatively you can look them up on YouTube or take CS 201.

The Actual Tutorial

Before beginning any coding, it’s probably key to know three key elements:

  • What your program should take as  input
  • What your program should do
  • What your program should output

Java is fairly limited in what it can take as input, so I would suggest (if possible) placing your data into .txt format, probably the simplest way to make it accessible by java.

The two most important elements in this exercise will be the use of the  Scanner() and  for  loops since the program will have to iterate through the files provided and compare elements of them to your queries. This is essentially what the computer does when you press ctrl + f, but implementing a program helps us add more delimiters to the search (a double find if you will). While the program can probably be written in one method/function, I’ll walk through the setup with 4 methods, since it is easier to understand when broken up into smaller problems.

Loading methods once they’re in text file format is quite simple. Using the command line input standard for java main functions, you can just type in the file name as a parameter in the command line. Using the File implementation, you’d first create a new File variable with the filename, and subsequently a new scanner with the file. The scanner then allows you to iterate through the file either by line, character or string.

The for loops are somewhat self-explanatory in their function. Using for (an item in your list: the list) the iterator is implemented, done automatically by java for you.

My Example

In this case, my expectation was for the program to take in a text file, count the number of occurrences for a combination of residence hall and class year, and hopefully output another text file with the results. You’d begin by writing your standard java file, (I usually import everything just to make sure) a class definition, a main function the first method, to load the input text file. My approach would be to use a dictionary where the keys are the residence halls and the values are the number of students in the dorm, the list is used to keep track of what residence halls we are searching through. The program should have 3 critical methods; one to load the .txt files, one to count the number of “x” in each hall per year and the display method (or print). None of the methods should have return values since we’ll be using and changing the instance variables we created in the beginning. The final output will either be printed in the command prompt window, or saved to the computer as a new text file.

Basis for the program.

The load function creates the dictionary with the provided halls to search through, and makes all of the values 0 as starting points. Try-catch blocks are to make sure you are loading an actual file, if not it throws an Exception. Because of how this method is coded, your hall text file should be formatted with one hall per line. The while loop continues reading through the file until it reaches the end, adding each line to both the List of hall names and the dictionary.

Example residence hall names file.
public static void loadHalls(File fileName)
   Scanner hallsInput = null;
      hallsInput = new Scanner(fileName);
   catch (FileNotFoundException e) {System.out.println("File loading error");}
      String hallName = hallsInput.nextLine();
      halls.put(hallName, 0);

The bulk of the work is done by the count method, which uses the same scanner function to look through the lines of the text you’re searching through. The file name, year of the directory and class year to be searched for are provided through the method call. In this case, lines aren’t added to the dictionary, instead the line is split into words separated by spaces and put into a list which can be access more freely. The following for loop checks if the line contains both the hall name and the predetermined class year. The format in which these two are found also depends on the directory (Goodhue vs. GHUE or ’17 vs. Senior) but these are details that can be dealt with in the main method. If both are in the line, then the value in the dictionary for the respective hall (key) is increased by 1.

public static void count(File fileName, String year, String classYear)
      Scanner input = null;
      input = new Scanner(fileName);
      catch(FileNotFoundException e){System.out.println("File loading error.");}
      String[] valuesInLine;
      while (input.hasNextLine())
         String line = input.nextLine();
         valuesInLine = line.split(" ");
         List valuesInLineList = new ArrayList();
         for (String word: valuesInLine)
         for (String hall: hallsList)
            if (valuesInLineList.contains(hall) && valuesInLineList.contains(classYear))
               halls.put(hall, halls.get(hall)+1);

Finally, the display method which gives the end result. This can be done in one of two ways; the easy route being to have the program print out the final values in the command prompt window or the slightly more complex, to create a text file with those values and save it to the computer. While the former takes less time to code, the latter might save more time in the long run so I’ll talk about that one. This method takes in the class standing (defined in the main) and the year of the directory, used simply to name the file where the data is stored. These details are mostly my personal preference and could be done without. The method creates a new PrintWriter instance which takes two parameters; the name of the file and (essentially) the language/the characters to be used in it. Next, the halls and their respective values are put in to the file one by one, using the same type of for loop as the count methods, and throwing an exception if there is an issue with the instantiation of the PrintWriter function. As to specifics of the PrintWriter methods, those can be read on the PrintWriter javadoc.

public static void printTextFile(String standing, String year)
      PrintWriter writer = new PrintWriter(standing + "PerHall" + year, "UTF-8");
      for (String hall: hallsList)
         writer.println(hall + ": " + halls.get(hall));
      catch(IOException e)
         System.out.println("File load failure");


With all the methods done, the main method can be synthesized in order to bring together the pieces. Since all the methods return void, they all need to be called in the main. And so the program can be as fluid as possible, the filenames and years can be input as parameters in the command line call, since java requires the program name be called anyway. Separated by a space, anything following the name of the program will be considered as another String in an array, with the first string after the program name being the 0th element of the array. Therefore the main standardizes the input, expecting the user to enter the name of the file to be analyzed, the file with hall names, the year of the directory, and the class year to search for (these details can be changed with no severe impact on the program).

Error message when providing no parameters in command line.
public static void main(String args[])
      String file1= null, file2 = null, year=null, classyear = null;
         file1 = args[0];
         file2 = args[1];
         year = args[2];
         classyear = args[3];
      catch (IndexOutOfBoundsException e) {System.out.println("Input file to analyze, file with halls, year and classyear");}
      String standing;
      int FSJS = Integer.parseInt(args[3])-Integer.parseInt(args[2].substring(2, 4));
      if (FSJS==3) standing = "Freshmen";
      else if (FSJS==2) standing = "Sophomores";
      else if (FSJS==1) standing = "Juniors";
      else standing = "Seniors";
      File hallNameFile = new File(file2);
      File text = new File(file1);
      printTextFile(standing, year);
      print(standing, year);

Afterwards, each of the methods must be called, with the specified parameters, and bingo! you have results. Because the code is made as general as possible (that being restricted by my knowledge) it is pretty fluid and can be changed with minimal effort to take into account different formats of the directories.

Printed results.
Results as file, automatically saved to folder.

ArcGIS Online – Organizing and Expressing Data Tutorial

While I was making my Japanese Mascot Map (which you can see here!) a lot of my time was spent experimenting with how to organize my data and which data ArcGIS Online could pull directly from the resulting spreadsheets. I explained some of my process HERE but I want to explain in greater depth how I organized, input, and expressed my data.

I used GoogleSheets to store and organize my data which worked well. However, ArcGIS Online can only upload spreadsheet files in .csv or .txt format. GoogleSheets lets you download individual sheets in .csv format, but that means if you edit your data than you have to delete the layer made with the unedited file, download the edited file to your computer, and render the edited file to your map as a new layer. This takes a lot of time. I’m hoping that with this tutorial the time I spend editing/downloading/uploading could save some time with your own project


What type of location data do you want to map? This will affect how many spreadsheets you make and how you express location within them. My rule of thumb is make a new datasheet for a new layer if location of places/regions will be rendered to the map by  different parameters. In other words, locations denoted by City, State vs. Street Address vs. Latitude/Longitude should be organized in their own spreadsheets. This is helpful for a couple of reasons:

  1. Confuses ArcGIS less – ArcGIS asks you when you import a layer which columns it should pull location data. With multiple datasheets/layers, you can choose which data are rendered in which manner without having to sacrifice accuracy, arbitrarily pick drop points for larger regions, or confuse the program with void entries.
  2. As data sets get larger, problems are easier to find – A number of my spellings didn’t match those ArcGIS used and some of my photo links were broken. It was much easier to delete one layer, find the error, fix it, and re-upload a smaller spreadsheet than it was to do the same for a spreadsheet with 50+ entries
  3. Easier to stylize different categories of data – Not to mention you only have to re-stylize some of your data if you have to fix a layer. Whenever you (re)upload a layer, the points are set as red dots by default.

In my case, I wanted to map mascots from Prefectures, Cities, Buildings, Organizations, and Companies. I used two different methods for designating location so I made two map layers from two datasheets.

Prefectures and Cities  I mapped as points denoted by two columns of data: Prefecture, City (ie. Hokkaido, Hakodatte).  Because prefectures aren’t associated with any one city, I used the same format with the capital as my city marker (ie. Kobe, Hyogo). It would be the same as making one column each for State and City if you were mapping in the United States.

NOTE!: If you want to make polygons for prefectures/states and not points, I would suggest making a separate spreadsheet for them. I did not use polygons in my original map, but if you include a city than ArcGIS will pin that city instead of denoting a region.

Buildings, Organizations, and Companies I mapped using Latitude and Longitude. These are things with definite locations, usually denoted by street addresses. However, street addresses are often very different across countries, and that’s before differing spelling conventions for foreign languages. Even in familiar areas, points sometimes don’t get dropped in the right place. The easiest way to get an accurate point the first time is to use its latitude and longitude.

An easy way to find it is to use GoogleMaps. To do so:

1.) Search Location

2.) Right click on the point and choose “What’s here?”

A small bar should show up at the bottom of the screen. See the numbers at the very bottom?

The first number is the Latitude. The second number is the Longitude.

In my spreadsheet, I made 4 columns for location data: Prefecture, City, Latitude, Longitude. I included the Prefecture and the City because I wanted to display this information at each point, but when I uploaded the layer the program used the latitude and longitude to drop the pins. A window may pop up asking you to specify which data you’d like to use for locations. In that case, pick your preference.

NOTE!: ArcGIS sometimes gives you the option to limit your expressed dataset to a single country, in my case, Japan. If your data set reaches across countries, include a Country column in each spreadsheet. So, in the first example, the location of a city would be expressed in 3 columns: Japan, Hokkaido, Hakodatte. The location of a building would be expressed in 5: Japan, Kyoto (prefecture), Kyoto (city), 34.987756, 135.759333.

You should include a column for any categories you want to distinguish stylistically.

In my spreadsheets I added a Mascot_Type column. This I kept close to my other non-location data: Name and Name of Building/Company/Organization.

From that data, I could set the layer to display points based on what type mascot the point represents. When you upload a new layer, a menu called “Change Style” will appear on the left. In the drop down menu under “Choose an attribute to show,” pick which column you put your categories.

You can then change how each category appears on the map by changing the appearance of the point. Click on one of the sample points in the “Change Style” menu. A window will pop up with point style options for the category you selected. When you are done, press “OK” both in the window and the “Change Style” menu.

If you want an image to pop up when you click on a point ArcGIS Online can pull images straight from your spreadsheets. In a column titled “Image” or “URL”, add URL links to images you want to use for each location. Here is the image of Tawawachan I used and the corresponding link highlighted in yellow. Because URLs are long, I recommend making this the last column in your datasheet.

To add these images to your pop-ups, make sure you pressed “Ok” on any open menus and click the “…” next to the layer you want to add images to. From that menu, click “Configure Pop-up.”

The menu will open on the left hand side. Here you can change which data is expressed where. To add images, go to  “Pop-up Media” and press “Add.” From that drop down menu, select “Image.”

A window called “Configure Image” will appear. Here you can add titles, captions, and hyperlinks to your images. To add the images from your spreadsheet, go down to “URL” and press the small boxed cross to the right. Scroll down and select the name of the column where you put your image URLs. I called mine “Image.”

Press “Ok” in both the “Configure Image” window and “Configure Pop-up” menu. Once you do, an image should appear in your pop-ups when you click on a point. If you don’t see it immediately, scroll down or enlarge the pop-up window as they are quite small. If you still don’t see an image, the URL in your spreadsheet may be broken.

NOTE!: Make sure to double check the links aren’t broken while you’re still working in your spreadsheet. When you render your data onto the map, the program won’t tell you if it can’t find images. It’s better to check before you render your data instead of after you’ve spent time stylizing your points because you will have to re-upload the sheet, setting the points back to default red circles.