Java Data Analysis Tutorial

So I’m gonna begin by saying this is in no way the best or most efficient method we’ve come across to analyze data, but I figured it would be fun (at least a little bit) and different than the usual methods. Anyway, on to the good part.

Seeing as my group ran into a few issues in finding ways to extract the important pieces of information from the mess that was our OCR’d, scanned (decade-old) student directories,  I figured I’d try applying what little Computer Science knowledge I’ve picked up during my time here to the issue. This tutorial assumes you have a certain understanding of the Java Programming Language (I promise it isn’t that much) which you can find in a few different ways.

Codecademy has some pretty good, interactive video lessons on the subject, and the Oracle (official Java site) website has its own text tutorials. Alternatively you can look them up on YouTube or take CS 201.

The Actual Tutorial

Before beginning any coding, it’s probably key to know three key elements:

  • What your program should take as  input
  • What your program should do
  • What your program should output

Java is fairly limited in what it can take as input, so I would suggest (if possible) placing your data into .txt format, probably the simplest way to make it accessible by java.

The two most important elements in this exercise will be the use of the  Scanner() and  for  loops since the program will have to iterate through the files provided and compare elements of them to your queries. This is essentially what the computer does when you press ctrl + f, but implementing a program helps us add more delimiters to the search (a double find if you will). While the program can probably be written in one method/function, I’ll walk through the setup with 4 methods, since it is easier to understand when broken up into smaller problems.

Loading methods once they’re in text file format is quite simple. Using the command line input standard for java main functions, you can just type in the file name as a parameter in the command line. Using the File implementation, you’d first create a new File variable with the filename, and subsequently a new scanner with the file. The scanner then allows you to iterate through the file either by line, character or string.

The for loops are somewhat self-explanatory in their function. Using for (an item in your list: the list) the iterator is implemented, done automatically by java for you.

My Example

In this case, my expectation was for the program to take in a text file, count the number of occurrences for a combination of residence hall and class year, and hopefully output another text file with the results. You’d begin by writing your standard java file, (I usually import everything just to make sure) a class definition, a main function the first method, to load the input text file. My approach would be to use a dictionary where the keys are the residence halls and the values are the number of students in the dorm, the list is used to keep track of what residence halls we are searching through. The program should have 3 critical methods; one to load the .txt files, one to count the number of “x” in each hall per year and the display method (or print). None of the methods should have return values since we’ll be using and changing the instance variables we created in the beginning. The final output will either be printed in the command prompt window, or saved to the computer as a new text file.

Basis for the program.

The load function creates the dictionary with the provided halls to search through, and makes all of the values 0 as starting points. Try-catch blocks are to make sure you are loading an actual file, if not it throws an Exception. Because of how this method is coded, your hall text file should be formatted with one hall per line. The while loop continues reading through the file until it reaches the end, adding each line to both the List of hall names and the dictionary.

Example residence hall names file.
public static void loadHalls(File fileName)
{
   Scanner hallsInput = null;
   try
   {
      hallsInput = new Scanner(fileName);
   }
   catch (FileNotFoundException e) {System.out.println("File loading error");}
   while(hallsInput.hasNextLine())
   {
      String hallName = hallsInput.nextLine();
      hallsList.add(hallName);
      halls.put(hallName, 0);
   }
   hallsInput.close();
}

The bulk of the work is done by the count method, which uses the same scanner function to look through the lines of the text you’re searching through. The file name, year of the directory and class year to be searched for are provided through the method call. In this case, lines aren’t added to the dictionary, instead the line is split into words separated by spaces and put into a list which can be access more freely. The following for loop checks if the line contains both the hall name and the predetermined class year. The format in which these two are found also depends on the directory (Goodhue vs. GHUE or ’17 vs. Senior) but these are details that can be dealt with in the main method. If both are in the line, then the value in the dictionary for the respective hall (key) is increased by 1.

public static void count(File fileName, String year, String classYear)
   {
      Scanner input = null;
      try
      {
      input = new Scanner(fileName);
      }
      catch(FileNotFoundException e){System.out.println("File loading error.");}
      String[] valuesInLine;
      while (input.hasNextLine())
      {
         String line = input.nextLine();
         valuesInLine = line.split(" ");
         List valuesInLineList = new ArrayList();
         for (String word: valuesInLine)
         {
            valuesInLineList.add(word);
         }
         for (String hall: hallsList)
         {
            if (valuesInLineList.contains(hall) && valuesInLineList.contains(classYear))
            {
               halls.put(hall, halls.get(hall)+1);
            }
         }
      }
      input.close();
   }

Finally, the display method which gives the end result. This can be done in one of two ways; the easy route being to have the program print out the final values in the command prompt window or the slightly more complex, to create a text file with those values and save it to the computer. While the former takes less time to code, the latter might save more time in the long run so I’ll talk about that one. This method takes in the class standing (defined in the main) and the year of the directory, used simply to name the file where the data is stored. These details are mostly my personal preference and could be done without. The method creates a new PrintWriter instance which takes two parameters; the name of the file and (essentially) the language/the characters to be used in it. Next, the halls and their respective values are put in to the file one by one, using the same type of for loop as the count methods, and throwing an exception if there is an issue with the instantiation of the PrintWriter function. As to specifics of the PrintWriter methods, those can be read on the PrintWriter javadoc.

public static void printTextFile(String standing, String year)
   {      
      try
      {
      PrintWriter writer = new PrintWriter(standing + "PerHall" + year, "UTF-8");
      for (String hall: hallsList)
      {
         writer.println(hall + ": " + halls.get(hall));
      }
      writer.close();
      }
      catch(IOException e)
      {
         System.out.println("File load failure");
      }

   }

With all the methods done, the main method can be synthesized in order to bring together the pieces. Since all the methods return void, they all need to be called in the main. And so the program can be as fluid as possible, the filenames and years can be input as parameters in the command line call, since java requires the program name be called anyway. Separated by a space, anything following the name of the program will be considered as another String in an array, with the first string after the program name being the 0th element of the array. Therefore the main standardizes the input, expecting the user to enter the name of the file to be analyzed, the file with hall names, the year of the directory, and the class year to search for (these details can be changed with no severe impact on the program).

Error message when providing no parameters in command line.
public static void main(String args[])
   {
      String file1= null, file2 = null, year=null, classyear = null;
      try 
      {
         file1 = args[0];
         file2 = args[1];
         year = args[2];
         classyear = args[3];
      }
      catch (IndexOutOfBoundsException e) {System.out.println("Input file to analyze, file with halls, year and classyear");}
      
      String standing;
      int FSJS = Integer.parseInt(args[3])-Integer.parseInt(args[2].substring(2, 4));
      
      if (FSJS==3) standing = "Freshmen";
      else if (FSJS==2) standing = "Sophomores";
      else if (FSJS==1) standing = "Juniors";
      else standing = "Seniors";
      
      File hallNameFile = new File(file2);
      loadHalls(hallNameFile);
      File text = new File(file1);
      count(text,year,classyear);
      printTextFile(standing, year);
      print(standing, year);
   }

Afterwards, each of the methods must be called, with the specified parameters, and bingo! you have results. Because the code is made as general as possible (that being restricted by my knowledge) it is pretty fluid and can be changed with minimal effort to take into account different formats of the directories.

Printed results.
Results as file, automatically saved to folder.

3 Replies to “Java Data Analysis Tutorial”

  1. I think learning to write little scripts/programs like this is really useful. It’s a good introduction to a language like python or java, and it’s definitely more satisfying than going through all that manual labor. Once you get good at it, you can also save yourself a bunch of time. And if you make your programs well, you can reuse them for other data sets.

    1. I agree! Unfortunately this one probably took me more time to code/write down what I was doing than it would take to look through the data by hand but I like to believe it could be useful.

  2. Wes,

    This is a really helpful tutorial that shows the concrete benefits of writing a script in a language like Java for a DH project. I especially like that it was part of the work on your actual final group project, which makes the relevance crystal clear.

    Your explanations were clear and you struck a good balance between the specifics of your project and the general principles one should follow in any similar setup. Your screen shots and code snippets are also very helpful (I enabled a pretty print plugin so the code is now even prettified and color coded). Well done!

Leave a Reply