Data Visualization Using RStudio + ggplot2

                                 

I took a 300 level stats class last term called Data Science, and in that class I learnt about how to use the ggplot2 package in R (a programming language that is great for statistical analysis) to plot various interesting graphs for data visualization. I found this R package extremely useful (I am actually using it to plot various graphs for my Computer Science Senior Capstone project). I want to share with you here the very basics of data visualization using ggplot2 in RStudio (which is an IDE- integrated development environment for R).

 

So first of all, you need to download R and RStudio before you get started. Once you have downloaded them, click the RStudio icon (which appears as the one below) to launch it.

After launching it, you will get something like this:

Next, select File-> New File -> R Markdown, and name the R Markdown file you are going to create. I named it DataVisualizationTutorial.

Then, an R Markdown file will appear that replaces the console, and you will notice that there’s already something in this file:

In our next step, we will insert a script in the R Markdown file to upload some R packages including the ggplot2 package we will be using:

After typing this chunk of script in the R Markdown file right after the {r setup} chunk, you can click the green triangle on the top right corner of the chunk you just typed in to run this chunk of script (to actually install the packages).

Then we want to load in a dataset in .csv form into the RStudio.

This data file serves as an example in this tutorial. When you want to visualize your own data, you just load your own .csv file to RStudio.

We load the .csv file by adding the following script in our R Markdown file (at the bottom of the image below):

Then we click the green triangle run button for this R script chunk to run our newly added script and load the data. After doing so, you will see the Data window on the top right corner of your screen will now have this graphDataFrame variable.

By clicking this graphDataFrame in the Data window, another window tab will pop up next to our R Markdown file. This is a R dataframe that we just created from the .csv file.

Now that the data is loaded as a dataframe in R, we can use ggplot commands to plot the data. Below is the script to plot a point graph with x axis corresponding to the “bin_size” column in our dataframe, y being the rmse, and color/shape of the points corresponds to the eval.dev variable.

If we click the green button, we will see the graph showing up in the bottom right corner of the screen:

We can now change labels for the x,y axes and also add a title to our graph by writing the following script:

Now if we run it again, we will have a graph with better x,y labels and a title!

Next, we can export this visualization using the Export button:

Then, we will get a .png file for this newly created plot!

 

Here are two further resources about using ggplot2 to make data visualizations:

3D modeling

This past week, we learnt about various tools for 3D modeling. We expanded our SketchUp toolkit, explored procedural modeling through CityEngine, and also used PhotoScan to build a 3D model of an artifact from some 2D pictures (Photogrammetry).

In terms of research questions for which 3D modeling and simulation would be appropriate methodology, what came to my mind first were modeling ancient artifacts and constructing small-scale 3D models of ancient architecture/landscape. In an art history class I took at Carleton, we looked at ancient Chinese bronze vessels. There are different kinds of them, each of different usage. Each kind, however, is of a very standard shape, and there are also several standard surface decoration patterns for these vessels (that indicate owners’ social classes).

With 3D modeling techniques, especially procedural modeling, we can easily construct 3D prototypes for different kinds of vessels, which might help us better classify them. Also, the 3D models of ancient artifacts and architecture/landscape can aid one to better understand how people lived hundreds or even thousands of years ago.

I found it hard to come up with an occasion when manual modeling would make the most sense. I guess sometimes when one tries to model an ancient artifact that is too delicate to be scanned or taken pictures of, we might need to resort to manual modeling. In most other occasions, scanning is likely to yield a more accurate model for small objects than manual modeling, and procedural modeling is likely to model large-scale objects much faster. Procedural modeling can potentially be combined with manual modeling to make the code-generated 3D models more accurate.

From reading Marie Saldana’s paper An Integrated Approach to the Procedural Modeling of Ancient Cities and Buildings, I got a sense that while procedural modeling allows rapid prototyping and interactive updating of 3D content, the process of finding the optimal set of “rules” is hard and requires some backward thinking.  So I was wondering whether computer scientists can develop algorithms and programs that automatically generate the optimal set of “rules” when being fed with information about the objects to be modeled.

 

Spreadsheets vs. Relational Databases

Last year I went to the Carleton Hackathon, which is a 36-hour non-stop coding contest. Our group built a mobile app that displays the daily menus for the Carleton dining halls and cafes  and sends notifications to a user when the food that user likes is on the menu.

One of the major challenges we faced was to collect data from the Bon Appetite website. It seemed that it stores the weekly menus in an online database, which we were not able to access. We ended up coding our app so that it will make a url request every day at midnight to get the html page that displays the daily menu, and then scrape the useful data from the messy html page.

I wondered at that time how the online database for a website works, and now through exploring the WordPress backend, I think I got a better understanding of it.

In terms of pros and cons for spreadsheets and relational databases, I summarized below some points that came to my mind immediately:

  • Spreadsheets: 
    • Pros:
      • user friendly
      • easy to read and understand
    • Cons:
      • might store redundant information: e.g. in the book example in class, when using a flat spreadsheet to store the information, we see that many publishers, authors are repeated multiple times when a publisher published many books and an author wrote many books.
      • Pain to modify the data: if some information is wrong, and we want to modify it, then we will need to fix every occurrence of that mistake: e.g. if a publisher’s information has a typo in it, and 10 books in the spreadsheet are published by that publisher, we will need to change the 10 cells that have typos. For a relational database, we only need to go to the publisher table and change one line.
      • Little support for reproducible data manipulation: with a relational database, we normally write a script to do the data manipulation. If we figured that the input data is outdated, we can easily import the updated data and run the script again with one click (reproducible data manipulation). However, for a flat spreadsheet, we will need to repeat the data manipulation (which might involve many steps) all over again by hand.
  • Relational Databases: 
    • Pros:
      • steep learning curve for new users
      • hard to read and understand
    • Cons:
      • Easy to modify stored data
      • Support reproducible data manipulation

Should humanities students learn to program?

As a computer science major without much humanities exposure, my instant thought to this question was: “humanities students do NOT need to learn to program “. Programming skill is something handy to have, but one can easily learn to code when one needs to by using websites like CodeAcademy. It might well be that throughout a humanities student’s career, he/she never needs to write any computer programs. If this is the case, then I see no point of learning to program.

I certainly agree with the argument by Matthew Kirschenbaum:

I believe that, increasingly, an appreciation of how complex ideas can be imagined and expressed as a set of formal procedures — rules, models, algorithms — in the virtual space of a computer will be an essential element of a humanities education. Hello Worlds (why humanities students should learn to program) 

Indeed, it would be beneficial to know how computer abstracts complex things using a set of fixed formal procedures, and to appreciate the elegance and neatness of it. However, one does not necessarily need to program to achieve this goal. Many mathematicians who can’t code are working on intriguing theoretical computer science research problems. Coding is just one of the many aspects of the field of computer science, as Evan Donahue noted in his post:

To think of the computer sciences as one “computer science” unified by the language of code makes as much sense as thinking of the humanities as one discipline united by the language of (in the case of the American academy) English. A “Hello World” Apart (why humanities students should NOT learn to program) 

Although I believe that humanities students do not necessarily need to learn to program, I still think that it would be beneficial to know one programming language, like Python or R. If one want to clean a datafile for a DH project, for instance, one can certainly do it using spreadsheet (there are many tools that one can use to select a subset of the data, chop off some unwanted data points etc.). However, if say there are 100 files we want to clean, and for each one, there is a fixed set of 10 procedures to do the cleaning in a spreadsheet application, then doing the 1000 procedures all using spreadsheet tools would consume a lot of time. Now if one knows R programming, one can write a simple R script that takes in a file name as input and saves the cleaned data to a new file. We then simply run this script for 100 times (which can be done using a for loop), and get what we want in less than 10 seconds. I have done a lot of data cleaning and organization (all tedious work if done in spreadsheet) using R script for the Data Science class I took last term, and I can see it being applied to DH projects that involve a lot of data manipulation.

I have learned a little bit of HTML/CSS/JS several years ago for a software engineering internship. I refreshed my memory by taking four lessons on HTML, CSS and JS on CodeAcademy yesterday. I really like the interactive nature of the CodeAcademy and the exercises would be very helpful for people without much coding experience. As a computer science major, however, I feel that the pacing is a little bit too slow, and I was frustrated that I needed to pass each exercise to get to the next step. It might be better for me to just look at a HTML/CSS/JS cheatsheet to figure out how things work in much less time =P

 

Musical Passage

Musical Passage immediately caught my attention among all the listed DH projects. As a music enthusiast, I love to listen to music of all different kinds, and Musical Passage turned out to be even more fun to explore than I had expected.

This DH project is about the little known early African diasporic music. It contains recordings of several pieces of early African diasporic music, and a relatively short article explaining the background. Compared to the Enchanting Dessert project we’ve looked at in class, Musical Passage is of a much smaller scale, and I found it much easier to figure out a “correct” order to explore this project:

  • first listen to the music pieces,
  • then read the article for more information.
Reverse Engineering Musical Passage:

In terms of sourcesthis project primarily used Hans Sloane’s 1707 book Voyage to the Islands of Madera, Barbados, Nieves, S. Christophers and Jamaica. This book contains some very first transcriptions of African diasporic music in the European tradition, and the music pieces presented in Musical Passage is a subset of the book’s collection.

In terms of processes, the authors of Musical Passage took some music transcriptions in Hans Sloane’s book, and then reconstructed and recorded the music by themselves. In this step, they also resorted to music professionals for music interpretations using Sloane’s notes. They also wrote the background article that is another important component of their website.

When it comes to presentation, they created two separate sections on their website, EXPLORE and READ. The EXPLORE section contains a single page of music scores of the pieces they’ve selected. When a viewer clicks the scores of any piece, a small window will appear next to it, which gives the viewer two recordings of the piece that he/she could play, followed by text that introduces the piece and its background. The READ section contains the background article as well as some illustrations from Hans’ book.

Although Musical Passage is a small DH project, it is very well-designed. The creators’ attempt to include recordings of the music allow us to connect more easily to the music hundreds of years ago.

 

SketchUp 101, Childhood House

Jan. 12 In class:

Google

Tried to draw a chimney but failed 🙁 Will make this house prettier later…


Jan. 15 Update:

I grew up in a big, crowded city and there are very few houses near the city center (I guess only the richest people can afford a house). I thus lived in apartments throughout my childhood. Building a 3D model for an apartment building I lived/live in seemed too ambitious for a SketchUp beginner. Thus I decided to make a 3D model of my grandparents’ house in the countryside where I spent almost every summer of my childhood.

Its layout is pretty simple: a one-layer kitchen attached to a two-layer house. I figured out how to add a chimney to the kitchen roof, which is good. But I couldn’t figure out how to draw the wooden front door so that it is not in a “closed” state. I actually spent most of my time figuring out how to draw “open” doors, but failed. So I guess I will just leave the door closed in my 3D model (although during the day, the door is always opened). Some other things I found hard to draw using the basic tools are  the two trash cans (and lids) and the fish tank in the front yard. So I drew something similar to the dog food bowls we did in class.

A tip for other novice SketchUp users: Apple Magic mouse doesn’t support the orbit mode, and is thus harder to use for SketchUp than a normal mouse. I used the Apple mouse in class but a normal mouse to finish up my drawings later. The normal mouse definitely makes life better.