Databases (or: Ramblings and Questions about Categorization)

(FAIR WARNING: rambling ahead. I couldn’t put my thoughts and questions into words that made sense so please, bear with me. It’s Monday.)

Databases boggle my mind – I honestly can’t tell if it’s because of the great volume of information that we’ve managed to store in databases, or because of the care in categorization that databases require.

Making categories seems to be the biggest issue as we saw with the wide variance of tags in the Carleton history timeline we made in class. Each event clearly holds meaning to us, but how to express that meaning (Academics? Academia?) and how specific that meaning should be represented are important and difficult questions. Somewhere in our reading from coding week, the study of code as a linguistic signature of individuals came up. I am sure the same could be done for databases because as we saw last week, everyone had a different way of organizing and presenting data based on what patterns they saw and what they thought were important to emphasize.

There’s also the question, I think, about what information is important now and what could be important in the future, especially as it pertains to crowd sourced big data projects like ANZACs, When ANZACs first started, they could have kept the crowd sourcing to individuals heights, but they also crowd source names, regiment numbers, causes of death, etc. I wonder how they store that data and how will it be made accessible in the future?

(side note: will there be a generational difference in database-makers who will only use # to denote qualitative categories? Does anyone else remember when # was called the “pound sign”?)

I’m very curious about what the process would be if we wanted to combine the information held in databases. For example: say that the researchers behind ANZACs found a group in the US  and a group in Europe who did a similar crowd sourced project which took more or less the same data from similar sources from a similar time period and wanted to pool their data. Flat data sets would likely be tedious to merge with a different organizational system, but how much would you have to backtrack through a relational database to get at the keys that equate Location1=New York  and AuthorID1=Mark Twain and change them so the two databases can talk to each other without redundancy? By this I mean if Location1=New York in one data set and Location1=Auckland in another, that would lead to problems. Once the data is large enough that doing it manually is impractical, how does one untangle the categories? Is this a problem digital humanists have encountered yet?

Leave a Reply