Sunday, November 3, 2013

Creating the Massachusetts Settlement Maps

Did you know that not everything that appears on the Internet is true? (how about that statement itself, ah, gotcha).

Right, a central theme for this blog is critical skepticism. The Internet is a wondrous tool for sharing information, but a healthy skepticism is more than justified. When someone is saying, "I've analyzed this data and this is what I've learned", you might want to ask where the data comes from, how the analysis was done and if the analysis can be repeated and verified.

And that ability to repeat and verify is not always easy (in fact, it's usually not easy). It's common to find yourself working with large amounts of data and with data coming from varied sources. And the analysis process itself can produce new and very large data sets that then become the basis for later steps. Managing all of this data so that the process can be repeated is where workflows come into play. This is a topic I'll get into in much greater depth later on but it's only fair that I set the right tone by documenting the process used to create the maps seen in the "Why put a city there?" post.


That post was meant to illustrate some key ideas about patterns and the analysis of geodata, and it uses basic geodata management and analysis techniques. Even so, how do you know if the evidence is real or contrived? Where did the data come from? Is the representation fair or distorted? You'd hope that I can answer those questions. So, let's give it a try.

For starters, the list of settlement dates comes from a website provided by the State of Massachusetts. 
Massachusetts Incorporation and Settlement Dates ( http://www.sec.state.ma.us/cis/cisctlist/ctlistalph.htm )


Figure 1: QGIS with settlement map layers loaded

I used QGIS 2.0 to manage the data and create the maps (Figure 1). The base map layers are shapefiles (states, cities, rivers) that came from various sources. The state boundaries and cities were bundled with an obsolete educational GIS product distributed by ESRI (ArcGIS Java Edition for Education). It's usually OK to use data provided with your GIS software to create maps for publication and using a data browser (like the QGIS Data Browser) you can check the table metadata for restrictions on use. But these tables have been around for a very long time and do not include that information. To be absolutely safe I should have found tables with clear licensing info. Rule for the day: do as I say, not as I do. The river and streams layer was extracted from the USGS hydrological data set. Explaining how to do that will require another post and I'll leave it for another time.


The process used to create the maps was:
  1. Create a project (I refer to these generally as workspaces)
  2. Load the US cities layer. Open the table view and select the cities in Massachusetts.
  3. Save the selection to a new shapefile (paying close attention to where you save the file. My first save went into the QGIS program folder, not where I want this data file to be).
  4. Add the new Massachusetts cities layer into into the workspace and put it into edit mode.
  5. Add a column for settlement year. In QGIS you can do this from the attribute table view. I wanted to name this new column settlement_year, but shapefiles are based on a very old database file format and column names are limited to 10 characters, so I went with settleyear.
  6. I removed all but a few essential columns from the table to make it easier to add the settlement years.
  7. With the MA-Cities attribute table open on one screen, and the website with the settlement dates open on my second scree, I went down the list and manually entered the year of settlement for each place. Having two screens makes this much easier, and arranging the displays to make this go quickly is essential. It took maybe 15 minutes to do this. I've seen people spend hours on something like this because they didn't think about how they could speed up the process. With everything arranged nicely it was click, enter a number, click, enter a number, and so on.
  8. However, when I went to add the first city I found that I could only enter one digit in the settleyear column. When I created the column I didn't pay attention to the "width" setting for the column. I selected "integer" as the column type and I assumed that it would store values up to around 33 thousand (that's a programmer thing, don't worry about where that number comes from). And this assumption was wrong. In QGIS you must specify the width for integer columns. I couldn't find a way to change the width so I dropped the column I had just added, and added it again with a width of 4. 
One more thing. Since I'm new to this release of QGIS, I wanted to make sure that I was not going to do a bunch of manual editing and then find out that I couldn't save those edits. "Saving" in a GIS tends to be more complicated than you expect; there are almost always many files involved. So after adding a few settlement dates I selected Save from the file menu (on general principle). I then closed the attribute table view (Figure Two) and again selected Save from the file menu. Still not sure that everything was saved, I tried to exit QGIS. Doh! Sure enough, it warned me that I had unsaved changes. When you directly edit a shapefile you have to toggle "edit" mode off before you can save your changes. Saving the project (workspace) doesn't automatically do that. To save my edits I right clicked on the table name in the layer list and selected Current Edits-Save Selected Layers. I could also have toggled editing off and then used Save or Save As to save the changes to the layer. After all of this I saved the workspace one more time, closed the workspace, and reloaded it. My edits were there, so I proceed to add the bulk of the settlement dates.

Figure Two: Attribute Table View in QGIS

So what about bias? This analysis is probably not going to generate a lot of controversy. Still, you could wonder if the conclusions I'm drawing are valid. And one point of concern occurred to me as I was doing the work. I started with a list of modern towns and cities and added settlement dates to that list. What if those cities are not actually representative of 17th century settlement patterns. For example, what about settlements that failed to prosper and didn't last? Those places would not show up in the data I'm using and this might have distorted the entire analysis. There are many more places in Massachusetts that were settled before 1764. If we could see all of those places would different patterns emerge?

That, grasshopper, is an excellent question.

And we could endeavor to answer that question by taking the full list of cities from the website and creating a map that shows them all. The HTML from the website could be converted to a table (I would probably use a spreadsheet for that) and then the table could be geocoded using the city names to get the latitude/longitude pairs (that's another whole story). That table could then be used as the basis for a new shapefile that we could view in our GIS. There's a bunch of useful stuff embedded in that process, so I think I'll make it my next post. But until then, remember, when crossing a mine field avoid the temptation to pick up shiny objects.  


No comments:

Post a Comment