Saturday, November 16, 2013

Swimming Upstream Against the Info Torrent - Part One - Revision Control

Tsunami, torrent, fire hose, etc. These days data seems to grow on trees (my current interest is in Beech Bark Disease).

In any case, to do serious work you have to take care with the management and storage of the data you work with. I just spent several hours organizing and archiving recent work and this reminded me how significant a hurdle this can be. When you work with geodata just keeping track of all your files is a big job.

The approach I use comes directly from working as a software developer. Programmers deal with lots of files and it's common to have many versions of a file (I have over 1.5 million files on my laptop - how did that happen). Software developers deal with this problem by using a version control systems to keep versioned copies of their files. There are (at least) a half dozen different version control systems in common use with both commercial and open source alternatives. Even so, if your objective is the management of research data for the long term there is a compelling case for using an open source product. Using open source software avoids the risk of getting stuck with a commercial product that doesn't meet your needs in the future (or that you might not want to continue to pay for). Git, Mercurial and Subversion are among the best known open source options for version control.

A version control system (VCS) allows you to store your data and files in what amounts to a structured database. You start by creating a repository where you place an initial copy of each file you want to store (you can have more than one repository). Each time you edit a file (or dataset) you commit the new version to the repository along with comments that say what you changed. When used for software development the revision control process gets complicated, but an individual or group working with research data can get most of the benefits by following a few basic practices. The generalized steps for using revision control are:
  • Create a repository -- a common approach is one project per repository
  • Create a folder structure within the repository. This structure represents the organization of your project data and files
  • Copy your project files into the repository. In VCS lingo you commit or check-in your work 
  • Add files to the project repository as needed. When you change an existing file the new version is stored as a revision of the existing file (previous versions are also available). Add good comments with each update so you know what the change represents.
Using a VCS to store and version your files means that you don't need to create copies when you make a change. If you've ever ended up with a folder looking like Figure One, you know what I mean. 

  Figure One: Out of control file copies

I was experimenting with different ways to process a KML file (the Google Format for geodata) and I saved the result of each experiment to a separate file. A few weeks later I wanted to repeat the analysis using a new version of the data. Unfortunately I couldn't associate the refinement techniques I had use with the copied files. So I ended up repeating some of the previously done work. Using a VCS --with comments that said what I had done at each stage in the process-- would have saved a bunch of re-work.

If you are new to the idea or using revision control I recommend that you start with TortoiseSVN (Figure Two). 


Figure Two: The TortoiseSVN Repository Browser

TortoiseSVN is actually just the client to the Subversion system. Subversion provides the data management core of the system (and a command-line interface) and TortoiseSVN puts a nice user-friendly interface on top of that. But, when you install TortoiseSVN it also installs a local copy of Subversion core so you can create local repositories. You can also use TortoiseSVN to access repositories stored on remote servers.

TortoiseSVN is especially nice if you are using Microsoft Windows. It integrates with Windows Explorer so you manage your files by right clicking on a folder or file and selecting the appropriate option from the popup menu (the Macintosh version is similar). 

It's useful to have local repositories for your day to day work but the value of using a VCS is multiplied when your repositories are stored on servers. Using a server based repository you can allow others to access and update files (with control over who can do what). And you don't even have to install and manage the server software yourself. Systems such as GitHub or Google Code provide browser based front-ends to Git and Subversion (respectively). These systems store your files on remote servers run by the provider so you don't have to manage the VCS system software (and make backups). Depending on the amount of space you need you might need to pay a fee but basic uses are free or low cost. Using an online VCS is convenient, especially if you need to share work with other contributors or if you are placing your work into the public domain, but I recommend that you use a local VCS for day-to-day work and use the online alternatives when sharing is the main objective.

Of course, setting this up and using it is more work on top of all the other things you probably don't have time to do. A really basic and simple alternative is to start with an online storage system such as Google Drive or DropBox (or similar alternatives). These systems store your files on a remote server making them easy to share (and providing access from multiple devices). These systems typically offer limited free storage and you buy additional space as needed (usually inexpensive).

In addition to Google drive there is Google Docs. Google Docs is Google's online office suite and you store the "docs" you create via Google Drive. One of the best features of Google Docs is that the system automatically stores a revision for every change you save. This is the wiki approach popularized by Wikipedia and using this "save everything" approach makes it possible to allow collaborators to edit any document. If an edit is wrong or unwanted you can easily go back to an earlier version of a document. This is similar to using a version control system except that you don't get to add comments to each revision. Still, used with a disciplined approach, it's far better than nothing.

Lastly, you can simply organize your work in folders on your computer. This approach can work but it requires great discipline if the system is going to hold up over time. To quote Uncle Ben, "with great power comes great responsibility."


If you value your work you must think about how you will organize and archive your data. Backups are needed for when disaster strikes but it is just the start. Protecting the value of your work for the long term is a much bigger challenge. 


Wednesday, November 6, 2013

GPS Accuracy

GPS is a truly amazing technology. I mean, here's a every-day technology that relies on the relationship between space and time that we wrap up in the theories of relativity and special relativity. Wow.

But not to worry, that's not what this post is all about. I have four devices that incorporate GPS in some way and I'd like to compare the relative accuracy of the four. I'd also like to know which one is most accurate and what conditions alter that accuracy level.

The four devices that I'm evaluating:
  • A Garmin eTrex Venture HC hand-held GPS - (from 2009, now discontinued, but similar models are available for around $125, see the Garmin eTrex 10, 20, 30)
  • A Magellan Explorist 210 (from roughly 2002, long ago discontinued)
  • A Samsung Charge smart phone (2.5 year old Android smart phone)
  • A Canon Powershot sx260HS camera with built-in GPS

The testing was done under ideal GPS operating conditions;
  • Clear skies
  • In the middle of a large open area with no obstructions
  • Plenty of time for each device to lock signals from as many GPS satellites as possible
The Basic Test
My testing will consist of recording latitude and longitude pairs from each device at known locations so that the coordinates returned by the devices can be compared with the actual coordinates. So how am I finding these "known locations"? I'm using Google Earth to zoom in on an easily recognized location near my home. I then add placemarkers in GE and view the coordinates that GE assigns to those placemarkers. So, yes, there are assumptions being made. For one, I'm assuming that the coordinates I obtain from Google Earth are trust worthy and two I'm discounting (for now) any discrepancies that might come from the use of different coordinate systems by the devices involved.

The location for the testing is a square wading pool (82 feet on a side) located in a “quad” complex in the Saratoga Spa State Park inSaratoga Springs, NY. Figure One shows a screen capture of the pool as it appears in Google Earth. Figure two shows the pool with more of the surrounding area visible to provide context. I made three circuits of the pool stopping at each corner and recording the location with each of the four devices. I then averaged the latitude and longitude values for each each device to obtain a single location per device. These coordinates were used to add points markers to Google Earth and I could then use the Google Earth measuring tool find the distance between the locations provided by each device and the reference location. Those measurements are shown in Table One.



Wading Pool in Saratoga Spa State Park as seen in Google Earth. GPS fixes were obtained at each of the four corners. The image on the right shows the wading pool in context.
*

Figure One



Garmin
Magellan
Phone
Camera
Difference between GPS location and GE coordinates in Feet
8
6
25
10
Difference between the GPS elevation and elevation obtained from Google Earth
1
10
94
20
No. of satellites registered by each device
(after minimum 5 minute warmup)
10
8
10
?
Table One


The averaged results are in line with what I expected. The Garmin and Magellan GPS units produced results within the expected accuracy level of 10 feet. The camera was nearly as accurate, but the phone was accurate to 25 feet at best, and it appears to be off by about 50 feet for a typical single reading. This may be by design. You can use the phone to provide information on your location to others in real-time and being really accurate might not be desirable. This is just a guess and I'm sure that GPS accuracy varies widely among models.
Also worth noting is that the three coordinate sets obtained from the dedicated GPS devices were consistent with the overall accuracy level provided by the devices (falling with a circle with the location at the center and a radius of 15 feet). The locations obtained from the camera and the phone were much more variable with both devices providing some fixes that were hundreds of feet away from the actual locations (more on this later).

In addition to the latitude/longitude values representing the location, GPS devices also able to provide a measurement of the elevation relative to sea level. I recorded and averaged these elevations and Table One includes the comparisons of the calculated elevation with the elevation that Google Earth shows at that place. That reference elevation provided by Google Earth was compared with a nearby USGS benchmark and the benchmark value differed from the Google Earth value by just 2 feet.

The elevation measurements provide additional insights into the accuracy of the devices. The calculation of elevation using GPS requires that a signal be obtained from at least 3 satellites (as opposed to a minimum of two for the location only). Given that all of these devices were able to lock signals from 8 to 10 satellites I would expect the devices to provide accurate elevations, but, historically, elevation measurements made by GPS have tended to be less accurate and less reliable (for various reasons). The two GPS devices and the camera worked well under the ideal conditions of this testing, but the phone GPS was off by close to 100 feet on average.

Overall, the Garmin GPS performed best, but that is mostly a question of convenience. The Garmin is typically ready to go within 30-60 seconds and the Magellan unit commonly needs several minutes to lock signals from a full complement of satellites. The Garmin uses a far superior antenna technology but, this is no surprise, the Magellan unit is over 10 years old (a newer Magellan would probably be similar). Both GPS units and the camera provide geographic references that can used for a variety of purposes. The next question I want to consider is precision. What does it mean to say that a lat/lng pair are within 10 feet of the "actual" value. There's more to it than you might think.

Scientific Workflows

If acquiring and analyzing your data is not enough to keep you busy you've also got to think about following a process that ensures that you and others will be able to repeat and verify your work. You'll need to carefully store and catalog the data you use to provide repeatability, and also to protect the long term availability of the data.

This problem is not new, but the scale of the difficulties has expanded in recent years. Looking for answers to questions that are spatial, or that have a geographic context, will commonly require the use of data from multiple sources. These data sets tend to be large and you typically will need to massage the data into compatible formats. Just managing the intermediary result sets that come out of these processes can be a big job. Software designed for these tasks can help but the bottom line is that you must think about a process up front and then stick with it.

Figure 1

Figure one shows an idealized flow of data through an analysis pipeline. You acquire the data you need from various sources. The data is transformed into compatible formats and normalized. You analyze the data. You curate (store and catalog) data that represents your results, as well as data created by intermediate steps in the process. Make no mistake, it can be a whole lot of work and requires specialized skills and knowledge.

As mentioned, there is software available that can help address these issues. I organize these tools into three categories based on whether the workflow support is internal to an analysis framework or general purpose. An additional consideration is whether or not the system is open source. The point of all of this is to ensure that the work can be recreated at a later time and the use of proprietary software places an external constraint on your ability to do that. You might use closed source software for some parts of your process, but you'll want to ensure that all the data you create is stored in non-proprietary formats. And for the workflow tools themselves, the long-term availability of the workflow system is a critical consideration.
  1. Open source: Frameworks with internal workflow support
  2. Open source: General purpose workflow management tools/frameworks
  3. Closed source: Workflow management tools/frameworks
Category one is represented by data analysis frameworks such as SAGA or Paraview. These frameworks record a script that represents the sequence of actions you take as you load, transform and analyze your data. These scripts can be reloaded and rerun as needed. As long as you complete the task inside the tool, you have a workflow to represent the work.

Category two includes general purpose data management and analysis workflow systems such as Kepler. Kepler generalizes the dedicated capabilities of the systems in category one. Kepler allows you to incorporate entire systems into a single integrated process. It does this by providing basic processing primitives that are used to connect inputs and outputs from various systems. Kepler can be extended by creating plugins that know how to integrate with external systems. This is a great approach but it comes with a cost in the form of complexity. An alternative is the Trident system created by Microsoft Research. The Trident workflow manager itself is open source, but Trident runs on top of a closed source platform (Microsoft Windows Server and SQL Server) so it cannot be considered to be a fully open source system.

Category three is seen in the Workflow Manager that ESRI makes available for use with ArcGIS. At the risk of mis-characterizing a product that I have not personally used, the Workflow Manager is similar to general purpose tools such as Kepler. It's aimed at the integration of GIS-based processes into larger business or research processes. ArcGIS also incorporates a feature called the Model Builder that is similar in concept to the data analysis pipelines discussed in category one. The downside is that that like ArcGIS, these tools are closed source proprietary systems.

Resources:
Kepler
Trident


Pattern and Process

Pattern and Process - A Geographic Perspective

Just for fun let's say that the analysis of data to gain information or knowledge can be reduced to two simple questions:
  • Is there a pattern? 
  • Can an evident pattern be attributed to a process? 
Sound familiar? "Pattern and process" might remind you of the scientific method featured in high school science courses. And for good reason; it's the same idea. Given data that represents something of interest we can look for patterns and try to figure out what circumstances or process produced those patterns. Turning the question around we can start with an idea about how something works and try to acquire evidence (in the form of data) to support or refute our idea. The analysis of geographic data lends itself to thinking in terms of "pattern and process". 

That said, caution is always required. Not every pattern represents a meaningful truth (whatever that means) and there are many ways for the whole thing to go wrong. We can look at data and find real patterns that result from nothing more than chance. Or, our data can be biased. Meaning that it systematically over or under represents some aspect of what we are interested in. You might associate bias with deliberate attempts to skew the results, but that's just one source of bias. Bias also comes from flaws in the methods we use to acquire data or how we manage the data once we've got it.

On top of chance and bias we have uncertainty around how well our data actually represents the things we want to learn about. When we use data to represent something in the real-world we are almost always summarizing and sampling the real attribute of interest. Even our brains do this. You have no doubt seen examples of optical illusions that trick our brains into into seeing things that are not really there. This is because our brains summarize and sample the stream of data that comes from our eyes so as to not be overwhelmed by the flow. Similarly, we summarize and sample when we acquire data to avoid being overwhelmed by the complexity or just the sheer volume of the data.

And if that's not enough, we have to take care to ensure that our we don't overreach in drawing conclusions from the data we analyze. We might find a pattern, and we might even be certain that the pattern is connected the process of interest, but that does not mean that we understand cause and effect. Statisticians have a catchy phrase for this and I recommend that you repeat this quietly to yourself three times each day:


Correlation does not imply causation
Correlation does not imply causation
Correlation does not imply causation 


A well known example of cause and effect thinking run amuck is the often cited notion that marijuana is a "gateway" drug and that using marijuana leads to the use of harder drugs. This idea grew out of data that showed that many heroin addicts used marijuana before they got hooked on heroin. That is a correlation that I don't doubt for a second, but it does not mean that preventing the use of marijuana will prevent heroin addiction. 

Correlation does not imply causation

The same correlation could be found with the use of alcohol or with many other behaviors that are common among heroin addicts. John Stewart (of the Daily Show) nailed this with his theory that, for kids growing up in Illinois, participating in student student government in high school leads to political corruption and prison later on. In short, student class president is a gateway office. Stewart logic is:  
  • Over the past 20 years a high percentage of prominent political figures in Illinois have ended up in prison 
  • Stewart noted that many of these figures had held student government offices in high school
  • Therefore, participation in student government is a "gateway office" leading to political corruption, prison and despair. He urged parents to protect their children by knowing the signs of political ambition and taking direct action to stop it before it's too late.
Brilliant. Correlation does not imply causation. Don't ever forget this.

Community Science

Community science is a variation on a better known theme that you might know as "citizen science". The idea is that "regular" people -citizens or communities- can and should participate in research that contributes to a better understanding of our world. The distinction lies in where the research questions or issues originate. Citizen science is commonly associated with traditional research projects where citizen volunteers participate by gathering or evaluating data. The questions to be answered tend to come from traditional sources in academia or government and the projects are managed by those institutions. Community science extends the idea that "anyone can do real science" by encouraging the involvement of groups that self-organize around an issue or question of special interest to a community. 

But this distinction is somewhat arbitrary. The big question is:
  • How does the research get turned into useful knowledge?
This is one of those questions that gets more complicated every time you look at it and I will come back to it again and again in this blog. If you are interested in this topic of "open" research, here are some sources for the information that shapes my thinking on this:
Community science is sometimes associated with "adaptive management". Adaptive management is a natural resources management methodology that attempts to reduce conflicts in natural resources management by involving stakeholders in an iterative process. The Wikipedia page provides an overview and links to resources.

[Free eBook available on-line from Microsoft Press]
Dozier Jeff, Gale, William B., The Emerging Science of Environmental Applications, The Fourth Paradigm, Data Intensive Scientific Discovery, Edited by Tony Hey, Stewart Tansley, and Kristin Tolle, Microsoft Press

Victoria Stodden has written about open science and participatory science. Her blog has links to her work and other resources. I found this paper to be interesting.

[you need access to an academic library for these] 
Carr, A. J. L., 2004, Why Do We All Need Community Science?, Society and Natural Resources, 17:841–849, 2004

Gibbons (1999), SCIENCE'S NEW SOCIAL CONTRACT WITH SOCIETY, Nature 402, C81 (1999), Macmillan Publishers Ltd. Impacts

Goodchild, Michael F., Citizens as sensors: the world of volunteered geography, GeoJournal (2007) 69:211–221


Data Overview

Data, data everywhere...

Links to data resources referenced in the blog.


SourceComment
NYS GIS ClearinghouseFirst place to check for geodata created by or used by New York State official bodies. Some data is not available to the public by default but can requested from the originating source.

more to come...

Sunday, November 3, 2013

Creating the Massachusetts Settlement Maps

Did you know that not everything that appears on the Internet is true? (how about that statement itself, ah, gotcha).

Right, a central theme for this blog is critical skepticism. The Internet is a wondrous tool for sharing information, but a healthy skepticism is more than justified. When someone is saying, "I've analyzed this data and this is what I've learned", you might want to ask where the data comes from, how the analysis was done and if the analysis can be repeated and verified.

And that ability to repeat and verify is not always easy (in fact, it's usually not easy). It's common to find yourself working with large amounts of data and with data coming from varied sources. And the analysis process itself can produce new and very large data sets that then become the basis for later steps. Managing all of this data so that the process can be repeated is where workflows come into play. This is a topic I'll get into in much greater depth later on but it's only fair that I set the right tone by documenting the process used to create the maps seen in the "Why put a city there?" post.


That post was meant to illustrate some key ideas about patterns and the analysis of geodata, and it uses basic geodata management and analysis techniques. Even so, how do you know if the evidence is real or contrived? Where did the data come from? Is the representation fair or distorted? You'd hope that I can answer those questions. So, let's give it a try.

For starters, the list of settlement dates comes from a website provided by the State of Massachusetts. 
Massachusetts Incorporation and Settlement Dates ( http://www.sec.state.ma.us/cis/cisctlist/ctlistalph.htm )


Figure 1: QGIS with settlement map layers loaded

I used QGIS 2.0 to manage the data and create the maps (Figure 1). The base map layers are shapefiles (states, cities, rivers) that came from various sources. The state boundaries and cities were bundled with an obsolete educational GIS product distributed by ESRI (ArcGIS Java Edition for Education). It's usually OK to use data provided with your GIS software to create maps for publication and using a data browser (like the QGIS Data Browser) you can check the table metadata for restrictions on use. But these tables have been around for a very long time and do not include that information. To be absolutely safe I should have found tables with clear licensing info. Rule for the day: do as I say, not as I do. The river and streams layer was extracted from the USGS hydrological data set. Explaining how to do that will require another post and I'll leave it for another time.


The process used to create the maps was:
  1. Create a project (I refer to these generally as workspaces)
  2. Load the US cities layer. Open the table view and select the cities in Massachusetts.
  3. Save the selection to a new shapefile (paying close attention to where you save the file. My first save went into the QGIS program folder, not where I want this data file to be).
  4. Add the new Massachusetts cities layer into into the workspace and put it into edit mode.
  5. Add a column for settlement year. In QGIS you can do this from the attribute table view. I wanted to name this new column settlement_year, but shapefiles are based on a very old database file format and column names are limited to 10 characters, so I went with settleyear.
  6. I removed all but a few essential columns from the table to make it easier to add the settlement years.
  7. With the MA-Cities attribute table open on one screen, and the website with the settlement dates open on my second scree, I went down the list and manually entered the year of settlement for each place. Having two screens makes this much easier, and arranging the displays to make this go quickly is essential. It took maybe 15 minutes to do this. I've seen people spend hours on something like this because they didn't think about how they could speed up the process. With everything arranged nicely it was click, enter a number, click, enter a number, and so on.
  8. However, when I went to add the first city I found that I could only enter one digit in the settleyear column. When I created the column I didn't pay attention to the "width" setting for the column. I selected "integer" as the column type and I assumed that it would store values up to around 33 thousand (that's a programmer thing, don't worry about where that number comes from). And this assumption was wrong. In QGIS you must specify the width for integer columns. I couldn't find a way to change the width so I dropped the column I had just added, and added it again with a width of 4. 
One more thing. Since I'm new to this release of QGIS, I wanted to make sure that I was not going to do a bunch of manual editing and then find out that I couldn't save those edits. "Saving" in a GIS tends to be more complicated than you expect; there are almost always many files involved. So after adding a few settlement dates I selected Save from the file menu (on general principle). I then closed the attribute table view (Figure Two) and again selected Save from the file menu. Still not sure that everything was saved, I tried to exit QGIS. Doh! Sure enough, it warned me that I had unsaved changes. When you directly edit a shapefile you have to toggle "edit" mode off before you can save your changes. Saving the project (workspace) doesn't automatically do that. To save my edits I right clicked on the table name in the layer list and selected Current Edits-Save Selected Layers. I could also have toggled editing off and then used Save or Save As to save the changes to the layer. After all of this I saved the workspace one more time, closed the workspace, and reloaded it. My edits were there, so I proceed to add the bulk of the settlement dates.

Figure Two: Attribute Table View in QGIS

So what about bias? This analysis is probably not going to generate a lot of controversy. Still, you could wonder if the conclusions I'm drawing are valid. And one point of concern occurred to me as I was doing the work. I started with a list of modern towns and cities and added settlement dates to that list. What if those cities are not actually representative of 17th century settlement patterns. For example, what about settlements that failed to prosper and didn't last? Those places would not show up in the data I'm using and this might have distorted the entire analysis. There are many more places in Massachusetts that were settled before 1764. If we could see all of those places would different patterns emerge?

That, grasshopper, is an excellent question.

And we could endeavor to answer that question by taking the full list of cities from the website and creating a map that shows them all. The HTML from the website could be converted to a table (I would probably use a spreadsheet for that) and then the table could be geocoded using the city names to get the latitude/longitude pairs (that's another whole story). That table could then be used as the basis for a new shapefile that we could view in our GIS. There's a bunch of useful stuff embedded in that process, so I think I'll make it my next post. But until then, remember, when crossing a mine field avoid the temptation to pick up shiny objects.