Wednesday, November 6, 2013

Scientific Workflows

If acquiring and analyzing your data is not enough to keep you busy you've also got to think about following a process that ensures that you and others will be able to repeat and verify your work. You'll need to carefully store and catalog the data you use to provide repeatability, and also to protect the long term availability of the data.

This problem is not new, but the scale of the difficulties has expanded in recent years. Looking for answers to questions that are spatial, or that have a geographic context, will commonly require the use of data from multiple sources. These data sets tend to be large and you typically will need to massage the data into compatible formats. Just managing the intermediary result sets that come out of these processes can be a big job. Software designed for these tasks can help but the bottom line is that you must think about a process up front and then stick with it.

Figure 1

Figure one shows an idealized flow of data through an analysis pipeline. You acquire the data you need from various sources. The data is transformed into compatible formats and normalized. You analyze the data. You curate (store and catalog) data that represents your results, as well as data created by intermediate steps in the process. Make no mistake, it can be a whole lot of work and requires specialized skills and knowledge.

As mentioned, there is software available that can help address these issues. I organize these tools into three categories based on whether the workflow support is internal to an analysis framework or general purpose. An additional consideration is whether or not the system is open source. The point of all of this is to ensure that the work can be recreated at a later time and the use of proprietary software places an external constraint on your ability to do that. You might use closed source software for some parts of your process, but you'll want to ensure that all the data you create is stored in non-proprietary formats. And for the workflow tools themselves, the long-term availability of the workflow system is a critical consideration.
  1. Open source: Frameworks with internal workflow support
  2. Open source: General purpose workflow management tools/frameworks
  3. Closed source: Workflow management tools/frameworks
Category one is represented by data analysis frameworks such as SAGA or Paraview. These frameworks record a script that represents the sequence of actions you take as you load, transform and analyze your data. These scripts can be reloaded and rerun as needed. As long as you complete the task inside the tool, you have a workflow to represent the work.

Category two includes general purpose data management and analysis workflow systems such as Kepler. Kepler generalizes the dedicated capabilities of the systems in category one. Kepler allows you to incorporate entire systems into a single integrated process. It does this by providing basic processing primitives that are used to connect inputs and outputs from various systems. Kepler can be extended by creating plugins that know how to integrate with external systems. This is a great approach but it comes with a cost in the form of complexity. An alternative is the Trident system created by Microsoft Research. The Trident workflow manager itself is open source, but Trident runs on top of a closed source platform (Microsoft Windows Server and SQL Server) so it cannot be considered to be a fully open source system.

Category three is seen in the Workflow Manager that ESRI makes available for use with ArcGIS. At the risk of mis-characterizing a product that I have not personally used, the Workflow Manager is similar to general purpose tools such as Kepler. It's aimed at the integration of GIS-based processes into larger business or research processes. ArcGIS also incorporates a feature called the Model Builder that is similar in concept to the data analysis pipelines discussed in category one. The downside is that that like ArcGIS, these tools are closed source proprietary systems.

Resources:
Kepler
Trident


No comments:

Post a Comment