This problem is not new, but the scale of the difficulties has expanded in recent years. Looking for answers to questions that are spatial, or that have a geographic context, will commonly require the use of data from multiple sources. These data sets tend to be large and you typically will need to massage the data into compatible formats. Just managing the intermediary result sets that come out of these processes can be a big job. Software designed for these tasks can help but the bottom line is that you must think about a process up front and then stick with it.
Figure 1
Figure one shows an idealized flow of data through an analysis pipeline. You acquire the data you need from various sources. The data is transformed into compatible formats and normalized. You analyze the data. You curate (store and catalog) data that represents your results, as well as data created by intermediate steps in the process. Make no mistake, it can be a whole lot of work and requires specialized skills and knowledge.
As mentioned, there is software available that can help address these issues. I organize these tools into three categories based on whether the workflow support is internal to an analysis framework or general purpose. An additional consideration is whether or not the system is open source. The point of all of this is to ensure that the work can be recreated at a later time and the use of proprietary software places an external constraint on your ability to do that. You might use closed source software for some parts of your process, but you'll want to ensure that all the data you create is stored in non-proprietary formats. And for the workflow tools themselves, the long-term availability of the workflow system is a critical consideration.
- Open source: Frameworks with internal workflow support
- Open source: General purpose workflow management tools/frameworks
- Closed source: Workflow management tools/frameworks
Category two includes general purpose data management and analysis workflow systems such as Kepler. Kepler generalizes the dedicated capabilities of the systems in category one. Kepler allows you to incorporate entire systems into a single integrated process. It does this by providing basic processing primitives that are used to connect inputs and outputs from various systems. Kepler can be extended by creating plugins that know how to integrate with external systems. This is a great approach but it comes with a cost in the form of complexity. An alternative is the Trident system created by Microsoft Research. The Trident workflow manager itself is open source, but Trident runs on top of a closed source platform (Microsoft Windows Server and SQL Server) so it cannot be considered to be a fully open source system.
Resources:
Kepler
Trident
No comments:
Post a Comment