In any case, to do serious work you have to take care with the management and storage of the data you work with. I just spent several hours organizing and archiving recent work and this reminded me how significant a hurdle this can be. When you work with geodata just keeping track of all your files is a big job.
The approach I use comes directly from working as a software developer. Programmers deal with lots of files and it's common to have many versions of a file (I have over 1.5 million files on my laptop - how did that happen). Software developers deal with this problem by using a version control systems to keep versioned copies of their files. There are (at least) a half dozen different version control systems in common use with both commercial and open source alternatives. Even so, if your objective is the management of research data for the long term there is a compelling case for using an open source product. Using open source software avoids the risk of getting stuck with a commercial product that doesn't meet your needs in the future (or that you might not want to continue to pay for). Git, Mercurial and Subversion are among the best known open source options for version control.
A version control system (VCS) allows you to store your data and files in what amounts to a structured database. You start by creating a repository where you place an initial copy of each file you want to store (you can have more than one repository). Each time you edit a file (or dataset) you commit the new version to the repository along with comments that say what you changed. When used for software development the revision control process gets complicated, but an individual or group working with research data can get most of the benefits by following a few basic practices. The generalized steps for using revision control are:
I was experimenting with different ways to process a KML file (the Google Format for geodata) and I saved the result of each experiment to a separate file. A few weeks later I wanted to repeat the analysis using a new version of the data. Unfortunately I couldn't associate the refinement techniques I had use with the copied files. So I ended up repeating some of the previously done work. Using a VCS --with comments that said what I had done at each stage in the process-- would have saved a bunch of re-work.
If you are new to the idea or using revision control I recommend that you start with TortoiseSVN (Figure Two).
TortoiseSVN is actually just the client to the Subversion system. Subversion provides the data management core of the system (and a command-line interface) and TortoiseSVN puts a nice user-friendly interface on top of that. But, when you install TortoiseSVN it also installs a local copy of Subversion core so you can create local repositories. You can also use TortoiseSVN to access repositories stored on remote servers.
TortoiseSVN is especially nice if you are using Microsoft Windows. It integrates with Windows Explorer so you manage your files by right clicking on a folder or file and selecting the appropriate option from the popup menu (the Macintosh version is similar).
It's useful to have local repositories for your day to day work but the value of using a VCS is multiplied when your repositories are stored on servers. Using a server based repository you can allow others to access and update files (with control over who can do what). And you don't even have to install and manage the server software yourself. Systems such as GitHub or Google Code provide browser based front-ends to Git and Subversion (respectively). These systems store your files on remote servers run by the provider so you don't have to manage the VCS system software (and make backups). Depending on the amount of space you need you might need to pay a fee but basic uses are free or low cost. Using an online VCS is convenient, especially if you need to share work with other contributors or if you are placing your work into the public domain, but I recommend that you use a local VCS for day-to-day work and use the online alternatives when sharing is the main objective.
Of course, setting this up and using it is more work on top of all the other things you probably don't have time to do. A really basic and simple alternative is to start with an online storage system such as Google Drive or DropBox (or similar alternatives). These systems store your files on a remote server making them easy to share (and providing access from multiple devices). These systems typically offer limited free storage and you buy additional space as needed (usually inexpensive).
In addition to Google drive there is Google Docs. Google Docs is Google's online office suite and you store the "docs" you create via Google Drive. One of the best features of Google Docs is that the system automatically stores a revision for every change you save. This is the wiki approach popularized by Wikipedia and using this "save everything" approach makes it possible to allow collaborators to edit any document. If an edit is wrong or unwanted you can easily go back to an earlier version of a document. This is similar to using a version control system except that you don't get to add comments to each revision. Still, used with a disciplined approach, it's far better than nothing.
A version control system (VCS) allows you to store your data and files in what amounts to a structured database. You start by creating a repository where you place an initial copy of each file you want to store (you can have more than one repository). Each time you edit a file (or dataset) you commit the new version to the repository along with comments that say what you changed. When used for software development the revision control process gets complicated, but an individual or group working with research data can get most of the benefits by following a few basic practices. The generalized steps for using revision control are:
- Create a repository -- a common approach is one project per repository
- Create a folder structure within the repository. This structure represents the organization of your project data and files
- Copy your project files into the repository. In VCS lingo you commit or check-in your work
- Add files to the project repository as needed. When you change an existing file the new version is stored as a revision of the existing file (previous versions are also available). Add good comments with each update so you know what the change represents.
Figure One: Out of control file copies
I was experimenting with different ways to process a KML file (the Google Format for geodata) and I saved the result of each experiment to a separate file. A few weeks later I wanted to repeat the analysis using a new version of the data. Unfortunately I couldn't associate the refinement techniques I had use with the copied files. So I ended up repeating some of the previously done work. Using a VCS --with comments that said what I had done at each stage in the process-- would have saved a bunch of re-work.
If you are new to the idea or using revision control I recommend that you start with TortoiseSVN (Figure Two).
Figure Two: The TortoiseSVN Repository Browser
TortoiseSVN is actually just the client to the Subversion system. Subversion provides the data management core of the system (and a command-line interface) and TortoiseSVN puts a nice user-friendly interface on top of that. But, when you install TortoiseSVN it also installs a local copy of Subversion core so you can create local repositories. You can also use TortoiseSVN to access repositories stored on remote servers.
TortoiseSVN is especially nice if you are using Microsoft Windows. It integrates with Windows Explorer so you manage your files by right clicking on a folder or file and selecting the appropriate option from the popup menu (the Macintosh version is similar).
It's useful to have local repositories for your day to day work but the value of using a VCS is multiplied when your repositories are stored on servers. Using a server based repository you can allow others to access and update files (with control over who can do what). And you don't even have to install and manage the server software yourself. Systems such as GitHub or Google Code provide browser based front-ends to Git and Subversion (respectively). These systems store your files on remote servers run by the provider so you don't have to manage the VCS system software (and make backups). Depending on the amount of space you need you might need to pay a fee but basic uses are free or low cost. Using an online VCS is convenient, especially if you need to share work with other contributors or if you are placing your work into the public domain, but I recommend that you use a local VCS for day-to-day work and use the online alternatives when sharing is the main objective.
Of course, setting this up and using it is more work on top of all the other things you probably don't have time to do. A really basic and simple alternative is to start with an online storage system such as Google Drive or DropBox (or similar alternatives). These systems store your files on a remote server making them easy to share (and providing access from multiple devices). These systems typically offer limited free storage and you buy additional space as needed (usually inexpensive).
In addition to Google drive there is Google Docs. Google Docs is Google's online office suite and you store the "docs" you create via Google Drive. One of the best features of Google Docs is that the system automatically stores a revision for every change you save. This is the wiki approach popularized by Wikipedia and using this "save everything" approach makes it possible to allow collaborators to edit any document. If an edit is wrong or unwanted you can easily go back to an earlier version of a document. This is similar to using a version control system except that you don't get to add comments to each revision. Still, used with a disciplined approach, it's far better than nothing.
Lastly, you can simply organize your work in folders on your computer. This approach can work but it requires great discipline if the system is going to hold up over time. To quote Uncle Ben, "with great power comes great responsibility."
If you value your work you must think about how you will organize and archive your data. Backups are needed for when disaster strikes but it is just the start. Protecting the value of your work for the long term is a much bigger challenge.
No comments:
Post a Comment