A few weeks ago, my advisor asked me to give a talk to our research group about data management. There are a fair number of undergraduates new to research that don’t yet have a coherent data management strategy, so giving them my perspective would hopefully prompt them to improve their digital filing skills.
I didn’t like the idea of just talking about this; I need some sort of visual aid. A flowchart seemed like a good idea. What I came up with looked like this:
A lot of my actual file structure (bottom box) is a result of my time as a website designer/developer (which I still do occasionally). The data management strategy was taught to me by James Negley, who was a firm believer in uniform filing under headings like “ADMIN,” “BUSINESS,” and so forth.
Don’t Forget Your Lab Notebook!
Of course, research files are more fragmented than web design files. Some projects require large amounts of imaging, while others have none. Some projects result in almost no digital data aside from the occasional spreadsheet, while others are difficult to put into a notebook. To reconcile this, I follow a simple rule:
Every result makes it into my lab notebook, digital or not.
This can be a big pain. Data from an undertaking like systematic tensile testing, which I have done my fair share of, requires including all of the individual stress/strain plots in your lab notebook, as well as aggregated numerical data, statistical analyses, and resultant plots. It’s an undertaking, but the result is satisfying.
Digital data management still remains a challenge. The file structure is the key to my organization; no more than a handful of files or folders should be presented to you at any given time. This has saved me many headaches, though it does take a fair amount of clicking to drill down through my file structure. At the end of the day, though, it is worth it. The exception to this is my literature directory, which can have dozens of papers on roughly the same topic. It doesn’t seem useful to further segregate literature files.
Being a Vigilant Curator
It’s one thing to manage your own files, but also using a collaborative tool like Dropbox or Google Drive requires either: (1) having redundant files, (2) having fragmented files, or (3) paying for a huge amount of cloud storage so you can have everything on the web. My many gigs of image files precludes option (3), and redundant files (1) drives me bonkers because I wind up with multiple versions to reconcile. Because of this, I put “high priority” files (i.e. manuscripts, important/recent results, etc.) on Dropbox to share with my advisor, collaborators, and undergraduate researchers. After they are no longer “high priority” (i.e. old spreadsheets, accepted publications, etc.), I move them to my local hard drive.
Now that they’re off an actively backed-up cloud platform, they’re my problem . This means that backups are a must. Some folks prefer to use an active backup (I have toyed with active backup solutions before, and they’re OK, but not for me). Others prefer a cloud option. These are great, but I’m old school—the first day of every month and before vacations I back up all of my files on a reliable (I like Lacie) portable hard drive. I’m not really afraid of a fire claiming my whole office, so I feel safe keeping the backup in my office cabinet. Really, I’m much more worried about a simple hard drive crash.
Note: if you happen to be a student at an earthquake- or wild fire-prone area and wish to curate your files similarly, you may want to look closely at cloud solutions or geographic backups (i.e. mailing a physical HDD to your out-of-state parents).
Edit, 1-14-13: There are some other websites that do a very good job of laying out “higher-level” flow charts that demonstrate how a university/department/lab-wide data repository should be set up. This wasn’t really the point of my post, but there is some value in looking at the “big picture.” This is a good example of what I’m talking about.