The right tool for the job

October 17th, 2008 by james

Following my earlier post on using numbered folders to keep track of your projects, I received a couple emails from readers wondering what software I use to implement the system. As a quick recap, the idea was that you create a single projects folder and in that directory, every new project goes into its own sequentially numbered folder. Then – and here’s the key bit – you use a file of some sort to keep track of these ids and any associated meta-data (e.g. titles, status, todos).

In this post, I want to review the basic tools for storing key indexed data and the pros and cons of each technology. Being productive is largely about using the right tools for the job and the numbered folders problem is no different. So below I’ll review what I see as the four basic options and how they can be used generally and for the numbered folders technique.

  1. Plain text
    This is the mother of all data formats. Plain text files are portable (they work on any operating system, as long as you’re careful about end-of-line characters), open (i.e. non-proprietary), and therefore you can format and hack them any way you like. For these reasons and more, the experts at 43folders have discussed the appeal of plain text for productivity tasks at length.

    To make this work with the numbered folders approach, you might go for a simple comma-separated schema like the following:

    id,title,start,status
    1,My first paper,17/10/2007,Published
    2,My second paper,17/10/2008,Concept

    You can then use almost any programming tool you can think of (including standard spreadsheet software) to manipulate this data and generate useful reports. For more sophisticated projects, plain text can also mean using XML, which uses strictly formatted text validated against a schema or document type definition. This is more than you need for numbered folders but hopefully it gives you an idea of the scope for plain-text data storage. Anything’s possible really.

  2. Spreadsheet
    Most people use spreadsheets as the default option when playing with data and in many cases, this is a good way to go. I tend to think of spreadsheets as a sketchpad: it’s easy to enter information, do some quick calculations, and maybe make a few graphs. They often have built-in scripting tools as well (e.g. VBA for Excel) but in my opinion, these are dirty hacks and shouldn’t be used for any serious analysis.

    There are at least three problems with spreadsheets. First, they store their data in proprietary formats such as Excel’s .xls. This means is you want to edit the file, you need to have access to the software that will open it. If you’re using OpenOffice, the data format is open and so as long as you know how it stores data, you can still access your information any way you like. But it doesn’t entirely avoid the second point, that you’re meant to keep your data in the host application. By adding graphs and pivot tables and other features, you basically have to manipulate your data from within these applications which may or may not be acceptable to you. And the final point is that you are often restricted in how much data you can store in one file. More than 256 columns or 65536 rows in Excel 2003? Tough. (Things have improved in Excel 2007 but size limits are still a problem.)

    For the numbered folders technique, these issues aren’t really an obstacle. Your index spreadsheet is likely to be small and you can easily export the information to a plain text format if you really need to. But bear in mind that for any serious data application, you should not use a spreadsheet. In fact, if you even think your application might get big, start with a database.

  3. Portable database
    Databases are completely different beasts from spreadsheets and while they may have a bit of a learning curve, they are ultimately the most scalable and powerful data storage applications. There are different types of database and by a “portable” database, I mean one where the data is contained in a single file that can easily moved from one computer to another. Microsoft Access is the prime example.

    There are two main reasons for using a database. One is when you have lots of data, i.e. millions of records, and a spreadsheet won’t cut it. The second is when you have structured data relationships. In these cases, you can apply the concepts of data normalization to split independent aspects of your data set. Let’s see what this means for the numbered folders approach.

    Suppose you have a data table to track your folders like the following:

    id,title,date,status
    1,My title,1/1/08,Concept

    Looks great but what if you want to add a status update? Your choices are either a) overwrite your current data:

    id,title,date,status
    1,My title,1/2/08,Drafting

    Or b) add another row:

    id,title,date,status
    1,My title,1/1/08,Concept
    1,My title,1/2/08,Drafting

    Fine but suppose you typed the title wrong in the second row? Now you can’t sort by title and get the right paper!

    I personally use a database structure like the following to keep track of my numbered folders.

    Note that each type of data is confined to a unique table. This means I can then declare the links between the tables using the unique ids and not have to worry about data duplication and errors. Then it’s simply a matter of joining the tables and pulling out the data I need. As an example, the following query gives me the titles of all papers with the status of CONCEPT, targeted at the journal Energy Policy, sorted by descending date.

    SELECT p.title FROM papers p
    INNER JOIN journals j ON p.jid = j.id
    INNER JOIN status s ON p.sid = s.id
    WHERE s.name='CONCEPT' AND j.name='ENERGY POLICY'
    ORDER BY p.duedate DESC;

    Getting the hang of writing queries and so on does take time. But once you’re used to thinking in the database way, there’s no going back really – it’s just too powerful.

  4. Database server
    Last but certainly not least, the most powerful option is the database server. Here your data is stored on a computer and can be accessed from wherever that server is accessible, whether that’s keeping it only on your local computer, over an intranet or the entire internet. MySQL, for example, is a powerful open-source database platform which runs many of the web’s largest websites, containing many gigabytes (even terabytes of data). Portable databases, like Access, struggle at these sizes and with thousands of simultaneous requests for data.

    The two big advantages of database servers, besides the size thing, is that they are good at handling multiple users with complex permission schemes and they can support a wide range of software applications. So if you want your colleagues to read what projects you’re working on but not change any of the underlying data, you can easily control this. And this information can be presented via a php script generating an online report of your papers, analysed in detail using R or manipulated with a Java interface. Pretty much whatever you can think of, a database server will support it.

    A database server is overkill for the numbered folders approach, unless you need your information to be accessible from different locations. It also has a larger learning curve, as database servers typically lack the graphic user interface tools found for the other platforms discussed here. But with a database server, you can get your data from anywhere, access it using almost any modern programming language, and you’ll rarely have capacity problems.

Conclusion

I’ve covered a lot of material here but hopefully the message is clear: use the right tool for the job. If you’re generating gigabytes of highly structured data and sharing it with colleagues all over the world, then a database server like MySQL is the only practical option. However if you are simply trying to manage a few local projects, a tool like Access with its accessible interface and reports can be very helpful. A spreadsheet works fine too as long as there isn’t too much data and complicated relationships between the data. The table below summarizes the main features of each technique.

Advantages Disadvantages
Plain text
  • cross platform
  • non-proprietary
  • fully customizable
  • cumbersome for large data sets
  • need other tools for analysis
Spreadsheet
  • intuitive interface
  • quick calculations and graphs
  • widely available
  • proprietary format
  • often requires host application
  • Scales poorly (>1000 rows, I’d use a database)
Portable database
  • Handles relational datasets
  • Good with larger datasets (< 2 GB)
  • Single portable file
  • Data should be normalized (not necessarily a bad thing but requires thought)
  • requires host application
  • limited graphics capabilities
  • steep learning curve
Database server
  • Stored data widely accessible (e.g. over internet, from different analysis software)
  • Scalable and powerful (handles very large datasets)
  • Sophisticated access permissions
  • No or limited user interface
  • Requires a server
  • Takes time to learn effectively

So getting back to the readers’ original questions about the numbered folders technique, I would say you’re probably fine using a spreadsheet. If you like maximum control, a text file is a good choice and shouldn’t get too unwieldy unless you’re really a prodigious writer. If you want to learn a bit about databases, then managing your folders is a perfect excuse to try out a portable database like Access. Just choose what works best for you and your needs.

If you enjoyed this post, make sure you !


2 Responses to “The right tool for the job”

  1. kdNo Gravatar Says:

    You forgot sqlite – a better solution than mysql for small single user projects. there’s also a Firefox extension called SQLite Manager which is probably handy.

  2. joseNo Gravatar Says:

    If you are on windows, there’s a strange beast of an application called SQLnotes:
    http://www.sqlnotes.net/

    It’s a mixture of a notetaker, PIM, and database. looks daunting at first. I have not used it, but there are people raving about it around the DC forums:

    http://www.donationcoder.com/Forums/bb/index.php?topic=10432.msg81492

    I use oneNote myself, but I’m looking for a cross-platform, portable alternative. Unfortunately nothing comes close to it.

Leave a Reply