btOOL Links    
Greg's Home
    Created: mid-1997?    
    Updated: Aug 2011    

Greg's Life as BibTeX Hacker

This whole bibliography thing started for me with a ferocious all-nighter on March 31, 1996. The next day, our lab was due to present a letter of intent to the MRC (Medical Research Council of Canada), stating that we were going to assault them with a massive grant application that coming September. Part of what was needed for this letter of intent was a rough bibliography of all the lab's output over the last five years (this was for a renewal---the grant was first awarded in 1991).

So I was presented---about 24 hours before this things had to be finished ---with a pile of supposedly correct BibTeX files gleaned from PERUSE (a periodical index used at McGill). Unfortunately, the PERUSE data was rife with errors, full of duplicates, and had been incorrectly converted to BibTeX format. Worse yet, the original data wasn't kept, so I was stuck with the faults inherent to PERUSE as well as various errors (both syntactic and semantic) in the .bib files. By shortly after midnight, I managed to throw together an ad hoc BibTeX parser in Perl that just happened to work on 99% of the data we had (it helped that all the .bib files were generated by the same program!); some fuzzy code that did various heuristic comparisons to find duplicates, correct mis-classified entries (eg. change @article entries to @inproceedings when they're really articles from a conference proceedings), and attempt to collapse variant spellings of author names; and a curses-based interface to the whole thing that let us step through the data, one entry at a time, verifying that the fuzzy checks were doing the right thing. Amazingly enough, we had a reasonably decent-looking (although quite incomplete and still full of errors---but hey, it looked good) bibliography by about 6 am. I had just enough energy left do make a second version of the bibliography with all article titles `chefferized' (bork bork bork!) before heading home for some blissful slumber. (Hey, it was April Fool's Day!)

I spent most of the summer of 1996 working on the `real thing'---the final lab bibliography that was to be included in the grant application, due on the 15th of September. At the time, I scoured CTAN (the Comprehensive TeX Archive Network) for one-stop-shopping solutions, but the only thing that came up was BibTool. We were able to go fairly far with BibTool, but not quite far enough---so I ended up writing my harnessing BibTool's parser/prettyprinter to print the BibTeX entries into an easy-to-parse format, glomming a Perl module onto it, and writing a bunch of medium-sized, but very flexible, Perl programs in lieu of tiny, but not-quite-flexible-enough BibTool resource files. Eventually, I wrote my own lexer and parser (using PCCTS) to fill BibTool's role, mainly out of curiosity and the perverse enjoyment I get out of writing parsers.

Once I had the BibTeX data in Perl---whether read through a pipe to BibTool, or to the initial version of my parser, or (as it's done now) through direct C-to-Perl "glue code"---I could then do pretty much as I pleased with this---suck in a list of entries and sort them on ridiculously complex keys, process them one at a time and correct misspelled author and journal names, detect duplicates by a complicated set of heuristics, and so forth. All this was pretty slow---a "do-nothing" filter in Perl, with an interface to my C parse/simplify program, was several times slower than BibTool (pure C) doing the same thing. But hey, sometimes programmer efficiency is the most important thing---and writing those little hacks went very quickly.

Not suprisingly, during most of that summer, it occured again and again to me that there must be a better way. At the time (alas!) I didn't know about Nelson Beebe's collection of BibTeX processing tools, or a variety of other sites that were then available on the Internet (but have since disappeared).

So, as any self-respecting young programmer would do, I implemented my own "better way": btOOL, a pair of libraries (one C and one Perl) for general, unhindered access to BibTeX files.