Joe Ganley

This is the story of the biggest mistake I've made in my professional career. (I say I; there were other people involved, but it was as much my fault as anyone's.) It was early in my career, and I was tasked with building a new algorithmic optimization system from the ground up. This meant we needed a database; this is a database of geometric objects and connectivity, not the kind people normally think of when they hear the word database. This is exactly what OpenAccess, which I helped develop some years later and which had it existed back then would've allowed me to avoid this whole fiasco, is.

When I came on board, we needed a prototype done, like, yesterday. So I slapped together the quickest, dirtiest database that could possibly work, with the intention that eventually it would be replaced by something more production-worthy. That may have been a mistake too, but it's not the big one that the title refers to.

Fast-forward about a year. The prototype is done, and is producing fantastic results. We've written a real database, and it is time to move the system onto it. Here's the big mistake: I did that as open-heart surgery, completely ripping the system apart and replacing the old database calls with new ones. The data model had changed substantially, so this was a major effort, not just a syntactic change, and calls to the database were woven through the entire codebase; to make another surgical analogy, imagine trying to replace all of a person's nerves. There was a period of a couple of months in which the system did not work at all. Eventually we got it running, and then there were several weeks of debugging - just plain old bug fixes, many of them fixing bugs that had probably already been fixed in the old version but the fixes were lost with the old database calls.

Meanwhile, we weren't able to just freeze the old system in carbonite; we needed to continue improving it, competing in benchmarks, and the like. So it continued to evolve forward from the code base that had been used to begin the conversion. Had the new version ever worked properly, we would have had to make all of the same improvements to it when we were done that we had made to the original system during those months.

Because here is the worst part: The system running on top of this database was mostly Monte Carlo optimization algorithms. Such machinery is highly dependent, in unpredictable and hard-to-debug ways, on such harmless-seeming transformations as changing the units in which a size is expressed, or changing the order in which a graph vertex's edges are enumerated. There were many such differences between the old and new databases, and the new system never did produce results as good as the old one.

After it was all over, it was clear to me that this way of making this conversion was totally wrong-headed. The right way would have been to first write a bunch of regression tests. Then write a facade over the new database that had the old database's API. Move the system onto it (which is nothing more than recompiling against the new library). Then slowly migrate the code, and the facade API, to look more like the new database's API. Run the regression tests frequently, so that if you make a change that breaks things, you know what change is to blame. Eventually the facade API looks just like the new database's API, and at this point the facade is vestigial and can be removed.

This approach has two key features: There is just one version of the system, and it is always working throughout the process. It probably takes substantially more time than the open-heart surgery approach would if everything went smoothly, but how often does that happen?

So imagine how I felt listening to this week's Stack Overflow podcast, in which Jeff talks about facing the same problem with Stack Overflow. Evidently the schemas of his core tables turn out to be really wrong, and force a lot of baroque and complicated over-joining and such in the site's code. Joel suggested something almost identical to what I decided so long ago I should have done: Create a view that looks like what he wants the new table to look like but has the original tables underneath. Then, migrate the code a piece at a time from the underlying tables to the new view. Again, there remains just one, working system the whole time. To my horror, Jeff disagreed quite vehemently and said he planned to go the open-heart surgery route. He went on a bit about the romance and adventure of that sort of swashbuckling. Surprisingly, Joel acquiesced a little and said that might be the right approach for Jeff. I seriously doubt it, and I was disappointed that Joel didn't let Jeff have it with both barrels; after all, this is just a smaller-scale version of the exact same mistake Joel wrote about in Things You Should Never Do, Part I, for pretty much the same reasons. (By the way, that hadn't been written yet when I had my little misadventure.)

Just as Joel says quite unequivocally that you should never do a full rewrite of an application, I'll say just as unequivocally that you shouldn't perform this kind of massive surgery on a working application unless it is simply impossible to do it incrementally. Indeed, in the decade since then I've formed a habit of never having my software be broken for more than a few hours at a time.

Labels: programming, rewriting, software_development

Comments (0)

Joe Ganley	I make software and sometimes other things.

Due to Blogger's termination of support for FTP, this blog is no longer active. It is possible that some links from here, particularly those within the site, are now broken. If you encounter one of those, your best bet is to go to the new front page and hunt for it from there. Most, but not all, of the blog's posts are on this page; the archives are here.
Big mistake
This is the story of the biggest mistake I've made in my professional career. (I say I; there were other people involved, but it was as much my fault as anyone's.) It was early in my career, and I was tasked with building a new algorithmic optimization system from the ground up. This meant we needed a database; this is a database of geometric objects and connectivity, not the kind people normally think of when they hear the word database. This is exactly what OpenAccess, which I helped develop some years later and which had it existed back then would've allowed me to avoid this whole fiasco, is. When I came on board, we needed a prototype done, like, yesterday. So I slapped together the quickest, dirtiest database that could possibly work, with the intention that eventually it would be replaced by something more production-worthy. That may have been a mistake too, but it's not the big one that the title refers to. Fast-forward about a year. The prototype is done, and is producing fantastic results. We've written a real database, and it is time to move the system onto it. Here's the big mistake: I did that as open-heart surgery, completely ripping the system apart and replacing the old database calls with new ones. The data model had changed substantially, so this was a major effort, not just a syntactic change, and calls to the database were woven through the entire codebase; to make another surgical analogy, imagine trying to replace all of a person's nerves. There was a period of a couple of months in which the system did not work at all. Eventually we got it running, and then there were several weeks of debugging - just plain old bug fixes, many of them fixing bugs that had probably already been fixed in the old version but the fixes were lost with the old database calls. Meanwhile, we weren't able to just freeze the old system in carbonite; we needed to continue improving it, competing in benchmarks, and the like. So it continued to evolve forward from the code base that had been used to begin the conversion. Had the new version ever worked properly, we would have had to make all of the same improvements to it when we were done that we had made to the original system during those months. Because here is the worst part: The system running on top of this database was mostly Monte Carlo optimization algorithms. Such machinery is highly dependent, in unpredictable and hard-to-debug ways, on such harmless-seeming transformations as changing the units in which a size is expressed, or changing the order in which a graph vertex's edges are enumerated. There were many such differences between the old and new databases, and the new system never did produce results as good as the old one. After it was all over, it was clear to me that this way of making this conversion was totally wrong-headed. The right way would have been to first write a bunch of regression tests. Then write a facade over the new database that had the old database's API. Move the system onto it (which is nothing more than recompiling against the new library). Then slowly migrate the code, and the facade API, to look more like the new database's API. Run the regression tests frequently, so that if you make a change that breaks things, you know what change is to blame. Eventually the facade API looks just like the new database's API, and at this point the facade is vestigial and can be removed. This approach has two key features: There is just one version of the system, and it is always working throughout the process. It probably takes substantially more time than the open-heart surgery approach would if everything went smoothly, but how often does that happen? So imagine how I felt listening to this week's Stack Overflow podcast, in which Jeff talks about facing the same problem with Stack Overflow. Evidently the schemas of his core tables turn out to be really wrong, and force a lot of baroque and complicated over-joining and such in the site's code. Joel suggested something almost identical to what I decided so long ago I should have done: Create a view that looks like what he wants the new table to look like but has the original tables underneath. Then, migrate the code a piece at a time from the underlying tables to the new view. Again, there remains just one, working system the whole time. To my horror, Jeff disagreed quite vehemently and said he planned to go the open-heart surgery route. He went on a bit about the romance and adventure of that sort of swashbuckling. Surprisingly, Joel acquiesced a little and said that might be the right approach for Jeff. I seriously doubt it, and I was disappointed that Joel didn't let Jeff have it with both barrels; after all, this is just a smaller-scale version of the exact same mistake Joel wrote about in Things You Should Never Do, Part I, for pretty much the same reasons. (By the way, that hadn't been written yet when I had my little misadventure.) Just as Joel says quite unequivocally that you should never do a full rewrite of an application, I'll say just as unequivocally that you shouldn't perform this kind of massive surgery on a working application unless it is simply impossible to do it incrementally. Indeed, in the decade since then I've formed a habit of never having my software be broken for more than a few hours at a time. Labels: programming, rewriting, software_development Comments (0)