All Blogs Go to Heaven

veni, vidi, vefactori

Published: March 1, 2020, 2:40 p.m.


Over the past few months I have been working on modernizing a data pipeline at the AgeLab. The pipeline cleans and processes data about human behavior while driving, especially human interactions with automation features. In the research ML is present as the technology being tested and as a technology being used to assist in collecting data... but I'm going to talk about the boring stuff for now.


A few dozen terabytes of data are collected each year and that all needs to be processed and stored. The current process does everything it needs to do. It was written to get the project up and running and it did that very well, but the code coalesced from a number of scripts that were meant to be run manually. Everything works, but there is a lot of duplicated code, and almost duplicated code. In the rush to go from 0 to 60, lots of unused code lingers in the shadows leaving a bloated "common" module that does way too much of the heavy lifting.


For an early career engineer, this is probably the most exciting positions to be in: I'm in charge of optimizing code and streamlining processes around a code base that A) works and B) was rushed to completion.


B) might sound like a bad thing, but it means that there is a lot of low-hanging fruit. There are tons of places in the code where it's obvious that the original engineers had more valuable things to do than shave a few minutes off of processing that takes hours to complete. They were worried about getting data and processing it so that it could be used by researchers. That's a bit more stressful.


Since I have a "working" product, I get a chance to think about code structure, and how to simplify what we have. It gives me an opportunity to think about what makes quality code instead of just what will work at this moment. Of course... on the negative side there is a ton of mess to cut through. Obviously, if I were writing from scratch that would NEVER happen...