Data engineering "best practices" aren’t enough anymore

I have been in data (in the fuzziest sense of the word) since about 2009, whether that means data engineering, management, analysis, strategy, or visualization. Over that time, which is not in the grand scheme of things technology, the longest time, things have changed DRASTICALLY. In my first “real” data position, I was asked to identify and organize fallout from a claim auto adjudication engine to identify ways to avoid manual processing… In MS Access. This was objectively the wrong way to do this type of analysis, but I did what I was told, and it took forever to do something that could have been solved much more simply in basically any other technology. Our “data engineering’ process was to manually pull down an excel sheet that had been manually created and saved on a SharePoint site on a daily basis. If you missed work one day, that day’s data would be overwritten and lost into the ether. It was… not ideal. If this is how you are currently handling data engineering, I apologize, and please reach out to me, because I can help you do better.

As my career progressed, so did the sophistication I saw at companies and with data engineering. Things moved from manual processes, to on prem databases with manual processes, to automated processes kicked off by humans, to fully automated processes with no human interaction necessary. We have come a looong way, but what makes a data engineering process really stand out? Five years ago, I probably would have said some amount of automation with the option for manual refreshes, the ability to process a full data load in a reasonable amount of time, a process for what to do if something goes wrong and some sort of alerting to abnormalities in the data (at one company, a quirky engineer did this by sending emails with emojis based on data load --- this was not especially helpful).

Now however, effective data engineering processing requires much more. You need to Leverage CI/CD processes, have tracking that identifies who is working on what and where it lives, check in your code every day (maybe even more than once). In addition, leverage lower environments and make sure that there is some sort of code review process that doesn’t just involve the person that wrote the code. And finally, use some sort of data pipeline orchestration that can run processes without human intervention, in a cadence that appeases all data users, and can detect, through statistical analysis and validation testing, when something is unexpected AND that notifies multiple people who know what to do when something is wrong.

That’s not all though, you also need documentation. I suggest that every company maintains a Best Practices Doctrine, or a document that outlines how (what technology, tools, practices) the work should be done and what common mistakes or “gotcha’s” occur in the data and why they matter. One thing that hasn’t ever changed and is the most often overlooked is the need for clear and concise technical documentation of the data and process itself that is updated on a regular basis. I know that this is the very last thing that anyone in a “data” position wants to do, but that is why we invented business analysts. All joking aside, the best data engineering teams do not operate in a vacuum, they partner with their stakeholders and take them along for the data journey, ensuring at multiple points that the data being engineered is the data that people want and that everyone understands where the data comes from and what analyses are appropriate for the data.

Data engineering "best practices" aren’t enough anymore

Recent Posts

Comentários