Raw data is always a mess. It has to be prepared before it’s useful. You have to fill in missing values, drop duplicates, handle outliers, and generally just make it easier to explore, understand, and model the data.
If data cleansing sounds boring, it’s because it is.
More importantly, though, variance in how your data are tracked is often a major source of extraneous cognitive load for data teams. This load takes the form of your org’s particular quirks & things it does with data that nobody else does. Especially as your teams grow, keeping your data organized becomes a very hard problem.
Variance Without Value
As a leader, one of the most important levers you have to defend your data team’s productivity is to reduce variance without value in your data.
A great example of variance without value for data is timestamps.
Everyone does timestamps differently, even at the same organization. Hours get spent writing & rewriting date parsing utilities. Often there are bugs.
Null values & how to handle them are another great example for variance without value. Whether nulls and null pointers are a good idea or not, the fact that every language handles this differently tends to create a lot of messes. It is often preferable to be consistent (even if you’re consistently wrong).
Managing Variance Without Value
What’s important to note is the kind of data variance we are talking about here is that it is not at all valuable for you. You can’t learn anything from it. The variance is merely a cost.
And it’s a cost you’ll want to manage carefully as your data & engineering teams grow, to help your data teams avoid falling behind.
- Socialize gold standards across the org. Your gold standard should be representative samples that people working with data should mimic where possible. Tie expectations of accuracy, service ownership, etc to medallions (gold, silver, bronze, none), where different medallions have different expectations with respect to SLAs/SLOs & data quality.
- Document naming conventions but keep them lightweight. Process must pay for itself. Your naming conventions should generally fit on the back of a napkin. Make sure your conventions help you identify personal data / information under the GDPR and CCPA. Also make sure your engineers use consistent logging formats and respect the naming conventions.
- Leverage SQL where possible. SQL is the lingua franca of data work. Use it whenever you can to simplify hiring & collaboration.
- Invest in your onboarding process for data teams. Hiring data talent is hard and it can easily slow your down. Your documentation will always be out of date, but your onboarding docs shouldn’t be, or you’re just flushing money down the drain bringing data talent up to speed.
You will never completely get rid of the time it takes to prepare data sets for analysis. But by managing variance without value, you’ll protect your data team’s productivity & help them avoid falling behind.
This will be the first in a four-part series on Data Productivity. Next week, we’ll be back to discuss the right way to onboard data teams. Then we’ll talk a bit about game theory & evolutionary software design. And we’ll cap off the discussion on data productivity with a discussion of data technologies I like & why I like them. Stay tuned!