There’s a long running joke that AI is a lot like teen sex; everyone talks about it, nobody knows how to do it, everyone thinks everyone else is doing it & so claims to do it.
Making matters worse, engineers are notorious for obsessing over their tools. Bike shedding over tooling is a major source of variance without value, particularly with data (timestamps are a great example if you need convincing).
Although I don’t want to distract people much with conversations about tooling, there are a few state-of-the-art data technologies that I would encourage people coming out of code bootcamps to pay especially close attention to. In no particular order…
State-of-the-Art Big Data Tooling (2020)
Marquez is a newer project, and an extremely exciting one at that.
Marquez grew out of data lineage efforts at WeWork. Data lineage is becoming more important to companies lately because of data privacy concerns.
It’s still an open market, but I expect to see the folks at Marquez really leading the practice for a while when it comes to open source data lineage. Make sure you take a look. They’re looking for contributors too, if you are looking for a great project to contribute to for your developer portfolio.
Phrase extraction & text summarization are hard problems, but this handy little Python library excels at making good answers extremely easy to get by.
Within a matter of hours, TextRank become one of my favorite utilities for textual analysis & summarization. Props to Paco Nathan for getting this into spaCy and props to the researchers who developed this as well. Excellent work, folks!
Papermill allows you to inject variables into Jupyter notebooks & then run them. This in turn allows you to schedule them & log a notebook’s execution just like you would a normal log file.
This is powerful stuff & especially relevant to protecting data privacy.
Airflow / DBT, Metaflow, & Tensorflow
Airflow is like a cross between cronjobs and infrastructure as code, except the infrastructure here refers to a directed acyclic graph. DBT is similar, but it leans quite a bit heavier on SQL, which personally I feel makes it a bit more accessible to people with less data fluency.
Airflow & DBT have both made significant inroads into the data community in one form or another. Both are exceptionally good. If you work with data in 2020, make sure you know about both of these. Pick a good open source job scheduler early on instead of writing a new one from scratch or convincing yourself that you won’t need one (you will).
Metaflow is a newer project with a bright future ahead of it. It makes it simple to deploy machine learning models to production. It handles failures intelligently & helps you keep good logs of ML runs. That’s a winning combination in any tool. Make sure you check this out.
Tensorflow, of course, is Google’s framework for developing machine learning pipelines. If you’re working with data in 2020 and you haven’t already used all of these tools, it’s worth loading them up on Docker and getting familiar with them.
Developers tend to love JSON, but it can be hell to work with in your data warehouse. Encourage your developers to use JSON Schema wherever possible.
Arrow excels for computation over datasets that fit in memory. It’s great at compression & extremely easy to use, especially with Python. It’s a stellar format to recommend to any data scientists at your organization. For larger analytical datasets that have to be stored on disk, Parquet is generally the way to go.
If you’re starting a data project in 2020, make sure you at least consider these state-of-the-art data tools.
Are there any that you’d like to add or take away from this list? Get in touch and let me know your thoughts!