I’m re-reading Inspired by Marty Cagan lately and came across this quote: “To set your expectations, strong teams normally test many product ideas each week-on the order of 10 to 20 or more per week.”
To be honest I was pretty shocked. Not only have I never seen a team that came even close to that number, but just considering the steps required in a typical data science project, I see so much work that 10-20 per week seems absolutely unrealistic. Still, it got me thinking why everything is taking so long and what could be done to make the Data Scientists life much better.
The quote above doesn’t exactly specify what a test is. For Data Science work, testing usually means an A/B test and a rigorously performed A/B tests take time. One week is often the minimum to account for weekly changes.
Data might not be available or buggy. A frontend team may need to work on tracking user actions to fix this. It might take weeks before they have time to work on this. You can try to escalate, but that also takes time.
Getting the data into the right format takes time. It may be tricky, or it may be a lot of data. You might be missing labelled information. You might need to label data, or search for sources where you can get or buy the data you miss. This usually means integrating more data sources which again takes time.
We haven’t even gotten into training a model, get it to work, iterate on the right features. After that, you need to “productionize” your model, make sure pipelines run robustly and at the scale required. Often this means re-implementing your model in a different programming language altogether.
So, how can we get faster? There are many aspects to this like how to collect and provide data, or how to set up teams, and run data science projects.
I believe that one reason is that the tooling landscape for ML is very complex. Not only are you dealing with many different tools, but these also change through the lifecycle of the project. You might be using several tools for the same task over the course of a project. Just on the data preparation side you might move from manual SQL dumps and pandas to Airflow orchestrated pipelines running a mixture of SQL and Apache Spark jobs. Your training code may move from a Jupyter notebook running on your Laptop to several independent docker containers deployed on a kubernetes cluster.
Re-implementing something to do “essentially the same, just better” takes time, depending on the maturity of the people and whether or not they have software engineering support. It can be substantial. I wouldn’t be surprised if the first model took a team a week to build, but then it takes them 1.5 months to productionize their work.
The other, bigger challenge may be that the wrong tool is chosen for the wrong stage in the process. For example, a team may try to put a notebook to production and then struggle with robustness, monitoring, etc. Or they may decide to move to production technology too quickly, while they still need to try out many things. Having already a full Airflow pipeline that takes a couple of hours to run means it gets much harder to iterate quickly because you have to redeploy, run, wait for results, and so on.
To be honest, I don’t think there is a quick fix here. I think we currently lack the tools that span the whole lifecycle of a data science project. There are some attempts to make some parts of the work easier, currently mostly focussed on the later stages around “MLops.”
The data science toolbox consists mostly of individual tools and no one is actively working to integrate them into a consistent whole. Some data formats and tools emerge as de-facto standards, like pandas for storing moderate amounts of data, or the scikit-learn interface for learning algorithms, but if you’re out of luck, you find yourself transforming data just to make different models work.
Unfortunately this problem is also hard to solve for individual teams as it would require to invest significantly into building quite advanced tooling, and especially in companies this is frowned upon. What happens instead is that you as a data scientists have to be a master in a broad array of tools and manually manage how these tools are put together.
So the best you can currently do is to
- Understand the different stages of a data science project, from exploratory to first prototype to A/B test to production.
- Make sure you have a good toolset for each stage of the process, and you know what to use when. Strive for having good mastery on each of these stages.
- Be very aware which transitions you have to make, and develop consistent approaches to them (for example move from notebook to a Python project for training. How do you do it? What is a good layout for the Python project.)
- If you can, develop tooling (scripts/templates/libraries) to automate repetitive work.
There is a whole post waiting to be written about how difficult it is to develop good tooling within a company, but that is for another time 🙂
What tools do you use for which stage in the project?