written by Roni Kobrosly on 2023-10-06 | tags: machine learning product management career data
Anyone who has worked in tech in the data science / MLE / applied ML field knows that most ML projects will fail. Newcomers to the field have super high expectations about what they can do, and, sadly, this leads to a lot of them leaving the space altogether (See this goodbye letter, or this collection of stories or this list of unrealistic data science job expectations). ML project failure is an inevitable part of life in this field and it's happened to me many times at different stages of my career. In the last couple of years, I've begun to think about the building of ML projects as a sort of funnel process; ideas are cheap and many, but only a small amount of these ideas reach the end of the funnel and become productionized, solid, used, effective, and maintained ML applications. I sketched it out as this:
Below, I'm going to try to put this idea into writing. There are several "gates" on the way through the funnel and it's sort of similar to the lifecycle of a ML project, but there isn't perfect overlap. I'll try to describe each gate in the order I've generally seen teams hit them and I'll try to sprinkle in examples I've personally encountered. In my estimation, less than 10% of "project intentions" make it all of the way through the funnel.
Ah ha! You thought the order of items 1 and 2 would be switched, right?. After all, all tech projects ought to begin with a good idea. However, with companies that aren't in the upper echelon of data maturity, it's generally the case that they have some collected data (e.g. user clicks, web logs, etc), data teams subsequently try to draw up a list of potential project ideas that could employ that data, they prioritize those ideas based on their potential impact, likelihood of success, level of effort, etc etc, and then they try to build the thing. Sure, sometimes companies begin with a big transformative idea and decide to build or buy software to bring together data in service of that idea, but I find this happens less frequently. All organizations these days know that data is power, the degree to which they can harness it varies .
Accordingly, the first step is some open-ended data exploration, where teams will discover whether they have a decent quantity and quality of data to work with. So really, at this stage we don't have a specific project idea as much as we have a whiff of an idea / a sense that there could be something exciting to come out of looking at the data.
Almost all non-data practitioners don't have a good sense for what we'll need to get a project off the ground. So many times in my career I've had an excited product person from a team reach out to talk about how we could build something off of their data, only to learn that the excitement is about:
This sort of thing happens all the time.
So it turns out you found more than just a 100 reddit comments in an excel file. Maybe your company has a large central data warehouse with Snowflake, tons of rich web logs in Splunk, or even a beefy excel file with a carefully curated history of the 1000s of technical incidents that has happened at the company. Good. Are you now able to pair that data with a clear business problem statement that all relevant stakeholders can get behind?
Note: unless you're at a small startup, it's unlikely that you'll be socializing this project idea with senior leadership at this point. So this step is really just "does my immeadiate leadership and the stakeholder team's leadership agree this is an important business project and that this isn't just a science project?"
Now that you have some workable data and an idea, can you execute on this and build a half-way decent prototype?
This is the section that is probably the least worth drilling into since the overwhelming majority of internet content on machine learning already focuses on this. How do you pick the right model for a problem? And how do you build that model? And what libraries do you use? And how do you evaluate how well it works? And does it perform its task in an efficient way? And how did you package that model? Did you include unit and integration tests? And is there good documentation on the model? Etc, etc.
There are one of three outcomes here:
The prototype completely and miserably fails, and there is no way to reframe the problem to make it work. I find that this rarely happens. Either the DS/MLE developing the prototype or the team around them has developed the data intuition to know what has a chance of working. I suppose this can happen if you're working in a small, data immature organization where your first data hires are on the junior side, or if your data leader doesn't have ML expertise and is leading a junior team.
The prototype is a mixed success. By far, this seems to be the most likely outcome here. By "mixed success" I mean your binary classification model has metrics in the 70-80%, or your regression model has a great R-squared for some subpopulations but poor for others, or your computer vision / OCR model does a decent job extracting text from non-blurry images, but only 70% of the images in the wild are non-blurry. You'll need to do the plate balancing act of working to fine tune the model, reframe it or refocus it to make it more successful, work with the stakeholders to understand whether the model can be made serviceable, see if you can buy more time for improving the model, etc. Sometimes you get there, and sometimes you don't 🤷.
You knock the prototype out of the park! Damn, this feels fantastic when it happens. You developed a hybrid ML and regex-based algorithm that can successfully translate complex, unstructured factory codes into meaningful categories, somehow saving hundreds of hours of manual toil. Huzzah!
Throughout my tech career, I've found this step to the bloodiest of all the steps in the funnel. This part of the funnel contains a vast cemetary of ideas. Losing at this point in the funnel can be particularly painful for data teams, because at this point you or your team have likely invested a decent amount of time getting to know the data, developed a prototype (that may even be quite promising!), and are starting to really believe in the potential of this project (particular the primary MLE behind it). Here are some examples of non-technical, potentially fatal blows to projects:
Politics:
Economics:
Culture:
Like with step #3, there is a terrific amount of internet content available on how to deployment ML applications, so it's not worth drilling too deep into this step.
The point here is: even if your organization has passed the prior stage of the funnel and has the funding to create a beefy Kubernetes cluster on bare metal, AWS's EKS, GCP's GKE, etc etc, it can be challenging for some organizations to put together an effective DevOps team. This is particularly true for older organizations trying to transition into a more data mature state. Kubernetes is not new, but it is notoriously challenging to troubleshoot and optimize.
I've seen firsthand how a solid model completely crumbled because the small, central DevOps team was inexperienced and having turnover issues. Kubernetes pods were regularly failing and our external customers were getting 5xx errors from our REST API. It would be months before knowledgeable contractors would be brought in to diagnose the issue. So painful. Solid DevOps and engineering is the foundation upon which ML can be built.
If you've been involved in ML prototyping and application development for more than a year, I'm guessing you've felt pain from this question at least once in your tech career. There are few things more frustrating than having a great idea, getting buy-in, working with a small team for a month to build a tool, deploying it, celebrating, and then ... ouch, the dashboard tracking the usage of your API or web app is almost a flatline. There are so many reasons why this can happen:
I come from an academic background in epidemiology. Epidemiologists make a distinction between "efficacy" and "effectiveness", where the former is defined as the performance of an intervention under ideal and controlled circumstances and the latter as its performance under real-world conditions.
While it's critical to think through what your data distributions in the wild look like versus in your model training sandbox, that's actually not what I'm getting at here. Instead: after your application has been written and deployed and is being used, is it altering real world behavior and moving KPIs in the way you'd like? Human behavior is weird, and if your ML application is meant to be directly consumed by people, sometimes the outcome can shock you. Here's an example:
I have a former colleague that worked for medium-sized startup that focused algorithmicly-selected clothing boxes for subscribed customers. As you can imagine, there was a giant warehouse somewhere with hundreds of aisles and rows that housed all of the individual clothing items. My friend created an algorithm to suggest efficient routes around the aisles that the warehouse workers could use to put together a customer's box (their job entailed packing a box by collecting 5 items). All of the above funnel steps were met and his excellent model was put into production. The application was implemented and in the first week all of the workers were getting familiar with the system. A month later he ran some queries to see whether the mean minutes needed to construct a box (his metric of interest) had decreased. The metric didn't decrease, it didn't hold constant, IT ACTUALLY INCREASED. He ran all sorts of checks, but he could find nothing wrong from an engineering or mathematical perspective. He excluded the first week of the new system from his calculations, so it wasn't that it took the workers time to adjust to the new system. He verified that most workers were indeed taking the new, optimal routes. The team concluded that the increase had to be related to the behavior or psychology of the workers. Maybe workers felt less urgency knowing that their route had been chosen for them, or maybe workers felt like this reduced their autonomy and so it hurt their morale and speed. A well-intentioned, perfectly delivered ML application flopped in the face of counter-intuitive human nature.
Wahoo! Your team's project is running in production and meaningfully improving metrics. You hopefully get the kudos you deserve and the organization's trust in your team increases.
Remember though, we're playing the long game here. There are many technical and non-technical reasons why a productionized and effective ML application could still fail down the road.
Technical:
Non-technical: