The acquisition of Paxata by the Unicorn DataRobot in December 2019 was interesting but it certainly didn’t come as a surprise to me. You see, such unifications in the data and analytics world have been a few years in making.
As a practitioner of data science based analytics (I think of it as non-BI and non-Reporting analytics), I have a simple pathway in my mind of what it takes to get the data and make it work for you to unlock business value:
1. Data acquisition – getting your hands on the data that might have value
2. Data preparation – getting the data into a form that is cleaner, aggregated, actionable
3. Data science – running various experiments to unlock deductive, predictive value
4. Deployment – packaging the data and models to extract business value
While, historically, Deployment has been the hardest part and also the most valuable, a lion’s share of approbation has been reserved for Data science. With the availability of cheap and unlimited memory and compute, and also with the advent of the Cloud platforms, it wasn’t too hard to lionize Data science, and to buy into the marketing construct that AI/ML are the panacea to all human (and corporate) woes. Still, I’m sure many of us have visited data science graveyards riddled with the be-all and end-all solutions or have been to corporate theatres and felt our excitement deflate as we recognized BI/Reporting masquerading as AI/ML!
The Netflix Movie Recommendation competition (2009) was probably one of the first well known endeavors that provided a shot in the arm for Data science tools. Platforms such as Kaggle then tried to create a common platform for data scientists to come in and go head on against each other. Many of the current Data science tools such as DataRobot, Dataiku etc. saw that there was value in automating and simplifying the tasks that a data scientist has to perform in order to compete effectively.
Of course, a lot of these startups then had to figure out how to make money, and had to pivot from serving broke data scientists to serving enterprise data scientists. That pivot has been slightly strange for them though. The typical enterprise doesn’t really pine for a 99.99% accurate model – it will be happy with 90% as long as the costs are in line and results can be delivered that quarter. Also, things had to be easy. As easy as the easy button.
Many of these Fortune 500 enterprises had their own standalone on-prem environments or even worse, no big data environments at all. Suddenly, the Data science startups had to span a chasm they hadn’t anticipated. These vendors too had their own standalone systems. They scrambled to integrate either by putting the tools on the ‘edge’ or by putting wrappers such as API calls etc on their standalone environments. Still, it was all really choppy. I remember working on a product for which we would prepare data by hand, then throw it over to a shared directory, then run a standalone data science tool, then approximate the model recommendation by writing our own implementation of the model in R, point it back to our data and try to figure out if it worked properly. The model almost always would be off. Then came the question of deployment. So, depending on the environment, one would then do some software engineering hack jobs to run, generate results, and create a mechanism to broadcast results.
If I lost you there dear reader, I apologize. But that was the intent. The path to deployment was like performing tightrope walking on high seas.
Not that it has become a safe harbor yet, but things are getting interesting. With the advent of public Cloud and their eventual focus shifting from mostly data and compute hosting to broader end to end solutions, things are changing. Having got direct feedback from a lot of enterprise customers, the Cloud platforms have certainly realized that unified solution right from data acquisition to deployment is the way to go. Some platforms have realized that they have gaps in certain areas, and have developed deep relationships with various tool providers (e.g. Google for data prep with Trifacta, Azure with leveraging Spark with Databricks). This change has also made the startups realize that in order to survive, they have to somehow cater to all the four steps as well.
Thus, the data science startups have embarked on a two track strategy. They are going to all the Cloud platforms as plug and play solutions and they are also either acquiring or building deep partnerships with data acquisition/data prep. Some are also offering easier ways to get the models production ready and deployable.
In 2012, the Wired magazine provided more details on the Netflix $1M challenge for improving the movie recommendation model. Apparently, the winning model was never deployed because of the significant engineering costs involved in “prepping” and deploying the model. Thus, it would seem, that enterprise should keep a pragmatic view of how to cascade through all the different steps in extracting value from their data, keeping in mind the key metrics of a) time to value b) effectiveness c) costs.
As the Paxata-DataRobot deal illustrates, there is a lot more action to be seen in the market. On one side we have the Cloud platforms developing their full end to end suite of offerings and on the other side we have these highly specialized tools that are collaborating with other such tools and also Cloud platforms to win the customer. No matter who wins this game, the enterprise customer certainly will. With a choice to pick from various permutations of tools and platforms, all of them selling the united colors of analytics, the enterprise should get its easy button soon enough (fingers crossed).