The Data-Centric AI Movement and Opportunities for the MLOps Ecosystem
Data is the new oil
“Data is the new oil,” has become somewhat of a trope in the tech community: a quippy statement to illustrate the vast amount of data in the universe (forecasted to reach almost 80 zettabytes in 2021), and the incredible power that will be granted to those who can mine that resource for all its worth.
It’s true that businesses can do a lot with data. They can use it to generate insights about the past, trigger actions in the present, or make predictions about the future. Machine learning is proving to be an especially powerful way to use data because it can spot patterns that are otherwise undetectable by humans and can use those patterns to make decisions (hopefully smarter ones).
But, as any data scientist will tell you, the journey from raw data to accurate, functioning machine learning is long and arduous. Like oil, data for machine learning needs to be refined (collected, curated, cleaned, and organized) and put into a functioning engine (a model) in order to be useful.
Model-centric vs. data-centric machine learning
Historically, most efforts to build better machine learning projects have focused on the engine, or improving the model. While model optimization is certainly important, there’s another part of the equation that’s equally as, or perhaps even more, important: refining the data.
Data-centric machine learning optimizations is a topic that technologist and entrepreneur Andrew Ng has been advocating for a lot in recent months. If you haven’t seen his talk— “From model-centric to data-centric AI”—we highly recommend watching it. For those who prefer the TL;DR, here are the highlights:
In some cases, changing your data set is more impactful to accuracy than changing your model.
Machine learning models have improved dramatically over the last decade. While there’s certainly more innovation to be had in this category, AI practitioners are starting to see diminishing returns from model improvements. On the other hand, data sets, which Google researchers have called, “the most under-valued and deglamorized aspect of AI,” have a lot more room for improvement.
Today, around 80% of a data scientists’ time is spent cleaning data, and models are typically only trained by thousands of examples, rather than millions. These two facts alone—the time spent dealing with data, and the lack of sufficient data examples used to train models—are reason enough to see that data sets deserve more attention than they’ve received historically.
Today, data collection and data labelling lacks standardization.
One major shortcoming in machine learning data, according to Ng, is lack of standardization around data collection and data labelling. There is little consistency in data collection and data labelling guardrails today, especially when data labelling is outsourced to a third party on a project-by-project basis. A systematic approach across all ML projects would vastly decrease the need for error correction and improve a data set’s predictive capabilities.
Big data isn’t always better data.
It was historically assumed that, to correct errors caused by low-quality or unorganized data, one needed to add more data to train the model. Ng questions this belief and maintains that it is the quality, not the quantity, of data that defines a model’s success. Organized and relevant data will always outperform low-quality big data.
MLOps’ most important task is to make high quality data available through all stages of the machine learning lifecycle.
Ng believes that the growth of the machine learning operations (MLOps) field will be critical to popularizing efficient, systematic, data-centric AI practices. Whether it’s identifying the right data set for the task at hand, enforcing data labeling standards, deciding when to collect additional training data, or refining data sets as machine learning projects reach real-world production, MLOps teams and MLOps tools should consider their primary mandate to ensuring that high-quality data is available at every stage of a machine learning project’s life. Ng also sees MLOps as an emerging job field for data scientists, software engineers, and domain experts and believes many more MLOps roles will be created in the coming years.
Opportunities for the MLOps ecosystem
As VCs, we’ve seen a number of startups over the years in the model-centric side of AI. Data-centric AI has yet to take center stage, and we think that in part is due to a nascent tooling ecosystem. As companies become increasingly data driven, we expect there will be more investment in a suite of data-centric machine learning tools. Here’s a few of the opportunities we’re most excited about in this emerging category:
Internal data labeling
Data labeling has historically been outsourced to third parties that employ an army of human annotators to label data sets. While third-party labeling is not likely to go away anytime soon (ScaleAI’s recent $7.3B valuation certainly validates how effective this approach can be), important domain-specific context can get lost in translation when a third-party annotator, rather than a subject matter expert, is labeling a highly specialized data set like medical imaging or industrial equipment.
These sorts of projects may be better suited for in-house labeling, using tools like Watchful and Heartex, that enable subject matter experts to play a role in the process while automating as much of the grunt work as possible. There are also tools like Dataloop (an F2 Ventures portfolio company) which is an end-to-end data management platform with quality-first data labeling capabilities.
External data tools
Companies often use external public data to augment their private data sets. However, acquiring relevant external data comes with challenges: it is hard to access, tedious to use, doesn’t guarantee sustained relevance, and can pose compliance risks. This is where external data tools come into play, enabling companies to discover thousands of relevant data signals and expand the organization’s data perimeter with ease.
A leading company in this vertical from F2’s portfolio is Explorium, a data science platform that eliminates barriers to acquiring relevant external data. Explorium searches its collection of thousands of external data signals and automatically discovers the most relevant signals to improve the customer’s analytics and machine learning projects.
Ultimately, Explorium taps into the transition to the data-centric approach by offering its platform as a data acquisition strategy. Fidap is another new entrant in this space.
While synthetic data is still a nascent category, used mostly for privacy-preserving cases, we believe that automated synthetic data solutions could play a major role in expanding small data sets in a highly standardized way. As Ng calls out in his talk, generating a relevant synthetic data set can still be a laborious task, but tools like Datomize (another F2 company) for structured data or Synthesis AI for unstructured data (along with many other new entrants) are trying to make this process easier. These solutions could go one step further by integrating with or enforcing overarching data set standards, applied to both the real data set and the synthetic one. Overall, we believe we’re still in the very early innings of synthetic data and are excited to see what opportunities they create for companies of all shapes and sizes to do more with less.
Verticalization of MLOps
Specialized solutions will always drive better performance than generalized solutions. Like many other industries, we expect MLOps will produce a set of large, enduring players focused on specific verticals, like healthcare, robotics, or transportation. We’re already starting to see this play out, particularly in healthcare with companies like Syntegra (synthetic data) or Centaur Labs (data labeling for medical data), and we expect these divisions to deepen in the coming years.
Perhaps one day we’ll see the specialization go even further than just focusing on one specific vertical. One could imagine a future state where a retail company uses one set of MLOps tools for models that predict the shopper’s behavior and a completely different set for models that identify fraud in credit card transactions. After all, the more specialized the MLOps, the better chance of strong performance.
It used to be assumed that the more data you had, the better. And while this holds some truth, it is equally important that the data flowing into your environment is analyzed and validated before it is collected, stored, and analyzed. Otherwise, companies can face the “garbage in, garbage out” problem (not to mention sky high storage bills). This is where data observability comes into play, sitting far upstream in the data lifecycle and ensuring the company is always collecting useful, clean, high quality data.
Monte Carlo is a prominent startup in this sphere, and there’s a swath of new entrants like Anomalo and Soda. These tools provide engineers with a view of the company’s data health and its reliability for specific use cases. In providing this visibility across data pipelines and products, data observability tools greatly reduce data downtime and the countless hours that go into root-causing data quality issues.
Quality data will fuel the machine learning revolution
Despite the potential of machine learning, it still isn’t a mainstream business tool. While there is no silver bullet, we think that better data-centric MLOps tooling can play a major role in making sure every machine learning project is successful, whether you’re a Fortune 500 enterprise or a young startup. We believe MLOps tools should make the data, not just the model, a first-class citizen, and give companies the ability to squeeze every last drop of value out of their data sets, no matter how small or specialized they are. Tools that do this well will be critical in democratizing companies of all sizes that want to use machine learning to make the world a better place.