The 2021 machine learning, AI, and data landscape

Just when you thought it couldn’t grow any more explosively, the data/AI landscape just did: the rapid pace of company creation, exciting new product and project launches, a deluge of VC financings, unicorn creation, IPOs, etc.

It has also been a year of multiple threads and stories intertwining.

One story has been the maturation of the ecosystem, with market leaders reaching large scale and ramping up their ambitions for global market domination, in particular through increasingly broad product offerings. Some of those companies, such as Snowflake, have been thriving in public markets (see our MAD Public Company Index), and a number of others (Databricks, Dataiku, DataRobot, etc.) have raised very large (or in the case of Databricks, gigantic) rounds at multi-billion valuations and are knocking on the IPO door (see our Emerging MAD company Index).

But at the other end of the spectrum, this year has also seen the rapid emergence of a whole new generation of data and ML startups. Whether they were founded a few years or a few months ago, many experienced a growth spurt in the past year or so. Part of it is due to a rabid VC funding environment and part of it, more fundamentally, is due to inflection points in the market.

In the past year, there’s been less headline-grabbing discussion of futuristic applications of AI (self-driving vehicles, etc.), and a bit less AI hype as a result. Regardless, data and ML/AI-driven application companies have continued to thrive, particularly those focused on enterprise use trend cases. Meanwhile, a lot of the action has been happening behind the scenes on the data and ML infrastructure side, with entirely new categories (data observability, reverse ETL, metrics stores, etc.) appearing or drastically accelerating.

To keep track of this evolution, this is our eighth annual landscape and “state of the union” of the data and AI ecosystem — coauthored this year with my FirstMark colleague John Wu. (For anyone interested, here are the prior versions: 2012, 2014, 2016, 2017, 2018, 2019: Part I and Part II, and 2020.)

For those who have remarked over the years how insanely busy the chart is, you’ll love our new acronym: Machine learning, Artificial intelligence, and Data (MAD) — this is now officially the MAD landscape!

We’ve learned over the years that those posts are read by a broad group of people, so we have tried to provide a little bit for everyone — a macro view that will hopefully be interesting and approachable to most, and then a slightly more granular overview of trends in data infrastructure and ML/AI for people with a deeper familiarity with the industry.

Quick notes:

My colleague John and I are early-stage VCs at FirstMark, and we invest very actively in the data/AI space. Our portfolio companies are noted with an (*) in this post.

Let’s dig in.

The macro view: Making sense of the ecosystem’s complexity

Let’s start with a high-level view of the market. As the number of companies in the space keeps increasing every year, the inevitable questions are: Why is this happening? How long can it keep going? Will the industry go through a wave of consolidation?

Rewind: The megatrend

Readers of prior versions of this landscape will know that we are relentlessly bullish on the data and AI ecosystem.

As we said in prior years, the fundamental trend is that every company is becoming not just a software company, but also a data company.

Historically, and still today in many organizations, data has meant transactional data stored in relational databases, and perhaps a few dashboards for basic analysis of what happened to the business in recent months.

But companies are now marching towards a world where data and artificial intelligence are embedded in myriad internal processes and external applications, both for analytical and operational purposes. This is the beginning of the era of the intelligent, automated enterprise — where company metrics are available in real time, mortgage applications get automatically processed, AI chatbots provide customer support 24/7, churn is predicted, cyber threats are detected in real time, and supply chains automatically adjust to demand fluctuations.

This fundamental evolution has been powered by dramatic advances in underlying technology — in particular, a symbiotic relationship between data infrastructure on the one hand and machine learning and AI on the other.

Both areas have had their own separate history and constituencies, but have increasingly operated in lockstep over the past few years. The first wave of innovation was the “Big Data” era, in the early 2010s, where innovation focused on building technologies to harness the massive amounts of digital data created every day. Then, it turned out that if you applied big data to some decade-old AI algorithms (deep learning), you got amazing results, and that triggered the whole current wave of excitement around AI. In turn, AI became a major driver for the development of data infrastructure: If we can build all those applications with AI, then we’re going to need better data infrastructure — and so on and so forth.

Fast-forward to 2021: The terms themselves (big data, AI, etc.) have experienced the ups and downs of the hype cycle, and today you hear a lot of conversations around automation, but fundamentally this is all the same megatrend.

The big unlock

A lot of today’s acceleration in the data/AI space can be traced to the rise of cloud data warehouses (and their lakehouse cousins — more on this later) over the past few years.

It is ironic because data warehouses address one of the most basic, pedestrian, but also fundamental needs in data infrastructure: Where do you store it all? Storage and processing are at the bottom of the data/AI “hierarchy of needs” — see Monica Rogati’s famous blog post here — meaning, what you need to have in place before you can do any fancier stuff like analytics and AI.

You’d figure that 15+ years into the big data revolution, that need had been solved a long time ago, but it hadn’t.

In retrospect, the initial success of Hadoop was a bit of a head-fake for the space — Hadoop, the OG big data technology, did try to solve the storage and processing layer. It did play a really important role in terms of conveying the idea that real value could be extracted from massive amounts of data, but its overall technical complexity ultimately limited its applicability to a small set of companies, and it never really achieved the market penetration that even the older data warehouses (e.g., Vertica) had a few decades ago.

Today, cloud data warehouses (Snowflake, Amazon Redshift, and Google BigQuery) and lakehouses (Databricks) provide the ability to store massive amounts of data in a way that’s useful, not completely cost-prohibitive, and doesn’t require an army of very technical people to maintain. In other words, after all these years, it is now finally possible to store and process big data.

That is a big deal and has proven to be a major unlock for the rest of the data/AI space, for several reasons.

First, the rise of data warehouses considerably increases market size not just for its category, but for the entire data and AI ecosystem. Because of their ease of use and consumption-based pricing (where you pay as you go), data warehouses become the gateway to every company becoming a data company. Whether you’re a Global 2000 company or an early-stage startup, you can now get started building your core data infrastructure with minimal pain. (Even FirstMark, a venture firm with several billion under management and 20-ish team members, has its own Snowflake instance.)

Second, data warehouses have unlocked an entire ecosystem of tools and companies that revolve around them: ETL, ELT, reverse ETL, warehouse-centric data quality tools, metrics stores, augmented analytics, etc. Many refer to this ecosystem as the “modern data stack” (which we discussed in our 2020 landscape). A number of founders saw the emergence of the modern data stack as an opportunity to launch new startups, and it is no surprise that a lot of the feverish VC funding activity over the last year has focused on modern data stack companies. Startups that were early to the trend (and played a pivotal role in defining the concept) are now reaching scale, including DBT Labs, a provider of transformation tools for analytics engineers (see our Fireside Chat with Tristan Handy, CEO of DBT Labs and Jeremiah Lowin, CEO of Prefect), and Fivetran, a provider of automated data integration solutions that streams data into data warehouses (see our Fireside Chat with George Fraser, CEO of Fivetran), both of which raised large rounds recently (see Financing section).

Third, because they solve the fundamental storage layer, data warehouses liberate companies to start focusing on high-value projects that appear higher in the hierarchy of data needs. Now that you have your data stored, it’s easier to focus in earnest on other things like real-time processing, augmented analytics, or machine learning. This in turn increases the market demand for all sorts of other data and AI tools and platforms. A flywheel gets created where more customer demand creates more innovation from data and ML infrastructure companies.

As they have such a direct and indirect impact on the space, data warehouses are an important bellwether for the entire data industry — as they grow, so does the rest of the space.

The good news for the data and AI industry is that data warehouses and lakehouses are growing very fast, at scale. Snowflake, for example, showed a 103% year-over-year growth in their most recent Q2 results, with an incredible net revenue retention of 169% (which means that existing customers keep using and paying for Snowflake more and more over time). Snowflake is targeting $10 billion in revenue by 2028. There’s a real possibility they could get there sooner. Interestingly, with consumption-based pricing where revenues start flowing only after the product is fully deployed, the company’s current customer traction could be well ahead of its more recent revenue numbers.

This could certainly be just the beginning of how big data warehouses could become. Some observers believe that data warehouses and lakehouses, collectively, could get to 100% market penetration over time (meaning, every relevant company has one), in a way that was never true for prior data technologies like traditional data warehouses such as Vertica (too expensive and cumbersome to deploy) and Hadoop (too experimental and technical).

While this doesn’t mean that every data warehouse vendor and every data startup, or even market segment, will be successful, directionally this bodes incredibly well for the data/AI industry as a whole.

The titanic shock: Snowflake vs. Databricks

Snowflake has been the poster child of the data space recently. Its IPO in September 2020 was the biggest software IPO ever (we had covered it at the time in our Quick S-1 Teardown: Snowflake). At the time of writing, and after some ups and downs, it is a $95 billion market cap public company.

However, Databricks is now emerging as a major industry rival. On August 31, the company announced a massive $1.6 billion financing round at a $38 billion valuation, just a few months after a $1 billion round announced in February 2021 (at a measly $28 billion valuation).

Up until recently, Snowflake and Databricks were in fairly different segments of the market (and in fact were close partners for a while).

Snowflake, as a cloud data warehouse, is mostly a database to store and process large amounts of structured data — meaning, data that can fit neatly into rows and columns. Historically, it’s been used to enable companies to answer questions about past and current performance (“which were our top fastest growing regions last quarter?”), by plugging in business intelligence (BI) tools. Like other databases, it leverages SQL, a very popular and accessible query language, which makes it usable by millions of potential users around the world.

Databricks came from a different corner of the data world. It started in 2013 to commercialize Spark, an open source framework to process large volumes of generally unstructured data (any kind of text, audio, video, etc.). Spark users used the framework to build and process what became known as “data lakes,” where they would dump just about any kind of data without worrying about structure or organization. A primary use of data lakes was to train ML/AI applications, enabling companies to answer questions about the future (“which customers are the most likely to purchase next quarter?” — i.e., predictive analytics). To help customers with their data lakes, Databricks created Delta, and to help them with ML/AI, it created ML Flow. For the whole story on that journey, see my Fireside Chat with Ali Ghodsi, CEO, Databricks.

More recently, however, the two companies have converged towards one another.

Databricks started adding data warehousing capabilities to its data lakes, enabling data analysts to run standard SQL queries, as well as adding business intelligence tools like Tableau or Microsoft Power BI. The result is what Databricks calls the lakehouse — a platform meant to combine the best of both data warehouses and data lakes.

As Databricks made its data lakes look more like data warehouses, Snowflake has been making its data warehouses look more like data lakes. It announced support for unstructured data such as audio, video, PDFs, and imaging data in November 2020 and launched it in preview just a few days ago.

And where Databricks has been adding BI to its AI capabilities, Snowflake is adding AI to its BI compatibility. Snowflake has been building close partnerships with top enterprise AI platforms. Snowflake invested in Dataiku, and named it its Data Science Partner of the Year. It also invested in ML platform rival DataRobot.

Ultimately, both Snowflake and Databricks want to be the center of all things data: one repository to store all data, whether structured or unstructured, and run all analytics, whether historical (business intelligence) or predictive (data science, ML/AI).

Of course, there’s no lack of other competitors with a similar vision. The cloud hyperscalers in particular have their own data warehouses, as well as a full suite of analytical tools for BI and AI, and many other capabilities, in addition to massive scale. For example, listen to this great episode of the Data Engineering Podcast about GCP’s data and analytics capabilities.

Both Snowflake and Databricks have had very interesting relationships with cloud vendors, both as friend and foe. Famously, Snowflake grew on the back of AWS (despite AWS’s competitive product, Redshift) for years before expanding to other cloud platforms. Databricks built a strong partnership with Microsoft Azure, and now touts its multi-cloud capabilities to help customers avoid cloud vendor lock-in. For many years, and still to this day to some extent, detractors emphasized that both Snowflake’s and Databricks’ business models effectively resell underlying compute from the cloud vendors, which put their gross margins at the mercy of whatever pricing decisions the hyperscalers would make.

Watching the dance between the cloud providers and the data behemoths will be a defining story of the next five years.

Bundling, unbundling, consolidation?

Given the rise of Snowflake and Databricks, some industry observers are asking if this is the beginning of a long-awaited wave of consolidation in the industry: functional consolidation as large companies bundle an increasing amount of capabilities into their platforms and gradually make smaller startups irrelevant, and/or corporate consolidation, as large companies buy smaller ones or drive them out of business.

Certainly, functional consolidation is happening in the data and AI space, as industry leaders ramp up their ambitions. This is clearly the case for Snowflake and Databricks, and the cloud hyperscalers, as just discussed.

But others have big plans as well. As they grow, companies want to bundle more and more functionality — nobody wants to be a single-product company.

For example, Confluent, a platform for streaming data that just went public in June 2021, wants to go beyond the real-time data use cases it is known for, and “unify the processing of data in motion and data at rest” (see our Quick S-1 Teardown: Confluent).

As another example, Dataiku* natively covers all the functionality otherwise offered by dozens of specialized data and AI infrastructure startups, from data prep to machine learning, DataOps, MLOps, visualization, AI explainability, etc., all bundled in one platform, with a focus on democratization and collaboration (see our Fireside Chat with Florian Douetteau, CEO, Dataiku).

Arguably, the rise of the “modern data stack” is another example of functional consolidation. At its core, it is a de facto alliance among a group of companies (mostly startups) that, as a group, functionally cover all the different stages of the data journey from extraction to the data warehouse to business intelligence — the overall goal being to offer the market a coherent set of solutions that integrate with one another.

For the users of those technologies, this trend towards bundling and convergence is healthy, and many will welcome it with open arms. As it matures, it is time for the data industry to evolve beyond its big technology divides: transactional vs. analytical, batch vs. real-time, BI vs. AI.

These somewhat artificial divides have deep roots, both in the history of the data ecosystem and in technology constraints. Each segment had its own challenges and evolution, resulting in a different tech stack and a different set of vendors. This has led to a lot of complexity for the users of those technologies. Engineers have had to stitch together suites of tools and solutions and maintain complex systems that often end up looking like Rube Goldberg machines.

As they continue to scale, we expect industry leaders to accelerate their bundling efforts and keep pushing messages such as “unified data analytics.” This is good news for Global 2000 companies in particular, which have been the prime target customer for the bigger, bundled data and AI platforms. Those companies have both a tremendous amount to gain from deploying modern data infrastructure and ML/AI, and at the same time much more limited access to top data and ML engineering talent needed to build or assemble data infrastructure in-house (as such talent tends to prefer to work either at Big Tech companies or promising startups, on the whole).

However, as much as Snowflake and Databricks would like to become the single vendor for all things data and AI, we believe that companies will continue to work with multiple vendors, platforms, and tools, in whichever combination best suits their needs.

The key reason: The pace of innovation is just too explosive in the space for things to remain static for too long. Founders launch new startups; Big Tech companies create internal data/AI tools and then open-source them; and for every established technology or product, a new one seems to emerge weekly. Even the data warehouse space, possibly the most established segment of the data ecosystem currently, has new entrants like Firebolt, promising vastly superior performance.

While the big bundled platforms have Global 2000 enterprises as core customer base, there is a whole ecosystem of tech companies, both startups and Big Tech, that are avid consumers of all the new tools and technologies, giving the startups behind them a great initial market. Those companies do have access to the right data and ML engineering talent, and they are willing and able to do the stitching of best-of-breed new tools to deliver the most customized solutions.

Meanwhile, just as the big data warehouse and data lake vendors are pushing their customers towards centralizing all things on top of their platforms, new frameworks such as the data mesh emerge, which advocate for a decentralized approach, where different teams are responsible for their own data product. While there are many nuances, one implication is to evolve away from a world where companies just move all their data to one big central repository. Should it take hold, the data mesh could have a significant impact on architectures and the overall vendor landscape (more on the data mesh later in this post).

Beyond functional consolidation, it is also unclear how much corporate consolidation (M&A) will happen in the near future.

We’re likely to see a few very large, multi-billion dollar acquisitions as big players are eager to make big bets in this fast-growing market to continue building their bundled platforms. However, the high valuations of tech companies in the current market will probably continue to deter many potential acquirers. For example, everybody’s favorite industry rumor has been that Microsoft would want to acquire Databricks. However, because the company could fetch a $100 billion or more valuation in public markets, even Microsoft may not be able to afford it.

There is also a voracious appetite for buying smaller startups throughout the market, particularly as later-stage startups keep raising and have plenty of cash on hand. However, there is also voracious interest from venture capitalists to continue financing those smaller startups. It is rare for promising data and AI startups these days to not be able to raise the next round of financing. As a result, comparatively few M&A deals get done these days, as many founders and their VCs want to keep turning the next card, as opposed to joining forces with other companies, and have the financial resources to do so.

Let’s dive further into financing and exit trends.

Financings, IPOs, M&A: A crazy market

As anyone who follows the startup market knows, it’s been crazy out there.

Venture capital has been deployed at an unprecedented pace, surging 157% year-on-year globally to $156 billion in Q2 2021 according to CB Insights. Ever higher valuations led to the creation of 136 newly minted unicorns just in the first half of 2021, and the IPO window has been wide open, with public financings (IPOs, DLs, SPACs) up +687% (496 vs. 63) in the January 1 to June 1 2021 period vs the same period in 2020.

In this general context of market momentum, data and ML/AI have been hot investment categories once again this past year.

Public markets

Not so long ago, there were hardly any “pure play” data / AI companies listed in public markets.

However, the list is growing quickly after a strong year for IPOs in the data / AI world. We started a public market index to help track the performance of this growing category of public companies — see our MAD Public Company Index (update coming soon).

On the IPO front, particularly noteworthy were UiPath, an RPA and AI automation company, and Confluent, a data infrastructure company focused on real-time streaming data (see our Confluent S-1 teardown for our analysis). Other notable IPOs were C3.ai, an AI platform (see our C3 S-1 teardown), and Couchbase, a no-SQL database.

Several vertical AI companies also had noteworthy IPOs: SentinelOne, an autonomous AI endpoint security platform; TuSimple, a self-driving truck developer; Zymergen, a biomanufacturing company; Recursion, an AI-driven drug discovery company; and Darktrace, “a world-leading AI for cyber-security” company.

Meanwhile, existing public data/AI companies have continued to perform strongly.

While they’re both off their all-time highs, Snowflake is a formidable $95 billion market cap company, and, for all the controversy, Palantir is a $55 billion market cap company, at the time of writing.

Both Datadog and MongoDB are at their all-time highs. Datadog is now a $45 billion market cap company (an important lesson for investors). MongoDB is a $33 billion company, propelled by the rapid growth of its cloud product, Atlas.

Overall, as a group, data and ML/AI companies have vastly outperformed the broader market. And they continue to command high premiums — out of the top 10 companies with the highest market capitalization to revenue multiple, 4 of them (including the top 2) are data/AI companies.

Chart of top ten EV and NTM revenue multiples. Source is Jamin Ball, Clouded Judgement, September 24, 2021 — Another distinctive characteristic of public markets in the last year has been the rise of SPACs as an alternative to the traditional IPO process. SPACs have proven a very beneficial vehicle for the more “frontier tech” portion of the AI market (autonomous vehicle, biotech, etc.). Some examples of companies that have either announced or completed SPAC (and de-SPAC) transactions include Ginkgo Bioworks, a company that engineers novel organisms to produce useful materials and substances, now a $24B public company at the time of writing; autonomous vehicle companies Aurora and Embark; and Babylon Health.

Private markets

The frothiness of the venture capital market is a topic for another blog post (just a consequence of macroeconomics and low-interest rates, or a reflection of the fact that we have truly entered the deployment phase of the internet?). But suffice to say that, in the context of an overall booming VC market, investors have shown tremendous enthusiasm for data/AI startups.

According to CB Insights, in the first half of 2021, investors had poured $38 billion into AI startups, surpassing the full 2020 amount of $36 billion with half a year to go. This was driven by 50+ mega-sized $100 million-plus rounds, also a new high. Forty-two AI companies reached unicorn valuations in the first half of the year, compared to only 11 for the entirety of 2020.

One inescapable feature of the 2020-2021 VC market has been the rise of crossover funds, such as Tiger Global, Coatue, Altimeter, Dragoneer, or D1, and other mega-funds such as Softbank or Insight. While those funds have been active across the Internet and software landscape, data and ML/AI has clearly been a key investing theme.

As an example, Tiger Global seems to love data/AI companies. Just in the last 12 months, the New York hedge fund has written big checks into many of the companies appearing on our landscape, including, for example, Deep Vision, Databricks, Dataiku*, DataRobot, Imply, Prefect, Gong, PathAI, Ada*, Vast Data, Scale AI, Redis Labs, 6sense, TigerGraph, UiPath, Cockroach Labs*, Hyperscience*, and a number of others.

This exceptional funding environment has mostly been great news for founders. Many data/AI companies found themselves the object of preemptive rounds and bidding wars, giving full power to founders to control their fundraising processes. As VC firms competed to invest, round sizes and valuations escalated dramatically. Series A round sizes used to be in the $8-$12 million range just a few years ago. They are now routinely in the $15-$20 million range. Series A valuations that used to be in the $25-$45 million (pre-money) range now often reach $80-$120 million — valuations that would have been considered a great series B valuation just a few years ago.

On the flip side, the flood of capital has led to an ever-tighter job market, with fierce competition for data, machine learning, and AI talent among many well-funded startups, and corresponding compensation inflation.

Another downside: As VCs aggressively invested in emerging sectors up and down the data stack, often betting on future growth over existing commercial traction, some categories went from nascent to crowded very rapidly — reverse ETL, data quality, data catalogs, data annotation, and MLOps.

Regardless, since our last landscape, an unprecedented number of data/AI companies became unicorns, and those that were already unicorns became even more highly valued, with a couple of decacorns (Databricks, Celonis).

Some noteworthy unicorn-type financings (in rough reverse chronological order): Fivetran, an ETL company, raised $565 million at a $5.6 billion valuation; Matillion, a data integration company, raised $150 million at a $1.5 billion valuation; Neo4j, a graph database provider, raised $325 million at a more than $2 billion valuation; Databricks, a provider of data lakehouses, raised $1.6 billion at a $38 billion valuation; Dataiku*, a collaborative enterprise AI platform, raised $400 million at a $4.6 billion valuation; DBT Labs (fka Fishtown Analytics), a provider of open-source analytics engineering tool, raised a $150 million series C; DataRobot, an enterprise AI platform, raised $300 million at a $6 billion valuation; Celonis, a process mining company, raised a $1 billion series D at an $11 billion valuation; Anduril, an AI-heavy defense technology company, raised a $450 million round at a $4.6 billion valuation; Gong, an AI platform for sales team analytics and coaching, raised $250 million at a $7.25 billion valuation; Alation, a data discovery and governance company, raised a $110 million series D at a $1.2 billion valuation; Ada*, an AI chatbot company, raised a $130 million series C at a $1.2 billion valuation; Signifyd, an AI-based fraud protection software company, raised $205 million at a $1.34 billion valuation; Redis Labs, a real-time data platform, raised a $310 million series G at a $2 billion valuation; Sift, an AI-first fraud prevention company, raised $50 million at a valuation of over $1 billion; Tractable, an AI-first insurance company, raised $60 million at a $1 billion valuation; SambaNova Systems, a specialized AI semiconductor and computing platform, raised $676 million at a $5 billion valuation; Scale AI, a data annotation company, raised $325 million at a $7 billion valuation; Vectra, a cybersecurity AI company, raised $130 million at a $1.2 billion valuation; Shift Technology, an AI-first software company built for insurers, raised $220 million; Dataminr, a real-time AI risk detection platform, raised $475 million; Feedzai, a fraud detection company, raised a $200 million round at a valuation of over $1 billion; Cockroach Labs*, a cloud-native SQL database provider, raised $160 million at a $2 billion valuation; Starburst Data, an SQL-based data query engine, raised a $100 million round at a $1.2 billion valuation; K Health, an AI-first mobile virtual healthcare provider, raised $132 million at a $1.5 billion valuation; Graphcore, an AI chipmaker, raised $222 million; and Forter, a fraud detection software company, raised a $125 million round at a $1.3 billion valuation.

Acquisitions

As mentioned above, acquisitions in the MAD space have been robust but haven’t spiked as much as one would have guessed, given the hot market. The unprecedented amount of cash floating in the ecosystem cuts both ways: More companies have strong balance sheets to potentially acquire others, but many potential targets also have access to cash, whether in private/VC markets or in public markets, and are less likely to want to be acquired.

Of course, there have been several very large acquisitions: Nuance, a public speech and text recognition company (with a particular focus on healthcare), is in the process of getting acquired by Microsoft for almost $20 billion (making it Microsoft’s second-largest acquisition ever, after LinkedIn); Blue Yonder, an AI-first supply chain software company for retail, manufacturing, and logistics customers, was acquired by Panasonic for up to $8.5 billion; Segment, a customer data platform, was acquired by Twilio for $3.2 billion; Kustomer, a CRM that enables businesses to effectively manage all customer interactions across channels, was acquired by Facebook for $1 billion; and Turbonomic, an “AI-powered Application Resource Management” company, was acquired by IBM for between $1.5 billion and $2 billion.

There were also a couple of take-private acquisitions of public companies by private equity firms: Cloudera, a formerly high-flying data platform, was acquired by Clayton Dubilier & Rice and KKR, perhaps the official end of the Hadoop era; and Talend, a data integration provider, was taken private by Thoma Bravo.

Some other notable acquisitions of companies that appeared on earlier versions of this MAD landscape: ZoomInfo acquired Chorus.ai and Everstring; DataRobot acquired Algorithmia; Cloudera acquired Cazena; Relativity acquired Text IQ*; Datadog acquired Sqreen and Timber*; SmartEye acquired Affectiva; Facebook acquired Kustomer; ServiceNow acquired Element AI; Vista Equity Partners acquired Gainsight; AVEVA acquired OSIsoft; and American Express acquired Kabbage.

What’s new for the 2021 MAD landscape

Given the explosive pace of innovation, company creation, and funding in 2020-21, particularly in data infrastructure and MLOps, we’ve had to change things around quite a bit in this year’s landscape.

One significant structural change: As we couldn’t fit it all in one category anymore, we broke “Analytics and Machine Intelligence” into two separate categories, “Analytics” and “Machine Learning & Artificial Intelligence.”

We added several new categories:

In “Infrastructure,” we added:
- “Reverse ETL” — products that funnel data from the data warehouse back into SaaS applications
- “Data Observability” — a rapidly emerging component of DataOps focused on understanding and troubleshooting the root of data quality issues, with data lineage as a core foundation
- “Privacy & Security” — data privacy is increasingly top of mind, and a number of startups have emerged in the category
In “Analytics,” we added:
- “Data Catalogs & Discovery” — one of the busiest categories of the last 12 months; those are products that enable users (both technical and non-technical) to find and manage the datasets they need
- “Augmented Analytics” — BI tools are taking advantage of NLG / NLP advances to automatically generate insights, particularly democratizing data for less technical audiences
- “Metrics Stores” — a new entrant in the data stack which provides a central standardized place to serve key business metrics
- “Query Engines“
In “Machine Learning and AI,” we broke down several MLOps categories into more granular subcategories:
- “Model Building“
- “Feature Stores“
- “Deployment and Production“
In “Open Source,” we added:
- “Format“
- “Orchestration“
- “Data Quality & Observability“

Another significant evolution: In the past, we tended to overwhelmingly feature on the landscape the more established companies — growth-stage startups (Series C or later) as well as public companies. However, given the emergence of the new generation of data/AI companies mentioned earlier, this year we’ve featured a lot more early startups (series A, sometimes seed) than ever before.

Without further ado, here’s the landscape:

Key Trends in Data Infrastructure 2021 chart showing key companies and trends in the data infrastructure space, full information available at mattturk.com

The 2021 machine learning, AI, and data landscape

Tags: Data landscape learning machine

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

The 2021 machine learning, AI, and data landscape