Summary of 2023 State of Data + AI: Powered by Databricks Lakehouse
2 min readJul 3, 2023
I recently read the 2023 State of Data + AI Report published by Databricks. This report investigates and answers three questions —
- How are companies applying data science and ML in the real world?
- Which data and AI products are most popular today and which are growing quickly?
- How are companies managing their data warehousing needs in the age of AI?
Here are some of my notes from reading this report —
- The most popular company use case for AI is NLPs (LLMs fall into this bucket). NLP enables doing abstract textual tasks (e.g. summarizing text blocks or doing sentiment analysis). Databricks found that 49% of the AI use cases they investigated in companies were related to NLPs.
- The second most common AI bucket for companies was using it for simulations and optimizations. An example of this would be Google using AI to manage electricity usage within its datacenters. This bucket accounted for another 30% of use cases.
- Within the NLP space, LLMs are the hot thing. There are two ways we are seeing companies start using LLMs. The first is to build their own in-house LLMs the second is they are buying SaaS contracts which enable access to tools like ChatGPT. Both of these have seen an explosion of adoption across companies since ChatGPT was released. This growth is X,XXX% YoY.
- After a model is trained it goes through a process called logging. Logging basically tests the model for accuracy before the model is deemed to be a production candidate. In order to understand how easy it is to develop a new production ready model, Databricks compared the number of models that underwent the logging (testing phase) to the number of models that were registered (released to production). They found a ratio of 2.9 :1, meaning that 34% of models which have been trained and have undergone some amount of testing end up making it to production. This ratio has improved significantly from a year ago when it was more like 5 : 1. This implies the tooling to develop and productionize models is improving.
- As we are seeing an explosion of growth in companies usage of LLMs, we are also seeing XXX% growth within infrastructure tools used to manage data. A few of the faster growing tools that Databricks identified were: dbt, Fivetran, Informatica and Qlik. These products solve a large set of problems such as — cleaning data, loading data, transforming data, extracting information from data, moving data, indexing data.
- Data integration is the process of moving, cleaning and transforming data from multiple sources into a single datalake. That datalake then serves as input data from training AIs. The data integration market is growing at roughly 100% YoY.