Good Data = Good GenAI

It may not be flashy, but AI readiness is all about the basics — the format and quality of corporate data — according to a survey of 334 data leaders

MIT IDE

Published in

MIT Initiative on the Digital Economy

7 min readApr 15, 2024

By Thomas H. Davenport and Priyanka Tiwari

Courtesy of Harvard Business Review

Many organizations are excited about generative AI, and they are mobilizing to take advantage of it. Boards of directors are having educational workshops and encouraging their companies to act. Senior management teams are thinking about what use cases to develop. Individuals and departments are experimenting with how the technology can increase their productivity and effectiveness.

The work that really matters to the success of generative AI, however, falls to chief data officers (CDOs), data engineers, and knowledge curators. And unfortunately, many haven’t even started to do the needed work. A survey in the second half of 2023 of 334 CDOs and data leaders in organizations (sponsored by Amazon Web Services and the MIT Chief DataOfficer/Information Quality Symposium), and a series of interviews with these executives, found that while they are excited about generative AI as everyone else, they have much work to do to get ready for it.

In terms of data preparedness, in particular, companies have not yet created new data strategies or begun to manage their data in the ways necessary to make generative AI work for them.

We’ll describe the results of the survey and what that suggests for next steps with data.

Maximizing Value

AI historically worked with structured data, typically numbers in rows and columns. But generative AI uses unstructured data — text, images, even video — to generate new or recombined forms of unstructured data. It has provided both assistance and competition for human content creators. CDOs and data leaders share the enthusiasm for the technology: 80% agreed that it would eventually transform their organization’s business environment, and 62% said their organizations were planning to increase spending on generative AI. The technology has also brought their roles a lot of attention.

However, the respondents indicated that most are not yet getting substantial economic value from their genAI usage. When asked how their organizations were addressing generative AI, almost a third said they were “experimenting at the individual level,” rather than creating use cases for the enterprise. An additional 21% indicated that they were experimenting, but with usage guidelines for employees. A slightly smaller percentage, 19%, were experimenting at the department or business unit level.

Most importantly, only 6% had a generative AI application in production deployment.

A surprising 16% noted that their organizations had banned use by employees, though that approach seems to be decreasing over time as companies handle data privacy issues with enterprise versions of generative AI models.

Experiments with genAI should be in core aspects of the business. One company that is experimenting aggressively with generative AI for R&D is Universal Music. The company’s strong interest in the technology is not surprising, since generative AI can create music, write lyrics, and imitate the voices of artists. Universal is exploring how generative AI can be used in ways that protect the intellectual property rights of artists. It has another experiment underway to create AI versions of song tracks using Universal artists’ voices (with their permission and participation). It has also engaged in legal action against an AI provider to protect its artists’ copyrights.

Policies and proofs of concept can be useful, but they don’t generate economic value.

For generative AI to be truly valuable to companies, they need to customize vendors’ language or image models with their own data, and do the internal work to prepare their data for that integration.

Data Prep is Key to Success

The relatively unstructured data that generative AI uses needs to be well-curated for accuracy, newness, uniqueness, and other attributes if generative AI models employing it are to be highly useful. Poor-quality internal data will yield poor-quality responses from genAI models.

Many data leaders in our survey echoed this challenge: 46% identified “data quality” as the greatest challenge to realizing genAI’s potential in their organizations.

Jeff McMillan, the chief data, analytics, and innovation officer at Morgan Stanley Wealth Management — an early adopter of genAI — described some of the issues involved:

“Every single piece of research content has to be reviewed by a registered compliance person, so we know the training content is of very high quality. Even in non-research content, we have a team that scores individual submissions on issues like tagging requirements, broken links, presence of a summary up front, and we give each document a grade.
We also had to spend a lot of time thinking about different content sets and optimizing the results…These large language models do not solve the problem of disparate data sources. Companies need to address data integration and mastering before attempting to access data with generative AI.”

Most data leaders have not begun to make the needed changes in their data strategies. While 93% of survey respondents agreed that data strategy is critical for getting value from gen AI, 57% said they had made no changes thus far in their organizations’ data. Only 37% agreed — and only 11% agreed strongly — that their organizations have the right data foundation for genAI. In other words,

most organizations have a considerable amount of work to do in order to apply genAI broadly in their businesses.

Those data leaders who have begun to make changes have focused on a few specific tasks. Data integration or cleaning of datasets is being undertaken by a quarter of the organizations. Eighteen percent are surveying data that might support the use of genAI. And 17% are curating documents or text to prepare it for domain-specific genAI models.

Walid Mehanna, the group chief data and AI officer at Merck Group, emphasized the importance of several types of data preparation: “If we want to do AI, we need to build it on concrete, not quicksand. We are getting the process and data supply in good shape. We’re working on a data inventory and catalog, a data fabric with a new data structure and metadata layer, data pipelines, and ad-hoc self-service insights generation. We believe that generative AI will be a key way to create insights from data in the future.”

Raj Nimmagadda, the chief data officer for R&D at Sanofi, also said that the company is preparing its data for genAI, and told us that

success with genAI “hinges on high-quality, ‘business-ready’ data, which is guided by a robust data foundation, data governance, and standards.”

Sanofi, he said, is currently investing in building these core capabilities across its R&D.

As with data transformation in general, for most organizations it will be a monumental effort to curate, clean, and integrate all unstructured data for use in genAI applications. Therefore, most companies should focus on particular data domains where they expect to implement genAI in the near future. Companies might have a longer on-ramp to using and creating value from genAI than many hoped.

In the survey, the most common business areas in which data leaders were prioritizing genAI development were (in order): customer operations such as customer support or chat bots; software engineering/code generation; marketing and sales activities such as personalized marketing campaigns or sales offerings, and R&D/product design and development. Many organizations were also pursuing overall personal productivity applications of genAI, but this is not likely to involve a specific data domain.

Time to Start is Now

While we believe that data leaders should increase the pace of preparing their data for gen AI, we also acknowledge that there are other important data projects, including improving transaction data and making data available for traditional analytics and machine learning applications. Indeed, perhaps one reason for the somewhat slow pace may be that 71% of the CDOs agreed that “generative AI is interesting, but we are more focused on other data initiatives to deliver more tangible value.”

Given the level of excitement about generative AI, this survey result is somewhat surprising. In the 2022 survey, CDOs were under pressure to deliver value quickly, but also to facilitate implementation of genAI. We suspect that they are pivoting a bit slowly from managing and improving structured data to unstructured content. In addition, there is contention for who leads generative AI within their companies; CDOs are competing with CIOs, CTOs, and chief digital officers for leadership of the hot new technology.

If genAI is going to transform the organization, it doesn’t make sense to wait to start preparing data. We — and most of the survey respondents — agree that genAI is a transformative capability. The job to make a large organization’s important data ready for AI could easily take several years. The time to start is now.

Thomas H. Davenport is the President’s Distinguished Professor of Information Technology and Management at BabsonCollege, a visiting scholar at the MIT Initiative on the Digital Economy, and a senior adviser to Deloitte’s AI practice. He is a coauthor of All-in on AI: How Smart Companies Win Big with Artificial Intelligence (Harvard Business Review Press, 2023).

Priyanka Tiwari is a product marketing leader at Amazon Web Services (AWS). She focusses on connected storytelling across AWS databases, analytics and machine learningservices and solutions.

Good Data = Good GenAI

It may not be flashy, but AI readiness is all about the basics — the format and quality of corporate data — according to a survey of 334 data leaders

Written by MIT IDE