Cutting-Edge Use of GPT-4 and Claude in Data Documentation Produces Mind-Blowing Results

Imagine spending hours meticulously documenting data tables, only to redo it all with each minor update. This was our reality at Included Health, until we embraced a game-changing solution. Now, we’re not just managing data; we’re revolutionizing its understanding by leveraging Large Language Models (LLMs).

The Pre-LLM Era at Included Health

Data documentation was our Achilles’ heel. Picture this: a data table, crucial to decision-making, yet devoid of any comprehensible documentation. The result? Prolonged hours of confusion and distrust among our teams. So much so that 60% of Included Health analysts indicated data discovery was their #1 problem. A staggering 47% identified the lack of documentation as a significant barrier to their work. Even our senior and middle management reported that meeting basic Level 1 Governance standards was a herculean task due to these documentation demands. That’s not too surprising given we have thousands of datasets ranging from member claims data to member click-through data. At Included Health, data-driven insights drive the value we produce for our members. We gather data on everything, so we can make better decisions to improve the health journey of our members.

In a previous blog post, we talked about how at Included Health, data is integral to our mission of raising the standard of care for everyone. We have a variety of datasets documenting details about all patient interactions with our products, and stakeholders from across the company use this data to inform decisions. Data quality matters because even when pipelines are healthy, the data values themselves may have changed, such as through a distributional shift or an uptick in NULLs. Data quality impacts trust in the data driving decisions. Understanding also impacts trust, understanding how the dataset came to be, what decisions were made and the opinions present in the dataset allows someone to decide if the data is suitable for and be trusted to inform decisions that impact ourselves, our business, and most importantly, our members.

Discovering, trusting, and understanding our data is a cornerstone of our data governance program at Included Health. As a result, tables seeking to pass data governance standards required a time investment of up to 4 hours by a skilled analyst for initial documentation, and an additional 30 minutes for every subsequent change. More alarmingly, under tight deadlines, documentation was often skipped, leaving most of our tables in a state of informative obscurity.

Introducing Our LLM Heroes

Recognizing that human nature resists mundane, repetitive tasks, we turned to Large Language Models (LLMs). Through a series of exploration followed by A/B testing with our internal users, we found that preference was for a combination of GPT-4 and Claude2. GPT-4 was preferred for table and logic summarization; whereas, users preferred Claude’s straightforward write ups for field descriptions.

The need to be model agnostic and take the best from the models available is a finding we’ve uncovered internally elsewhere as well. Our Data Science team created Wordsmith (which we’ll soon describe in a subsequent post) to address this issue. Using Wordsmith, we could delegate common infrastructure like model choice, proxy calls, and budgeting. Wordsmith allows any developer at IncludedHealth to develop applications based upon the best model for the job we need done.

Wordsmith is our platform for interacting with models from different providers. It can take in raw text input from our online services, internal data platform, or ad-hoc requests to the internal API. Then, it is able to interact with models from different providers – in this case we leveraged Microsoft’s Azure GPT-4 and AWS Bedrock’s Claude model – to return output for tasks like text summarization and classification. The input was the SQL projection logic and the output was logic summary, table documentation, and field documentation. We were able to protect our IP by excluding instance data and using our internal Wordsmith platform, which enforces contractual usage agreements and terms between our business and the models.

The Impact of LLMs

At Included Health, we know that effort is not a substitute for impact. We focus on causal impact that removes obstacles and enables employees to put the members first. We wanted to prove our impact and sought diverse perspectives from across all of our internal stakeholders. As we started to backload tables, we engaged our stakeholders and this is what they had to say:

“That is wildly amazing. Very impressive. Really neat project. Thanks for doing this.” – IH Actuarial.
“My mind is blown actually. There are a couple of instances where the model was a bit wordy, and maybe an instance of hallucination. Everything else looks good to me.” – IH Data Scientist
“Now that I sat down to read in detail, I actually feel like the accuracy is pretty spot on.” – IH Analyst.
“The logic synopsis is really helpful when a query is hard to read, or if you want to check your understanding. I like that.” – IH Data Engineering.

Our users preferred a solution that was a co-pilot helping them as they create and edit tables. This is a co-pilot that table owners can generate something helpful to start with that they can review and tweak any nuance they find. As a starting place, it reduces boilerplate and writers block, cutting down a task that sometimes takes up to two hours into a review and edit that only takes ten to twenty minutes. This 6-fold increase in efficiency is now available for all existing and new projections in QueryBook.

With LLMs in action, every data table on our platform can now boast comprehensive, up-to-date documentation. This includes detailed business logic, table-level insights, and specific column information. These large language models, with their advanced text and code summarization abilities, now handle the tedious task of drafting and updating documentation. Their integration into our workflow was not just about efficiency; it was about redefining how we interact with data.

It’s a paradigm shift now that users can focus on creating valuable insights which inform decisions with the data rather than spending time to infer the meaning of the data. While a data producer wants to do the right thing and has put a lot of effort into generating data, the tax on making this transformation logic understandable and meaningful for other use cases was too high before we rolled out LLM based documentation. This frees up our data professionals to focus on the fun parts of creating insights. The work after all the data is found like writing SQL and code to support insight generation. This also includes having more time to tell the story with our data to ensure our clients and members are satisfied.

A Glimpse Into the Future at Included Health

We ask ourselves every day, “is there a better way?” We know change is our opportunity and adaptation is our super power. We work together to spark innovation and break down barriers. We encourage unconventional thinking that inspires and works to solve complex problems.

Our journey with data is evolving. It’s no longer about numbers and charts in isolation; it’s about the story they tell and decisions they drive. With LLMs, we’ve reduced the effort needed to produce insights and make working with data more natural. By bootstrapping every data table with up-to-date documentation that serves as the starting point for data driven decision making, we’re laying the groundwork for how analytical data is generated and reused across value streams.

This foundation will enable us to make investments in further AI supported data professional tooling ensuring data consistency, metric reuse and expanding the reach and impact of these data tables beyond analysts to anyone at IH with great ideas and the ability to drive decisions to raise the standard of health care for our members.

Conclusion

Join us at Included Health as we continue to explore the frontiers of data management and understanding. We are pushing the boundaries on how we can apply the latest tools to solve our users’ needs and change the healthcare ecosystem for the better. If you’re excited by this work take a look at some of our open roles!

Matt Vagnoni

Matt Vagnoni, Sr Informatician at Included Health (Information Steward & Evangelist). He has spent most of his career playing with data and dreaming about AI. He has led the creation of six data platforms within new companies or large enterprises. Most recently he became a senior informatician at IncludedHealth, creating a framework for improving trust and usability of data through a combination of evangelism, vision, and new tools like generative AI. He’s actively investigating novel writing and data governance using generative AI.

Eden Grown-Haeberli

Eden Grown-Haeberli, Software Engineer at Included Health. Driven by an unwavering passion for leveraging technology and data to transform healthcare, Eden Grown-Haeberli brings over five years of experience to her role as a Software Engineer at Included Health, where she is currently spearheading the development of a groundbreaking data platform. With a Master’s degree in Computer Science and a Bachelor’s degree in Bioengineering from Stanford University, Eden possesses a deep understanding of the intersection between technology and healthcare, empowering her to drive innovation in the field.

May Xiao

May Xiao, Software Engineer at Included Health. As a software engineer in the healthcare sector, May enjoys designing and implementing innovative solutions that bridge the gap between cutting-edge technology and patient-centric care. Her passion lies in leveraging her technical skills to contribute to the advancement of healthcare systems, ensuring seamless integration and optimal user experiences for medical professionals and patients alike.