The Adventure Begins

It was January 2023, and I was discussing with my manager, Rick, what we wanted the Data Science (DS) Platform team to work on in the next quarter. He mentioned that I should block out some of my time to look into recent developments in natural language processing (NLP), since the field had advanced a lot since our DS team last deployed a major NLP project in 2021. I didn’t know it then, but I was about to embark on what has been the most exciting and rewarding project of my career so far.

The GenAI boom was just starting to heat up: ChatGPT had launched in November 2022, and GPT-4 had yet to be released. The time was ripe to start exploring the possibilities of what these new technologies could do for our company.

Early Experiments

I finally got started on our NLP 2.0 exploration in March 2023. This was excellent timing: GPT-4 was released on March 14. During that week, I tested GPT-4 on several tasks that my coworker had previously tested with GPT-3.5. While GPT-3.5 had failed these tasks just the week before, GPT-4 successfully completed them. It represented a major breakthrough in the practical possibilities of product applications that just weren’t feasible before. Looking back, I’m still amazed at what GPT-4 could do on tasks I was experimenting with for the first time.

In April I gave a presentation to our DS team that I subtitled: “What have I been doing for the past month?” I gave an overview of the ideas that we had been brainstorming along with some Product Managers, focusing in on three main ideas that we thought were most promising: 

  1. Combining large language models (LLMs) with ongoing work on indexing the Knowledge Base that is referenced when members contact us for support
  2. Intent and Action Classification: improving our text classification using LLMs
  3. Automatic generation of notes from calls and chats: speeding up manual documentation that care coordinators were writing after each encounter with a member.
At this point, Rick really wanted to know.

The Birth of Wordsmith

By May, we had done enough initial validation of these different ideas that we knew we wanted to start investing in some kind of platform for LLMs and GenAI, which I named Wordsmith.

At a high-level, Wordsmith was meant to do anything you wanted with text data, and to be so easy to use that other teams could just treat it as a black box. This is the first diagram from my original plan, and one I still frequently reference:

Inside that black box, there would be several parts. In the plan, I laid out the five main components of the platform:

  1. NLP Proxy Service and Client Library
  2. Prompt Tooling
  3. Training Library
  4. Evaluation Library
  5. Model Inference/Model Serving System

Development of the platform has generally proceeded along this breakdown, so I’ll go over each component and the progress that the team has made so far:

Proxy Service and Client Library

Given the strong abilities of API-based foundation models, as well as the growing number of models available from different providers, this has been the heart of our LLM platform thus far.

When we were starting out, there were a few libraries that provided a single interface to make LLM calls and would send your request to different providers from the client side. However, I made the decision early on that the logic of routing between model providers would be handled on the server. Instead of the request going from a Data Scientist’s local environment directly to Google, or OpenAI, all LLM requests at Included Health pass through a single internal service called wordsmith-proxy, which then routes them to the appropriate provider.

Proxying on the server side has several practical benefits: the first is that users are insulated from churn as we work on routing logic and added providers. They don’t have to upgrade the client package to get each new change. It also makes provider credential management easier: instead of trying to figure out how to securely distribute separate credentials to each user, we could set up a single OpenAI API key, a single Google service account credential, etc that belongs to the proxy service.

How we actually implemented this: on the server, we followed the spec for the OpenAI API, which they publish in OpenAPI format. I figured that OpenAI probably had already written a better API for communicating with LLMs than I could come up with. Because we’re serving the same API as OpenAI, our Data Scientists and other internal users can just use the OpenAI Python SDK, and point the base URL at our internal endpoint.

When the request gets to our server, we translate it from OpenAI format into a request to one of four different providers: OpenAI, Google VertexAI, AWS Bedrock, and internal (wordsmith-serving). In the case of OpenAI, obviously, the translation is just a pass-through function: no modifications are needed.

For our Data Scientists, this architecture means that they can switch between model providers without needing to write any new code, just by changing the model name from a string like “azure:gpt-4-turbo” to “google:gemini-pro”. It also makes us cross-language out of the box: our engineering team mostly writes in Go, and recently integrated with Wordsmith Proxy using the Go OpenAI library.

I believe that making this choice for our platform has paid off. Since then, open-source LLM server tools have popped up that use the OpenAI schema as a common interface. If I were starting today, I would strongly consider using something like LiteLLM. Additionally, more client tools like LlamaIndex are supporting connections to “OpenAI-like” APIs out of the box.

Model Inference/Model Serving System

For online serving, we deployed MLServer internally, and use the HuggingFace runtime to serve models. This service is connected to our internal deployment of MLFlow, and it can download model artifacts from there. For batch inference, my teammate Amogh Nalwaya built out our wordsmith-inference system, which integrates with our data warehouse to allow users to launch jobs that apply LLMs over text data stored in the tables.

Training Library

At the outset, our Kubernetes infrastructure didn’t support GPUs, so the first task was to work with our Infrastructure and Data Infra teams to get that working. We use Karpenter for provisioning nodes in our compute cluster, so we worked to set up a new Provisioner to create GPU nodes. I also deployed the nvidia-device-plugin so that we could request GPUs as Kubernetes resources.

On the software side, we’re using HuggingFace transformers. Most of what I worked on for training was applying the basic examples from HuggingFace to our particular datasets, and experimenting with some more advanced techniques like Lora.

Evaluation Library

The wordsmith-evaluation library works as a wrapper around HuggingFace evaluate. You can use it with any of the existing metrics in the evaluate library, and we’ve also integrated some open-source metrics that aren’t in the library, as well as custom metrics that we’ve developed in house.

Prompt Tooling

Jack Sullivan led development of a Python library to abstract away boilerplate from prompting. For example, it makes it easy for users to generate prompts from string templates, which is a common use case for LLM applications. It also exposes a simple interface for users to provide examples for few-shot prompting, and then within the library can select the desired number of examples using semantic similarity.

What we’re doing with it

There’s no sense in building a platform that nobody uses. So I’ve been excited to see the growing adoption of Wordsmith at IH, and I want to tell you a little bit about some of the applications that my teammates have been working on. (Although not too much, because I expect they will each get their own highlight on our blog soon enough!)

Automatic Documentation of Data Warehouse Tables

Matt Vagnoni and the rest of our Data Tools team have been great early adopters and collaborators on the Wordsmith platform. They used LLMs to improve documentation of our data warehouse, and have already documented their use case in its own post: Cutting-Edge Use of GPT-4 and Claude in Data Documentation Produces Mind-Blowing Results.

Ghostwriter

This project automatically generates documentation for care coordinators after member interactions. It now covers both chats and calls, using transcription capabilities via a Whisper model hosted on Wordsmith Serving.

Coverage Checker

This is the first GenAI use within the IH app, answering insurance plan questions by retrieving relevant documents. It’s been released for internal testing and rolled out to the first external customer.

Clinical Scribe

This tool automates clinical documentation to enhance healthcare provider efficiency. It supports real-time visit transcription and generation of medical documentation, including SOAP notes.

Records Collection

This system uses LLMs to automate the process of gathering medical information for Expert Medical Opinion services. It parses, reformats, and filters records based on relevance to specific EMO cases.

ChatIH

This started as a solo project of mine that won 2nd place in our past hackathon. The idea was to take the work we had already done on the backend with wordsmith-proxy, and connect a deployment of the open-source ChatUI from HuggingFace as a frontend so that more internal employees could access LLMs without having to write Python code. 

We’ve now rolled it out to 400 internal users! The “Assistants” feature has spawned a number of helpful productivity assistants, like Meeting Summarizer, which summarizes Google Meet transcripts, and IH Dejargonizer, a play on the popular Dejargonizer GPT but with our own Included Health acronyms.

Lessons Learned

If you are tackling a similar challenge at your company, here are some of the main lessons I have learned from building Wordsmith over the past year.

Be Flexible

This is by far the most important. Nobody can have a 12-month roadmap for their GenAI platform right now, because nobody (except maybe Sam Altman) has any idea what’s going to happen in this space in 12 months. And that’s just the external environment–you also have to plan for developments internal to your company. Last summer I was concentrating on enabling fine-tuning and self-hosting because I thought we were months away from any kind of deal to send HIPAA data to OpenAI, and then Google Cloud launched VertexAI with Palm and all of sudden we had access to cloud-based LLMs. Fast forward 6 months, and all of a sudden we have LLMs from three different cloud providers (Google VertexAI, AWS Bedrock, and Azure OpenAI) to manage.

Be Modular 

This goes along with the point about flexibility. Yes, I wanted to build a platform to do anything that could be done with text data and LLMs, and that was an ambitious goal. But I also knew it wouldn’t fit into just one service or library, and probably not in one code repository either. By building a set of composable tools, we empowered the user to select just what they needed to accomplish the task at hand. On the development side, it allowed us to better organize our time and resources, and pivot to work on new priorities as needed.

Open Source, Open Source, Open Source 

Open source tools are a platform builder’s best friend. Whenever I find an open source tool that does something I want to do, it almost feels like cheating. I have tried very hard to use open source and/or common interfaces wherever possible, which ultimately improves our ability to integrate Wordsmith with external tools.

If you work at a company in a regulated industry and want to do something with GenAI, and you haven’t already started engaging your Legal and Security teams, do that now! Through this project, we’ve really gotten effective at working quickly with our security team. GenAI only introduces more wrinkles in these areas on top of what previous Machine Learning techniques already had, so be sure to account for completing reviews and paperwork in your project timeline.

Wordsmith Future

Development of Wordsmith is still very active, because this space is still rapidly changing. For example, when AWS Bedrock announced their Converse API, I was able to refactor our AWS support to simplify it and reduce the amount of logic for different model providers. We support Structured Outputs for OpenAI models, and Google just announced that Controlled generation was generally available, so we will be working to support that in the proxy to get Google models to feature parity. We’re also looking at adding GPU support to Wordsmith Serving to scale the needs of our transcription inference.

In terms of big ideas, there are a few that weren’t in the original plan for Wordsmith that we have started development on, and that will likely be key components in the future.

Tool Calling and Agents

wordsmith-tools and wordsmith-agents

This was not on my radar at all when I wrote the first plan for Wordsmith. But if you ask most people in the industry now, the next big thing is LLM agents: LLMs that can call external tools to perform their tasks. We’ve started to develop infrastructure around this pattern, including wordsmith-tools, which is a simple way to add endpoints with an OpenAPI spec to be called by LLMs, and wordsmith-agents, which allows users to configure LLM agents. These new services have been used to configure an agent in ChatIH that can query for internal Confluence documentation, which is now used by over 80 of our engineers.

We’re also looking into implementing the OpenAI Assistants API in wordsmith-proxy to provide a standard API for working with agents.

RAG

wordsmith-retrieval

RAG is one of the hottest ideas in LLMs right now, and we want our platform to start enabling people to use it more easily. The idea here is to provide an API where users can just send in a set of documents, and we create an endpoint for them that allows retrieval of those documents for RAG. Behind the scenes we would handle the chunking, embedding, and other steps as necessary. Having a unified API over RAG capabilities would also allow us to easily experiment with some of the cloud-hosted RAG solutions, like Vertex AI Search or Knowledge Bases for Amazon Bedrock.

Higher-level frameworks

I’m happy with the results we’ve gotten from basing our platform around the OpenAI API, and in the early stages I was hesitant to move beyond that and go all-in on one higher-level library, like LangChain. At this stage of development, I think that we can understand the value of these abstractions and get more out of them. We’ve had some productive explorations of integrating LlamaIndex for indexing and retrieval, as well as running single LLM agents. We’re also in the early stages of evaluating CrewAI and AutoGen for handling the logic of how to coordinate multiple LLM agents.