What I’ve learned in 5 years working on medical speech technology

7 min readOct 7, 2022

I will also be posting my writing on my new site erinpalm.com. Please enjoy!

I remember, as a resident, going slightly hoarse one day while dictating clinic notes. I would dial a transcription service, key in a dictation code, work type, and medical record number. Dictate a recording. A person would type it up and send it back 48 hours later. I had such a long list of notes to dictate, I would give myself targets and small rewards. Get through 5 more notes, get a coffee.

I also remember wondering, why do I dial this antiquated phone service? There should be an app that transcribes my note in real-time, and then I can route it to the transcriptionist to clean up if I need to. The tech must exist.

Not Dragon — I might as well type in the EMR. Something more modern, more mobile. More helpful.

I spent the last few years building that app. Suki does more than write notes, but with notes and medical speech at its core, I’ve learned a ton in the process of bringing that simple idea to life. I want to share with you a few things I’ve learned about medical speech during the past 5 years building Suki.

In a nutshell, today’s medical speech technology is all about:

Dictation: The foundation, the real-time ability to turn speech into text.
Commands: Must be flexible, natural, and sometimes interleaved with dictation.
Ambient documentation: Is it real, or science fiction?
How ML fits in

Dictation

Using voice to write notes is an old practice in medicine. It didn’t start with Dragon. It started with the medical transcriptionist, and for the old-school among us, that’s still the service we prefer.

When using technology to transcribe speech, doctors have old-school expectations. We have been trained to say things like “period,” “comma,” and “number next.” We will spell out hard-to-spell words. We will give formatting instructions that may not be machine-interpretable, because we are used to dictating to a human.

We have very high expectations of technology.

Speech-to-text technology is fascinating, complex, and imperfect. Many smart minds and hefty computing resources are dedicated to building automated speech recognition (ASR) models that take sound as an input and write words as an output. Their dual success metrics are accuracy and speed.

One surprising difference between speech models in the medical domain is, most models get the complex words right. A general purpose ASR transcribes “total knee arthroplasty” as well as an ASR tuned for orthopedic surgery. However, the ortho ASR will be better at transcribing the simpler words strung together in a way that is peculiar to medical speech. “Patient presents 2 weeks after left total knee” isn’t something most people say in general English.

Another challenge is, doctors don’t speak naturally when they’re talking to software. Some speak in a slow monotone, like a robot. They keep pausing and waiting to see words appear. Others speak rapid-fire, like they’re listing side-effects at the end of a pharma commercial (these are folks whose transcriptionists are not just saints but also mind-readers).

Pro tip: You should talk to software like it’s a human. It’s trained to understand humans.

Dictation software products also need to solve some usability hurdles. Among them include giving real-time feedback that the software is listening, and using context to spell names correctly.

Also, someday soon, doctors won’t be saying “period” and “next line” anymore, unless they like to. The automatic formatting and punctuation models that are now coming online will allow us to speak much more naturally. They will make it worthwhile to give up our old-school ways.

Commands

Doctors are used to medical dictation software by now, but we aren’t used to software that understands commands. My kids are used to it though. All afternoon at home, my kids are shouting things like, “Alexa play The Floor is Lava song” and “Alexa what’s 100 times a thousand?”

Like Alexa, Suki is a voice assistant. When it hears the doctor say “Suki,” it wakes up and listens for a command.

When we started building out Suki’s repertoire of commands for doctors — show a clinic schedule, insert a note template — we ran into something unique about medicine. We found that dictation is so foundational here, we had to build Suki to mix dictation and commands.

Mixing dictation and commands is hard, but makes sense for doctors. You can be dictating and want to leverage a pre-built template or reference some patient data for your note. Then you immediately want to start dictating again. To enable this, the software has to understand that when you’re saying a command, you don’t want the command written down in the transcript.

Because dictation software is already widespread in medicine, when you want to introduce other ways to use voice, you need to think about whether the flow works well with dictation.

Our physician users keep nudging the Suki team to build voice features that directly mimic keyboard and mouse actions — select the last word, move the cursor after the word “lava.” Be wary of using voice exactly like a keyboard. We can expect better than this from a natural language interface. We should stretch our imaginations beyond what the keyboard and mouse do today.

Flexible commands is an admirable goal, but what about cases where the user can say the exact same words to mean different things? Take the command “Suki I’m done.” The doctor could be done viewing their schedule, done dictating a note, or done eating lunch for that matter. At first glance, this might seem impossible to figure out. But the software knows a lot about the doctor, what they’re doing right now, and what they probably want to do next.

Many times, language is not enough in isolation. The software has to use context to make smarter decisions.

Suki set a high bar in terms of enabling flexible and natural commands that can be used with different phrasing and vocabulary, in all different contexts. There is a long way to go still. For example, can the speech models recognize commands without the wakeword “Suki”? When talking to a colleague, we don’t constantly say their name before every sentence.

With commands — as with dictation — doctors should be able to speak like humans, not robots.

Ambient documentation

There is a holy grail in medical speech called ambience. The quest to achieve it is honestly a bit silly.

The first step to achieving ambience is to transcribe a conversation between 2 or more speakers. This is a hard problem, and because it’s hard, it’s very interesting to speech engineers. Smart people have built some very good models. However, solving this problem (speaker diarization) does not deliver an ambient product.

Once the doctor-patient conversation is transcribed, then a text summarization model has to turn the conversation into a coherent medical note. This is actually the harder part. It’s not clear that these models can mimic the doctor’s thought process, and why should they be able to? We learned a lot in medical school, residency, and years of practice.

Teams working on ambient technology offer a scribe-like experience. In fact, while they work on the actual tech, they use scribes to create the experience. I have no problem with this at all — it’s a great service. I am just not sure when it will ever be tech.

To make it easier for the tech to figure it out, I have heard technology teams suggest the doctor should interact a certain way with the patient. They should speak out the various thoughts and findings needed to create the note, phrasing them in ways the software can pick up.

Do we really think doctors are going to change the way they talk to patients, in order to build a note? Isn’t this sort of negating the reason we want ambience, which is to let doctors be doctors, while tech does the documenting?

Let’s not get too fixed on ambience. As a physician, when I am with a patient, I am making observations and forming an assessment and a plan. At some point, I will have to record it in the medical record. There are many ways to do that — and many ways voice technology can help.

How does ML fit in?

Understanding speech in the medical domain is challenging and cutting-edge. Most of what I’ve written about here involves machine learning. Today, Suki’s ML platform transcribes speech and recognizes commands, with speed and accuracy.

We’ve been working on this for 5 years, and it’s a lot to be proud of. But of course, there’s more to do. Doctors expect a lot.

The team has some great ideas about personalization. How can the assistant be more helpful, more accurate, smarter by using what we know about you, your preferences, and your patients? They also have some far-fetched ideas, like building DALL-E for doctors. Instead of creating Impressionist-style paintings of cats on the moon, could Suki listen to the doctor say a few sentences about a patient, and flesh that out into a polished note?

Building medical speech products has taught me that we are just scratching the surface. These are difficult and exciting technical problems with big rewards on the other side. I can’t wait to see what’s next.

What I’ve learned in 5 years working on medical speech technology

Dictation

Commands

Ambient documentation

How does ML fit in?

Written by Erin Palm