Advanced Transcription Features: Segments, Prompts, and Multilingual Support

Introduction: Why Use Advanced Transcription Features?

Welcome back! So far, you have learned how to capture audio from a microphone and set up an audio analysis pipeline using the Web Audio API. Now, you are ready to take your transcription skills to the next level.

In this lesson, you will learn about advanced features that make your transcriptions more useful and accurate. These features include:

Segments: Breaking the transcription into smaller, time-stamped parts.
Prompts: Giving the model extra context to improve accuracy.
Multilingual Support: Transcribing audio in different languages or letting the model detect the language automatically.

These tools are especially helpful when working with long audio files, specific topics, or audio in multiple languages. By the end of this lesson, you will know how to use these features to get more detailed and helpful transcriptions.

Quick Recap: Preparing the Transcription Client

Before we dive into advanced features, let’s quickly remind ourselves how to set up the transcription client and prepare an audio stream:

This setup allows you to send audio data to the Whisper model for transcription. In this lesson, we will build on this foundation to use advanced options.

Working With Segmented Transcription Output

When you transcribe long audio files, it is helpful to break the text into smaller parts, each with its own start and end time. These are called segments. Segments make it easier to follow along with the audio, find specific parts, or display subtitles.

Explanation:

The verbose_json format includes detailed segments with timestamps.
Now we're getting structured segments that include start, end, and text. This format unlocks features like timestamped captions and more granular control over UI rendering.

Guiding Transcription With Prompts

In the backend, this prompt gets passed as part of the POST body and forwarded to the Whisper API. To support this, we updated:

src/services/transcriber.ts to accept the prompt and include it in the API call.
src/routes/transcribe.ts route to read it from req.body.

Prompting is optional — it only activates if promptInput.value is non-empty.

Sometimes, audio is noisy, unclear, or contains domain-specific terms. The prompt parameter helps Whisper handle this better.

Think of it like setting the stage for the model — you give it context about what kind of content to expect.

Example: Without vs With Prompt

🎧 Input Audio:
"...and the React component mounts to the DOM when the hook is called."

❌ Without Prompt:

✅ With Prompt:

Explanation:

Without a prompt, Whisper misinterprets technical terms.
With a clear contextual prompt, Whisper understands that "React", "DOM", and "hook" are expected — so the transcription becomes accurate.

Transcribing In Multiple Languages

Whisper supports many languages and can automatically detect them — but specifying the language can improve results.

Or, for automatic detection:

Whisper will return a language field indicating what it detected.

Backend Support: How We Integrated These Features

Previously, our /transcribe endpoint only required a file path. Now, to support prompts and multilingual transcription, we updated both the route and the service layer.

Updated /transcribe route to accept optional prompt and language parameters from the frontend.
Modified transcriber.ts service to:
- Use response_format: 'verbose_json'
- Pass the prompt and language to Whisper
- Extract and return a clean array of { start, end, text } segments

Frontend Code Integration Overview

In the previous version, transcription requests only sent the audio file. Now, we extended the UI and logic to include:

A prompt input field for contextual hints
A language dropdown with predefined language options or auto-detect

Here's the updated request payload sent from the browser:

And here’s how the transcription is rendered when segments are returned:

Complete Example: Advanced Transcription In Action

Note: 💡 Don’t forget: this example assumes you're getting back a list of segments (because of verbose_json). If you’re using the default format, response.text will return a single combined transcript without timestamps.

Summary & Practice Preview

In this lesson, you learned how to unlock powerful features in the OpenAI Whisper API:

Segments: Improve readability and enable time-aligned playback.
Prompts: Steer the model toward more accurate transcriptions.
Multilingual Transcription: Transcribe audio in many languages — automatically or manually.

With these tools, your transcription system becomes smarter, more flexible, and easier to integrate into real-world apps.

Coming up next: try these features out on different types of audio, and see how your prompts and language settings affect the results!

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal