Lesson 2: Segment-Based Transcription with Whisper API

In the previous lesson, you learned how to load and play audio files using Howler.js in the browser, while controlling playback state from a TypeScript backend. This laid the foundation for interactive audio applications. Now it's time to take the next step: extracting meaningful text from audio using OpenAI Whisper.

But instead of transcribing full audio files, we’ll focus on clipping and transcribing only the audio segments the user listens to—based on when they press start and stop. This will prepare the way for precise, user-driven transcription workflows.


What You’ll Learn

By the end of this lesson, you will be able to:

  • Clip a specific portion of an audio file using ffmpeg.
  • Prepare the clipped audio segment for Whisper transcription.
  • Send a transcription request to the backend with start and duration parameters.
  • Use the OpenAI Whisper API to convert speech to text.

In this lesson, we focus purely on the backend logic. In the upcoming, you’ll learn how the browser tracks segment timing and sends that to the backend.


Prerequisite: Installing ffmpeg

To clip audio files before transcription, we rely on ffmpeg, a powerful command-line tool for processing multimedia files.

While ffmpeg is widely supported, it may not be installed by default on your system. You’ll need to install it manually if you haven’t already.

ffmpeg is a cross-platform utility for handling audio, video, and other media files. In this project, we use it to:

  • Extract a segment of an audio file (-ss and -t)
  • Convert the audio to mono and 16kHz sample rate (as required by Whisper)

Here are common installation methods by platform:

macOS (using Homebrew)

Ubuntu

Windows

To verify installation, run:

You should see version details printed in the terminal.

⚠️ If ffmpeg is not installed or not in your PATH, the clipping step will fail with an error like ffmpeg: command not found.

Backend: Clipping Audio Segments with ffmpeg

To transcribe just part of a file, we need to extract a time-based audio segment. This is done using ffmpeg, a powerful command-line tool for media processing.

Explanation:

  • -ss ${start}: Start time (in seconds)
  • -t ${duration}: How long the clip should last
  • -ac 1: Force mono audio (required by Whisper)
  • -ar 16000: Set sample rate to 16kHz (also required by Whisper)
  • -i "${fullPath}": Input file path
  • Output is saved to a .clip.mp3 file

This command ensures the audio segment is formatted properly before sending it to Whisper.


Route: /transcribe

Let’s look at the full backend route for segment-based transcription.

This endpoint does three things:

  1. Validates the request (ensures file exists and duration is valid)
  2. Clips the audio file using ffmpeg
  3. Sends the clipped audio to the Whisper transcription function

Whisper Integration: transcriber.ts

Once the clipped file is ready, we use the Whisper API to transcribe it. We are already familiar with it from the previous course,

Explanation:

  • A readable stream is created from the clipped file.
  • The Whisper API is called with model: "whisper-1".
  • The resulting text is returned to the route handler.

This service can be reused anywhere in your app that needs transcription.


Summary

In this lesson, you learned how to:

  • Use ffmpeg to extract a segment of an audio file by time.
  • Set up a backend route to handle transcription requests.
  • Prepare audio clips in a format compatible with OpenAI Whisper.
  • Call Whisper to transcribe audio into text.

You’ve now built the backend pipeline for targeted transcription.

In the next lesson, we’ll shift back to the frontend and implement a user-driven experience that captures the exact start and stop times of playback—sending that segment for transcription.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal