Setting Up a Pseudo-Realtime Transcription System Using Audio Chunking

Real-Time Microphone Transcription (Live Simulation)

In this lesson, we’re enhancing our transcription system to work like a live microphone transcription tool. Instead of recording the entire audio before transcribing, we now record short chunks (3 seconds each) and transcribe them one-by-one as they arrive—simulating a live transcription experience directly in the browser.

What You Will Learn

This unit covers:

How to capture short audio snippets (chunks) from the user's microphone in real time.
How to transcribe each audio chunk immediately after recording.
How to update the UI with live transcription results.
How to manage a recording session with duration limits and countdown timers.

Frontend: Simulating Live Microphone Transcription

We'll begin with public/app.js, where we configure how microphone input is handled in real time.

These constants are critical for timing and quality control:

mimeType: This tells the MediaRecorder what format to use. audio/webm;codecs=opus specifies WebM format with the Opus codec, which is well-suited for audio and supported by Whisper.
CHUNK_DURATION: Each recording session will be sliced into 3-second pieces.
MAX_CHUNKS: Limits the session to 10 chunks (to simulate ~30s cap).
MAX_TIME_S: Converts chunk duration * number of chunks into seconds for UI display.
chunkCount & remainingTime: Track session state and countdown for the user.

Managing Chunk Loop

recordChunk() is the main function that performs all recording logic for a single audio segment. We will discuss in a separate section below.
setInterval: Automatically runs recordChunk() every 3 seconds.
We also invoke recordChunk() immediately to avoid waiting for the first interval.
clearInterval(intervalId): Essential for stopping the session; otherwise, recording will continue indefinitely even if the user presses stop.

Countdown Timer

A simple helper that updates the visible timer on the screen using the remainingTime variable.

Chunk Recording Logic

Now let’s dive into the heart of this simulation: the recordChunk() function.

This function is responsible for executing one full iteration of the record → upload → transcribe cycle. Every 3 seconds, it does the following:
Requests microphone access to capture a short audio stream.
Records exactly one chunk using the browser's MediaRecorder API.
Packages the audio data into a Blob for upload.
Sends the chunk to the backend, where it’s temporarily stored.
Initiates transcription by sending the uploaded file to the Whisper API.
Appends the returned text to the live transcript on the UI.

This structure enables us to transcribe small segments in near-real-time, giving users immediate feedback as they speak. By repeating this function on an interval, we simulate continuous live transcription, without needing a streaming connection. Let’s break it down:

This line is crucial—it asks the browser for access to the user’s microphone using navigator.mediaDevices.getUserMedia.

The audio object specifies a high-quality mono stream:
- sampleRate: 44100: CD-quality audio.
- channelCount: 1: Mono (single channel).
- noiseSuppression: Reduces background noise.
- echoCancellation: Removes speaker echo (common in browser mic recordings).

When Chunk Stops: Upload + Transcribe

Blob: A binary large object that packages all audio data into a single file.
FormData: Simulates a form submission to send binary files over HTTP.
formData.append(): Adds the blob to the form data under the key 'audio', with a filename.

The audio blob is sent to the backend /recordings/upload route.
We retrieve the server-side file path of the uploaded chunk.

We pass the file path to the /transcribe endpoint.
Transcription text is returned and added to the live transcript in the UI.

Limit & Cleanup

Increments counters and refreshes the session timer.
Automatically stops recording once max chunks are reached.
Cleans up the mic stream with getTracks().forEach(track => track.stop()).

Utility: Append Transcript

Adds each transcribed chunk as it’s returned from the server.

UI Handlers: Start and Stop Buttons

Starts a fresh session and resets all states and UI indicators.

Gracefully ends a session and restores UI defaults.

Backend Logic (No Change Required)

The backend from the previous unit continues to work seamlessly:

/recordings/upload: stores each .webm chunk.
/transcribe: invokes transcribe() to convert uploaded audio into text using OpenAI’s Whisper API.

Summary

In this unit, you:

Simulated live microphone transcription using 3-second audio chunks.
Learned to manage a transcription session with time and chunk limits.
Processed and displayed each chunk’s transcript live in the browser.
Built a scalable transcription pipeline with clean UI feedback and Whisper API integration.

Next up: we’ll expand on this to support long-form recordings with advanced segmentation and context-aware processing.

Previous Lesson

Next Lesson: Advanced Transcription Features: Segments, Prompts, and Multilingual Support

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal