Streaming Microphone Input with the Web Audio API

Real-Time Microphone Transcription with Whisper API

In this lesson, you'll learn how to record audio from your browser in real time and use the OpenAI Whisper API to transcribe it. We'll walk through the full logic from initiating the recording in the browser to returning the transcription from the backend.

What You Will Learn

This lesson will guide you through:

Setting up the browser to record microphone audio in real time.
Uploading recorded audio files from the frontend to the backend.
Processing those files using the OpenAI Whisper API.
Displaying the transcription result in the browser.
Cleaning up files after use to manage server storage efficiently.

Each of these steps contributes to building a fluid real-time transcription interface directly from the browser.

Start Recording Audio

We'll start in public/app.js, which handles browser audio recording and the UI.

getUserMedia({ audio: true }): Requests access to the user's microphone using the WebRTC API.
MediaRecorder: This browser API lets you capture media streams such as audio or video; here, we use it specifically to record audio.
mediaRecorder.ondataavailable: This event is triggered periodically during recording. We push each audio chunk into audioChunks, which is an array that will hold all segments of the final recording.
UI updates ensure a clean user experience:
- textArea.textContent = '' clears any previous transcriptions.
- resultPanel.classList.add('hidden') hides the results panel so users don’t see stale output.
- Button states are updated to reflect that recording has started.

Stop Recording and Transcribe

When the user decides to stop recording, mediaRecorder.stop() halts the recording session.
The onstop event triggers once the recording has fully stopped, allowing us to safely process the audio data.
We create a Blob object from the collected chunks. This blob acts like a file in memory, with the type set to 'audio/webm' so that we retain format compatibility when sending to the server.

A Blob (Binary Large Object) represents raw immutable binary data. In this context, the Blob acts like an in-memory file containing all the audio chunks we recorded. The second argument ({ type: 'audio/webm' }) defines the MIME type, helping both the browser and the server interpret the content format. Blobs are commonly used for handling file-like objects, such as media streams or generated documents, in frontend JavaScript.

FormData is a built-in web API that mimics a form submission and is used to build a set of key/value pairs for HTTP requests—especially useful when sending files. In this case, we’re appending the Blob under the key 'audio', which acts like an <input type="file" name="audio"> in an HTML form. The FormData object ensures the request uses the content type automatically, which is required when sending binary files in HTTP POST requests.

Backend: Handling Audio Uploads

Route: src/routes/recordings.ts

This sets up a file storage strategy using multer, a middleware for handling multipart/form-data in Node.js.
Files are saved to an uploads/ directory.
The filename is dynamically generated using the field name and timestamp, with the original extension preserved—important for ensuring Whisper understands the file format.

multipart/form-data is a content type used when submitting forms that include files. It breaks the request body into parts, each with its own content headers and boundaries. Standard JSON or URL-encoded bodies can't support file uploads, which is why multipart/form-data is required when using FormData.

multer is a Node.js middleware that parses incoming multipart/form-data requests and makes uploaded files accessible via req.file or req.files. It automatically stores files to disk (or memory, if configured) based on the provided storage settings, allowing us to avoid manual parsing of multipart requests.

Backend: Transcribing Audio with Whisper

Route: src/routes/transcribe.ts

This route ensures that a valid filePath is provided and that the corresponding file actually exists on the server.
path.resolve(process.cwd(), filePath) computes the absolute path.
If the file does not exist, we return a 404 error to prevent further processing.

We call the transcribe() function to invoke the OpenAI Whisper API.
The result is returned to the frontend in JSON format.
Finally, we clean up by deleting the uploaded file from the disk using fs.unlink, which helps maintain server storage hygiene.

Core Logic: src/services/transcriber.ts

We use the OpenAI Node.js SDK to interact with the Whisper API.
A file stream is passed to the API call, enabling efficient upload of the file contents.
The 'whisper-1' model is used, which returns a simple text transcript of the audio.
The transcribed result is returned to the route that called this function.

Summary

Here's what we built in this lesson:

A real-time recording experience using the browser’s MediaRecorder API.
Recorded audio is uploaded to a server and stored with appropriate metadata.
A robust backend route accepts and saves the file for processing.
Another backend route safely invokes the Whisper API and returns clean transcription.
Temporary files are deleted to prevent storage clutter.

This end-to-end flow lets users easily record and transcribe their speech, creating a foundation for advanced voice-driven features.

Next, we’ll explore how to transcribe long recordings in segments, unlocking features like timestamped captions and partial analysis.

Let’s keep building! 💪

Next Lesson: Setting Up a Pseudo-Realtime Transcription System Using Audio Chunking

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal