Lesson 2
Splitting and Processing Large Files with FFmpeg
Splitting and Processing Large Files

Welcome back! In our previous lessons, we've explored using basic transcribing techniques with OpenAI's Whisper API, as well as calculating the media duration using FFmpeg. Today, we'll shift our focus to transcribing large files with OpenAI Whisper and FFmpeg. Managing large audio or video files by splitting them into manageable pieces ensures that tasks like transcription can be performed efficiently and without errors. This lesson will empower you to handle these files smoothly, leveraging FFmpeg's capabilities.

Understanding Transcribing Large Files

OpenAI Whisper has a file size limitation of 25 MB, which poses a challenge when attempting to transcribe large audio or video files. To work around this constraint, we need a method to divide these large files into smaller, manageable chunks that can be processed sequentially. Our strategy involves leveraging FFmpeg's capabilities to split the files into segments that fall within the permissible size limit. This will ensure compatibility with OpenAI Whisper while maintaining the quality and integrity of the original content. By breaking down large files, we facilitate efficient transcription, allowing for smooth and accurate processing of each smaller segment.

Using FFmpeg to Split Media Files: Media Duration

Let's consider Python code to achieve this, ensuring all steps are easily comprehensible. First, let's revisit how we retrieve the media's length using FFmpeg:

Python
1import math 2import os 3import subprocess 4import tempfile 5 6def get_audio_duration(file_path): 7 """Get the duration of an audio file using ffprobe""" 8 cmd = [ 9 'ffprobe', 10 '-v', 'quiet', 11 '-show_entries', 'format=duration', 12 '-of', 'default=noprint_wrappers=1:nokey=1', 13 file_path 14 ] 15 try: 16 output = subprocess.check_output(cmd) 17 return float(output) 18 except: 19 return None

This section of the code employs ffprobe to determine an audio file's duration. ffprobe is a component of FFmpeg that fetches file data without altering it. The command is carefully structured to extract only the duration, allowing us to calculate how to split the file accordingly.

Using FFmpeg to Split Media Files: Streaming FFmpeg's Output

Now, let's implement one more helper function. Splitting a media file into chunks is a time-consuming process, and FFmpeg will produce its logs as a stream - they will iteratively appear as it keeps processing the file. In order for us to process that efficiently, we should implement a way to stream these logs to the console in Python:

Python
1def run_command_with_output(cmd, desc=None): 2 """Run a command and stream its output in real-time""" 3 if desc: 4 print(f"\n{desc}") 5 6 process = subprocess.Popen( 7 cmd, 8 stdout=subprocess.PIPE, 9 stderr=subprocess.STDOUT, 10 universal_newlines=True 11 ) 12 13 for line in iter(process.stdout.readline, ''): 14 print(line, end='') 15 16 process.stdout.close() 17 return_code = process.wait() 18 19 if return_code != 0: 20 raise subprocess.CalledProcessError(return_code, cmd)

This helper function allows us to run commands and stream outputs in real time. By setting up a subprocess, it captures output line-by-line, ensuring you keep track of the progress during long operations, a critical feature when managing large files.

Using FFmpeg to Split Media Files: Splitting Files into Chunks

The process of splitting media files into smaller chunks involves key FFmpeg commands that work together to extract segments without re-encoding. Let's break down the code to see how it operates:

Python
1def split_media(file_path, chunk_size_mb=20): 2 """Split media file into chunks smaller than the API limit""" 3 print("\nSplitting media into chunks...") 4 5 duration = get_audio_duration(file_path) 6 if not duration: 7 raise Exception("Could not determine audio duration") 8 9 file_size = os.path.getsize(file_path) 10 chunk_duration = duration * (chunk_size_mb * 1024 * 1024) / file_size 11 num_chunks = math.ceil(duration / chunk_duration) 12 13 chunks = [] 14 for i in range(num_chunks): 15 start_time = i * chunk_duration 16 temp_file = tempfile.NamedTemporaryFile( 17 delete=False, 18 suffix=os.path.splitext(file_path)[1] 19 ) 20 21 cmd = [ 22 'ffmpeg', 23 '-i', file_path, # Specify the input file to process 24 '-ss', str(start_time), # Set the start time of the chunk 25 '-t', str(chunk_duration), # Define the chunk's duration 26 '-c', 'copy', # Copy streams without re-encoding for efficiency 27 '-y', # Overwrite output files without confirmation 28 temp_file.name 29 ] 30 31 run_command_with_output( 32 cmd, 33 f"Extracting chunk {i+1}/{num_chunks}" 34 ) 35 chunks.append(temp_file.name) 36 print(f"Split media into {len(chunks)} chunk(s): {chunks}") 37 return chunks

Code Explanation:

  1. Initialize Variables:

    • We first determine the duration of the media file using the helper get_audio_duration function.
    • The file_size is retrieved to calculate the proper chunk duration that fits within the specified chunk_size_mb limit (which is by default 20Mb).
  2. Calculate Chunks:

    • chunk_duration uses the ratio of chunk_size_mb to file_size multiplied by the duration to find how long each chunk should be.
    • num_chunks calculates the total number of chunks required by dividing the full duration by chunk_duration and rounding up.
  3. Create Each Chunk:

    • A loop iterates over each chunk, calculating the start_time for each segment.
    • A temporary file is created for storing the chunk. This file will mimic the original file's extension for compatibility.
  4. FFmpeg Command:

    • -i specifies the input file.
    • -ss sets the start time for each chunk.
    • -t sets the duration for each chunk.
    • -c copy ensures content is copied directly without re-encoding, preserving quality and improving efficiency.
    • -y automatically overwrites existing output files without user confirmation.
  5. Run Command and Store Chunks:

    • run_command_with_output executes the FFmpeg command, streaming progress to keep the user informed.
    • Each generated temporary file is appended to the chunks list, which is later returned for further processing.

This approach systematically breaks down large files into smaller, manageable pieces using FFmpeg's powerful media handling capabilities.

Checking Yourself: Executing the Media File Split

Running the code (e.g., split_media('resources/sample_video.mp4', 1)) will print something like this:

Bash
1Splitting media into chunks... 2 3Extracting chunk 1/2 4<ffmpeg output for chunk 1> 5 6 7Extracting chunk 2/2 8<ffmpeg output for chunk 2> 9 10Split media into 2 chunk(s): ['/tmp/tmprgsjob1j.mp4', '/tmp/tmpr2iqj_ll.mp4']

The sample_video.mp4 video file size is around 2Mb, so splitting it into chunk_size_mb produces 2 chunks of 1 Mb, both of which are properly extracted with FFmpeg and saved as separate temporary files.

Lesson Summary

Congratulations on mastering the process of splitting large media files using FFmpeg! In this lesson, you've learned how to leverage FFmpeg's capabilities to efficiently break down large files into smaller, manageable chunks. By understanding the intricacies of file handling, you can now enhance file operations, reduce memory overhead, and enable parallel processing for improved performance, all while maintaining content quality. You’re now well-equipped to tackle large-scale multimedia tasks with confidence and precision!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.