Using Azure to add subtitles to a video

I needed to create English subtitles for a video which was recorded in French and figured I'd use Azure instead of paying for cloud services. In this guide, I'll walk through how to use Azure Speech Service to automatically generate subtitles for your videos.

Prerequisites

Before we begin, you'll need:

An Azure subscription (you can create one for free)
A video file you want to add subtitles to
Basic familiarity with command line tools

Step 1: Extract Audio from Video and Convert to WAV

First, we need to extract the audio from your video file and convert it to WAV format:

On MacOS

Use the built-in afconvert command:

afconvert -f WAVE -d LEI16@44100 input.m4a output.wav

On Windows

Use ffmpeg (you'll need to install it first):

ffmpeg -i input.mp4 -acodec pcm_s16le -ar 44100 output.wav

Step 2: Set Up Azure Speech Service

Create a Speech resource in the Azure portal
Once created, get your Speech resource key and region from the "Keys and Endpoint" section

Step 3: Install the Speech CLI

The Azure Speech CLI is the easiest way to generate subtitles. Install it using the .NET CLI:

Note: You'll also need to make sure you have the .NET 6 SDK installed. If, like me you are on ARM64 Mac, the Cognitive Services Speech SDK doesn't support it yet, so you'll need to use Rosetta to run the .NET CLI.

I have a guide on running .NET x64 under Rosetta on ARM64 Macs if you need help with that.

Once you have .NET 6 SDK installed, install the Speech CLI:

dotnet tool install --global Microsoft.CognitiveServices.Speech.CLI

Step 4: Configure Speech CLI

Set up your Azure Speech Service credentials:

spx config @key --set YOUR-SUBSCRIPTION-KEY
spx config @region --set YOUR-REGION

Replace YOUR-SUBSCRIPTION-KEY with your Speech resource key and YOUR-REGION with your resource region (e.g., westus, northeurope).

If you are on Windows or Linux, you're also best off installing GStreamer, you can follow the instructions for that here.

If you are on MacOS, GStreamer isn't supported. I tried to get it work for this, but it just refused.

Step 5: Generate Subtitles

Now you can generate subtitles in WebVTT format: *Important: Specificing the format flag will trigger GStreamer dependency requirements. *

Rosetta on MacOS:

spx recognize --file input.wav --language fr-FR --output vtt file subtitles.vtt

Or on Windows or Linux using GStreamer:

spx recognize --file your-audio.m4a --format any --output vtt file subtitles.vtt --output srt file subtitles.srt

Additional options you can use:

--profanity masked: Masks profanity in the output
--phrases "Phrase1;Phrase2": Improves recognition of specific phrases
--property SpeechServiceResponse_StablePartialResultThreshold=5: Improves accuracy by requiring more confidence in the recognition

Output Format

The command generates a WebVTT file that looks like this:

WEBVTT

00:00:00.170 --> 00:00:03.230
Welcome to this video tutorial.

00:00:03.230 --> 00:00:06.450
Today we'll be discussing Azure services.

Clean Up

When you're done, you can remove the Speech resource from Azure portal if you don't plan to use it again. This ensures you won't incur any additional costs.

Tips for Better Results

Use high-quality audio for better recognition accuracy
Add custom phrases for domain-specific terminology
Test with a small segment first to verify the quality
Consider post-editing the subtitles for perfect accuracy

Conclusion

Azure Speech Service provides a powerful and automated way to generate subtitles for your videos. While the output might need some manual refinement depending on your needs, it significantly reduces the time and effort required compared to manual transcription.

For more advanced scenarios or programmatic access, Azure Speech Service also provides SDKs for various programming languages including C#, Python, and JavaScript.