From Transcripts to Podcast Videos
Last Updated: Sat Mar 30 2024
I recently set myself a little project to try and build a work flow that would allow to me produce the following:
- Podcast Videos: It seems to be all the rage these days so I thought I would have a look at what would be needed to make a Podcast available via Youtube.
- Podcast Transcripts: I've been meaning to look at this for a while now. Mostly for accessibility reasons but now that Apple has announced support for the podcast:transcript tag, it just makes sense.
Podcast Transcripts
When I'm talking about transcriptions I'm really talking about two things:
- A human readable copy of what is said in the audio file
- A marked up record of what is said in the audio file that is able to be displayed IN TIME with the audio (whether that's in video or other format)
Now there's a lot of cross over and you can get away with fulfilling both requirements using any of the industry standards for subtitles, however for a proper human readable experience I think it needs a bit more than that. So whatever I need to do is going to have two outputs, for humans and for machines.
With that in mind I set out to transcribe the first ever episode of Purser Explores The World - Numbats In Space
I could have sat down and transcribed the episode manually but given that I was looking at transcribing literally tens of hours of content with PETW and Women In STEMM I thought I would try and see how well the current crop of Open Source Transcription Models performed. For this experiment I used Whisper but I'll be looking at others like Kaldi as well.
I figured Numbats In Space was a good test as not only did it feature three different interviewees, but there were quality issues with the audio (there's lag in some parts, a bit of static in others) that might trip up any model. However after a bit of futzing around with Whisper and tools like WhisperX I was able to generate whats called an SRT file (stands for SubRip subTitle) that didn't need a lot of correcting. There was one instance of a speaker being mis-assigned and some cleanup on words that where either unclear or scrambled.
Now, all I have left to do is:
- Add the podcast:transcript field to my rss builder on the site
- Upload the SRT so that it goes out with the rss
- Edit a copy of the transcript so that its much more human than just a time marked statement of who said what, when.
Podcast Videos
In the meantime I had a little play to see if I could automate the generation of videos based on the audio and transcripts. The goal was to produce something that could be uploaded to youtube and not actually look like buttock.
I tried with ffmpeg, but I never really nailed the look so I shelved that for the moment (I will be returning to it) and decided to have a play with an actual video tool.
First off I tried with DaVinci Resolve, I have the free version, but it is immensely powerful and I can't recommend it highly enough for anyone who wants to get into compositing and editing.
Now this looks AMAZING, however there were a couple of issues. Firstly the render size was absolutely massive. That's probably on me and I haven't configured it correctly.
Secondly the render TIME was going to be roughly 14 hours.
14 hours to render a fancy audio visualiser, an image and an audio track. Sorry but my brain just rebelled at having to wait that long to get a result. So I stopped the render and clipped the first minute and a bit.
Another thing you might notice is that the subtitles/captions that are appearing look a little raw, I hadn't run through and verified the transcript at this point.
Right, so given that I was impatient and what I was going for wasn't actually that complicated (and certainly didn't NEED the power of Davinci) I turned to kdenlive a Free and Open Source Software Non Linear Video Editor that I've used on and off for over a decade. I was able to throw together something similar very quickly and the final render time and size was much smaller.
And here's the result
If you don't have subtitles turned on then turn them on and select the en-AU set (Youtube still offers the auto-generated as well so you have to be clear).
It's all a work in progress of course, I'd still like to be able to automate as much as possible (both for practical reasons and for the fun of it) but I think I'm at a good place to continue