Improving my audio extraction script.

Tue Jun 14 16:18:48 UTC 2022

Okay, so I often use mkvextract to extract the audio from mkv files
both because my portable media player generally doesn't like .mkv and
other containers that do multi-audio and built-in soft subs, and
because the media player has much less storage.(a 50-episode
television show that takes up 100GB is a drop in the bucket of a 4TB
hard drive, it's an unmanageable chunk on a 512GB SD card when you
already have 100GB of MUSIC and 100GB of audiobooks on that same SD
card)

I've written a script to automate the process:

#! /bin/bash
for file in */*.mkv; do
mkvextract tracks "$file" $1:"$file""$1".ac3
done

This script takes a track number as a command line argument, loops
over all the mkv files in the immediate subdirectories of the working
directory, extracts the track matching the track number provided, and
saves the extracted file with the name of the source file, the track
number, and the .ac3 extension.

This has several issues:

1. I generally don't know which track I'm getting ahead of time, be it
audio or subtitle, and if it's audio, whether it's English, Japanese,
or something else... and while the video stream is usually track 0,
this isn't universal, a given set of mkvs isn't guaranteed to have the
same track ordering, and since the script loops over every immediate
subdirectory, even if all of the mkvs within a given directory are
uniform, uniformity across directories is unlikely, and there's no
easy way to omit directories where I got what I wanted extracting
track 1 when I do a run to extract track 2.

2. the new files retain the .mkv of the original file's filename, so I
end up with a bunch of .mkv1.ac3, .mkv2.ac3, etc. files. Also, since
the output is saved in the same place as the source, I often have to
manually separate the extracted audio from the original files.

3. The script assumes the extracted audio is AC3, and while that seems
to be the most popular codec for storing the audio streams in .mkv
files, it's not universal.

Improvements I would like to make but am not sure how to do so:

1. instead of extracting a specific track number, it would be nice if
I could extract English audio regardless of which track its stored in.
Bonus if I could get all English audio tracks in the event of files
containing an English language commentary.

2. Instead of looping through all subdirectories in the working
directory, looping through a set specified at the command line,
perhaps with the empty set treated like *.

3. remove the .mkv from the original filename before appending the new
extension.

4. Actually giving the output files the appropriate extension.

5. Instead of saving the output files to SourceDirectory, saving them
to sourceDirectoryAudio or something similar.

6. This would just be Gravy, but if anyone knows a way to either
convert subtitle files to human readable plain text(e.g. stripping out
the metadata, timestamps, and formatting) or having a TTS generate a
audio file of the subtitles using the subtitle timing, either having
the script extract and process English subtitles would be nice... if
the script could be made to do this if and only if there isn't an
English audio track would be amazing... but extracting audio is my
primary concern, being able to do something useful with subs when no
English audio exists is lower priority...

I think the basename command I use in my uncompress script to create a
separate folder for each extracted file might be of some help on
getting rid of the .mkv in the ouput filenames, but I'm not sure...
for reference, here's the contents of my uncompress.sh:

#! /bin/bash
for file in *.rar; do
dir=$(basename "$file" .rar) # remove the .zip from the filename
mkdir "$dir"
cd "$dir" && unrar x -y ../"$file" # unzip and remove file if successful
cd ..
done
for file in *.zip; do
dir=$(basename "$file" .zip) # remove the .zip from the filename
mkdir "$dir"
cd "$dir" && unzip ../"$file" && rm ../"$file" # unzip and remove file
if successful
cd ..
done

But I'm not sure where to start on any of the other issues.