Wednesday, December 08, 2010

Converting Youtube's annotation into SRT subtitle

It has been a long time since my last blog. Well, I'm a lazy guy, and English is apparently not my native language. Besides, there were lots of things that weren't exciting enough for me to write a long article on the blog, so I usually write short comments on the my Buzz instead.

Any way, let's cut to the chase.

These days, more and more people like to use annotation to add "subtitles" onto Youtube videos rather than to use caption. There already are lots of on-line/off-line "Youtube downloaders" that can download either videos, the corresponding captions, or both of them at once, such as get_flash_videos, clive, youtube-dl, Google2SRT, and Youtube Subtitle Ripper, etc. However, there is not much information available about how to download the annotations and convert them into SRT subtitles. Today, I found the solution.


First of all, I found this comment on the blog post about how to download the annotations in XML format. And yes, I do write a script to download the caption and annotation using wget, but it is a simple script that is not worth to mention. After downloading the annotation in XML, next step would be converting it into some subtitle format.

Although there are many subtitle formats available, and the converting algorithm is possibly existing in the Google2SRT source code, I decide to write my own bash script that converts the XML into the SRT format, which is one of the simplest subtitle format.

The script I wrote, called ann2srt, uses the XMLStarlet as the XML parsing tool. Other than that, the script only uses the bash built-ins and coreutils like cut and tr. For now, the generated SRT could have some compatibility problems with some players. This is because the annotations in the XML are not in chronicle order. Adding the sorting is possible, but since mplayer can handle the out-of-order subs correctly, I'll leave it this way for now. Here is the code of ann2srt:


#!/bin/bash
#
# Convert the youtube annotation into SRT subtitle
#
# By Shang-Feng Yang <storm_dot_sfyang_at_gmail_dot_com>
# Version: 0.1
# License: GPL v3

function usage() {
echo -e "Usage:\n"
echo -e "\t$(basename $0) ANNOTATION_FILE\n"
}

function parseXML() {
cat ${ANN} |xmlstarlet sel -t -m 'document/annotations/annotation' -v 'TEXT' -o ',' -m 'segment/movingRegion/rectRegion' -v '@t' -o ',' -b -n
}

function reformatTime() {
H=$(echo $1 |cut -d ':' -f 1)
M=$(echo $1 |cut -d ':' -f 2)
S=$(echo $1 |cut -d ':' -f 3)
printf '%02d:%02d:%02.3f' ${H} ${M} ${S} |tr '.' ','
}

ANN=$1
SRT=$(basename ${ANN} .xml).srt
IFS=$'\n'
I=0

[ -f ${ANN} ] || { usage; exit 1; }
[ -f ${SRT} ] && rm ${SRT}

for LINE in $(parseXML); do
(( I++ ))
C=$(echo ${LINE} |cut -d ',' -f 1)
B=$(echo ${LINE} |cut -d ',' -f 2)
E=$(echo ${LINE} |cut -d ',' -f 3)
echo -e "${I}\n$(reformatTime ${B}) --> $(reformatTime ${E})\n${C}\n" >> ${SRT}
done


A sidenote for mplayer users: When playing videos with subs generated by this script, remember to turn on the SSA/ASS support by using the "-ass" option. Due to the nature of the annotations, it is possible that several annotations occupy the same time period, and the built-in SRT parser of mplayer will only show one of them, while they will be stacked when -ass is enabled.

SRT is a quite simple format that did not support any special effect, of which the annotations possess such as position and color of the annotations. The next version of the script will be one that converts the annotations into SSA/ASS format -- only if I have the motive to improve it...

Read more ...