Sunday, October 30, 2011

ann2srt v0.3

Although all the bug fixing, testing, and cleaning up have been done several days ago, I was a little too lazy to write... Anyway, here is the "official release notice" of ann2srt version 0.3.

Thanks to the commenter L who helped me on testing and debugging the script on Cygwin, version 0.3 of ann2srt now can handle the annotations other than Traditional Chinese language that have newlines and commas in them, and also can run correctly under Cygwin environment on Win32 platform.


Due to the fact that version 0.2 script uses CSV (Comma-Separated Values) as an intermediate format, the version 0.2 script will fail if the annotation has newline or comma in it. To fix this, in version 0.3, tr is used to eliminate newlines in the annotation. To address the "comma" problem, the delimiter for the intermediate stream is changed from comma to "|".

The version 0.2 script, technically speaking, should be able to run correctly without any modification under Cygwin environment. However, since Windows uses "DOS style" newline characters that consists CR+LF, if any of the external programs used in the script were Win32 binary, or if the input annotation file was in DOS format, the execution of the script becomes unpredictable. To fix this, tr is used again to convert the annotation and the output of the Win32 XMLStarlet from DOS format into UNIX format.

Let's cut to the chase. Here is the source of the version 0.3 script:

#!/bin/bash
#
# Convert the youtube annotation into SRT subtitle
#
# By Shang-Feng Yang
# Version: 0.3
# License: GPL v3
#
# Changelog:
# * v0.3 (Oct/19/2011):
# - Fix the parsing errors caused by comma and newline characters in
# some English annotations
# - Adding transparent dos2unix conversion for compatibility under Cygwin
# * v0.2 (Jan/19/2011):
# - Sort the annotations using the "begin" time as key
# - Minor bugs fixing
# * v0.1 (Dec/7/2010):
# - Initial release


ANN=$1
SRT=$(basename ${ANN} .xml).srt
IFS=$'\n'
I=0

function usage() {
echo -e "Usage:\n"
echo -e "\t$(basename $0) ANNOTATION_FILE\n"
}

function parseXML() {
cat ${ANN} | tr -d '\r' |tr '\n' ' ' | xmlstarlet sel -t -m 'document/annotations/annotation' -v 'TEXT' -o '|' -m 'segment/movingRegion/rectRegion' -v '@t' -o '|' -b -n | tr -d '\r'
}

function reformatTime() {
local H=$(echo $1 |cut -d ':' -f 1)
local M=$(echo $1 |cut -d ':' -f 2)
local S=$(echo $1 |cut -d ':' -f 3)
printf '%02d:%02d:%06.3f' ${H} ${M} ${S} |tr '.' ','
}

function time2sod() {
# Convert time in HH:MM:SS.SSS format into second-of-the-day value
local SOD=$(echo $1 | awk -F ":" '{printf("%f\n", $1*3600+$2*60+$3);}')

echo ${SOD}
}

[ "x${ANN}" = "x" ] && { usage; exit 1; }
[ -f ${ANN} ] || { usage; exit 1; }
[ -f ${SRT} ] && rm ${SRT}
[ -f ${SRT}.tmp ] && rm ${SRT}.tmp

for LINE in $(parseXML); do
C=$(echo ${LINE} |cut -d '|' -f 1)
B=$(echo ${LINE} |cut -d '|' -f 2)
E=$(echo ${LINE} |cut -d '|' -f 3)
echo "$(time2sod ${B})#${B}#${E}#${C}" >> ${SRT}.tmp
done

grep "###" ${SRT}.tmp && {
echo "\"${ANN}\" has no valid annotation!" >&2
rm ${SRT}.tmp
exit 1
}

for LINE in $(cat ${SRT}.tmp|sort -n -t '#'); do
(( I++ ))
C=$(echo ${LINE} |cut -d '#' -f 4)
B=$(reformatTime $(echo ${LINE} |cut -d '#' -f 2))
E=$(reformatTime $(echo ${LINE} |cut -d '#' -f 3))
echo -e "${I}\n${B} --> ${E}\n${C}\n" >> ${SRT}
done

rm ${SRT}.tmp


The version 0.3 script can also be downloaded from here to avoid typos caused by copy-and-paste:
http://dl.dropbox.com/u/1382119/tmp/ann2srt

In fact, I just found that the customized "code block" loses all indentations after the blogger updates. Please download the correct script from the link above.

Read more ...