Thursday, January 20, 2011

ann2srt v0.2

Last time in my post "Converting Youtube's annotation into SRT subtitle, I released a bash script called "ann2srt" v0.1. Version 0.1 was a pretty crude one that did not deal with the sorting of the subtitles in SRT file, and could possibly be problematic for some SRT parser. Yesterday, I spent some time to improve the script with the sorting functionality, and also fixed some minor bugs in v0.1.


The sorting is achieved by using awk/gawk to convert the "beginning" time of the annotation into seconds and then passing the results into sort for sorting. Since sort is part of the GNU coreutils, and awk/gawk should be installed on most of the distributions, this change should not be a big deal for most people.

Here is the code for v0.2:

#!/bin/bash
#
# Convert the youtube annotation into SRT subtitle
#
# By Shang-Feng Yang <storm_DOT_sfyang_AT_gmail_DOT_com>
# Version: 0.2
# License: GPL v3
#
# Changelog:
# * v0.2 (Jan/19/2011):
# - Sort the annotations using the "begin" time as key
# - Minor bugs fixing

function usage() {
echo -e "Usage:\n"
echo -e "\t$(basename $0) ANNOTATION_FILE\n"
}

function parseXML() {
cat ${ANN} |xmlstarlet sel -t -m 'document/annotations/annotation' -v 'TEXT' -o ',' -m 'segment/movingRegion/rectRegion' -v '@t' -o ',' -b -n
}

function reformatTime() {
local H=$(echo $1 |cut -d ':' -f 1)
local M=$(echo $1 |cut -d ':' -f 2)
local S=$(echo $1 |cut -d ':' -f 3)
printf '%02d:%02d:%06.3f' ${H} ${M} ${S} |tr '.' ','
}

function time2sod() {
# Convert time in HH:MM:SS.SSS format into second-of-the-day value
local SOD=$(echo $1 | awk -F ":" '{printf("%f\n", $1*3600+$2*60+$3);}')

echo ${SOD}
}

ANN=$1
SRT=$(basename ${ANN} .xml).srt
IFS=$'\n'
I=0

[ "x${ANN}" = "x" ] && { usage; exit 1; }
[ -f ${ANN} ] || { usage; exit 1; }
[ -f ${SRT} ] && rm ${SRT}
[ -f ${SRT}.tmp ] && rm ${SRT}.tmp

for LINE in $(parseXML); do
C=$(echo ${LINE} |cut -d ',' -f 1)
B=$(echo ${LINE} |cut -d ',' -f 2)
E=$(echo ${LINE} |cut -d ',' -f 3)
echo "$(time2sod ${B})#${B}#${E}#${C}" >> ${SRT}.tmp
done

grep "###" ${SRT}.tmp && {
echo "\"${ANN}\" has no valid annotation!"
rm ${SRT}.tmp
exit 1
}

for LINE in $(cat ${SRT}.tmp|sort -n -t '#'); do
(( I++ ))
C=$(echo ${LINE} |cut -d '#' -f 4)
B=$(reformatTime $(echo ${LINE} |cut -d '#' -f 2))
E=$(reformatTime $(echo ${LINE} |cut -d '#' -f 3))
echo -e "${I}\n${B} --> ${E}\n${C}\n" >> ${SRT}
done

rm ${SRT}.tmp


The usage should be the same with v0.1.

Read more ...