Saturday, April 07, 2012

URL encoding/decoding with shell script

Quite some time ago, I wrote a simple shell script called "urldecode" which decodes the "escaped", or URL-encoded, string using the "printf" utility from GNU coreutils. However, today when I tried to write a shell script to generate a short URL using tinyurl.com, I face the problem to have a string to be URL-encoded. So, after reading the page "Percent-encoding" on Wikipedia, I finished my "urlencode" script.


Let me talk about the decoding part first. Decoding an URL-encoded string is relatively simple. Since the "printf" utility accepts the "\xHH" format string, where "HH" is 1 to 2 digits of a byte with hexadecimal value, the only necessary pre-processing for the target string would be replacing the '%' characters in the string into '\x' strings. After that, just pass the processed string to "printf" to get the converted string. The following code is my implementation of the above-mentioned process:


#!/bin/bash
#
# urldecode - decoding the URL-encoded string
#
# (C)2010 Shang-Feng Yang <storm_DOT_sfyang_AT_gmail_DOT_com>
#
# License: GPLv3

ENC_STR=$@
[ "${ENC_STR}x" == "x" ] && {
TMP_STR="$(cat - | sed -e 's/%/\\x/g')"
} || {
TMP_STR="$(echo ${ENC_STR} | sed -e 's/%/\\x/g')"
}
PRINTF=/usr/bin/printf
exec ${PRINTF} "${TMP_STR}\n"

The "urldecode" script can read the string from either STDIN or the script calling argument. This script has an obvious shortcoming that, since the whole string is passed as the format string to the "printf" utility, the operation will fail if the length of the encoded string is too long.

For the encoding part, it becomes a little more complicated. At first, I was thinking about finding the reserved characters, escaping them, and then replacing the original characters with the escaped one. For that purpose, I wrote a short script to find the corresponding ASCII byte value of a given character, called "char2hex":


#!/bin/bash
#
# char2hex - returning the hexadecimal value of the given characters
#
# (C)2012 Shang-Feng Yang <storm_DOT_sfyang_AT_gmail_DOT_com>
#
# License: GPLv3

function usage() {
echo -e "Usage:\n"
echo -e "\t$(basename $0) CHARACTER(S)_TO_CONVERT\n"
}

CHAR=$1

[ "x${CHAR}" == "x" ] && { usage; exit 1; }

echo -n "${CHAR}" | od -A n -t x1 | tr -d ' '

This script is quite straight-forward. The only thing that is worth-mentioned is the reason for the '-n' option to the "echo" command. By default, "echo" will append a newline character to what it printed, so you will get an "additional" "0a" from the output. The '-n' option turns off this behavior.

This approach seems to be relatively elegant and simple, but the implementation could potentially be a nightmare. For one thing, it could be because I'm not smart enough, but I can not figure out a simple way to "pick up" and pass to the "char2hex" script the reserved characters from the input string or input stream by using simple shell syntax or simple utilities. It either could take too much effort to just do that, or the efficiency of the script could be quite low due to heavy I/O. It is apparently not an acceptable way to do this kind of thing for such a lazy guy like me.

After reading both the sections "Percent-encoding reserved characters" and "Percent-encoding the percent character" from the Wikipedia page "Percent-encoding", I found that the reserved characters that need to be encoded are not much, so it is practical to implement the "encoding" by using the "lookup table" method. So, the solution is stupid but simple:


#!/bin/bash
#
# urlencode - escaping the reserved characters using URL-encoding
#
# (C)2012 Shang-Feng Yang <storm_DOT_sfyang_AT_gmail_DOT_com>
#
# License: GPLv3

STR=$@
[ "${STR}x" == "x" ] && { STR="$(cat -)"; }

echo ${STR} | sed -e 's| |%20|g' \
-e 's|!|%21|g' \
-e 's|#|%23|g' \
-e 's|\$|%24|g' \
-e 's|%|%25|g' \
-e 's|&|%26|g' \
-e "s|'|%27|g" \
-e 's|(|%28|g' \
-e 's|)|%29|g' \
-e 's|*|%2A|g' \
-e 's|+|%2B|g' \
-e 's|,|%2C|g' \
-e 's|/|%2F|g' \
-e 's|:|%3A|g' \
-e 's|;|%3B|g' \
-e 's|=|%3D|g' \
-e 's|?|%3F|g' \
-e 's|@|%40|g' \
-e 's|\[|%5B|g' \
-e 's|]|%5D|g'

The "urlencode" script is too simple for me to explain it. It also accepts the target string from either STDIN or the command argument. The following demonstrates the usage of the scripts:


$ urlencode http://en.wikipedia.org/wiki/Percent-encoding
http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPercent-encoding
$ echo 'http://en.wikipedia.org/wiki/Percent-encoding' |urlencode
http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPercent-encoding
$ urldecode $(urlencode http://en.wikipedia.org/wiki/Percent-encoding)
http://en.wikipedia.org/wiki/Percent-encoding
$ urlencode http://en.wikipedia.org/wiki/Percent-encoding |urldecode
http://en.wikipedia.org/wiki/Percent-encoding

PS. Due to my "upgrading" the old template into new one, there are some formatting error in the code and terminal blocks... I probably will fix them by modifying underlying CSS of the new template in the future if I got enough motivation...

No comments: