How to Convert a String Representing a Unicode Character Sequence to the Unicode Character

I recently received some translated resource files from the Translations team at work.  To my surprise, all of the files, even those for double-byte languages, were returned in ASCII encoded files.  After some inquiry, I found out that because of the technical limitations of a proven legacy system, all translation files were encoded as ASCII.  What this meant is that I was confronted with a set of ASCII text files containing Unicode escape sequences (\uxxxx) that I was responsible for converting to a proper Unicode encoding.

While solving the problem, I came across a couple solutions for converting Unicode escape sequences to a different encoding.  The first was to use the StringEscapeUtils class in Apache Commons Lang.

String lineOfUnicodeText = StringEscapeUtils.unescapeJava(lineOfASCIIText);

Using the StringEscapeUtils class is very straightforward; simply read the contents of the the ASCII file line-by-line, feed the line of data in to the unescapeJava method, and write the unescaped text to a properly-encoded new file.  But this technique requires writing a utility program to feed the contents of the ASCII files into the StringEscapeUtils methods and then write the transformed string to a new file.  Not hard to do, but much more work than ideal.

The second solution is to use the native2ascii utility included with the Java JDK.  The utility can take the input file and perform effectively the same unescape transformation that Apache Commons does.

native2ascii -encoding utf8 c:\source.txt c:\output.txt

A very simple solution that works as advertised.  No quirks or caveats that I’ve noticed.  There’s even an ANT task for incorporating native2ascii into build scripts.

Leave a Reply