Pages

Thursday, March 8, 2018

Removing accents and other diacritical marks from unicode text so as to convert it into English letters

Often I need to convert unicode text, e.g. French, into English letters. The general way to remove diacritical marks is to decompose characters into chars representing letters and separately marks using Normalizer with form NFD, and then remove all chars holding diacritical signs using a regular expression \p{InCombiningDiacriticalMarks}+ matching the "Combining Diacritical Marks" unicode character block.

The sample class below uses as the input a meaningless text made up of french words with various accents:

public class Clean   {

    static void describe(String str) {
        System.out.println(str + " " + str.length());
    }

    public static void main(String[] args) {
        String str = "«J'ai levé la tête. Il doit être français». Il n'a pensé à lui ôter l'âge et se met à nager âgé.";
        describe(str);
        String normalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
        // the regexp corresponds to Character.UnicodeBlock.COMBINING_DIACRITICAL_MARKS
        String noDiacriticalMarks = normalizedString.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
        describe(normalizedString);
        describe(noDiacriticalMarks);
    }
}

In the output the first line is the original string. The second is the same string but normalized. Note, the accents are stored as individual characters which are eliminated in the third line. Each line contains the length of the string.

«J'ai levé la tête. Il doit être français». Il n'a pensé à lui ôter l'âge et se met à nager âgé. 96
«J'ai levé la tête. Il doit être français». Il n'a pensé à lui ôter l'âge et se met à nager âgé. 107
«J'ai leve la tete. Il doit etre francais». Il n'a pense a lui oter l'age et se met a nager age. 96
Remove diacritical marks from a string in Javascript

Analogous approach in javascript taken from here:

const str = "Crème Brulée"
console.log(str + "=>" + str.normalize('NFD').replace(/[\u0300-\u036f]/g, ""));

1 comment: