Often I need to convert unicode text, e.g. French, into English letters. The general way to remove diacritical marks is to decompose characters into chars representing letters and separately marks using Normalizer with form NFD, and then remove all chars holding diacritical signs using a regular expression \p{InCombiningDiacriticalMarks}+ matching the "Combining Diacritical Marks" unicode character block.
The sample class below uses as the input a meaningless text made up of french words with various accents:
public class Clean { static void describe(String str) { System.out.println(str + " " + str.length()); } public static void main(String[] args) { String str = "«J'ai levé la tête. Il doit être français». Il n'a pensé à lui ôter l'âge et se met à nager âgé."; describe(str); String normalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); // the regexp corresponds to Character.UnicodeBlock.COMBINING_DIACRITICAL_MARKS String noDiacriticalMarks = normalizedString.replaceAll("\\p{InCombiningDiacriticalMarks}+", ""); describe(normalizedString); describe(noDiacriticalMarks); } }
In the output the first line is the original string. The second is the same string but normalized. Note, the accents are stored as individual characters which are eliminated in the third line. Each line contains the length of the string.
«J'ai levé la tête. Il doit être français». Il n'a pensé à lui ôter l'âge et se met à nager âgé. 96 «J'ai levé la tête. Il doit être français». Il n'a pensé à lui ôter l'âge et se met à nager âgé. 107 «J'ai leve la tete. Il doit etre francais». Il n'a pense a lui oter l'age et se met a nager age. 96
Remove diacritical marks from a string in Javascript
Analogous approach in javascript taken from here:
const str = "Crème Brulée" console.log(str + "=>" + str.normalize('NFD').replace(/[\u0300-\u036f]/g, ""));
Excellent Blog, I like your blog and It is very informative. Thank you
ReplyDeletePHP
Scripting Language