Correctly reversing a string
suggest changeMost times when people have to reverse a string, they do it more or less like this:
char[] a = s.ToCharArray();
System.Array.Reverse(a);
string r = new string(a);
However, what these people don’t realize is that this is actually wrong. And I don’t mean because of the missing NULL check.
It is actually wrong because a Glyph/GraphemeCluster can consist out of several codepoints (aka. characters).
To see why this is so, we first have to be aware of the fact what the term “character” actually means.
Character is an overloaded term than can mean many things.
A code point is the atomic unit of information. Text is a sequence of > code points. Each code point is a number which is given meaning by the > Unicode standard.
A grapheme is a sequence of one or more code points that are displayed > as a single, graphical unit that a reader recognizes as a single > element of the writing system. For example, both a and ä are > graphemes, but they may consist of multiple code points (e.g. ä may be > two code points, one for the base character a followed by one for the > diaresis; but there’s also an alternative, legacy, single code point > representing this grapheme). Some code points are never part of any > grapheme (e.g. the zero-width non-joiner, or directional overrides).
A glyph is an image, usually stored in a font (which is a collection > of glyphs), used to represent graphemes or parts thereof. Fonts may > compose multiple glyphs into a single representation, for example, if > the above ä is a single code point, a font may chose to render that as > two separate, spatially overlaid glyphs. For OTF, the font’s GSUB and > GPOS tables contain substitution and positioning information to make > this work. A font may contain multiple alternative glyphs for the same > grapheme, too.
So in C#, a character is actually a CodePoint.
Which means, if you just reverse a valid string like Les Misérables
, which can look like this
string s = "Les Mise\u0301rables";
as a sequence of characters, you will get:
selbaŕesiM seL
As you can see, the accent is on the R character, instead of the e character. Although string.reverse.reverse will yield the original string if you both times reverse the char array, this kind of reversal is definitely NOT the reverse of the original string.
You’ll need to reverse each GraphemeCluster only. So, if done correctly, you reverse a string like this:
private static System.Collections.Generic.List<string> GraphemeClusters(string s)
{
System.Collections.Generic.List<string> ls = new System.Collections.Generic.List<string>();
System.Globalization.TextElementEnumerator enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while (enumerator.MoveNext())
{
ls.Add((string)enumerator.Current);
}
return ls;
}
// this
private static string ReverseGraphemeClusters(string s)
{
if(string.IsNullOrEmpty(s) || s.Length == 1)
return s;
System.Collections.Generic.List<string> ls = GraphemeClusters(s);
ls.Reverse();
return string.Join("", ls.ToArray());
}
public static void TestMe()
{
string s = "Les Mise\u0301rables";
// s = "noël";
string r = ReverseGraphemeClusters(s);
// This would be wrong:
// char[] a = s.ToCharArray();
// System.Array.Reverse(a);
// string r = new string(a);
System.Console.WriteLine(r);
}
And - oh joy - you’ll realize if you do it correctly like this, it will also work for Asian/South-Asian/East-Asian languages (and French/Swedish/Norwegian, etc.)…