To Lower or Not to Lower #
Introduction #
In the good old times, when everybody spoke English and ANSI was the only game in town, converting from uppercase to lowercase was only flipping a bit. Just OR with 0x20 any ASCII letter and you go from “A” (0x41) to “a” (0x61). Life was simpler and we were all happy1.
When you start playing with multi-lingual character sets changing between uppercase and lowercase is more complicated. Just as an example, my name should be written “Neacșu” and in all caps that makes “NEACȘU”. The “Lower case Latin letter s with cedilla”, as Unicode calls it, has code 0x15E and the “Upper case Latin letter S with cedilla” has code 0x15F. It happens to be just one bit of difference but it is another bit.
This article shows a multi-lingual implementation for functions tolower
and toupper
. The latest version of the code can always be downloaded from my GitHub site.
Background #
Knowing what letters are uppercase and lowercase in all the alphabets in the languages in the world is a daunting task for any programmer. Luckily, Unicode Consortium, the body that administers the Unicode provides this information in their main data file, UnicodeData.txt. Another document provides information about the format of the data file. You would have to browse other documents to learn about the different kinds of case mapping: common, full, simple and Turkic. My functions implement only the “common” and “simple” cases.
Implementation #
The basic idea of the implementation is quite simple. For uppercase to lowercase mapping, generate from the Unicode data file two tables of equal size, one with the uppercase letters and the other one with the lowercase ones. The uppercase table is sorted to allow for binary searching. If a code is found in the uppercase table, it is replaced with the matching code from the lowercase. The process is identical for lowercase to uppercase mapping but, owing to the quirks of different languages, the tables are different.
Here is the code for the tolower
function:
//definition of 'u2l' and 'lc' tables
#include "uppertab.h"
//...
std::string tolower (const std::string& str)
{
u32string wstr = runes (str);
for (auto ptr = wstr.begin (); ptr < wstr.end (); ptr++)
{
char32_t *f = lower_bound (begin (u2l), end (u2l), *ptr);
if (f != end (u2l) && *f == *ptr)
*ptr = lc[f - u2l];
}
return narrow (wstr);
}
The uppertab.h
is generated by a a short program, gen_casetab
, that reads the Unicode data file and produces the two tables, u2l
and lc
. Below is a short sample of those tables:
//Upper case table
static const char32_t u2l [1460] = {
0x00041, // LATIN CAPITAL LETTER A
0x00042, // LATIN CAPITAL LETTER B
0x00043, // LATIN CAPITAL LETTER C
0x00044, // LATIN CAPITAL LETTER D
....
0x1e91f, // ADLAM CAPITAL LETTER ZAL
0x1e920, // ADLAM CAPITAL LETTER KPO
0x1e921};// ADLAM CAPITAL LETTER SHA
//Lower case equivalents
static const char32_t lc [1460] = {
0x00061, 0x00062, 0x00063, 0x00064, 0x00065, 0x00066, 0x00067, 0x00068,
0x00069, 0x0006a, 0x0006b, 0x0006c, 0x0006d, 0x0006e, 0x0006f, 0x00070,
...
0x1e939, 0x1e93a, 0x1e93b, 0x1e93c, 0x1e93d, 0x1e93e, 0x1e93f, 0x1e940,
0x1e941, 0x1e942, 0x1e943};
The input string is converted to UTF-32 by calling the runes
function. Each character is than searched in the u2l
table and, if found it is replaced with the matching character from lc
table. Note the use of lower_bound
function that performs a binary search in u2l
table. The resulting string is converted back to UTF-8 and this is that.
The toupper
function is very similar except that it uses the tables l2c
and uc
tables, defined in the lowertab.h
file.
For the sake of completeness, both functions have also an in-place variant:
void make_lower (std::string& str);
void make_upper (std::string& str);
The table generation program gen_casetab.cpp
is also very straight-forward. It reads and parses UnicodeData.txt
file and produces first the uppertab.h
file and then the lowertab.h
file.2
Using the Code #
All functions are in the utf8
namespace. Because these and many other functions in this namespace have the same name as standard C/C++ functions, my recommendation is not to use a “using” directive. Below is a short example showing how to call these functions:
#include <utf8.h>
...
string all_caps = utf8::toupper (u8"Neacșu"); // all_caps should be "NEACȘU"
string greek {u8"αλφάβητο"};
utf8::toupper (greek); //string should be "ΑΛΦΆΒΗΤΟ"
Case Mapping vs Case Folding #
An earlier version of these functions used a different Unicode Consortium data file, CaseFolding.txt that contains correspondences between uppercase symbols and their lowercase equivalent. What was curious was that this correspondence was not one-to-one; more than one uppercase symbol would correspond to the same lowercase symbol. In my early implementation the second code was dropped from the table. It seemed to work OK but the whole idea was somewhat troubling: it meant there is no unique way to convert a lowercase string to it’s equivalent uppercase one. For instance both the uppercase Latin letter K and the degrees Kelvin symbol (K) are folded into the lowercase (k) letter.
Things got to a heed when two different compilers transformed one of the strings shown above, αλφάβητο in two different ways. One would generate ΑΛΦΆΒΗΤΟ while another would generate ΑΛϕΆΒΗΤΟ (bonus points if you spot the difference)3. After studying some more, I realized that Unicode makes a difference between two operations: case folding and case mapping. The first one is the operation that brings a character string to a case insensitive form (in particular it’s lowercase form), while the second converts between uppercase and lowercase or lowercase to uppercase. So, the case folding table I was using was not meant to be used for case mapping. I had to change my implementation to use the UnicodeData.txt
file.
Points of Interest #
These functions are not so lightweight: the conversion tables are almost 25k but I guess this is the price you have to pay if you need multi-lingual case awareness.
History #
- 17-Feb-2020 - Initial version.
- 05-May-2023 - Updated link to Unicode case folding document
- 06-Nov-2024 - Updated code version and distinction between case folding and case mapping.
Footnotes #
-
There where never such good times: it just happened that most of the nerds who worked in the field spoke English, ANSI code was called ASCII and was used by everyone except for a big blue company who insisted to use EBCDIC. This was so wasteful: EBCDIC used 8 bits for what ASCII could do with 7! ↩︎
-
The command line for
gen_casecvt
is:
gen_casecvt <input file> <output path>
↩︎ -
In technical terms, one was translating “Greek Small Letter Phi” (U+03C6) to “Greek Capital Letter Phi” (U+03A6), while the other was translating it to “Greek Phi Symbol” (U+03D5). ↩︎