To Lower or Not to Lower

To Lower or Not to Lower #

Cyrillic letters in cursive - source: Wikipedia

Introduction #

In the good old times, when everybody spoke English and ANSI was the only game in town, converting from upper case to lower case was only flipping a bit. Just OR with 0x20 any ASCII letter and you go from “A” (0x41) to “a” (0x61). Life was simpler and we were all happy1.

When you start playing with multi-lingual character sets changing between upper case and lower case is more complicated. Just as an example, my name should be written “Neacșu” and in all caps that makes “NEACȘU”. The “Lower case Latin letter s with cedilla”, as Unicode calls it, has code 0x15E and the “Upper case Latin letter S with cedilla” has code 0x15F. It happens to be just one bit of difference but it is another bit.

This article shows a multi-lingual implementation for functions tolower and toupper. The latest version of the code can always be downloaded from my GitHub site.

Background #

Knowing what letters are upper case and lower case in all the alphabets in the languages in the world is a daunting task for any programmer. Luckily, Unicode Consortium, the body that administers the Unicode has taken a break from creating emojis, and created a long list of case folding codes. You can download the list from The document describes four types of case folding: common, full, simple and Turkic. My functions implement only the “common” and “simple” cases.

Implementation #

The basic idea of the implementation is quite simple. Convert the Unicode document in two tables of equal size, one with the upper case letters and the other one with the lower case ones. The upper case table is sorted to allow for binary searching. If a code is found in the upper case table, it is replaced with the matching code from the lower case.

Here is the code for the tolower function:

std::string tolower (const std::string& str)
  //definition of 'u2l' and 'lc' tables
  #include "uppertab.c"
  u32string wstr = runes (str);
  for (auto ptr = wstr.begin (); ptr < wstr.end (); ptr++)
    char32_t *f = lower_bound (begin (u2l), end (u2l), *ptr);
    if (f != end (u2l) && *f == *ptr)
      *ptr = lc[f - u2l];
  return narrow (wstr);

The uppertab.c is generated by a a short program, gen_casetab, that reads the Unicode case folding file and produces the two tables u2l and lc. Here is a short sample of it:

//Upper case table
static char32_t u2l [1411] = { 

//Lower case equivalents
static char32_t lc [1411] = { 
  0x00061, 0x00062, 0x00063, 0x00064, 0x00065, 0x00066, 0x00067, 0x00068, 
  0x00069, 0x0006a, 0x0006b, 0x0006c, 0x0006d, 0x0006e, 0x0006f, 0x00070, 
  0x1e939, 0x1e93a, 0x1e93b, 0x1e93c, 0x1e93d, 0x1e93e, 0x1e93f, 0x1e940, 
  0x1e941, 0x1e942, 0x1e943};    

The input string is converted to UTF-32 by calling the runes function. Each character is than searched in the u2l table and, if found it is replaced with the matching character from lc table. Note the use of lower_bound function that performs a binary search in u2l table. The resulting string is converted back to UTF-8 and this is that.

The toupper function is very similar except that it uses the tables l2c and uc tables, defined in the lowertab.c file.

For the sake of completeness, both functions have also an in-place variant:

void tolower (std::string& str);
void toupper (std::string& str);

The table generation program gen_casetab.cpp is also very straight-forward. It reads and parses the case folding text file and produces first the uppertab.c file and then the lowertab.c file. In between it has to re-order the table that was sorted by uppercase codes to make it ordered by lowercase codes2.

Using the Code #

All functions are in the utf8 namespace. Because these and many other functions in this namespace have the same name as standard C/C++ functions, my recommendation is not to use a “using” directive. Below is a short example showing how to call these functions:

#include <utf8.h>
string all_caps = utf8::toupper (u8"Neacșu"); // all_caps should be "NEACȘU"
string greek {u8"αλφάβητο"};
utf8::toupper (greek); //string should be "ΑΛΦΆΒΗΤΟ"

Points of Interest #

One interesting point is that there are multiple uppercase codes that are folded into the same lower case code. In my implementation the second code is dropped from the table. It seems to work OK but the whole idea is somewhat troubling: it means there is no unique way to convert a lowercase string to it’s equivalent uppercase one. Personally, I think there are some strange choices that have been baked into the Unicode Consortium table. For instance both the upper case Latin letter K and the degrees Kelvin symbol (K) are folded into the lowercase (k) letter3.

Also note that these functions are not so lightweight: each pair of conversion tables is over 10k but I guess this is the price you have to pay if you need multi-lingual case awareness. In a feeble attempt at efficiency, I’ve included each pair of tables in it’s appropriate function (toupper or tolower) so if you need only one of them you would not need to carry the second one4.

History #

  • 17-Feb-2020 - Initial version.
  • 05-May-2023 - Updated link to Unicode case folding document

Footnotes #

  1. There where never such good times: it just happened that most of the nerds who worked in the field spoke English, ANSI code was called ASCII and was used by everyone except for a big blue company who insisted to use EBCDIC. This was so wasteful: EBCDIC used 8 bits for what ASCII could do with 7! ↩︎

  2. The command line for gen_casecvt is:
    gen_casecvt <input file> <output path> ↩︎

  3. I would argue that the degrees Kelvin does not have a lower case symbol and my science teacher would probably agree with me. ↩︎

  4. Assuming the linker does function-level linking (/Gy option for Visual Studio linker). ↩︎