Doing UTF-8 in Windows #
Introduction #
This is article shows how to handle UTF-8 encoding on a platform that still encourages the UTF-16 encoding. I am also providing a small library for this purpose. The code works, it is clean, easy to understand and small.
This is an implementation of the solution advocated in the UTF-8 Everywhere manifesto. I would strongly encourage you to go read the whole document to get indoctrinated ☺.
Background #
Let me rehash some of the points made in the manifesto mentioned above:
- UTF-16 (variously called Unicode, widechar or UCS-2) was introduced back in early ‘90-es and, at the time, it was believed that it’s 65000 codes will be enough for all characters,
- Except in particular cases, UTF-16 is not more efficient or easier to use than UTF-8. In fact in many cases the opposite is true.
- In UTF-16 characters have also variable width encoding (two or four bytes) and counting characters is as difficult as in UTF-8.
If you want to work with UTF-8 encoding in Windows (and you should), and you don’t want go insane or your program to crash unexpectedly you must follow the following rules:
- Define
_UNICODE
when compiling your program (or select “Use Unicode Character Set” in Visual Studio). - Use
wchar_t
orstd::wstring
only in arguments to API function calls. Usechar
orstd::string
everywhere else. - Use
widen()
andnarrow()
functions to go between UTF-8 and UTF-16.
The functions provided in this package will make your life much easier.
Calling Library Functions #
All functions live in the utf8
namespace and I would advise you not to place a using directive for this namespace. This is because many/most functions have the same name as the traditional C functions. For instance, if you had a function call:
mkdir (folder_name);
and you want to start using UTF-8 characters, you just have to change it to:
utf8::mkdir (folder_name);
Prefixing the function with the namespace makes it obvious what function you are using.
Basic Conversion Functions #
Following the same manifesto, the basic conversion functions are narrow()
, to go from UTF-16 to UTF-8 and widen()
to go in the opposite direction. Their signatures are:
std::string narrow (const wchar_t* s);
std::string narrow (const std::wstring&; s);
std::wstring widen (const char* s);
std::wstring widen (const std::string&; s);
In addition there are two more functions for conversion from and to UTF-32:
std::string narrow (const std::u32string&; s);
std::u32string runes (const std::string&; s);
Internally, the conversion is done using the WideCharToMultiByte
and MultiByteToWideChar
functions.
There are also functions for counting the number of characters in a UTF-8 string (length()
), to check if a string is valid (valid()
), and to advance a pointer/iterator in character string (next()
).
Wrappers #
Pretty much all the other functions are wrappers around traditional C/C++ functions or structures:
- directory manipulation functions:
mkdir
,rmdir
,chdir
,getcwd
- file operations:
fopen
,chmod
,access
,rename
,remove
- streams:
ifstream
,ofstream
,fstream
- path manipulation functions:
splitpath
andmakepath
- environment access functions
putenv
and `getenv - character classification functions
is...
(isalnum
,isdigit
,isalpha
, etc.)
The parameters for all these functions mimic the standard parameters. For some of them however, like access
, rename
, etc., the return type is bool
with true
indicating success and false
indicating failure. This is contrary to standard C functions that return 0 for success. Caveat emptor!
Return Values #
For API functions that return a character string, you would need to setup a wchar_t
buffer to receive the value, convert it to UTF-8 using a narrowing function and eventually release the buffer. Below is an example of how this would look like. The code retrieves the name of temporary file:
wstring wpath (_MAX_PATH, L'\0');
wstring wfname (_MAX_PATH, L'\0');
GetTempPath (wpath.size (), const_cast<wchar_t*>(wpath.data ()));
GetTempFileName (wpath.c_str(), L"ABC", 1, const_cast<wchar_t*>(wfname.data ()));
string result = utf8::narrow(wfname);
This seemed a bit too cumbersome and error prone so I made a small object destined to hold returned values. It has operators to convert it to a wchar_t
buffer and then to a UTF-8 string. For lack of a better name, I called it buffer
. Using this object, the same code fragment becomes:
utf8::buffer path (_MAX_PATH);
utf8::buffer fname (_MAX_PATH);
GetTempPath (path.size (), path);
GetTempFileName (path, L"ABC", 1, fname);
string result = fname;
Internally a buffer object contains UTF-16 characters but the string conversion operator invokes the utf8::narrow
function to convert the string to UTF-8.
Program Arguments #
There are two functions for accessing and converting UTF-16 encoded program arguments: the get_argv
function returns an argv
like array of pointers to commend line arguments:
int argc;
char **argv = utf8::get_argv (&argc);
The second one provides directly a vector of stings:
std::vector<std::string> argv = utf8::argv ();
When using the first function, one has to call utf8::free_argv
function to release the memory allocated for argv
array.
History #
- 22-Nov-2019 - Initial version.