Understanding C string characteristics

suggest change

#include <iostream>
#include <string>

int main()
{
    const char * C_String = "This is a line of text w";
    const char * C_Problem_String = "This is a line of text ኚ";
    std::string Std_String("This is a second line of text w");
    std::string Std_Problem_String("This is a second line of ϯϵxϯ ኚ");

    std::cout << "String Length: " << Std_String.length() << '\n';
    std::cout << "String Length: " << Std_Problem_String.length() << '\n';

    std::cout << "CString Length: " << strlen(C_String) << '\n';
    std::cout << "CString Length: " << strlen(C_Problem_String) << '\n';
    return 0;
}

Depending on platform (windows, OSX, etc) and compiler (GCC, MSVC, etc), this program may fail to compile, display different values, or display the same values.

Example output under the Microsoft MSVC compiler:

String Length: 31

String Length: 31

CString Length: 24

This shows that under MSVC each of the extended-characters used is considered a single “character”, and this platform fully supports internationalised languages.

It should be noted however that this behaviour is unusual, these international characters are stored internally as Unicode and thus are actually several bytes long. This may cause unexpected errors

Under the GNC/GCC compiler the program output is:

String Length: 31

String Length: 36

CString Length: 24

CString Length: 26

This example demonstrates that while the GCC compiler used on this (Linux) platform does support these extended-characters, it also uses (correctly) several bytes to store an individual character.

In this case the use of Unicode characters is possible, but the programmer must take great care in remembering that the length of a “string” in this scenario is the length in bytes, not the length in readable characters.

These differences are due to how international languages are handled on a per-platform basis - and more importantly, that the C and C++ strings used in this example can be considered an array of bytes, such that (for this usage) the C++ language considers a character (char) to be a single byte.

Found a mistake? Have a question or improvement idea? Let me know.

Table Of Contents