C# vs JavaScript: Strings

This post belongs to the C# vs JavaScript series.
by Nicklas Envall

Our computers only understand 1s and 0s. But by using binary code we can represent different characters with 1s and 0s. In other words, we may assign numbers to characters. For instance, we can determine that 0001 symbolizes A and that 0002 equals B. Then, we could keep going like this and eventually end up with a whole system that maps numbers to characters. Although we don't have to create a new system, considering there already are plenty of character sets (abbreviated as charsets) out there.

A charset is a set of characters that maps to unique code points (numeric values). They come in different sizes, like 7-bit, 8-bit, and so on. For instance, a 7-bit character set means that it can provide 2^7 (128) characters. Nevertheless, the first charsets (used for computers) like BCD and the Fortran Character Set were only 6 bits. Then eventually, the 7-bit ASCII charset was introduced and became widely adopted. Yet, at the same time, it caused problems because it being 7-bit meant that 1-bit was left unused. The reason for that is because computers, at the time, offered 8 bits. So, people would use that extra bit for other characters from languages like Icelandic, Greek, etc.

Now, this was fine when you only worked with your computer, but once you start exchanging strings between computers, you'd be in trouble. Because code points that require more than 7 bits could differ, making the decoding incorrect. The rise of the internet accelerated this problem. But luckily, the problem is not as prominent today because of Unicode. As we'll see, Unicode took the first 128 characters from ASCII and built on top of it, offering more than a million characters.

To encode code points in Unicode, we use different Unicode Transformation Formats (UTFs). Unicode has three different formats, UTF-32 (32-bit), UTF-16 (16-bit), and UTF-8 (8-bit). To understand how we use UTF, we must recognize that a string is a sequence of characters used to represent text. More technically, a string is a data type with an ordered sequence of code units. A code unit is the allocated storage for a code point. So to represent a character like วด with UTF-8, we need two 8-bit code units. Since its code point is 500, and 500 exceeds the 8-bit limit. Subsequently, if we used UTF-16 to encode that string, we would only need one 16-bit code unit. Thus, using UTF-16 appears simpler, but it comes with a trade-off because for a code point, like 120 (x), we would still need to allocate 16-bits. Whereas UTF-8 would only have required 8-bits. So, be aware that some code points require more bits than others. For instance, in UTF-8, the first 128 ASCII characters (code points in range 0-127) take only 1 byte of memory. Here are the ranges to be aware of:

0 - 127 (1 byte)
128 - 2047 (2 bytes)
2048 - 65535 (3 bytes)
65536 - 2097151 (4 bytes)
2097152 - 67108863 (5 bytes)
67108864 - 2147483647 (6 bytes)

Originally Unicode was designed with 16 bits in mind, yet it increased over time, and as a result, now has planes. A plane is a group of 65536 code points with a size of 16 bits, and Unicode has 17 of them. The first plane is the Basic Multilingual Plane (BMP), which contains the most commonly used characters in the world. The remaining 16 planes are called Supplementary planes or astral plane, which includes characters from not as well-known languages worldwide or emojis.

Strings in JavaScript

Both in JavaScript and C#, a so-called string represents text as a sequence of UTF-16 code units. In fact, JavaScript's String.length indicates how many UTF-16 code units it contains:

String.fromCodePoint(0).length; // 1 <-- 16 bits
String.fromCodePoint(128).length; // 1 <-- 16 bits
'๐Ÿ˜ƒ'.length; // 2 <-- 16 bits * 2

As you see, the emoji strangely enough, has a length of 2. But it's not a bug, it's a feature. The emoji's code point is 128515, and 128515 is higher than 2^16 (65536), consequently making it a supplemental character. It being 128515 results in JavaScript needing two UTF-16 code units to represent this emoji. In other words, we need a surrogate pair, which are code units that together make up a code point.

By using the charCodeAt function, we can get the high surrogate and low surrogate that makes up the emoji:

const highSurrogate = '๐Ÿ˜ƒ'.charCodeAt(0); // 55357
const lowSurrogate = '๐Ÿ˜ƒ'.charCodeAt(1); // 56835
String.fromCharCode(highSurrogate, lowSurrogate); // '๐Ÿ˜ƒ'
String.fromCharCode('๐Ÿ˜ƒ'.charCodeAt(0)); // ๏ฟฝ
String.fromCharCode('๐Ÿ˜ƒ'.charCodeAt(1)); // ๏ฟฝ

You might think, "but 55357 + 56835 does not equal 128515," and you'd be right. There's an algorithm for converting to and from surrogate pairs, though I will not cover it here.

Now, both String.fromCharCode() and String.charCodeAt() only work with UTF-16 units, but String.fromCodePoint() and String.codePointAt() work without having to use units. Instead, it understands if it's a surrogate pair or not:

'๐Ÿ˜ƒ'.charCodeAt(0); // 55357
'๐Ÿ˜ƒ'.codePointAt(0); // 128515
String.fromCharCode(128515); // "๏˜ƒ"
String.fromCodePoint(128515); // "๐Ÿ˜ƒ"

Lastly, we can use Blob to see how many code units it takes in UTF-8, which has 8 bits code units.

// 0-127 range
new Blob([String.fromCodePoint(0)]).size; // 1
new Blob([String.fromCodePoint(127)]).size; // 1

// 128-2047 range
new Blob([String.fromCodePoint(128)]).size; // 2
new Blob([String.fromCodePoint(2047)]).size; // 2

// 2048 - 65535 range
new Blob([String.fromCodePoint(2048)]).size; // 3
new Blob([String.fromCodePoint(65535)]).size; // 3

// 65536 - 1114111 range
new Blob([String.fromCodePoint(65536)]).size; // 4
new Blob([String.fromCodePoint(1114111)]).size; // 4 
new Blob([String.fromCodePoint(1114112)]).size; // Uncaught RangeError: 1114112 is not a valid code point

Now, even though there may be plenty of code points out there, JavaScript's highest code point is 1114111.

Strings in C#

Many programming languages have a char type, and C# is one of those, while JavaScript is not. A char instance in C# represents a 16-bit code unit, and a C# string is a sequence of char instances.

There are different approaches to creating a char in C#, and the most common one is to use two single quotes ''. Whereas in JavaScript, we may create strings with both single quotes and double quotes. We see three different ways to create a char in the example below:

char character1 = 'A';
char character2 = '\u0041';
char character3 = (char)65;

Console.WriteLine(character1); // A
Console.WriteLine(character2); // A
Console.WriteLine(character3); // A

To create a string in C#, we may use a literal just like in JavaScript. We can also send in a char array to the String class constructor:

char[] charArray = { 'A', 'B', 'C' };
string str = new string(charArray); // constructor
string str2 = "ABC"; // literal

Console.WriteLine(str); // ABC
Console.WriteLine(str2); // ABC

In C#, String.length conceptually works the same way as in JavaScript. But in C#'s case, it returns the number of char instances. Note that we might need two char instances for one character, and an example of that would be an emoji:

"A".Length; // 1
"๐Ÿ˜ƒ".Length; // 2

As we see, an emoji, like ๐Ÿ˜ƒ won't fit in a single char and will require two char instances:

char myChar2 = 'A'; // OK
char myChar1 = '๐Ÿ˜ƒ'; // ERROR: Too many characters in character literal

Similar to JavaScript's charCodeAt function we can convert the char to an int to get the code point. We can also use the Char.ConvertToUtf32, which is more similar to codePointAt because it can handle surrogate pairs and returns a string instead of a char:

(int)'A'; // 65
(int)"A"; // Cannot convert type 'string' to 'int'
(int)"๐Ÿ˜ƒ"[0]; // 55357
(int)"๐Ÿ˜ƒ"[1]; // 56835
(int)"๐Ÿ˜ƒ"; // Cannot convert type 'string' to 'int'

Char.ConvertToUtf32("A", 0); // 65
Char.ConvertToUtf32("๐Ÿ˜ƒ", 0); // 128515

Char.ConvertFromUtf32(65); // A
Char.ConvertFromUtf32(128515); // ๐Ÿ˜ƒ

C# also gives us some handy methods when working with surrogate pairs, like Char.IsSurrogate, Char.IsHighSurrogate, and Char.IsLowSurrogate:

Char.IsSurrogate("A", 0); // False
Char.IsSurrogate("๐Ÿ˜ƒ", 0); // True
Char.IsSurrogate("๐Ÿ˜ƒ", 1); // True

Char.IsHighSurrogate((char)55357); // True
Char.IsLowSurrogate((char)56835); // True

Working with Strings

To end this article, we'll get a bit more practical and cover some good-to-know things when working with strings in both JavaScript and C#. As you will see, it's very similar overall.

JS vs C#: String Methods

JavaScript:

'my string'.length; // 9
'MY STRING'.toLowerCase(); // my string
'my string'.toUpperCase(); // MY STRING
'my string'.replace('string', 'coffee'); // my coffee
'my string'.includes('my'); // true
'My name is Nicklas'.split(' '); // [ "My", "name", "is", "Nicklas" ]

C#:

"my string".Length; // 9
"MY STRING".ToLower(); // my string
"my string".ToUpper(); // MY STRING
"my string".Replace("string", "coffee"); // my coffee
"my string".Contains("my"); // True
"My name is Nicklas".Split(' '); // [ "My", "name", "is", "Nicklas" ]

JS vs C#: Template Strings

If you're familiar with ES6, then you probably like Template strings. In JavaScript, we use backticks:

const expression = 'hello';
`${expression} world`; // hello world

While in C#, we prefix a string with the $ character to make our string into a template:

string expression = "hello";
$"{expression} world"; // hello world

JS vs C#: Comparing strings alphabetically

When we compare strings, it bases its logic on sort order:

  • -1 indicates that it is before the target
  • 0 it is the same
  • 1 it is after

JavaScript:

'a'.localeCompare('b'); // -1
'a'.localeCompare('a'); // 0
'b'.localeCompare('a'); // 1

C#:

"a".CompareTo("b"); // -1
"a".CompareTo("a"); // 0
"b".CompareTo("a"); // 1

JS vs C#: Uppercase the first character of a string

Both JavaScript and C# do not have a built-in method to capitalize the first character of a string. So, we must make it ourselves.

JavaScript:

const capitalize = (str) => {
  return str.charAt(0).toUpperCase() + str.slice(1);
}

If str only contains zero characters or one, the slice(1) function will return an empty string, which means that we don't have to handle an exception.

C#:

static public string Capitalize(string str)
{
    if (str == null)
        return null;

    if (str.Length > 1)
        return Char.ToUpper(str[0]) + str.Substring(1);

    return str.ToUpper();
}

Note that there are many approaches to solve this problem. But, as we see in our C# example, String.Substring throws an error instead of returning an empty string like JavaScript's slice when the str only contains zero or one character.

A JavaScript developer who's comfortable with the forgiving nature of JavaScript likely thinks that the extra code derived from having to handle possible errors is unacceptable. But this type of code is a reality that we must live with when working with C#. However, it does come with benefits.