C# vs JavaScript: Strings
This post belongs to the C# vs JavaScript series.by Nicklas Envall
Our computers only understand 1s and 0s. But by using binary code we can represent different characters with 1s and 0s. In other words, we may assign numbers to characters. For instance, we can determine that 0001
symbolizes A
and that 0002
equals B
. Then, we could keep going like this and eventually end up with a whole system that maps numbers to characters. Although we don't have to create a new system, considering there already are plenty of character sets (abbreviated as charsets) out there.
A charset is a set of characters that maps to unique code points (numeric values). They come in different sizes, like 7-bit, 8-bit, and so on. For instance, a 7-bit character set means that it can provide 2^7 (128)
characters. Nevertheless, the first charsets (used for computers) like BCD and the Fortran Character Set were only 6 bits. Then eventually, the 7-bit ASCII charset was introduced and became widely adopted. Yet, at the same time, it caused problems because it being 7-bit meant that 1-bit was left unused. The reason for that is because computers, at the time, offered 8 bits. So, people would use that extra bit for other characters from languages like Icelandic, Greek, etc.
Now, this was fine when you only worked with your computer, but once you start exchanging strings between computers, you'd be in trouble. Because code points that require more than 7 bits could differ, making the decoding incorrect. The rise of the internet accelerated this problem. But luckily, the problem is not as prominent today because of Unicode. As we'll see, Unicode took the first 128 characters from ASCII and built on top of it, offering more than a million characters.
To encode code points in Unicode, we use different Unicode Transformation Formats (UTFs). Unicode has three different formats, UTF-32 (32-bit), UTF-16 (16-bit), and UTF-8 (8-bit). To understand how we use UTF, we must recognize that a string is a sequence of characters used to represent text. More technically, a string is a data type with an ordered sequence of code units. A code unit is the allocated storage for a code point. So to represent a character like วด
with UTF-8, we need two 8-bit code units. Since its code point is 500, and 500 exceeds the 8-bit limit. Subsequently, if we used UTF-16 to encode that string, we would only need one 16-bit code unit. Thus, using UTF-16 appears simpler, but it comes with a trade-off because for a code point, like 120 (x
), we would still need to allocate 16-bits. Whereas UTF-8 would only have required 8-bits. So, be aware that some code points require more bits than others. For instance, in UTF-8, the first 128 ASCII characters (code points in range 0-127) take only 1 byte of memory. Here are the ranges to be aware of:
0 - 127 (1 byte) 128 - 2047 (2 bytes) 2048 - 65535 (3 bytes) 65536 - 2097151 (4 bytes) 2097152 - 67108863 (5 bytes) 67108864 - 2147483647 (6 bytes)
Originally Unicode was designed with 16 bits in mind, yet it increased over time, and as a result, now has planes. A plane is a group of 65536 code points with a size of 16 bits, and Unicode has 17 of them. The first plane is the Basic Multilingual Plane (BMP), which contains the most commonly used characters in the world. The remaining 16 planes are called Supplementary planes or astral plane, which includes characters from not as well-known languages worldwide or emojis.
Strings in JavaScript
Both in JavaScript and C#, a so-called string represents text as a sequence of UTF-16 code units. In fact, JavaScript's String.length
indicates how many UTF-16 code units it contains:
String.fromCodePoint(0).length; // 1 <-- 16 bits String.fromCodePoint(128).length; // 1 <-- 16 bits '๐'.length; // 2 <-- 16 bits * 2
As you see, the emoji strangely enough, has a length of 2
. But it's not a bug, it's a feature. The emoji's code point is 128515
, and 128515
is higher than 2^16 (65536)
, consequently making it a supplemental character. It being 128515
results in JavaScript needing two UTF-16 code units to represent this emoji. In other words, we need a surrogate pair, which are code units that together make up a code point.
By using the charCodeAt
function, we can get the high surrogate and low surrogate that makes up the emoji:
const highSurrogate = '๐'.charCodeAt(0); // 55357 const lowSurrogate = '๐'.charCodeAt(1); // 56835 String.fromCharCode(highSurrogate, lowSurrogate); // '๐' String.fromCharCode('๐'.charCodeAt(0)); // ๏ฟฝ String.fromCharCode('๐'.charCodeAt(1)); // ๏ฟฝ
You might think, "but 55357 + 56835
does not equal 128515
," and you'd be right. There's an algorithm for converting to and from surrogate pairs, though I will not cover it here.
Now, both String.fromCharCode()
and String.charCodeAt()
only work with UTF-16 units, but String.fromCodePoint()
and String.codePointAt()
work without having to use units. Instead, it understands if it's a surrogate pair or not:
'๐'.charCodeAt(0); // 55357 '๐'.codePointAt(0); // 128515 String.fromCharCode(128515); // "๏" String.fromCodePoint(128515); // "๐"
Lastly, we can use Blob to see how many code units it takes in UTF-8, which has 8 bits code units.
// 0-127 range new Blob([String.fromCodePoint(0)]).size; // 1 new Blob([String.fromCodePoint(127)]).size; // 1 // 128-2047 range new Blob([String.fromCodePoint(128)]).size; // 2 new Blob([String.fromCodePoint(2047)]).size; // 2 // 2048 - 65535 range new Blob([String.fromCodePoint(2048)]).size; // 3 new Blob([String.fromCodePoint(65535)]).size; // 3 // 65536 - 1114111 range new Blob([String.fromCodePoint(65536)]).size; // 4 new Blob([String.fromCodePoint(1114111)]).size; // 4 new Blob([String.fromCodePoint(1114112)]).size; // Uncaught RangeError: 1114112 is not a valid code point
Now, even though there may be plenty of code points out there, JavaScript's highest code point is 1114111
.
Strings in C#
Many programming languages have a char
type, and C# is one of those, while JavaScript is not. A char
instance in C# represents a 16-bit code unit, and a C# string is a sequence of char
instances.
There are different approaches to creating a char
in C#, and the most common one is to use two single quotes ''
. Whereas in JavaScript, we may create strings with both single quotes and double quotes. We see three different ways to create a char
in the example below:
char character1 = 'A'; char character2 = '\u0041'; char character3 = (char)65; Console.WriteLine(character1); // A Console.WriteLine(character2); // A Console.WriteLine(character3); // A
To create a string in C#, we may use a literal just like in JavaScript. We can also send in a char
array to the String class constructor:
char[] charArray = { 'A', 'B', 'C' }; string str = new string(charArray); // constructor string str2 = "ABC"; // literal Console.WriteLine(str); // ABC Console.WriteLine(str2); // ABC
In C#, String.length
conceptually works the same way as in JavaScript. But in C#'s case, it returns the number of char
instances. Note that we might need two char
instances for one character, and an example of that would be an emoji:
"A".Length; // 1 "๐".Length; // 2
As we see, an emoji, like ๐
won't fit in a single char
and will require two char
instances:
char myChar2 = 'A'; // OK char myChar1 = '๐'; // ERROR: Too many characters in character literal
Similar to JavaScript's charCodeAt
function we can convert the char
to an int
to get the code point. We can also use the Char.ConvertToUtf32
, which is more similar to codePointAt
because it can handle surrogate pairs and returns a string
instead of a char
:
(int)'A'; // 65 (int)"A"; // Cannot convert type 'string' to 'int' (int)"๐"[0]; // 55357 (int)"๐"[1]; // 56835 (int)"๐"; // Cannot convert type 'string' to 'int' Char.ConvertToUtf32("A", 0); // 65 Char.ConvertToUtf32("๐", 0); // 128515 Char.ConvertFromUtf32(65); // A Char.ConvertFromUtf32(128515); // ๐
C# also gives us some handy methods when working with surrogate pairs, like Char.IsSurrogate
, Char.IsHighSurrogate
, and Char.IsLowSurrogate
:
Char.IsSurrogate("A", 0); // False Char.IsSurrogate("๐", 0); // True Char.IsSurrogate("๐", 1); // True Char.IsHighSurrogate((char)55357); // True Char.IsLowSurrogate((char)56835); // True
Working with Strings
To end this article, we'll get a bit more practical and cover some good-to-know things when working with strings in both JavaScript and C#. As you will see, it's very similar overall.
JS vs C#: String Methods
JavaScript:
'my string'.length; // 9 'MY STRING'.toLowerCase(); // my string 'my string'.toUpperCase(); // MY STRING 'my string'.replace('string', 'coffee'); // my coffee 'my string'.includes('my'); // true 'My name is Nicklas'.split(' '); // [ "My", "name", "is", "Nicklas" ]
C#:
"my string".Length; // 9 "MY STRING".ToLower(); // my string "my string".ToUpper(); // MY STRING "my string".Replace("string", "coffee"); // my coffee "my string".Contains("my"); // True "My name is Nicklas".Split(' '); // [ "My", "name", "is", "Nicklas" ]
JS vs C#: Template Strings
If you're familiar with ES6, then you probably like Template strings. In JavaScript, we use backticks:
const expression = 'hello'; `${expression} world`; // hello world
While in C#, we prefix a string with the $
character to make our string into a template:
string expression = "hello"; $"{expression} world"; // hello world
JS vs C#: Comparing strings alphabetically
When we compare strings, it bases its logic on sort order:
-1
indicates that it is before the target0
it is the same1
it is after
JavaScript:
'a'.localeCompare('b'); // -1 'a'.localeCompare('a'); // 0 'b'.localeCompare('a'); // 1
C#:
"a".CompareTo("b"); // -1 "a".CompareTo("a"); // 0 "b".CompareTo("a"); // 1
JS vs C#: Uppercase the first character of a string
Both JavaScript and C# do not have a built-in method to capitalize the first character of a string. So, we must make it ourselves.
JavaScript:
const capitalize = (str) => { return str.charAt(0).toUpperCase() + str.slice(1); }
If str
only contains zero characters or one, the slice(1)
function will return an empty string, which means that we don't have to handle an exception.
C#:
static public string Capitalize(string str) { if (str == null) return null; if (str.Length > 1) return Char.ToUpper(str[0]) + str.Substring(1); return str.ToUpper(); }
Note that there are many approaches to solve this problem. But, as we see in our C# example, String.Substring
throws an error instead of returning an empty string like JavaScript's slice
when the str
only contains zero or one character.
A JavaScript developer who's comfortable with the forgiving nature of JavaScript likely thinks that the extra code derived from having to handle possible errors is unacceptable. But this type of code is a reality that we must live with when working with C#. However, it does come with benefits.