Are strings also a collection? And how are characters represented?
This episode explains that strings do in many ways represent a collection of characters. Many programming languages will allow you to work with the individual characters just like you would work with individual items in an array.
Some languages like C# have immutable strings which means that once a string is constructed, then it can’t be changed. You might think you’re changing a C# string but in fact all you’re doing is creating a new string with the modified value.
The traditional method for determining the end of a string is to place a null value at the end. This is called a null terminated string. But what do you do if you want embedded null characters inside your string? My recommendation is that you don’t do this and instead select a different collection such as an array or vector of characters.
The final topic that this episode explores is how character values are represented. You’ll learn about ASCII (American Standard Code for Information Interchange) and Unicode and how Unicode encodes characters. A common encoding used today is called UTF-8. Listen to the full episode or read the full transcript below.
Thank you for your review and comments. This is one of the best reviews yet and it gives me a chance to discuss the promotions and explain how they fit into the episodes.
First of all, producing a show like this is a lot of work. Even a ten minute episode takes me about three hours to prepare, record, edit, and publish. I love doing this and comments from you, my listeners, are very rewarding. The comment above made my day. Just knowing that you’re getting value out of these episodes is amazing. There’s so much more that I can help you with in the live classes which is why I’m sponsoring the show myself for now.
In general, I’ll use the sponsor sections to let you know about products and services that will benefit you. I’ve already had people ask about sponsoring the show but I wasn’t interested because the advertisement was not related to programming. I hope you understand.
I also hope to get some more variety in the sponsor segments eventually. For right now though, sponsors want to know what the show has to offer. So adding my own sponsor segment is a great way to get the process started.
Okay on to the question this week. This question comes from the live class last week and I talked about this for at least a half hour.
I don’t always know what you’re struggling to understand. If you’re having difficulty with something, then others are probably having the same trouble. Taking the time to ask a question is a great way to get an answer to your situation and help others too.
Most languages give you the ability to enumerate the characters in a string either through special methods or the same as any other collection. The characters are ordered as you’d expect.
Some languages like C# treat strings as immutable. This means that if you ever need to change a string, you need to create a whole new one. This is not always obvious and C# makes it easy to perform actions that appear to be changing a string. If you have a rather large string or a string that you’re building in pieces by adding a few characters at a time, then it’s easy to end up with some inefficient code. Adding just one character to the end requires allocating a whole new section of memory, copying over all the existing characters, and then adding the single character to the end. And if you then need to add another character, your program will do this all over again.
You need to build a string in C#, there’s a class just for this purpose called the StringBuilder. You construct a StringBuilder and then build your string piece by piece and when you’re done, you can construct a string from the StringBuilder.
Another topic specific to strings is how do you determine the end of the string? Many methods that work with strings assume that the end of a string is marked with a null character. This is just a character with the value zero. some languages such as C++ allow you to have null characters inside your strings but you still need to be careful.
My advice is that if you want to work with strings that might contain embedded nulls and this is traditionally the definition of binary data, then don’t use a string but use a vector instead.
The next topic is a bit more complicated and involves quite a lot of historical information. I’ll explain how characters are represented right after this message from our sponsor. Yes, that’s me for now and like this message explains, you can sign up for live classes. There are free and paid live classes.
Computers work natively with numeric values stored in binary. That’s all. A computer knows all about a byte and the possible values that the bits in that byte can have. But there is absolutely nothing in a numeric value that leads directly to a letter.
Let’s take the letter A. First, is that a capital or small A? Should it be bold? Should it be slanted? Should it have extra symbols placed above it like some European languages? And what about languages that don’t use letters at all but instead use pictures for concepts? When you think about all these things, handling text is not an easy task even for us humans. And trying to get computers to understand this is even harder.
Starting in the early 1960s a standard was developed called ASCII which stands for the American Standard Code for Information Interchange. This standard went through many debates and several changes over the years. At it’s core though was an agreement for which numeric values would represent which characters. For example, the capital letter A is defined to be the decimal value 65. There are ASCII charts that map letters, number symbols, punctuation symbols, and several other characters to specific decimal values.
The only problem was that ASCII only defined 128 values. It used 7 bits. Actually, we’re lucky to have 7 bits. That doesn’t leave room for very many characters and many computers in the 1980s had an extra bit that could be used to define another 128 values. There were a lot of standards for various countries that started using 256 values in a byte.
But even with 256 values, this is not enough. Early computers often had to choose what language and region they would support. The computer would then use codes that mapped to characters for that language and region. If you saved a document file from one standard, you could not reliably make sense of it from another standard. This system allowed similar computers to share information and even mix English text with the local language. But sharing information between different systems or trying to include several different languages in a single document was really difficult if not impossible.
The one thing good about these standards is that they defined everything about a character. You knew exactly how to represent each character in a byte value. That is, as long as you could figure out what system was being used.
Modern computers use a new evolving standard called Unicode which defines identities for characters such as a capital A, collections of characters such as Latin characters, codes that define order so we know which characters come before and after other characters, codepoints which define a specific number for a specific character in a code, and finally encodings that define exactly how all of this is represented in byte values.
That was a long list of very technical concepts. And even that, I simplified for you. If you find all this exciting, then there’s lots of further study that you can research. And if you’re about to tune out right about now, don’t worry. Things have gotten so much better in the last decade.
Once character standards went beyond 256 values, multiple bytes are required. This is where the concept of encoding becomes important. A very popular encoding standard used now is called UTF-8. It has several benefits including being compatible with the original 128 values in ASCII.
It can also expand when needed beyond a single byte to handle characters from other languages. With this encoding, you can no longer just count the number of bytes to determine how many characters are in a string.
So what does all this have to do with how strings represent characters? It’s going to depend on your language. Languages like C# have a lot more support for Unicode built-in while C++ really doesn’t care what you put in your strings.
Just be aware that if you’re using UTF-8 and storing this data in a C++ string, then it’ll work just fine for most scenarios. Once you have some text with characters that need more than a single byte and if you also have code that expects to find individual characters at single byte positions, then this code will break.
You can no longer assume that the first character will be at index 0, the second character at index 1, the third character at index 2, etc.