114: Data Types: Strings Part 1.

You need more than a bunch of numbers and logic to write an application. You need text and working with individual characters isn’t enough either.

In some languages, a string is nothing more than an array of chars that ends with a special null value. Other languages give you much more powerful strings that blend into the language seamlessly. And some languages store strings as a linked list of characters.

A good string type will behave just like any other data type. You shouldn’t need to treat them any differently. For example, you can add the two numbers five and five to get the value ten. You should be able to add the two strings “pro” and “gram” to get the word “program” and this operation should use the same operator plus that the numbers used.

This topic will describe 21 points and be split into three episodes. The first seven points are:

What are string literals?
How are single and double quotes used?
What are escape sequences and why are they needed?
Is the string immutable or mutable?
Is the string null-terminated or capable of containing binary data?
Is the string empty or null?
What’s the difference between the length of strings in character count vs. byte count?

Listen to the full episode or you can also read the full transcript below.

Transcript

For this episode, there’s a lot of things to consider and learn about strings. I tried to come up with a clever way to structure all this but decided to go with just a numbered set of topics instead. After thinking about strings for a while, I came up with 21 points. This is too much for one episode, so the first 7 points will be included here and the rest will be in two additional episodes.

#1 What are string literals? When you embed strings in your code, these are string literals. These are strings that are known at compile time. If you need to translate your application so it works with multiple languages, then the translated strings will be held in some kind of string resource file and will not be directly in your code.

#2 How are single and double quotes used? This could depend on your language but normally single quotes are used to hold and define a single character while double quotes are used to define a string. Let’s say you have the single character ‘a’. Putting it inside single quotes means the letter ‘a’. If you don’t use any quotes at all, then that means you’re referring to a variable called “a” which could be the name of a variable of any type. If you put the letter a inside double quotes, then that means you want to define a string that just happens to have a single character, the letter ‘a’. The same thing works with numbers. A single digit number in single quotes represents the character of that digit and not the numeric value. You can only have a single character inside single quotes because they define single characters. Putting multiple letters or numbers inside double quotes, defines a string that contains all of those characters. So, the characters 123 in your source code by themselves with no double quotes defines the numeric value one hundred twenty three. And the same characters inside double quotes defines the string with the three numbers one, two, three. If your string needs double quotes inside the string, then what do you do? Many languages will require that you use escape sequences as the next point describes. But some languages that don’t make any strong difference between single and double quotes will let you switch between them as needed. So if you normally use double quotes to define the beginning and end of a string literal and need to include double quotes inside the string, then just switch to using single quotes to define the beginning and ending of the entire string. And the other approach works too. If you need some single quotes inside your string, then use double quotes outside the string. The reason double quotes are more common for defining strings is because it’s rare that you need to include additional double quotes inside the string. But single quotes are used all the time because the single quote is the same character as an apostrophe used in contractions like the word who’s which is short for who is.

#3 What are escape sequences and why are they needed? I already mentioned one need for an escape sequence when you need to include a double quote inside a string. You can’t just include the double quote directly or it’s presence would end the string literal right there. Instead, you put another character right before the double quote which acts like a flag. This flag tells the compiler to treat the next character special. The question then becomes what character should be used for the flag? It makes sense to use a character that’s not often needed itself, right? I mean, you wouldn’t want to use the letter e for a flag or you’ll be signaling special characters all over the place. Because of this, the backslash character was chosen as the special flag. anytime you include a backslash character in a string literal, then the backslash character itself doesn’t get included in the string. It’s presence is enough to cause the next character to be treated special. This is called an escape sequence because the two letters beginning with the backslash and followed by the next character form a sequence that means something special. What does it mean to escape the letter c? Nothing. Not all letters are valid escape sequences. Some letters do have a special meaning though. For example, the sequence backslash-n means to include a newline character.

I’ll describe the other 4 points right after this message from our sponsor.

( Message from Sponsor )

#4 Is the string immutable or mutable? Some languages let you change your string variables. Note that you can’t change a string literal. Because string literals are defined in your code, they’re not variable. However a string literal can be used to initialize a normal string variable. That string variable can then be modified. That is, if your language allows this sort of thing. If strings can be modified, then they’re said to be mutable. And if they can’t be modified, then they’re immutable. It’s easy to be fooled by this and write code that looks like it’s modifying strings. In C#, strings are immutable. But C# lets you define a string variable, give it a name fruit and an initial value, say, “apple”. Then you can change fruit to a different string, “orange”. It looks like you were able to change an apple to an orange. But what actually happens is the string variable fruit is just a reference to the initial string instance with the contents “apple”. When you change it to “orange”, what actually happens is a completely new string instance orange gets initialized from the string literal in the code and then the reference fit starts pointing to the new instance. The old string with the content “apple” gets forgotten about and the garbage collector will eventually reclaim the memory.

#5 Is the string null-terminated or capable of containing binary data? The C language defines strings as a simple char array where the last character is the null character. This is just a character where all the bits are zero. Strings that follow this system are usually called C strings. The C string “program” has a length of 7 characters but usually occupies 8 bytes. It needs an extra byte for the null termination character even though the termination character is not visible when the string is printed. A string that can contain binary data is sometimes called a binary string and sometimes called a byte string and there should be no special characters that mark the end of the string because all values are equal in importance. If there’s a possibility that the data can have values of all zero bits, then that’s usually the main factor that determines if a string is a byte string or not. A byte string will still need some way to mark the end. This can sometimes be determined with a separate length value. The problem then becomes where to store the length. One common system comes from the Pascal language and uses a integer type at the beginning of the string. These strings are sometimes called P strings. Another string similar to a P string comes from Windows and is called a BSTR or “beester” and it actually puts the length before the string data. A BSTR is a pointer to the beginning of the data and must usually be manipulated through special methods that know how to find the length. If you’re using an object-oriented class for working with strings, then the class can store the character data and length in separate internal fields.

#6 Is the string empty or null? In programming terms, an empty string is a valid string that just has no content. You usually express this in your code with a pair of double quotes with nothing between them. Note that if this is a C string that uses a null terminating character then even an empty string will still need the terminating null. A string that contains all spaces or tabs or some other unprintable characters is not normally considered to be empty. But if you trim away the whitespace, then you’re left with an empty string. Whitespace is defined to be any character that when printed leaves a blank space on the paper. That’s why spaces and table but also new lines are considered to be whitespace. Many languages either have pointers or references that can be null. And a null string is different than an empty string. A null string means you have no string at all. It doesn’t exist. While an empty string exists just with no content.

#7 What’s the difference between the length of strings in character count vs. byte count? This depends on the encoding being used and even on the contents of the string. First of all, the byte count is the actual number of bytes needed to store the string in memory. If the string uses one byte to store each character, then the byte count and the character count are the same. A byte with 8 bits only has room to hold 256 unique values, 0 through 255. That’s not very many values. It’s enough to hold all the upper and lower case English letters, the Arabic digits 0 through 9, and many punctuation characters and special purpose characters with some extra room to spare. But it’s not enough to hold all the characters from all the languages in the world. We need more bits. The actual representation of the numeric values and how they map to meaningful characters is called the encoding. There have been many different encoding systems used in the past and they’re still evolving and changing. Although the situation is getting better by at least adopting the Unicode standard. But the Unicode standard really just defines code points that map to characters. It doesn’t really say how those code points should map to bits. Popular encoding formats include UTF-8 which tries to be consistent with the early ASCII encoding that only mapped English letters. UTF-8 has the ability to use more bits when needed to represent code points from other languages. It can sometimes take up to 5 bytes to map a single code point. Another popular encoding is UTF-16 that normally uses 2 bytes to map more code points and can also expand to use 4 bytes when needed. Alright so depending on the encoding and the character, it could require one or more bytes to represent that character. But the whole situation gets even more complicated because some characters have multiple ways they can be represented. These are called composable characters. For example, some characters have special marks above them. Well, these marks can either be composed individually or included in a special version of the character itself. When you put all this together, it’s sometimes possible for a string to print on paper as a single letter but need many bytes, 2, 3, or maybe even 10 bytes in memory.

114: Data Types: Strings Part 1.

Transcript

Tags

Leave a ReplyCancel reply

114: Data Types: Strings Part 1.

Transcript

Share this:

Tags

Leave a ReplyCancel reply