You need more than a bunch of numbers and logic to write an application. You need text and working with individual characters isn’t enough either.
This episode continues the discussion about the string data type and covers the following points:
- Can the encoding be changed and how are character boundaries detected? Is the encoding self-synchronizing?
- How do you expand and collapse composite characters?
- How do you convert numbers and other data types to strings and back?
- How do you append, insert, and remove sections of a string?
- How do you reorder and reverse strings?
- How do you change and detect case of letters?
- How do you control the formatting of a string with placeholders?
Listen to the full episode or you can also read the full transcript below.
This episode continues the explanation of the string data type, what you can do with it, and many of the unique considerations that apply to strings. Listen to episode 114 for the first part and to tomorrow’s episode for the third part.
Here’s the next seven points from #8 to #14.
#8 Can the encoding be changed and how are character boundaries detected? Is the encoding self-synchronizing? Yes, encoding can be changed but not usually in-place. You’ll normally have to read one string and use the information to construct another string. And if you’re reading a stream of string data, then you can write another stream with the output. If the data you’re reading uses a single byte for each character, then there’s no issues with detecting character boundaries. You’ll know where one byte ends and the next begins. This is handled for you at a much lower level. But at some point, yes, even the bytes need to be separated from a flow of bits. Where you can run into problems with you code is when working with multiple bytes per character. You need to know which comes first, the high order byte or the low order byte and many files and streams will start with a byte order mark or a BOM to help identify this. Imagine for a moment a train passing by where the cars are related in pairs. If you’re lucky enough to catch the beginning of the train and don’t lose track, then you can keep track of the first and second cars, then the third and fourth cars, then the fifth and sixth cars, etc. But if all the cars look the same and you start observing the passing train after the engine has long since passed by, then you have no way of knowing if a particular car is related to the one before it or after it. This is a problem for encoding systems that are not self-synchronizing and you’ll just have to start at the beginning. If the cars don’t all look the same though, and let’s say the first car in each related pair has a green stripe, then it’s easy to tell when a new pair begins. Some encoding systems such as UTF-8 are self-synchronizing. They don’t have green stripes, but you can still tell exactly when a new character begins.
#9 How do you expand and collapse composite characters? The Unicode standard defines four normalization forms. This can get complicated and is often overlooked. Normalization is a process where you put things in a standard representation. The dictionary says it is a process to put something back into a usual or expected state. This is really important for file systems and communications. Let’s say you’re working on a scientific research paper and decide to include a special character in the name of your document. This special character is the angstrom symbol and it looks like an A with a circle on top. In fact it looks exactly like an A with a circle on top. There is another character that’s an A with a circle on top that’s not the angstrom character. Even though both look identical, they’re not. Sometimes, it’s important to swap special characters like this with other characters that look the same but are more common and expected. Another example are characters that contain multiple dots one above and one below. Each of these dots is itself a character that gets combined with a base character. But what order should they be in? Or should they be there at all and instead we should just use a single character that already has both dots. You really need to be a language expert to understand all the details. And I’m not a language expert. I do know enough to appreciate the problem and know that I should be calling methods in the operating system to handle these cases. Most of the time, your operating system will take care of these details for you. But if you ever need to write code that communicates with a different operating system or that writes data to a file that can be read back on another computer, then you should pay attention to normalization. It’ll still be too complicated to manage directly. Just make sure that you’re aware of the problem and call the right methods to deal with it.
Those points were some big ones. It’s time for a short break for this message from our sponsor.
( Message from Sponsor )
#10 How do you convert numbers and other data types to strings and back? Let’s say you want to display the value of a simple integer with the value two. Seems simple, right? And it mostly is. There are some things to be aware of though. What if you want to display a bunch of numbers in a column? It would be nice to align the numbers. Because if you don’t align them then the 2 would print right above the 1 in the value 10. And if you need to print numbers in the hundreds or thousands, then you’ll need to shift that 2 over even more. It’s more than just shifting it over though. Sometimes, you might want to add leading zeros. Or maybe if you’re displaying page numbers in a table of contents, then you might want leading dots to give that dotted line effect. Still, going from numbers to text is usually much easier than going from text to numbers. Because when you start with text, there’s all sorts of formats the numbers can be in. You might think that you can just ignore any characters and just focus on the numbers. But what about a minus sign? Or what about the letter e or x that can appear in scientific notation? Going from text to a number can fail for many reasons so usually the methods you’ll use will be called something related to parsing. This is because the process of parsing involves reading text and figuring out what it represents. And it’s not just numbers that you’ll want to convert. You might want to display the values of bools, or a pointer, or an enum, or a custom type. There might be a method like ToString you can use to convert to text. And going from text back to these other types will likely involve more parsing.
#11 How do you append, insert, and remove sections of a string? This is a common need with strings. You’ll need to be able to add more text to the end, add more text somewhere in the middle of a string, and sometimes remove portions of text. Just think about all the times you move around in an email adding text here, changing some text there, and deleting other things. Some languages like C++ let you do these things easy. Other languages like C# might make it seem like you’re able to modify a string but what really happens is you end up creating a new string with the changes and throwing away the old string. If you want to avoid this extra creation and destruction, then you’ll want to use a StringBuilder which gives you the ability to modify a string more like in C++.
#12 How do you reorder and reverse strings? This is similar to the previous point but the intent is usually different. With these types of operations, the user is less likely to be the one adding or removing specific text. Maybe you need to reorder the letters for a word guessing game. Or you need to reverse a string of numbers to show a mirroring effect that still keeps the numbers readable. These types of operations work less with substrings like the previous point and more with individual characters as specific indexes. You’ll want to make sure to understand how your language exposes individual characters within a string. Most languages allow you to work with a string as if it was an array of characters and then you can use normal array syntax such as square brackets with a number inside to identify a character at some numeric index.
#13 How do you change and detect case of letters? This is something unique to text but not all languages have this concept. And some languages have complicated rules for how upper and lower case letters interact. If you’re comparing two strings to see if they’re the same or not and they are the same except that one string has a few capital letters, then you might sometimes want to treat them as the same and sometimes different. The system and methods you use can’t make this decision for you. Sure, they might have a default behavior and you need to be aware of this and know how and when you want to ignore case or not. Sometimes you just want to detect if text has upper or lower case letters. And sometimes you want to convert everything to either all uppercase or all lowercase. Just be careful because there are some languages that have unique capitalization rules. If you want an interesting case, just search Google for the Turkish I problem. You see, if your customer is in Turkey and types a lower case i and your code tries to convert it to an upper case I, the result may not be what you expect. You end up with a different letter that’s actually a capital I with a dot. This can affect you more than you might realize because the word file has an i in it. And controlling access to files is sometimes an important security concern. An attacker might be able to get around your security by temporarily switching the text rules to follow Turkey.
#14 How do you control the formatting of a string with placeholders? This is an important topic especially for localization. Let’s say you want to enable your application to support multiple languages and there’s a string that let’s the user know how much space is left. It could be anything, really. I just used this as an example. You want to display a string to the user that says, “You have room for 2 more items.” The worst thing you can do is create a string for the part before the number and another for the part after the number and then combine them with the number value between. Or even if the number comes at the very end, you still don’t want to create a string for the first part and then just stick the number to the end. Why? Because different languages have different grammar rules. The person doing the translating needs to see the whole sentence including where any numbers will reside in order to properly translate it. If the translator sees two separate strings where the first one says “You have room for” and the second one says “more items” they just won’t have enough information to be able to properly translate the text. And if you really want to be precise, then you’ll have to account for languages that have different rules for plurality. In English, we would say “You have no more room available.” for when the number is zero, and “You have room for one more item.” for when the number is one, and “You have room for 2 more items.” for when the number is 2 or more. But some languages have another rule for three or more and treat two special. You need special cases for each of these strings and they have to be complete. There can’t be any concatenation allowed. You show in the string where the numbers, or anything for that matter, will be placed by using placeholders. These are special symbols that allow you to insert numbers of other content at that location.