You need more than a bunch of numbers and logic to write an application. You need text and working with individual characters isn’t enough either.
This episode continues explaining concepts important to the string data type with the following seven points:
- How do you search a string for another string or a pattern?
- How do you use a string as an index or key into a collection?
- How do you read and write strings as data streams?
- How do you use strings to transfer information such as HTML, JSON, and XML?
- How do you convert images to strings and then back again?
- How do you sort strings and control how numbers and special characters affect the order?
- How do you embed special symbols into strings such as line breaks, presentation information, and binary data?
Listen to the full episode or you can also read the full transcript below.
This episode continues the explanation of the string data type, what you can do with it, and many of the unique considerations that apply to strings. Listen to episodes 114 and 115 for the first two parts.
Here’s the next seven points from #15 to #21.
#15 How do you search a string for another string or a pattern? Most string data types will provide methods like findFirst or findLast that allow you to specify a string to search for. These can be a good choice when you’re fairly sure the text you’re looking for is somewhere in a larger string and you just need to verify that and find the exact location. Sometimes you just have one character that you’re interested in or any one of several characters. These could be special characters such as the angle brackets used in HTML, quotation marks that signal special data, or spaces or tabs. When you want to find the first or last of any one of several characters, then look for methods like findFirstOf or findLastOf. These methods also take a string to search for but instead of searching for the string as-is, they search for each character in the search string independently. Putting all this together, if you have the string “Stay with it.” and call findFirst with the string “it”, then you might expect the method to return the index of the word it. However, findFirst normally doesn’t know anything about word boundaries and will tell you that the string “it” first occurs inside the word “with”. That’s because, there’s an i and a t inside the word “with”. What if you call findFirstOf with the same string “it”? Now, you’re asking the string to find the first character of any of the characters in the search string. The method will return the index of the t in the word “Stay”. If you want to get more complicated search patterns, you can, but I’d suggest using another class called a regular expression or just regex for short. Regular expressions will need a future episode to explain fully.
#16 How do you use a string as an index or key into a collection? This can be an extremely useful way to organize your data. When you define a keyed collection such as a dictionary, you define the types that will be used for the keys and for the values. Listen to episodes 43 and 44… for more information. Let’s say you want to load some information into memory that applies to countries. This information could be weather related such as the high temperature each day. The information will arrive in various orders but you want to be able to access the information related to a specific country very quickly later. It makes sense then to put the information into either a hashtable or a dictionary as you get it. But what do you use for the key? Why not use the standard two-letter country code that each country already has. Even two characters is still a string. Actually, you can have a string with just a single character. Or no characters at all. But in this case, we’ll have two character strings that we can use as the key.
I’ll explain the last 5 points right after this message from our sponsor.
( Message from Sponsor )
#17 How do you read and write strings as data streams? A stream is a nice way to read or write information that’s more open-ended or that could be arbitrarily long. Streams don’t have to keep the entire contents in memory. They don’t even have to keep the entire contents on the same computer. I mentioned a real stream before as an example. Running water that flows past you. At any time, you have access to the water that’s right in front of you. You can’t get to water that hasn’t arrived yet. And water that passed earlier is already long gone. You can work with a stream of data in the same way. The data could come from a local file stored on disk, it could come from a hardware device that continuously generates random information, or it could come from a network connection. Sometimes you might be able to start at the beginning of a stream and go all the way to the end. And sometimes, there won’t be any special beginning or ending. That doesn’t matter because normally, you end up working with a chunk of information at a time. Each piece of information fits nicely into the concept of a string.
#18 How do you use strings to transfer information such as HTML, JSON, and XML? You’ll normally use a communications library which could be something like sockets that allow you to open connections with other computers and then send and receive information. Or it could be a framework like the Windows Communication Foundation or WCF. Or you could interact with a local web server through its own interfaces. However you establish a connection, you’ll then need to figure out what format to use when sending information and what format to expect for any replies. There are standards for this. For example, if you want to request a web page, then you make an HTTP request and expect a string in reply. This string should contain the contents of an HTML page. Or maybe you have a connection to a web service that allows you to request the current high scores for a game. The reply from the web service could be either JSON or XML depending on what the server uses for a default and if you specifically requested one type or another. Let’s say that you get back JSON. This will be a string that you can read. You can search for whitespace or curly braces to parse the reply.
#19 How do you convert images to strings and then back again? This question might be a bit misleading. I’m not talking about scanning a picture to look for text maybe in an application that allows you to take a picture of a business card or a receipt and then convert all the recognizable text while ignoring any graphics. This is different. You see, an image is binary data and that means it can contain any values in the bytes that make up the image. These values include zero which is normally interpreted as a null character and the end of a string. And the values can contain other special values such as new lines, bell signals, and form feeds. If you’ve ever tried sending the contents of a binary file to a console window, you’ll know what I mean. You end up with a jumble of meaningless characters and sounds. Most of the characters can’t be displayed and all you get is some symbol instead. This point is really about how do you convert binary data like this into something that can be easily displayed. It won’t be readable or make any sense but at least the output should consist of normal characters that can be displayed, copied, and pasted as normal text. It works like this. If you take three bytes at a time from the original binary data and stick them all together so all the bits are next to one another, then three bytes means you have 24 bits. Now, redivide those 24 bits into four pieces of just 6 bits each. These are not bytes anymore. They’re just 6-bit values. With 6 bits, we can count from 0 to 63. There’s 64 values and the name of this process is called Base64 encoding. That’s a reasonable number that we can assign regular English letters capital A to Z, and then lower case a to z, then the numbers 0 through 9, and have just two values left over. These two values have been defined to be the plus sign and the forward slash character. The output of the 6-bit values gets mapped to these 64 characters and the end result is that for every 3 bytes of binary data, we covert it into 4 bytes of simple letters, numbers, and a couple extra possible characters. The result is still completely unrecognizable but it’s now made up of content that can be transferred around as just text. It’s easy to go back the other way too. Just take 4 of the characters, put their special 6-bit values end-to-end, and divide the values back into the original 3 bytes of binary data. There are some special considerations that need to be accounted for when the number of original binary bytes are not evenly divisible by 3. In this case, Base64 encoding uses some padding and then adds equal sign characters to the end to show that padding was used.
#20 How do you sort strings and control how numbers and special characters affect the order? Sorting text should be simple, right? We’ve been doing it since 2nd or 3rd grade. But there’s a lot more to this topic than I could cover here. It’s not simple at all. First of all, what do you do when text contains numbers? If you do a simple alphabetical sort, then all the one’s come first, followed by all the two’s, etc. This results in sorted data that goes from 1 to 10, 11, 12, all the way to 19, before going back to 2. When sorting numbers, we want to first convert the text into a numeric value and then sort on the value to determine the order of the text. This results in 1, 2, 3, all the way to 9, then 10, 11, 12, etc. And the same process applies to dates. And what do you do when text contains numbers and dates scattered throughout the text? It’s not so easy anymore. And I haven’t even begun to explain all the complexities. You see, proper sorting really depends on your locale. That is, it depends on special rules that vary based on regionally accepted normal practices. In some languages, there are certain sequences of characters such as a double s that while they remain separate characters, they take on a different meaning that should be sorted differently than just a single s or even two normal s characters from some other locale. When you setup a database, one of the initial tasks you should do but often gets overlooked is to set the collating rules. A database engine needs to know what information should come before or after other information and uses the choice you make at the beginning before any data gets added to the database to determine this.
#21 How do you embed special symbols into strings such as line breaks, presentation information, and binary data? I’ve explained various aspects of this throughout these 21 points so this is more of a recap than anything else. A string is just a series of characters in a specific order and your language will store these characters in its own manner. Each will also have some way to determine the beginning and end of the sequence. Many of the characters in a string are normal characters found in a language alphabet. Or if the language has no alphabet, then the characters will map to various code points defined by the Unicode standard. These characters will be encoded in some specific manner that’s independent of of the code point values. Just think about the Base64 encoding that I just described. This encoding can be used for more than just images. Now, your strings will sometimes need to contain special symbols that mean something other than a letter to be displayed. Maybe you want the text to move to the next line at that point. Or you want the text to jump forward a bit as in a tab. Special characters such as these have Unicode values too and you can treat them just like any other character. The problem is that they’re not always capable of being typed directly. Sure, you can hit the enter key on your keyboard but this won’t always insert a newline character into your string. If you’re writing a string literal into your code, then hitting enter will just cause a compile error because half the string is now on a different line of your source code. You need to use an escape sequence. Or you can explicitly define the character values you want and insert them directly into your string. Other presentation data can be handled like how HTML defines angle brackets that can be used to insert markup into your text. This markup is part of the string but usually not part of the text that gets displayed. Instead, the markup defines instructions that change the way the text gets presented. And binary data also presents problems when your string needs to contain embedded null characters. This is because a lot of string processing methods assume C style strings that use a null to mark the end of the string. You’ll have to select a different encoding that avoids nulls entirely, or keep track of the length of the string in some other way.