created: 11/14/2007; revised 07/04/2017


String Puzzles — Introduction

This section is about null-terminated strings. Many of the puzzles are similar to the functions in the standard strings library of most C environments.

1. Null-terminated strings

A null-terminated string is a sequence of bytes in memory where each byte encodes a single character using ASCII. The last character is followed by a byte containing all zero bits. That byte is called a null byte. For example, the string "The game is afoot!" is stored as:

Null Terminated String

The slash in the last cell represents the null byte. The empty cells represent the space character, represented in ASCII as the pattern 0x20.

2. ASCII

Historically, C programs have encoded characters as bit patterns using the ASCII encoding scheme. Although various libraries use other encoding schemes, ASCII remains the most popular and is used in these puzzles.

The following table shows the ASCII encoding scheme. Each character is encoded as an 8-bit pattern. The table shows the pattern using the hexadecimal name for the pattern. The column heading gives the 1's place digit of the hex and the row heading gives the 16's place digit of the hex. For example, find the character 'K' in the central (white) portion of the chart. The column heading is B and the row heading is 4 so the bit pattern is 0x4B, or 0100 1011.

  1's place
16's
place
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SP ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL

Some characters in the chart are control characters, which in historic times were used to send commands to electro-mechanical equipment like teletypes and printers. Control characters do not correspond to printed characters and are shown in the chart as abbreviations. For example, BEL (0x07) commands a device to sound its bell. Modern equipment will sometimes sound a beep when it is sent BEL.

Line feed is 0x0A and is still used to signify the end of a line for text contained in main memory. When text is stored in a disk file, end of lines are signified by LF (in Unix systems), CR-LF (in Microsoft systems), or CR (older Apple systems).

Character SP (0x20) is the space character. Character HT (0x09) is the horizontal tab. NUL is the pattern 0x00 — the null byte placed at the end of strings.

 

3. Characters as Integers

Often it is convenient to regard the bit pattern that encodes a character as encoding an integer. For example, the bit pattern the represents 'A' is 0x41 or 01000001. This pattern could also be regarded as the base 2 encoding of an integer. Adding one to this integer yields 01000010 which is the ASCII encoding for 'B'. The following code fragment outputs the sequence 'A', 'B', 'C', 'D':

int ch = 'A' ;
putchar( ch );
putchar( ch+1 );
putchar( ch+2 );
putchar( ch+3 );

In the first statement, the byte encoding character 'A' is put into the low order byte of the int variable ch because putchar() requires an int argument.

Often you want to convert a digit character into the corresponding integer. For example, you might want to convert the character '1' into the integer value one. The following does not work:

int value;
value = '1';

because it will place the ASCII encoding for character '1' into the low order byte of the variable

0000 0000 0000 0000 0000 0000 0011 0001
(assuming ints are 32 bits), rather than the desired pattern for the integer one:
0000 0000 0000 0000 0000 0000 0000 0001

To get the desired pattern into the variable, do this:

int value;
value = '1' - '0';

Since the ASCII encoding for '1' immediately follows the encoding for '0', the subtraction evaluates to integer one.

 

4. Strings and Pointers

Strings are accessed through pointers. The printf() function from the standard I/O library prints a string to the monitor. For example,

printf("The game is afoot!")

prints its string to the monitor. At run time, the string literal "The game is afoot!" is a null terminated string placed somewhere in memory. The actual argument sent to printf() is a pointer to the first byte of that string:

pointer to first byte of a string

The effect is the same as the following:

char *p;
p = "The game is afoot!";
printf( p );

In the above code, the variable p contains a pointer to a char. To increment the pointer to point to the next char, add one to it. The following code:

char *p;
p = "The game is afoot!";
p++ ;
printf( p );

will print out "he game is afoot!" .

 

5. String Literals

A string literal is a sequence of characters contained between quote marks, like "The game is afoot!" The compiler allocates memory for it, puts the characters followed by a null byte into the memory, and provides a pointer to the first byte.

Bug Alert! In standard C, string literals are constants. They cannot be altered. The following is illegal:

char *p = "The game is afoot!";

*p = 'X' ;

The fragment is attempting to put 'X' into the first byte of the literal, but changing the literal is illegal. The compiler might not catch the error, so you will have a mysterious run-time error instead.

Unfortunately, many compilers do not enforce this, and the above code will compile and run. Worse, many C programs have been written with the expectation that string literals can be altered.

 

6. String Buffers

Often a program will allocate a generous amount of memory for a character array, more bytes than any string is expected to use. The various strings that the program manipulates are stored at the front of the array (and always terminated with a null byte). The array is used as a staging area, and is often called a string buffer. For example the following uses a buffer of 1024 characters which is plenty of room for the strings it is dealing with:

char buffer[1024];
strcpy( buffer, "SHORT String" );
strcat( buffer, " MADE Longer");
strToLower( buffer );

The first function copies characters into the buffer. Although the string literal itself cannot be changed, its characters can be copied into a buffer and the buffer can be changed. The second function appends additional characters to the string in the buffer by copying characters from a second literal. The third function converts characters in the buffer to lower case. The buffer is not a string literal, so the characters it contains can be altered.

String buffer and two strings


Return to Contents — Return to the main contents page