3.4: Characters and Strings

Last updated
Save as PDF

Page ID: 19872

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

In addition to numeric data, symbolic data is often required. Symbolic or non-numeric data might include an important message such as “Hello World”(For more information, refer to: http://en.Wikipedia.org/wiki/”Hello,_World!”_program) a common greeting for first programs. Such symbols are well understood by English language speakers.

Computer memory is designed to store and retrieve numbers. Consequently, the symbols are represented by assigning numeric values to each symbol or character.

Character Representation

In a computer, a character(For more information, refer to: http://en.Wikipedia.org/wiki/Character_(computing)) is a unit of information that corresponds to a symbol such as a letter in the alphabet. Examples of characters include letters, numerical digits, common punctuation marks (such as "." or "!"), and whitespace. The general concept also includes control characters, which do not correspond to symbols in a particular language, but to other information used to process text. Examples of control characters include carriage return or tab.

American Standard Code for Information Interchange

Characters are represented using the American Standard Code for Information Interchange (ASCII(For more information, refer to: http://en.Wikipedia.org/wiki/ASCII)). Based on the ASCII table, each character and control character is assigned a numeric value. When using ASCII, the character displayed is based on the assigned numeric value. This only works if everyone agrees on common values, which is the purpose of the ASCII table. For example, the letter “A” is defined as \(65_{10}\) (0x41). The 0x41 is stored in computer memory, and when displayed to the console, the letter “A” is shown. Refer to Appendix A for the complete ASCII table.

Additionally, numeric symbols can be represented in ASCII. For example, “9” is represented as \(57_{10}\) (0x39) in computer memory. The “9” can be displayed as output to the console. If sent to the console, the integer value \(9_{10}\) (0x09) would be interpreted as an ASCII value which in the case would be a tab.

It is very important to understand the difference between characters (such as “2”) and integers (such a \(2_{10}\)). Characters can be displayed to the console, but cannot be used for calculations. Integers can be used for calculations, but cannot be displayed to the console (without changing the representation).

A character is typically stored in a byte (8-bits) of space. This works well since memory is byte addressable.

Unicode

It should be noted that Unicode(For more information, refer to: http://en.Wikipedia.org/wiki/Unicode) is a current standard that includes support for different languages. The Unicode Standard provides series of different encoding schemes (UTF- 8, UTF-16, UTF-32, etc.) in order to provide a unique number for every character, no matter what platform, device, application or language. In the most common encoding scheme, UTF-8, the ASCII English text looks exactly the same in UTF-8 as it did in ASCII. Additional bytes are used for other characters as needed. Details regarding Unicode representation are not addressed in this text.

String Representation

A string(For more information, refer to: http://en.Wikipedia.org/wiki/String_...puter_science)) is a series of ASCII characters, typically terminated with a NULL. The NULL is a non-printable ASCII control character. Since it is not printable, it can be used to mark the end of a string.

For example, the string “Hello” would be represented as follows:

Character	"H"	"e"	"l"	"l"	"o"	NULL
ASCII Value (decimal)	72	101	108	108	111	0
ASCII Value (hex)	0x48	0x65	0x6C	0x6C	0x6F	0x0

A string may consist partially or completely of numeric symbols. For example, the string “19653” would be represented as follows:

Character	"l"	"9"	"6"	"5"	"3"	NULL
ASCII Value (decimal)	49	57	54	53	51	0
ASCII Value (hex)	0x31	0x39	0x36	0x35	0x33	0x0

Again, it is very important to understand the difference between the string “19653” (using 6 bytes) and the single integer \(19,653_{10}\) (which can be stored in a single word which is 2 bytes).