14.3: Numeric data types

Last updated
Save as PDF

Page ID: 85198

Carey Smith
Oxnard College

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

By Carey A. Smith

Read the MATLAB help for Floating-Point Numbers:

https://www.mathworks.com/help/matlab/matlab_prog/floating-point-numbers.html

Read the help for both double- and single-precision Floating-Point Numbers at this link.

This information includes the following statements:

Because the default numeric type for MATLAB is double, you can create a double with a simple assignment statement:
x = 25.783;
Because MATLAB stores numeric data as a double by default, you need to use the single conversion function to create a single-precision number:
x = single(25.783);

Also read "Largest and Smallest Values for Floating-Point Classes"

For most calculations, our computers have sufficient memory and speed to use double-precision floating point numbers.

Other numeric data types are at the following link. These include single and double-precision integer numbers. MATLAB also supports 1- and 2-byte integer and unsigned integer types for saving data memory. These are data types are primarily for interfacing with other application. They are not commonly used in MATLAB calculations

https://www.mathworks.com/help/matlab/numeric-types.html

Example \(\PageIndex{1}\) double vs. single floating point sum of fractions

The code shown here sums a lot of terms of a decreasing sequence using both single and double precision floating point numbers. There is some differences between these sums, because single-precision is less precise, and because when the terms get smaller than the last bit of single precision sum, the terms no longer change the running sum.

The relative error between the sums is a little more than 1%.

clear all;close all;clc;format compact format long

%% Sum a lot of terms of a sequence using double precision floating-point: a_double = double(1:100:100e6); a_double_len = length(a_double) seq_double = 1./a_double; a_double_sum = double(0); for k = 1:a_double_len a_double_sum = a_double_sum + seq_double(k); end disp(a_double_sum) % 1.143763955258335

%% Sum a lot of terms of a sequence using single precision floating-point: a_single = single(1:100:100e6); a_single_len = length(a_single) seq_single = single(1./a_single); a_single_sum = single(0); for k = 1:a_double_len a_single_sum = a_single_sum + seq_single(k); end disp(a_single_sum) % 1.1286

a_single_sum / a_double_sum % 0.98676

%% Note, the sum function appears to use double precision internally, % so that function is not used for this demonstration.

Solution

Add example text here.

Exercise \(\PageIndex{1}\) 32-bit vs. 64-bit integer processing speed

In this assignment, you will compare the time it takes to many iterations of for loops using 32-bit integers and 64-bit integers.

Start a Matlab script with the following code.

clear all; close all; clc; format compact
% Store the maximum value of each type of integer
int8max = intmax('int8') % 127
int16max = intmax('int16') % 32767
int32max = intmax('int32') % 2147483647
int64max = intmax('int64') % 9223372036854775807

%% int32 loop
tic
result32 = int32(10);
for m = 1:4e4
for n = 2:2:127
ii = int32(n);
result32 = int32(result32*(ii+1));
result32 = int32(result32/ii);
end
end
int32time = toc % This reports the time to complete these loops

Write the time it took in a comment.

Then create a 2nd, similar loop for 16-bit integer code. Copy the 32-bit integer code and replace the following in each place it appears:

int32 by int64

result32 by result64

Write the time it took in a comment.

Answer

% On one PC, the 32-bit loop took about 0.1 seconds and the 64-bit loop took about 10 seconds.

% This shows that 32-bit integer computations are much faster than 64-bit integer computations.

Exercise \(\PageIndex{2}\) 32-bit vs. 64-bit floating-point processing speed

In this assignment, you will compare the time it takes to many iterations of for loops using 32-bit floating-point floating-point and 64-bit floating-point values.

Start a Matlab script with the following code:

clear all; close all; clc; format compact; format long
% Store the maximum value of each type of floating-point number
float32max = realmax('single') % 3.4028e+38
float64max = realmax('double') % 1.7977e+308

%% Single loop
tic
for m = 1:10e6
result32 = single(10);
for n = 2:2:127
ii = single(n);
temp1 = single(result32*(ii+1));
temp2 = single(temp1/ii);
result32 = temp2;
end
end
result32 % echo the answer
singletime = toc

Then create a 2nd, similar loop for 64-bit floating-point code. Copy the 32-bit code and replace the following in each place it appears:

single by double

result32 by result64

Answer

% On one PC, the two loops took about the same amount of time, so there is not a large computational speed advantage to 32-bit floating point Matlab code.

% This implies that Matlab likely performs much of these computations in 64-bit floating-point, then converts the results to 32-bit.

% However, in C and assembly language, the single-precision code is run using single-precision operations, so it is much faster than double-precision code.