19.2: Multiprocessing
- Page ID
- 19982
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)As noted in Chapter 2, Architecture Overview, most current CPU chips include multiple cores. The CPU cores all have equal access to the main memory resource. Multiprocessing is a form of parallel processing that specifically refers to using multiple cores to perform simultaneous execution of multiple processes.
Focusing on a single large project, a main or initial process might generate subprocesses, referred to as threads(For more information, refer to: http://en.Wikipedia.org/wiki/Thread_(computing)). When the initial process is started and placed in memory by the system loader(For more information, refer to Chapter 5, Tool Chain), the Operating System creates a new process including the required Operating System data structures, memory allocations, and page/swap file allotments. A thread is often referred to as a light-weight process since it will use the initial process’s allocations and address space thus making it quicker to create. It will also be quicker to terminate as the deallocations are not performed. Since the threads share memory with the initial process and any other threads, this presents the potential for very fast communication between simultaneously executing threads. It also has the added requirement that the communications in the form of memory writes be carefully coordinated to ensure results are not corrupted. Such a problem is referred to as a race condition(For more information, refer to: http://en.Wikipedia.org/wiki/Race_condition). Race conditions are addressed in the following section.
The threaded approach will not scale to the extent of the distributed approach. CPU chips have a limited number of cores which places an upper bound on the number of simultaneous computations that can occur. This limit does not apply to distributed computing. The limitations regarding network communication speeds do not apply to threaded computations.
POSIX Threads
POSIX Threads(For more information, refer to: http://en.Wikipedia.org/wiki/POSIX_Threads), commonly called pThreads, is a widely available thread library on Ubuntu and many other operating systems. The example in this section will use the pThreads thread library Application Programmer Interface (API) for creating and joining threads.
The initial or main process is created during the load process. The main process may create additional threads with the pthread_create() library function. While there is no specific limit to the number of threads that can be created, there is a limit to the number of cores available thus creating a practical limit for the number of threads.
The initial or main process can see if the thread has completed with the pthread_join() library function. If the thread has not completed, the join function will wait until it does. Ideally, in order to maximize the overall parallel operations, the main process would perform other computations while the thread or threads are executing and only check for thread completion when the other work has been completed.
There are other pThread functions to address mutexes(For more information, refer to: http://en.Wikipedia.org/wiki/Mutual_exclusion), condition variables, and synchronization(For more information, refer to: http://en.Wikipedia.org/wiki/Synchro...puter_science)) between threads which are not addressed in this very brief overview.
It should be noted that there are other approaches to threading not addressed here.
Race Conditions
A race condition is a generic term referring to when multiple threads simultaneously write to the same location at the same time. Threaded programming is typically done in a high-level language. However, fully understanding the specific cause of a race condition is best done at the assembly language level.
A simple example is provided to purposely create a race condition and examine the problem in detail. This example is not computationally useful.
Assuming we wish to compute the following formula MAX times, where MAX is a defined constant. For example;
\[myValue = \left (\dfrac{myValue}{X} \right ) + Y\nonumber\]
We could write a high-level language program something along the lines of:
for (int i=0; i < MAX; i++) myValue = (myValue / X) + Y;
If we wish to speed-up this computation, perhaps because MAX is extremely large, we might create two thread functions, each to perform MAX/2 computations. Each of the thread functions would have shared access to the variables myValue, X, and Y. Thus, the code might be;
for (int i=0; i < MAX/2; i++) myValue = (myValue / X) + Y;
This code would be repeated in each of the thread functions. It may not be obvious, but assuming both threads are simultaneously executing, this will cause a race condition on the myValue variable. Specifically, each of the two threads are attempting to update the variable simultaneously and some of the updates to the variable may be lost.
To further simplify this example, we will assume that X and Y are both set to 1. As such, the result of each calculation would be to increment myValue by 1. If myValue is initialized to 0, repeating the calculation MAX times, should result in myValue being set to MAX. This simplification will allow easy verification of the results. It is not computationally useful in any meaningful way.
Implementing this in assembly would first require a main program that creates each of the two threads. Then, each of the thread functions would need to be created. In this very simple example, each thread function would be the same (performing MAX/2 iterations and that MAX is an even number).
Assuming one of the thread functions is named threadFunction0() and given the below pThread thread data structure;
pthreadID0 dd 0, 0, 0, 0, 0
The following code fragment would create and start threadFunction0() executing.
; pthread_create (&pthreadID0, NULL, ; &threadFunction0, NULL); mov rdi, pthreadID0 mov rsi, NULL mov rdx, threadFunction0 mov rcx, NULL call pthread_create
This would need to be performed for each thread function.
The following code fragment will check if a thread function is completed;
; pthread_join (pthreadID0, NULL); mov rdi, qword [pthreadID0] mov rsi, NULL call pthread_join
If the thread function is not done, the join call will wait until it is completed. The thread function, threadFunction0(), itself might contain the following code;
; ----- global threadFunction0 threadFunction0: ; Perform MAX / 2 iterations to update myValue. mov rcx, MAX shr rcx, 1 ; divide by 2 mov r10, qword [x] mov r11, qword [y] incLoop0: ; myValue = (myValue / x) + y mov rax, qword [myValue] cqo div r10 add rax, r11 mov qword [myValue], rax loop incLoop0 ret
The code for the second thread function would be similar.
; divide by 2
If both threads are simultaneously executing, they are both trying to update the myValue variable. For example, assuming that thread function 0 is executing on core 0 and thread function 1 is executing on core 1 (arbitrarily chosen), the following execution trace is possible;
Step |
Code: Core 0, Thread 0 |
Code: Core 1, Thread 1 |
1 |
mov rax, qword [myValue] |
|
2 |
cqo |
mov rax, qword [myValue] |
3 |
div qword [x] |
cqo |
4 |
add rax, qword [y] |
div qword [x] |
5 |
mov qword [myValue], rax |
add rax, qword [y] |
6 |
mov qword [myValue], rax |
As a reminder, each core has its own set of registers. Thus, the core 0 rax register is different than the core 1 rax register.
If the variable myValue is currently at 730, two thread executions should increase it to 732. On core 0 code, at step 1 the 730 is copied into core 0, rax. On core 1, at step 2, the 730 is copied into core 1, rax. As execution progresses, steps 2-4 are performed and core 0 rax is incremented from 730 to 731. During this time, on core 1, steps 1-3 are completed and the 730 is also incremented to 731. As the next step, step 5, is executed on core 0, the 731 is written to the myValue variable. As the next step, step 6 is executed on core 1, the value 731 is again written to the myValue variable. Two execution traces should have incremented the variable twice. However, since the value was obtained in core 1 before core 0 was able to write the final value, one of the increments was lost or duplicated. The end result of this overlap is that the final value of myValue will not be correct. And, since the amount of overlapping execution is not predictable, the degree of incorrectness is not easily predicted and may vary between different executions.
For example, if the value of MAX is fairly small, such as 10,000, there would not likely be any overlap. Each thread would start and complete so quickly, that the chances of overlapping execution would be very small. The problem still exists but would likely appear mostly correct on such executions, making it easy to ignore occasional anomalous output. As MAX is increased, the odds of overlapping execution increase.
Such problems can be very challenging to debug and require an understanding of the low-level operations and the technical architecture.