CPU vs (GP)GPU: the difference

Shortest possible explanation

Each GPU processor core has dedicated registers file and ALU. But program counter (PC) or instruction pointer (IP) is shared among all them.

More info

This is why you read this in CUDA manuals:

A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path.

...

If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path.

( src )

A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.

( src )

Before taking a branch, the compiler executes a special instruction to push this active-thread bit vector onto a stack. The code is then executed twice, once for threads for which the condition was TRUE, then for threads for which the predicate was FALSE. This two-phased execution is managed with a branch synchronization stack, as described by Lindholm et al.15

    If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads reconverge to the original execution path. The SM uses a branch synchronization stack to manage independent threads that diverge and converge. Branch divergence only occurs within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.

( src )

This is what is also called Single instruction, multiple threads (SIMT).

History

There was a time when CPU had two parts: control path and data path. For example, a DEC CPU from 1980s:

The J-11 is a microprocessor chip set that implements the PDP-11 instruction set architecture (ISA) jointly developed by Digital Equipment Corporation and Intersil. It was a high-end chip set designed to integrate the performance and features of the PDP-11/70 onto a handful of chips. It was used in the PDP-11/73, PDP-11/83 and Professional 380.

It consisted of a data path chip[1] and a control chip[2] in ceramic leadless packages mounted on a single ceramic hybrid DIP package. The control chip incorporated a control sequencer and a microcode ROM.

( Wikipedia )

Data path had ALU and registers.

Control path - fetched instructions from RAM/ROM, decoded, controlled everything else. It had PC/IP.

(GP)GPU is like having many data path blocks, each with dedicated ALUs and register files plus one control block with PC/IP, which controls them all synchronously. All ALUs works in sync.

(GP)GPU can't execute two threads with different PC/IP, physically. But can execute threads with different data.

(the post first published at 20221107.)

List of my other blog posts.

Subscribe to my news feed

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.