## Programming Massively Parallel Processors

#### **CUDA** Memories

#### G80 Implementation of CUDA Memories

#### Each thread can:

- Read/write per-thread registers
- Read/write per-thread local memory
- Read/write per-block shared memory
- Read/write per-grid global memory
- Read/only per-grid constant memory



### CUDA Variable Type Qualifiers

| Variable ded    | Memory           | Scope    | Lifetime |             |
|-----------------|------------------|----------|----------|-------------|
| devicelocal     | int LocalVar;    | local    | thread   | thread      |
| deviceshared    | int SharedVar;   | shared   | block    | block       |
| device          | int GlobalVar;   | global   | grid     | application |
| deviceconstant_ | int ConstantVar; | constant | grid     | application |

- \_\_device\_\_ is optional when used with
   \_local\_\_, \_\_shared\_\_, or \_\_constant\_\_
- Automatic variables without any qualifier reside in a register
  - Except arrays that reside in local memory

#### Where to Declare Variables?



#### Variable Type Restrictions

- Pointers can only point to memory allocated or declared in global memory:
  - Allocated in the host and passed to the kernel:

```
__global__ void KernelFunc(float* ptr)
```

– Obtained as the address of a global variable:

```
float* ptr = &GlobalVar;
```

### A Common Programming Strategy

- Global memory resides in device memory (DRAM)
   much slower access than shared memory
- So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory:
  - Partition data into subsets that fit into shared memory
  - Handle each data subset with one thread block by:
    - Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
    - Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element
    - Copying results from shared memory to global memory

# A Common Programming Strategy (Cont.)

- Constant memory also resides in device memory (DRAM) - much slower access than shared memory
  - But... cached!
  - Highly efficient access for read-only data
- Carefully divide data according to access patterns
  - R/Only → constant memory (very fast if in cache)
  - R/W shared within Block → shared memory (very fast)
  - R/W within each thread → registers (very fast)
  - R/W inputs/results → global memory (very slow)

For texture memory usage, see NVIDIA document.

#### **GPU Atomic Integer Operations**

#### Atomic operations on integers in global memory:

- Associative operations on signed/unsigned ints:
   atomicAdd(), atomicSub(), atomicExch(), atomicMin(), atomicMax(), atomicInc(), atomicDec, atomicCAS()
- Also some operations on bits

Requires hardware with compute capability 1.1 and above.

## Matrix Multiplication using Shared Memory

## Review: Matrix Multiplication Kernel using Multiple Blocks

```
global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE WIDTH + threadIdx.x;
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
  Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
```

#### How about performance on G80?

 All threads access global memory for their input matrix elements

- Two memory accesses (8 bytes) per floating point multiply-add
- 4B/s of memory bandwidth/ FLOPS
- 4\*346.5 = 1386 GB/s required to achieve peak FLOP rating
- 86.4 GB/s limits the code at 21.6 GFLOPS
- The actual code runs at about 15 GFLOPS
- Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS



# Idea: Use Shared Memory to reuse global memory data

Each input element is read by Width threads.

 Load each element into Shared Memory and have several threads use the local version to

reduce the memory bandwidth

Tiled algorithms

tx

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2 ECE498AL, University of Illinois, Urbana Champaign

#### Tiled Multiply

 Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd

TILE WIDTH



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 - ECE498AL, University of Illinois, Urbana Champaign

### A Small Example



# Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P

 $P_{0,1}$  $thread_{0.0}$ thread<sub>1,0</sub> thread<sub>0.1</sub>  $M_{0,1} * N_{0.0}$  $M_{1.1} * N_{0.1}$ Access order  $M_{3,1} * N_{0,3}$  $M_{3,0} * N_{1,3}$ 

thread<sub>1.1</sub>

 $M_{1,1} * N_{1,1}$ 

 $M_{2.1} * N_{1.2}$ 

 $M_{3,1} * N_{1.3}$ 

#### Breaking Md and Nd into Tiles



## Each phase of a Thread Block uses one tile from Md and one from Nd

|                  | Phase 1                                                                              |                                               |                                                                                                                 | Phase 2                                             |                                                     | 1                                                                                                               |
|------------------|--------------------------------------------------------------------------------------|-----------------------------------------------|-----------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|-----------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| T <sub>0,0</sub> | $Md_{0,0}$ $\downarrow$ $Mds_{0,0}$                                                  | $Nd_{0,0}$ $\downarrow$ $Nds_{0,0}$           | PValue <sub>0,0</sub> +=<br>Mds <sub>0,0</sub> *Nds <sub>0,0</sub> +<br>Mds <sub>1,0</sub> *Nds <sub>0,1</sub>  | <b>Md<sub>2,0</sub></b><br>↓<br>Mds <sub>0,0</sub>  | $Nd_{0,2}$ $\downarrow$ $Nds_{0,0}$                 | PValue <sub>0,0</sub> +=<br>Mds <sub>0,0</sub> *Nds <sub>0,0</sub> +<br>Mds <sub>1,0</sub> *Nds <sub>0,1</sub>  |
| T <sub>1,0</sub> | <b>Md</b> <sub>1,0</sub><br>↓<br>Mds <sub>1,0</sub>                                  | <b>Nd</b> <sub>1,0</sub> ↓ Nds <sub>1,0</sub> | PValue <sub>1,0</sub> +=<br>Mds <sub>0,0</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,0</sub> *Nds <sub>1,1</sub>  | <b>Md</b> <sub>3,0</sub><br>↓<br>Mds <sub>1,0</sub> | <b>Nd</b> <sub>1,2</sub><br>↓<br>Nds <sub>1,0</sub> | PValue <sub>1,0</sub> +=<br>Mds <sub>0,0</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,0</sub> *Nds <sub>1,1</sub>  |
| T <sub>0,1</sub> | $\begin{array}{c} \mathbf{Md_{0,1}} \\ \downarrow \\ \mathbf{Mds_{0,1}} \end{array}$ | $Nd_{0,1}$ $\downarrow$ $Nds_{0,1}$           | PdValue <sub>0,1</sub> +=<br>Mds <sub>0,1</sub> *Nds <sub>0,0</sub> +<br>Mds <sub>1,1</sub> *Nds <sub>0,1</sub> | $Md_{2,1}$ $\downarrow$ $Mds_{0,1}$                 | $Nd_{0,3}$ $\downarrow$ $Nds_{0,1}$                 | PdValue <sub>0,1</sub> +=<br>Mds <sub>0,1</sub> *Nds <sub>0,0</sub> +<br>Mds <sub>1,1</sub> *Nds <sub>0,1</sub> |
| T <sub>1,1</sub> | <b>Md</b> <sub>1,1</sub> ↓ Mds <sub>1,1</sub>                                        | $Nd_{1,1}$ $\downarrow$ $Nds_{1,1}$           | PdValue <sub>1,1</sub> +=<br>Mds <sub>0,1</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,1</sub> *Nds <sub>1,1</sub> | <b>Md</b> <sub>3,1</sub><br>↓<br>Mds <sub>1,1</sub> | <b>Nd</b> <sub>1,3</sub><br>↓<br>Nds <sub>1,1</sub> | PdValue <sub>1,1</sub> +=<br>Mds <sub>0,1</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,1</sub> *Nds <sub>1,1</sub> |

#### First-order Size Considerations in G80

- Each thread block should have many threads
  - TILE\_WIDTH of 16 gives 16\*16 = 256 threads
- There should be many thread blocks
  - A 1024\*1024 Pd gives 64\*64 = 4096 Thread Blocks
- Each thread block perform 2\*256 = 512 float loads from global memory for 256 \* (2\*16) = 8,192 mul/add operations.
  - Memory bandwidth no longer a limiting factor

# CUDA Code – Kernel Execution Configuration

#### Tiled Matrix Multiplication Kernel

global void MatrixMulKernel(float\* Md, float\* Nd, float\* Pd, int Width) shared float Mds[TILE WIDTH][TILE WIDTH]; shared float Nds[TILE WIDTH][TILE WIDTH]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; Identify the row and column of the Pd element to work on int Row = by \* TILE WIDTH + ty; int Col = bx \* TILE WIDTH + tx; float Pvalue = 0; Loop over the Md and Nd tiles required to compute the Pd element for (int m = 0; m < Width/TILE WIDTH; ++m) {</pre> Coolaborative loading of Md and Nd tiles into shared memory Mds[ty][tx] = Md[Row\*Width + (m\*TILE WIDTH + tx)]; Nds[ty][tx] = Nd[Col + (m\*TILE WIDTH + ty)\*Width]; syncthreads(); for (int k = 0; k < TILE WIDTH; ++k) 11. Pvalue += Mds[ty][k] \* Nds[k][tx]; Synchthreads(); Pd[Row\*Width+Col] = Pvalue; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

ECE498AL, University of Illinois, Urbana Champaign

#### Tiled Multiply

by

m

TILE WIDTH

- Each block computes one square sub-matrix Pd<sub>sub</sub> of size
- Each thread computes one element of Pd<sub>sub</sub>

TILE WIDTH



m

012 TILE WIDTH-1

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 CECE498AL, University of Illinois, Urbana Champaign

#### G80 Shared Memory and Threading

- Each SM in G80 has 16KB shared memory
  - SM size is implementation dependent!
  - For TILE\_WIDTH = 16, each thread block uses 2\*256\*4B = 2KB of shared memory.
  - Can potentially have up to 8 Thread Blocks actively executing
    - This allows up to 8\*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
  - The next TILE\_WIDTH 32 would lead to 2\*32\*32\*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time
- Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16
  - The 86.4B/s bandwidth can now support (86.4/4)\*16 = 347.6 GFLOPS!

### Tiling Size Effects



# Summary- Typical Structure of a CUDA Program

```
Global variables declaration
      host
    __device__... __global__, __constant__, __texture__
Function prototypes
   __global__ void kernelOne(...)
 float handyFunction(...)
Main ()

    allocate memory space on the device – cudaMalloc(&d GlbIVarPtr, bytes)

   transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl...)
   execution configuration setup
   kernel call – kernelOne<<execution configuration>>>( args...);
                                                                               repeat
   transfer results from device to host – cudaMemCpy(h GlblVarPtr,...)
                                                                               as
    optional: compare against golden (host computed) solution
                                                                               needed
Kernel – void kernelOne(type args,...)
 variables declaration - __local__, __shared__

    automatic variables transparently assigned to registers or local memory

 syncthreads()...
Other functions
   float handyFunction(int inVar...);
```

#### **ECE 498AL**

#### Programming Massively Parallel Processors

# Lecture 6: CUDA Memories Part 2

#### Tiled Multiply

 Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE498AL, University of Illinois, Urbana Champaign

### A Small Example



### Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P

Access order

| P <sub>0,0</sub>                               | P <sub>1,0</sub>                               | P <sub>0,1</sub>                    | P <sub>1,1</sub>                    |
|------------------------------------------------|------------------------------------------------|-------------------------------------|-------------------------------------|
| thread <sub>0,0</sub>                          | thread <sub>1,0</sub>                          | thread <sub>0,1</sub>               | thread <sub>1,1</sub>               |
| M <sub>0,0</sub> * N <sub>0,0</sub>            | M <sub>0,0</sub> * N <sub>1</sub>              | M <sub>0,1</sub> * N <sub>0,0</sub> | M <sub>0,1</sub> * N <sub>1</sub>   |
| M <sub>1</sub> <sub>0</sub> * N <sub>0,1</sub> | M <sub>1</sub> <sub>0</sub> * N <sub>1,1</sub> | M <sub>1,1</sub> * N <sub>0,1</sub> | M <sub>1,1</sub> * N <sub>1,1</sub> |
| M <sub>2,0</sub> * N <sub>0,2</sub>            | M <sub>2,0</sub> * N <sub>1,2</sub>            | M <sub>2,1</sub> * N <sub>0,2</sub> | M <sub>2,1</sub> * N <sub>1,2</sub> |
| M <sub>3,0</sub> * N <sub>0,3</sub>            | M <sub>3,0</sub> * N <sub>1,3</sub>            | M <sub>3,1</sub> * N <sub>0,3</sub> | M <sub>3,1</sub> * N <sub>1,3</sub> |

#### Breaking Md and Nd into Tiles

- Break up the inner product loop of each thread into phases
- At the beginning of each phase, load the Md and Nd elements that everyone needs during the phase into shared memory
- Everyone access the Md and Nd elements from the shared memory during the phase



## Each phase of a Thread Block uses one tile from Md and one from Nd

|                  | Phase 1                                             |                                                                    |                                                                                                                 | Phase 2                                                   |                                                     |                                                                                                                 |
|------------------|-----------------------------------------------------|--------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|-----------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| T <sub>0,0</sub> | $Md_{0,0}$ $\downarrow$ $Mds_{0,0}$                 | $Nd_{0,0}$ $\downarrow$ $Nds_{0,0}$                                | PValue <sub>0,0</sub> +=<br>Mds <sub>0,0</sub> *Nds <sub>0,0</sub> +<br>Mds <sub>1,0</sub> *Nds <sub>0,1</sub>  | <b>Md<sub>2,0</sub></b><br>↓<br>Mds <sub>0,0</sub>        | $Nd_{0,2}$ $\downarrow$ $Nds_{0,0}$                 | PValue <sub>0,0</sub> +=<br>Mds <sub>0,0</sub> *Nds <sub>0,0</sub> +<br>Mds <sub>1,0</sub> *Nds <sub>0,1</sub>  |
| T <sub>1,0</sub> | <b>Md</b> <sub>1,0</sub><br>↓<br>Mds <sub>1,0</sub> | $Nd_{1,0}$ $\downarrow$ $Nds_{1,0}$                                | PValue <sub>1,0</sub> +=<br>Mds <sub>0,0</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,0</sub> *Nds <sub>1,1</sub>  | <b>Md</b> <sub>3,0</sub><br>↓<br>Mds <sub>1,0</sub>       | <b>Nd</b> <sub>1,2</sub><br>↓<br>Nds <sub>1,0</sub> | PValue <sub>1,0</sub> +=<br>Mds <sub>0,0</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,0</sub> *Nds <sub>1,1</sub>  |
| T <sub>0,1</sub> | $Md_{0,1}$ $\downarrow$ $Mds_{0,1}$                 | $Nd_{0,1}$ $\downarrow$ $Nds_{0,1}$                                | PdValue <sub>0,1</sub> +=<br>Mds <sub>0,1</sub> *Nds <sub>0,0</sub> +<br>Mds <sub>1,1</sub> *Nds <sub>0,1</sub> | Md <sub>2,1</sub><br>↓<br>Mds <sub>0</sub> , <sub>1</sub> | $Nd_{0,3}$ $\downarrow$ $Nds_{0,1}$                 | PdValue <sub>0,1</sub> +=<br>Mds <sub>0,1</sub> *Nds <sub>0,0</sub> +<br>Mds <sub>1,1</sub> *Nds <sub>0,1</sub> |
| T <sub>1,1</sub> | $Md_{1,1}$ $\downarrow$ $Mds_{1,1}$                 | $\begin{array}{c} Nd_{1,1} \\ \downarrow \\ Nds_{1,1} \end{array}$ | PdValue <sub>1,1</sub> +=<br>Mds <sub>0,1</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,1</sub> *Nds <sub>1,1</sub> | Md <sub>3,1</sub><br>↓<br>Mds <sub>1,1</sub>              | Nd <sub>1,3</sub><br>↓<br>Nds <sub>1,1</sub>        | PdValue <sub>1,1</sub> +=<br>Mds <sub>0,1</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,1</sub> *Nds <sub>1,1</sub> |

#### Tiled Matrix Multiplication Kernel

global void MatrixMulKernel(float\* Md, float\* Nd, float\* Pd, int Width) shared float Mds[TILE WIDTH][TILE WIDTH]; shared float Nds[TILE WIDTH][TILE WIDTH]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; Identify the row and column of the Pd element to work on int Row = by \* TILE WIDTH + ty; int Col = bx \* TILE WIDTH + tx; float Pvalue = 0; Loop over the Md and Nd tiles required to compute the Pd element for (int m = 0; m < Width/TILE WIDTH; ++m) {</pre> Coolaborative loading of Md and Nd tiles into shared memory Mds[ty][tx] = Md[Row\*Width + (m\*TILE WIDTH + tx)]; Nds[ty][tx] = Nd[Col + (m\*TILE WIDTH + ty)\*Width]; syncthreads(); for (int k = 0; k < TILE WIDTH; ++k) 11. Pvalue += Mds[ty][k] \* Nds[k][tx]; Synchthreads(); Pd[Row\*Width+Col] = Pvalue; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

ECE498AL, University of Illinois, Urbana Champaign

# CUDA Code – Kernel Execution Configuration

#### First-order Size Considerations in G80

- Each thread block should have many threads
  - TILE\_WIDTH of 16 gives 16\*16 = 256 threads
- There should be many thread blocks
  - A 1024\*1024 Pd gives 64\*64 = 4096 Thread Blocks
  - TILE\_WIDTH of 16 gives each SM 3 blocks, 768 threads (full capacity)
- Each thread block perform 2\*256 = 512 float loads from global memory for 256 \* (2\*16) = 8,192 mul/add operations.
  - Memory bandwidth no longer a limiting factor

#### Tiled Multiply

by

m

TILE WIDTH

- Each block computes one square sub-matrix Pd<sub>sub</sub> of size
- Each thread computes one element of Pd<sub>sub</sub>

TILE WIDTH



m

012 TILE WIDTH-1

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 CECE498AL, University of Illinois, Urbana Champaign

#### G80 Shared Memory and Threading

- Each SM in G80 has 16KB shared memory
  - SM size is implementation dependent!
  - For TILE\_WIDTH = 16, each thread block uses 2\*256\*4B = 2KB of shared memory.
  - The shared memory can potentially have up to 8 Thread Blocks actively executing
    - This allows up to 8\*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
    - The threading model limits the number of thread blocks to 3 so shared memory is not the limiting factor here
  - The next TILE\_WIDTH 32 would lead to 2\*32\*32\*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time
- Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16
  - The 86.4B/s bandwidth can now support (86.4/4)\*16 = 347.6 GFLOPS!

#### Tiled Matrix Multiplication Kernel

global void MatrixMulKernel(float\* Md, float\* Nd, float\* Pd, int Width) shared float Mds[TILE WIDTH][TILE WIDTH]; shared float Nds[TILE WIDTH][TILE WIDTH]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; Identify the row and column of the Pd element to work on int Row = by \* TILE WIDTH + ty; int Col = bx \* TILE WIDTH + tx; float Pvalue = 0; Loop over the Md and Nd tiles required to compute the Pd element for (int m = 0; m < Width/TILE WIDTH; ++m) {</pre> Coolaborative loading of Md and Nd tiles into shared memory Mds[ty][tx] = Md[Row\*Width + (m\*TILE WIDTH + tx)]; Nds[ty][tx] = Nd[Col + (m\*TILE WIDTH + ty)\*Width]; syncthreads(); for (int k = 0; k < TILE WIDTH; ++k) 11. Pvalue += Mds[ty][k] \* Nds[k][tx]; Synchthreads(); Pd[Row\*Width+Col] = Pvalue; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

ECE498AL, University of Illinois, Urbana Champaign

### Tiling Size Effects



# Summary- Typical Structure of a CUDA Program

```
Global variables declaration
      host
    __device__... __global__, __constant__, __texture__
Function prototypes
   __global__ void kernelOne(...)
 float handyFunction(...)
Main ()

    allocate memory space on the device – cudaMalloc(&d GlbIVarPtr, bytes)

   transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl...)
   execution configuration setup
   kernel call – kernelOne<<execution configuration>>>( args...);
                                                                               repeat
   transfer results from device to host – cudaMemCpy(h GlblVarPtr,...)
                                                                               as
    optional: compare against golden (host computed) solution
                                                                               needed
Kernel – void kernelOne(type args,...)
 variables declaration - __local__, __shared__

    automatic variables transparently assigned to registers or local memory

 syncthreads()...
Other functions
   float handyFunction(int inVar...);
```