User:Orangeowlsolutions/Cuda memories

CUDA Memories new article content ...

Global Memory
Global memory, implemented with Dynamic Random Access Memory (DRAM), has long access latencies (hundreds of clock cycles) and limited access bandwidth. The host code can read and write global memory. Automatic array variables are not stored in registers. Instead, they are stored into the global memory and incur long access delays and potential access congestions. The scopes of these arrays are, same as automatic scalar variable, within individual threads. That is, a private version of such array is created and used for every thread. Once a thread terminates its execution, the contents of its automatic array variables also cease to exist. Due to the slow nature of automatic array variables, one should avoid using such variables. From our experience, one seldom needs to use automatic array variables in kernel functions and device functions. A variable whose declaration is preceded only by the keyword “__device__” (each “__’’ consists of two “_’’ characters), is a global variable and will be placed in global memory. Accesses to a global variable are very slow. However, global variable are visible to all threads of all kernels. Their contents also persist through the entire execution. Thus, global variables can be used as a means for threads to collaborate across blocks. One must, however, be aware of the fact that there is currently no way to synchronize between threads from different thread blocks or to ensure data consistency across threads when accessing global memory other than terminating the current kernel execution. Therefore, global variables are often used to pass information from one kernel execution to another kernel execution.

Constant Memory
The host code can read and write constant memory. The constant memory allows read-only access by the device and provides faster and more parallel data access for kernel execution than the global memory. If a variable declaration is preceded by keywords “__constant__’’ (each “__’’ consists of two “_’’ characters) it declares a constant variable in CUDA. One can also add an optional “__device__” in front of “__constant__” to achieve the same effect. Declaration of constant variables must reside outside any function body. The scope of a constant variable is all grids, meaning that all threads in all grids see the same version of a constant variable. The lifetime of a constant variable is the entire application execution. Constant variable are often used for variables that provide input values to kernel functions. Constant variables are stored in the global memory but are cached for efficient access. With appropriate access patterns, accessing constant memory is extremely fast and parallel. Currently, the total size of constant variables in an application is limited at 65,536 bytes.

Shared Memory
Shared memory variables can be accessed at very high speed in a highly parallel manner. Shared memories are allocated to thread blocks; all threads in a block can access variables in the shared memory locations allocated to the block. Shared memories are efficient means for threads to cooperate by sharing the results of their work. If a variable declaration is preceded by keywords “__shared__’’ (each “__’’ consists of two “_’’ characters), it declares a shared variable in CUDA. One can also add an optional “__device__” in front of “__shared__” in the declaration to achieve the same effect. Such declaration must reside within a kernel function or a device function. The scope of a shared variable is within a thread block, that is, all threads in a block see the same version of a shared variable. A private version of the shared variable is created for and used by each hread block during kernel execution. The lifetime of a shared variable is within the duration of the kernel. When a kernel terminates its execution, the contents of its shared variables cease to exist. Shared variables are an efficient means for threads within a block to collaborate with each other. Accessing to shared memory is extremely fast and highly parallel. CUDA programmers often use shared memory to hold the portion of global memory data that are heavily used in an execution phase of kernel. One may need to adjust the algorithms used in order to create execution phases that heavily focus on small portions of the global memory data.

Registers
Register variables can be accessed at very high speed in a highly parallel manner. Registers are allocated to individual threads; each thread can only access its own registers. A kernel function typically uses registers to hold frequently accessed variables that are private to each thread. All automatic variables except for arrays declared in kernel and device functions are placed into registers. We will refer to variables that are not arrays as scalar variables. The scopes of these automatic variables are within individual threads. When a kernel function declares an automatic variable, a private copy of that variable is generated for every thread that executes the kernel function. When a thread terminates, all its automatic variables also cease to exist. Note that accessing these variables is extremely fast and parallel but one must be careful not to exceed the limited capacity of the register storage in the hardware implementations.

Variables scope
If a variable’s scope is a single thread, a private version of the variable will be created for each and every thread; every thread can only access its own local version of the variable. For example, if a kernel declares a variable whose scope is a thread and it is launched with one million threads, one million versions of the variable will be created so that each thread initializes and uses its own version of the variable.

Variable lifetime
Lifetime specifies the portion of program execution duration when the variable is available for use: either within a kernel’s invocation or throughout the entire application. If a variable’s lifetime is within a kernel invocation, it must be declared within the kernel function body and will be available for use only by the kernel’s code. If the kernel is invoked several times, the contents of the variable are not maintained across these invocations. Each invocation must initialize the variable in order to use them. On the other hand, if a variable’s lifetime is throughout the entire application, it must be declared outside of any function body. The contents of the variable are maintained throughout the execution of the application and available to all kernels.