CUDA – Stephen Conover

December 29, 2015December 29, 2015

OpenCL Example: Bandwidth Test

Results

The NVIDIA CUDA Bandwidth example discussed before has an OpenCL equivalent available here (the OpenCL examples had previously been removed from the CUDA SDK, much to some people’s chagrin). A basic comparison was made to the OpenCL Bandwidth test downloaded 12/29/2015 and the CUDA 7.5 Example Bandwidth Test provided in with the CUDA SDK. Interestingly, the OpenCL bandwidth runs in PAGEABLE mode by default while the CUDA example runs in PINNED mode and resulting in an apparent doubling of speed by moving from OpenCL to CUDA. However, the OpenCL bandwidth example also has a PINNED memory mode through the use of mapped buffer transfers and we find that OpenCL and CUDA operate comparably.

Operation

The OpenCL example runs in either PINNED or PAGEABLE mode, where PAGEABLE corresponds to standard clEnqueueWriteBuffer commands and PINNED mode corresponds to an equivalent and faster clEnqueueMapBuffer and subsequent clEnqueueWriteBuffer to perform the transfer.

    if(memMode == PINNED)         
    {             
        // Get a mapped pointer         
        h_data = (unsigned char*)clEnqueueMapBuffer(cqCommandQueue, cmPinnedData, CL_TRUE, CL_MAP_READ, 0, memSize, 0, NULL, NULL, &ciErrNum);     
        oclCheckError(ciErrNum, CL_SUCCESS);         
    }         // DIRECT:  API access to device buffer     
    for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)         
    {          
       ciErrNum = clEnqueueWriteBuffer(cqCommandQueue, cmDevData, CL_FALSE, 0, memSize, h_data, 0, NULL, NULL);         
       oclCheckError(ciErrNum, CL_SUCCESS);         
    }         
    ciErrNum = clFinish(cqCommandQueue);
    oclCheckError(ciErrNum, CL_SUCCESS);</pre>
<pre>    //get the the elapsed time in seconds
    elapsedTimeInSec = shrDeltaT(0);
    
    //calculate bandwidth in MB/s
    bandwidthInMBs = ((double)memSize * (double)MEMCOPY_ITERATIONS)/(elapsedTimeInSec * (double)(1 << 20));
    //clean up memory
    if(cmDevData)clReleaseMemObject(cmDevData);
    if(cmPinnedData) 
    {
        clEnqueueUnmapMemObject(cqCommandQueue, cmPinnedData, (void*)h_data, 0, NULL, NULL);
        clReleaseMemObject(cmPinnedData);
    }

OpenCL Example Run

Running on...

GeForce GTX 750

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     5125.9

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     5181.6

Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     64039.2

[oclBandwidthTest.exe] test results...
PASSED

CUDA Example Run

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 750
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     11888.0

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     11948.6

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     68545.4

Result = PASS

January 17, 2015

CUDA Device Info Class

I adapted the NVIDIA CUDA 6.5 Device Query Example to encapsulate it in a cleaner class structure. The code is undocumented at this time but it is fairly straightforward in that it presents a the parameters for each CUDA device in the system. The CCUDAInfo class is small; it contains the count of the devices and an array of the devices themselves. The CCUDADeviceInfo class contains the bulk of the useful information. Both classes have ostream << operators overloaded and throw an exception if CUDA fails. CUDA must be initialized before using the class.

The class can be used as follows:

#include <iostream>
#include <cuda.h>
#include <helper_cuda_drvapi.h>
#include <drvapi_error_string.h>

#include "CCUDAInfo.h"

int main(int argc, char **argv)
{
	 
   std::cout << "Starting ... \n";
   
   // Init CUDA for application:
   CUresult error_id = cuInit(0);

    if (error_id != CUDA_SUCCESS)
    {
        std::cerr << "cuInit Failed. Returned " << error_id << ": " << getCudaDrvErrorString(error_id) << std::endl;
        printf("Result = FAIL\n");
        exit(EXIT_FAILURE);
    }

	// Load and display the CUDA Info Class:
	try
	{
		CCUDAInfo cinfo;
		std::cout << cinfo << "\n";
	}
	catch(std::exception &ex)
	{
		std::cout << "Error: " << ex.what() << "\n";
	}

	return 0;
}

With the following output:

Starting ...
CUDA Driver Version: 6.5
Device Count: 1
*** DEVICE 0 ***
Name: GeForce GT 650M
Compute Capability: 3.0
Clock Rate: 835000
Compute Mode: 0
CUDA CORES: 384
Cores Per MP: 192
Device ID: 0
ECC Enabled: No
Is Tesla: No
Kernel Timeout Enabled: Yes
L2 Cache Size: 262144
Max Block Dim: 1024, 1024, 64
Max Grid Dim: 2147483647, 65535, 65535
Max 1D Texture Size: 65536
Max 1D Layered Texture Size: 16384, 2048
Max 2D Texture Size: 65536, 65536
Max 2D Layers Texture Size: 16384, 16384, 2048
Max 3D Texture Size: 4096, 4096, 4096
Max Threads Per Block: 1024
Max Threads Per Multiprocessor: 2048
Memory Bus Width: 128
Memory Clock Rate: 2 Ghz
Memory Pitch Bytes: 2147483647
Multiprocessor Count: 2
PCI Bus ID: 1
PCI Device ID: 0
Registers Per Block: 65536
Shared Memory Per Block: 49152
Total Constant Memory Bytes: 65536
Total Global Memory Bytes: 1073741824
Warp Size: 32
Supports Concurrent Kernels: Yes
Supports GPU Overlap: Yes
Supports Integrated GPU Sharing Host Memory: No
Supports Map Host Memory: Yes
Supports Unified Addressing: No
Surface Alignment Required: Yes

The files can be found here.

January 15, 2015

CUDA Example: Device Query

Example Path: %NVCUDASAMPLES_ROOT%\1_Utilities\deviceQueryDrv

The NVIDIA CUDA Example Device Query shows how to discovery GPGPU’s on the host and how to discover their capabilities.

The basic execution looks like the following for a Geforce GT650M card in an HP Pavilion dv6 Laptop:

deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 650M"
  CUDA Driver Version / Runtime Version          6.5 / 6.5
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1024 MBytes (1073741824 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Clock rate:                                835 MHz (0.83 GHz)
  Memory Clock rate:                             2000 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536),
3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Mo
del)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simu ltaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Versi
on = 6.5, NumDevs = 1, Device0 = GeForce GT 650M
Result = PASS

The example first discovers the number of devices using cuDeviceGetCount(..) and then iterates over g a host of capability discovery functions such as:

cuDriverGetVersion(…)
cuDeviceTotalMem(…)
getCudaAttribute(…)

Where the bulk of the attributes are retrieved with getCudaAttribute using an enumerated selector to return the right value as in:

int asyncEngineCount;
getCudaAttribute<int>(&asyncEngineCount, CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);

And if there are more than two devices it checks to see if RDMA is enabled between them.

January 6, 2015

CUDA Example: Bandwidth Test

Example Path: %NVCUDASAMPLES_ROOT%\1_Utilities\bandwidthTest

The NVIDIA CUDA Example Bandwidth test is a utility for measuring the memory bandwidth between the CPU and GPU and between addresses in the GPU.

The basic execution looks like the following:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GT 650M
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6286.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6315.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     44976.7

Result = PASS

This was run on an NVIDIA Geforce GT 650M 1GB Mobile graphics card from a laptop. The device to host memory transfers are about 6GB/s while transfers inside the card occur at ~50GB/s. The transfer from host to device and device to host is often symmetric. However, I have seen systems exhibit non-symmetric transfers—such as on a PC104 architecture. Also, more bandwidth can be achieved in by using more address lines. That is, if you are using PCIe x4 or x8 or x16. And some motherboards currently have x16 physical connectors but only x8 electrical. The pinned memory is faster than simple pageable memory, in this case about 50% as pageable is only 4GB/s on host<->device transfers. There are a number of options you can pass. To see more use the –help flag when running the executable.

In the code there is a fair amount of command line parsing and generic testBandwidthRange but the real work is in testDeviceToHostTransfer, testHostToDeviceTransfer, and testDeviceToDeviceTransfer. In any case, the host memory is allocated using malloc(memSize) for pageable memory or cudaHostAlloc(…) for pinned memory (for fast transfer). The desired transfer is then done 10 times in a row and the total time is recorded with an event in the stream. If the memory is pinned then is uses asynchronous memory transfers and standard memory transfers otherwise:

checkCudaErrors(cudaEventRecord(start, 0));
 //copy host memory to device memory
    if (PINNED == memMode)
    {
        for (unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
        {
            checkCudaErrors(cudaMemcpyAsync(d_idata, h_odata, memSize,
                                            cudaMemcpyHostToDevice, 0));
        }
    }
    else
    {
        for (unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
        {
            checkCudaErrors(cudaMemcpy(d_idata, h_odata, memSize,
                                       cudaMemcpyHostToDevice));
        }
    }
checkCudaErrors(cudaEventRecord(stop, 0));

The total bandwidth found is dependent on the size of the transfer. Each transfer has a certain amount of overhead that will drag transfers down if you attempt to perform small transfers instead of a few large ones. The effect largely disappears for transfers larger than 256kB, though.

January 5, 2015

CUDA Example: Stream Callbacks

The NVIDIA CUDA Example simpleCallback shows how to use the CUDA function cudaStreamAddCallback to introduce a callback once CUDA has processed the stream up to the point that the callback was added. This may be used to asynchronously call kernels and wait for their completion or provide status updates on processing.

The key function in the example is

checkCudaErrors(cudaStreamAddCallback(workload->stream, myStreamCallback, workload, 0));

That injects the callback into the stream. The example creates several workers that are executed asynchronously but instead of using cudaDeviceSynchronize block the CPU until all the kernels complete, it uses a more subtle barrier synchronization and waits for that.

A Stackoverflow response from user jet47 concisely shows how to use callbacks with classes using a static member function:

class MyClass
{
public:
    static void CUDART_CB Callback(cudaStream_t stream, cudaError_t status, void *userData);

private:
    void callbackFunc();
};

void CUDART_CB MyClass::Callback(cudaStream_t stream, cudaError_t status, void *userData)
{
    MyClass* thiz = (MyClass*) userData;
    thiz->callbackFunc();
}

void MyClass::callbackFunc()
{
    // implementation here
}

MyClass* obj = new MyClass;
cudaStreamAddCallback(GpuStream, MyClass::Callback, obj, 0);