code – Stephen Conover

allTypes = {'Double', 'gpuArrayDouble', 'Single', 'gpuArraySingle'}; allTimes = nan(length(allTypes),1); n = 2048; % size of operation for Ax num_mc = 2^10; % number monte carlo runs to compute time average of runs randn('seed', 1982); Am = randn(n,n); xm = randn(n,1); for ind_type = 1:length(allTypes) myType = allTypes{ind_type}; switch lower(myType) case 'double' A = double(Am); x = double(xm); case 'single' A = single(Am); x = single(xm); case 'gpuarraydouble' A = gpuArray(Am); x = gpuArray(xm); case 'gpuarraysingle' A = gpuArray(single(Am)); x = gpuArray(single(xm)); otherwise error('Unknown type'); end tic for ind_mc = 1:num_mc y = A*x; end allTimes(ind_type) = toc/num_mc; end %% Display the results figure(34); clf; bar(allTimes*1000); set(gca, 'xticklabel', allTypes, 'color', [1 1 1]*.97); title(['Timing of CPU and GPU in M' version('-release')]); xlabel('Type'); ylabel('Time (ms)'); grid on %% figure(35); clf; speedupLabels = {('Double to Single') , ... ('Double to gpuDouble'), ... ('Single to gpuSingle'), ... 'Double to gpuSingle'}; bar([allTimes(1)/allTimes(3), allTimes(1)/allTimes(2), ... allTimes(3)/allTimes(4), allTimes(1)/allTimes(4)]); set(gca, 'xticklabelrotation', 15, 'xticklabel', speedupLabels, 'color', [1 1 1]*.97); title(['Speedup of CPU and GPU in M' version('-release')]); xlabel('Type'); ylabel('Speedup'); grid on

I adapted the NVIDIA CUDA 6.5 Device Query Example to encapsulate it in a cleaner class structure. The code is undocumented at this time but it is fairly straightforward in that it presents a the parameters for each CUDA device in the system. The CCUDAInfo class is small; it contains the count of the devices and an array of the devices themselves. The CCUDADeviceInfo class contains the bulk of the useful information. Both classes have ostream << operators overloaded and throw an exception if CUDA fails. CUDA must be initialized before using the class.

The class can be used as follows:

#include <iostream>
#include <cuda.h>
#include <helper_cuda_drvapi.h>
#include <drvapi_error_string.h>

#include "CCUDAInfo.h"

int main(int argc, char **argv)
{
	 
   std::cout << "Starting ... \n";
   
   // Init CUDA for application:
   CUresult error_id = cuInit(0);

    if (error_id != CUDA_SUCCESS)
    {
        std::cerr << "cuInit Failed. Returned " << error_id << ": " << getCudaDrvErrorString(error_id) << std::endl;
        printf("Result = FAIL\n");
        exit(EXIT_FAILURE);
    }

	// Load and display the CUDA Info Class:
	try
	{
		CCUDAInfo cinfo;
		std::cout << cinfo << "\n";
	}
	catch(std::exception &ex)
	{
		std::cout << "Error: " << ex.what() << "\n";
	}

	return 0;
}

With the following output:

Starting ...
CUDA Driver Version: 6.5
Device Count: 1
*** DEVICE 0 ***
Name: GeForce GT 650M
Compute Capability: 3.0
Clock Rate: 835000
Compute Mode: 0
CUDA CORES: 384
Cores Per MP: 192
Device ID: 0
ECC Enabled: No
Is Tesla: No
Kernel Timeout Enabled: Yes
L2 Cache Size: 262144
Max Block Dim: 1024, 1024, 64
Max Grid Dim: 2147483647, 65535, 65535
Max 1D Texture Size: 65536
Max 1D Layered Texture Size: 16384, 2048
Max 2D Texture Size: 65536, 65536
Max 2D Layers Texture Size: 16384, 16384, 2048
Max 3D Texture Size: 4096, 4096, 4096
Max Threads Per Block: 1024
Max Threads Per Multiprocessor: 2048
Memory Bus Width: 128
Memory Clock Rate: 2 Ghz
Memory Pitch Bytes: 2147483647
Multiprocessor Count: 2
PCI Bus ID: 1
PCI Device ID: 0
Registers Per Block: 65536
Shared Memory Per Block: 49152
Total Constant Memory Bytes: 65536
Total Global Memory Bytes: 1073741824
Warp Size: 32
Supports Concurrent Kernels: Yes
Supports GPU Overlap: Yes
Supports Integrated GPU Sharing Host Memory: No
Supports Map Host Memory: Yes
Supports Unified Addressing: No
Surface Alignment Required: Yes

The files can be found here.

Tag: code

Matlab R2014b CPU and GPU Matrix Multiply Time Comparison

CUDA Device Info Class