Matlab R2014b CPU and GPU Matrix Multiply Time Comparison

Matlab has incorporated GPU processing on the parallel computing toolbox and you can create GPU array objects using the gpuArray(…) function in MATLAB. I created a brief script to compare  matrix multiply of a 2048 x 2048 matrix against a vector. Ordinarily, the CPU operations see reasonable speedup (~2x) from moving from double to single precision values. However, moving to the GPU implementation results in a speedup of  6.8x for Double and  5.6x for Single! This means that if you can take a matrix-vector multiply that is double precision and convert it to single precision GPU version, you may see  a gain of nearly 14x.

The following we generated in Matlab R2014b on an i7-4770 3.5 Ghz CPU  (8 CPUs) with 16GB Ram and a Geforce GTX 750.

CPUGPUTimingMatlab2014bGPUSpeedupThe next step is to evaluate speed of the gpuArray on a basic L1 Optimization set—l1 magic.

The code used to generate this data is as follows:

 


allTypes = {'Double', 'gpuArrayDouble', 'Single', 'gpuArraySingle'};
allTimes = nan(length(allTypes),1);

n = 2048;       % size of operation for Ax
num_mc = 2^10;  % number monte carlo runs to compute time average of runs

randn('seed', 1982);
Am = randn(n,n);
xm = randn(n,1);      

for ind_type = 1:length(allTypes)
   
    myType = allTypes{ind_type};
     
    switch lower(myType)
        case 'double'
            A = double(Am);
            x = double(xm);
        case 'single'
            A = single(Am);
            x = single(xm);
        case 'gpuarraydouble'
            A = gpuArray(Am);
            x = gpuArray(xm);
        case 'gpuarraysingle'
            A = gpuArray(single(Am));
            x = gpuArray(single(xm));
        otherwise
            error('Unknown type');            
    end    
    
    tic
    for ind_mc = 1:num_mc
        y = A*x;
    end
    allTimes(ind_type) =  toc/num_mc;
    
end

%% Display the results
figure(34);
clf;
bar(allTimes*1000);
set(gca, 'xticklabel', allTypes, 'color', [1 1 1]*.97);
title(['Timing of CPU and GPU in M' version('-release')]);
xlabel('Type');
ylabel('Time (ms)');
grid on

%%
figure(35);
clf;
speedupLabels = {('Double to Single') , ...
                 ('Double to gpuDouble'), ...
                 ('Single to gpuSingle'), ...
                 'Double to gpuSingle'};
bar([allTimes(1)/allTimes(3), allTimes(1)/allTimes(2),  ...
    allTimes(3)/allTimes(4), allTimes(1)/allTimes(4)]);
set(gca, 'xticklabelrotation', 15, 'xticklabel', speedupLabels, 'color', [1 1 1]*.97);
title(['Speedup of CPU and GPU in M' version('-release')]);
xlabel('Type');
ylabel('Speedup');
grid on

CUDA Device Info Class

I adapted the NVIDIA CUDA 6.5 Device Query Example to encapsulate it in a cleaner class structure. The code is undocumented at this time but it is fairly straightforward in that it presents a the parameters for each CUDA device in the system. The CCUDAInfo class is small; it contains the count of the devices and an array of the devices themselves. The CCUDADeviceInfo class contains the bulk of the useful information. Both classes have ostream << operators overloaded and throw an exception if CUDA fails. CUDA must be initialized before using the class.

CUDADrvTestThe class can be used as follows:

#include <iostream>
#include <cuda.h>
#include <helper_cuda_drvapi.h>
#include <drvapi_error_string.h>

#include "CCUDAInfo.h"

int main(int argc, char **argv)
{
	 
   std::cout << "Starting ... \n";
   
   // Init CUDA for application:
   CUresult error_id = cuInit(0);

    if (error_id != CUDA_SUCCESS)
    {
        std::cerr << "cuInit Failed. Returned " << error_id << ": " << getCudaDrvErrorString(error_id) << std::endl;
        printf("Result = FAIL\n");
        exit(EXIT_FAILURE);
    }

	// Load and display the CUDA Info Class:
	try
	{
		CCUDAInfo cinfo;
		std::cout << cinfo << "\n";
	}
	catch(std::exception &ex)
	{
		std::cout << "Error: " << ex.what() << "\n";
	}

	return 0;
}

With the following output:

Starting ...
CUDA Driver Version: 6.5
Device Count: 1
*** DEVICE 0 ***
Name: GeForce GT 650M
Compute Capability: 3.0
Clock Rate: 835000
Compute Mode: 0
CUDA CORES: 384
Cores Per MP: 192
Device ID: 0
ECC Enabled: No
Is Tesla: No
Kernel Timeout Enabled: Yes
L2 Cache Size: 262144
Max Block Dim: 1024, 1024, 64
Max Grid Dim: 2147483647, 65535, 65535
Max 1D Texture Size: 65536
Max 1D Layered Texture Size: 16384, 2048
Max 2D Texture Size: 65536, 65536
Max 2D Layers Texture Size: 16384, 16384, 2048
Max 3D Texture Size: 4096, 4096, 4096
Max Threads Per Block: 1024
Max Threads Per Multiprocessor: 2048
Memory Bus Width: 128
Memory Clock Rate: 2 Ghz
Memory Pitch Bytes: 2147483647
Multiprocessor Count: 2
PCI Bus ID: 1
PCI Device ID: 0
Registers Per Block: 65536
Shared Memory Per Block: 49152
Total Constant Memory Bytes: 65536
Total Global Memory Bytes: 1073741824
Warp Size: 32
Supports Concurrent Kernels: Yes
Supports GPU Overlap: Yes
Supports Integrated GPU Sharing Host Memory: No
Supports Map Host Memory: Yes
Supports Unified Addressing: No
Surface Alignment Required: Yes

 

The files can be found here.