New Graphics in MATLAB 2014b

Mathworks MATLAB R2014b introduced a new graphical system. The charts below compare the old plots in R2013a with the new R2014b plots sporting

  1. New Color Scheme
  2. Anti-Aliasing on the lines and fonts
  3. Aesthetically pleasing grid
  4. Default Font size and weight on title, axes labels, and ticklabels

Overall, this is a substantial improvement over the old graphical system; the new figures are tasteful and modern.

 

plot01_2014b

plot01_2013a

imagesc01_2014b imagesc01_2013a  bar01_2014b bar01_2013a surf01_2014b surf01_2013a quiver01_2014b quiver01_2013a  pie01_2014b

pie01_2013a

hist01_2014b hist01_2013a

CUDA Device Info Class

I adapted the NVIDIA CUDA 6.5 Device Query Example to encapsulate it in a cleaner class structure. The code is undocumented at this time but it is fairly straightforward in that it presents a the parameters for each CUDA device in the system. The CCUDAInfo class is small; it contains the count of the devices and an array of the devices themselves. The CCUDADeviceInfo class contains the bulk of the useful information. Both classes have ostream << operators overloaded and throw an exception if CUDA fails. CUDA must be initialized before using the class.

CUDADrvTestThe class can be used as follows:

#include <iostream>
#include <cuda.h>
#include <helper_cuda_drvapi.h>
#include <drvapi_error_string.h>

#include "CCUDAInfo.h"

int main(int argc, char **argv)
{
	 
   std::cout << "Starting ... \n";
   
   // Init CUDA for application:
   CUresult error_id = cuInit(0);

    if (error_id != CUDA_SUCCESS)
    {
        std::cerr << "cuInit Failed. Returned " << error_id << ": " << getCudaDrvErrorString(error_id) << std::endl;
        printf("Result = FAIL\n");
        exit(EXIT_FAILURE);
    }

	// Load and display the CUDA Info Class:
	try
	{
		CCUDAInfo cinfo;
		std::cout << cinfo << "\n";
	}
	catch(std::exception &ex)
	{
		std::cout << "Error: " << ex.what() << "\n";
	}

	return 0;
}

With the following output:

Starting ...
CUDA Driver Version: 6.5
Device Count: 1
*** DEVICE 0 ***
Name: GeForce GT 650M
Compute Capability: 3.0
Clock Rate: 835000
Compute Mode: 0
CUDA CORES: 384
Cores Per MP: 192
Device ID: 0
ECC Enabled: No
Is Tesla: No
Kernel Timeout Enabled: Yes
L2 Cache Size: 262144
Max Block Dim: 1024, 1024, 64
Max Grid Dim: 2147483647, 65535, 65535
Max 1D Texture Size: 65536
Max 1D Layered Texture Size: 16384, 2048
Max 2D Texture Size: 65536, 65536
Max 2D Layers Texture Size: 16384, 16384, 2048
Max 3D Texture Size: 4096, 4096, 4096
Max Threads Per Block: 1024
Max Threads Per Multiprocessor: 2048
Memory Bus Width: 128
Memory Clock Rate: 2 Ghz
Memory Pitch Bytes: 2147483647
Multiprocessor Count: 2
PCI Bus ID: 1
PCI Device ID: 0
Registers Per Block: 65536
Shared Memory Per Block: 49152
Total Constant Memory Bytes: 65536
Total Global Memory Bytes: 1073741824
Warp Size: 32
Supports Concurrent Kernels: Yes
Supports GPU Overlap: Yes
Supports Integrated GPU Sharing Host Memory: No
Supports Map Host Memory: Yes
Supports Unified Addressing: No
Surface Alignment Required: Yes

 

The files can be found here.

CUDA Example: Device Query

Example Path: %NVCUDASAMPLES_ROOT%\1_Utilities\deviceQueryDrv

The NVIDIA CUDA Example Device Query shows how to discovery GPGPU’s on the host and how to discover their capabilities.

The basic execution looks like the following for a Geforce GT650M card in an HP Pavilion dv6 Laptop:

deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 650M"
  CUDA Driver Version / Runtime Version          6.5 / 6.5
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1024 MBytes (1073741824 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Clock rate:                                835 MHz (0.83 GHz)
  Memory Clock rate:                             2000 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536),
3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Mo
del)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simu ltaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Versi
on = 6.5, NumDevs = 1, Device0 = GeForce GT 650M
Result = PASS

The example first discovers the number of devices using cuDeviceGetCount(..) and then iterates over g a host of capability discovery functions such as:

  1. cuDriverGetVersion(…)
  2. cuDeviceTotalMem(…)
  3. getCudaAttribute(…)

Where the bulk of the attributes are retrieved with getCudaAttribute using an enumerated selector to return the right value as in:

int asyncEngineCount;
getCudaAttribute<int>(&asyncEngineCount, CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);

And if there are more than two devices it checks to see if RDMA is enabled between them.

 

CUDA Example: Bandwidth Test

Example Path: %NVCUDASAMPLES_ROOT%\1_Utilities\bandwidthTest

The NVIDIA CUDA Example Bandwidth test is a utility for measuring the memory bandwidth between the CPU and GPU and between addresses in the GPU.

The basic execution looks like the following:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GT 650M
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6286.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6315.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     44976.7

Result = PASS

This was run on an NVIDIA Geforce GT 650M 1GB Mobile graphics card from a laptop. The device to host memory transfers are about 6GB/s while transfers inside the card occur at ~50GB/s. The transfer from host to device and device to host is often symmetric. However, I have seen systems exhibit non-symmetric transfers—such as on a PC104 architecture. Also, more bandwidth can be achieved in by using more address lines. That is, if you are using PCIe x4 or x8 or x16. And some motherboards currently have x16 physical connectors but only x8 electrical. The pinned memory is faster than simple pageable memory, in this case about 50% as pageable is only 4GB/s on host<->device transfers. There are a number of options you can pass. To see more use the –help flag when running the executable.

In the code there is a fair amount of command line parsing and generic testBandwidthRange but the real work is in  testDeviceToHostTransfer,  testHostToDeviceTransfer, and testDeviceToDeviceTransfer. In any case, the host memory is allocated using malloc(memSize) for pageable memory or cudaHostAlloc(…) for pinned memory (for fast transfer). The desired transfer is then done 10 times in a row and the total time is recorded with an event in the stream. If the memory is pinned then is uses asynchronous memory transfers and standard memory transfers otherwise:

checkCudaErrors(cudaEventRecord(start, 0));
 //copy host memory to device memory
    if (PINNED == memMode)
    {
        for (unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
        {
            checkCudaErrors(cudaMemcpyAsync(d_idata, h_odata, memSize,
                                            cudaMemcpyHostToDevice, 0));
        }
    }
    else
    {
        for (unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
        {
            checkCudaErrors(cudaMemcpy(d_idata, h_odata, memSize,
                                       cudaMemcpyHostToDevice));
        }
    }
checkCudaErrors(cudaEventRecord(stop, 0));

The total bandwidth found is dependent on the size of the transfer. Each transfer has a certain amount of overhead that will drag transfers down if you attempt to perform small transfers instead of a few large ones. The effect largely disappears for transfers larger than 256kB, though.

 

BandwidthGT650M

 

 

 

 

 

 

Box-Muller Transform

The Box-Muller Transform is method for generating two normally distributed random numbers from two uniformly distributed numbers. It is widely used in C++11 implementations, cuRand, and elsewhere.

The algorithm is simple:

  1. Compute uniform random variables \(U_0, U_1 \in N(0,1)\) using rand()
  2. Compute two Gaussian random variables as:
    1. \(Z_0 = \sqrt{-2 ln(U_0)} cos(2 \pi U_1) \)
    2. \(Z_1 = \sqrt{-2 ln(U_0)} sin(2 \pi U_1) \)

Wikipedia’s article at the time of this writing contains a concise bit of c++ to return one random number at a time:

#define TWO_PI 6.2831853071795864769252866
 
double generateGaussianNoise(const double &variance)
{
	static bool haveSpare = false;
	static double rand1, rand2;
 
	if(haveSpare)
	{
		haveSpare = false;
		return sqrt(variance * rand1) * sin(rand2);
	}
 
	haveSpare = true;
 
	rand1 = rand() / ((double) RAND_MAX);
	if(rand1 < 1e-100) rand1 = 1e-100;
	rand1 = -2 * log(rand1);
	rand2 = (rand() / ((double) RAND_MAX)) * TWO_PI;
 
	return sqrt(variance * rand1) * cos(rand2);
}

 

 

* Image Courtesy Cmglee CC BY-SA