Category "HPC"


Noah Lorang argues that data scientists mostly just do arithmetic:

The dirty little secret of the ongoing “data science” boom is that most of what people talk about as being data science isn’t what businesses actually need. Businesses need accurate and actionable information to help them make decisions about how they spend their time and resources. There is a very small subset of business problems that are best solved by machine learning; most of them just need good data and an understanding of what it means that is best gained using simple methods.

This rings true for me in terms of the amount of time I spend doing various tasks during a typical day. I would estimate that 90-95% of my time is spent doing basic shell scripting, writing/modifying IDL code to do basic file and array manipulations, writing IDL GUI applications, etc. But the 5-10% of the other time is still important! The mundane tasks would be pointless without choosing the correct optimization technique and verifying it works. It might be that improving the performance of some section of code makes the difference between keeping up with incoming data or not and that might mean using some “hardcore” technique such as writing the section in C, using a GPU, or making use of some multicore technology.

via FlowingData

For some time, I have had a need for an easy way to make use of all of the cores of a single machine through multiple threads or processes, i.e., not a SIMD/vectorized paradigm. The IDL_IDLBridge is capable of doing this, but setup and usage is fairly painful. To make it easier, I have created a simple multicore library for IDL.

The simplest example is to create a process to execute an IDL command and then meet up with the main thread, like the mg_process_join_demo in the examples:

p = mg_process(name='subprocess', output='')
p->execute, 'mg_process_join_demo', /nowait
; main process free to do other stuff at this point
p->join ; main process and subprocess will meet here

Another example uses a pool class to create a pool of processes that work together to perform their part of a larger task. In this example, we will compute x^2 + y^3 for 100 values[1] using a pool with the number of processes corresponding to the number of cores available. We need a function to apply to each element of our x and y arrays:

function mg_pool_map_demo, x, y
  compile_opt strictarr

  return, x^2 + y^3
end

Now, we create a pool with the default number of processes being the number of cores available:

pool = mg_pool()

Create our datasets:

n = 100L
x = findgen(n)
y = 0.5 * findgen(n)

Then just pass our data to MG_Pool::map and indicate which function to apply to the data:

x_squared = pool->map('mg_pool_map_demo', x, y)

This will block, i.e., it returns control to the main process when finished processing. At this point, the result is computed:

print, x_squared, format='(8(F10.2))'

I have used this library to get fairly good performance on an 8 core laptop doing some image processing operations on a few hundred image files, as shown in the graph below.

Scaling


  1. This, of course, is not enough work to make this a good idea for performance and is only intended as a fast and simple example of using the library. ??

After spending awhile last Friday trying to vectorize a loop of a small matrix-vector multiplication for every pixel of an image, I gave up and decided to just write it as a DLM. For my image sizes of 1024 by 1024 pixels (actually two images of that size), the run time went from 3.15 seconds to 0.26 seconds on my MacBook Pro. That’s not a lot of time to save, but since we acquire imagery every 15 seconds, it was useful.

Check out analysis.c for source code. There are also unit tests showing how to use it.

GPULib 1.8 has been released with updates to the underlying libraries as well as many other features in many areas of the library. It has been updated to use the most recent versions of IDL and CUDA, IDL 8.4 and CUDA 6.5. The new features are:

  • Support for integer data types. I have been wanting to support integer types in GPULib for awhile and now GPULib supports all the numeric types provided by IDL! We can finally do:
dx = gpuIndgen(10)
  • Added GPUREPMAT routine. This is a handy routine to create a new array by repeating a 2-dimensional array in a grid.
  • Added GPUCREATEKERNEL routine to create the source code of a simple kernel. This is a code generation routine that can be loaded with GPULOADMODULE/GPULOADFUNCTION and executed with GPUEXECUTEFUNCTION.
  • Added GPUFINITE routine similar to IDL’s library routine.
  • Added linear algebra routines GPULUDC, GPULUSOL, and GPULEAST_SQUARES. This fills out more of the GPU equivalent of the convenience routines provided by IDL so that the LAPACK interface of MAGMA is not required to perform linear algebra computations.
  • Added support for RHO and THETA keywords in GPURADON.
  • Added GPUMEAN routine. This routine has DIMENSION and NAN keywords with the same functionality as IDL’s library routine.

Full disclosure: I work for Tech-X and I am the product manager/developer for GPULib.

My MacBook Pro has three OpenCL devices: a CPU, an integrated GPU, and a discrete GPU. I was interested in the performance I could get with my OpenCL GPULib prototype on the various devices, so I ran the benchmark routine on each of them. CL_BENCHMARK simply computes the gamma function for an array of values; see the results for various array sizes below.

Gamma computation performance on host and various OpenCL devices

There are several interesting points to these results:

  1. the discrete GPU did not have the best performance
  2. the CPU OpenCL device performed better than the host, i.e., the CPU, for more than a few million elements

Contact me if you are interested in the GPULib OpenCL prototype (still very rough).

Here’s the details on the various OpenCL devices on my laptop:

IDL> cl_report
Platform 0
Name: Apple
Version: OpenCL 1.2 (Apr 25 2014 22:04:25)

Device 0
Name: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Global memory size: 17179869184 bytes (16384 MB)
Double capable: yes
Available: yes
Compiler available: yes
Device version: OpenCL 1.2
Driver version: 1.1

Device 1
Name: Iris Pro
Global memory size: 1610612736 bytes (1536 MB)
Double capable: no
Available: yes
Compiler available: yes
Device version: OpenCL 1.2
Driver version: 1.2(May 5 2014 20:39:23)

Device 2
Name: GeForce GT 750M
Global memory size: 2147483648 bytes (2048 MB)
Double capable: yes
Available: yes
Compiler available: yes
Device version: OpenCL 1.2
Driver version: 8.26.21 310.40.35f08

Here are the slides from my talk about GPULib and FastDL today to the Scientific Programming in IDL class at the Exelis VIS office.

One of the features I was most excited about in CUDA 6 is the drop-in library (nvBLAS) support for BLAS. The idea is to use LD_PRELOAD when launching IDL to indicate that BLAS should be coming from the nvBLAS instead of the BLAS implementation that would normally be found, e.g., in IDL’s case, the idl_lapack.so distributed with IDL. I’ve had problems getting it to work so far, though.

Right now, the way I’m starting IDL is something like the following:

$ export NVBLAS_CONFIG_FILE=/path/to/nvblas.conf
$ export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib64
$ LD_PRELOAD=/usr/local/cuda-6.0/lib64/libnvblas.so idl

This seems to be recognizing nvBLAS because it will crash if I don’t set LD_LIBRARY_PATH and NVBLAS_CONFIG_FILE (I’m using a default configuration file). But I have not been able to get any different results testing performance of MATRIX_MULTIPLY between using nvBLAS or not. I will continue to test this, because it’s too interesting to pass up and there are a couple of items that I haven’t explored yet:

  1. I’m not using big enough matrices at 5000 x 5000 elements.
  2. I’m not setting something in the configuration file specified by NVBLAS_CONFIG_FILE.

I don’t imagine the speedup from nvBLAS is going to be amazing because memory transfer will eat into the performance, but you can’t beat not having to change any code at all. I am worried that the way that IDL loads dynamic libraries might get in the way of this working.

Numba is a Python package that uses the LLVM compiler to compile Python code to native code. Numba 0.13, released a few weeks ago, offers support for automatically creating CUDA kernels from Python code.

I created a notebook [1] (HTML) to show off the demo code.

Mandelbrot


  1. I’m not sure which is cooler, IPython notebooks or Numba. ??

The LAPACK library is distributed along with IDL, but wrappers to most of the routines are not provided. On Linux, CALL_EXTERNAL can be used to easily access any of the routines. On OS X, the Accelerate Framework can be used. I have not found a solution for Windows[1] since the needed symbols in the DLL provided by IDL aren’t exported.

In the following code example for Linux, we will call SGEQRF to determine the QR factorization of a given matrix. First, we need to specify the location of the LAPACK library:

ext = !version.os_family eq 'unix' ? '.so' : '.dll'
lapack = filepath('idl_lapack' + ext, $
                  root=expand_path('', /dlm))

m = 20L
n = 10L
x = randomu(seed, m, n)

info = 0L
tau = fltarr(m < n)
lwork = -1L
work = fltarr(1)
status = call_external(lapack, 'sgeqrf_', m, n, x, m, tau, work, lwork, info, $
                       value=[0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L], $
                       /auto_glue)
lwork = long(work[0])
work2 = fltarr(lwork)
status = call_external(lapack, 'sgeqrf_', m, n, x, m, tau, work2, lwork, info, $
                       value=[0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L], $
                       /auto_glue)

Here is a more generic routine as an example of calling into LAPACK that works on Linux and OS X.


  1. Please let me know if you have a way to do this on Windows! ??

QR factorization of a matrix A is the process of determining an orthogonal matrix Q and upper triangular matrix R such that $$A = QR$$ Using a QR factorization is faster and much more numerically stable than finding an inverse matrix of A to solve the system of equations Ax = b. QR factorization is used in fundamental linear algebra IDL routines such as LA_EIGENPROBLEM, LA_LEAST_SQUARES, LA_SVD, and others. I was recently using MAGMA’s [SDCZ]GEQRF within GPULib to retrieve a QR factorization and found excellent performance improvements by using the GPU.

Also, the full decomposition into Q and R matrices is seldom needed, so the results are not returned in a naive manner. I will show how to retrieve Q and R for the GPULib routines (it’s slightly different than the CPU version).

Performance of MAGMA for QR factorization was excellent and it scaled quite well compared to the CPU as shown below.

QR timings

For small matrices of size 500 x 500, I got about a 25x speedup, but on larger 5000 x 5000 element matrices about a 435x speedup.[1]

Next, we’ll go through how to do perform QR factorization using GPULib and verify that we are really factoring A correctly. First, as always, we initialize GPULib:

IDL> gpuinit
GPULib Full 1.7.0 (Revision: 2876M)
Graphics card: Tesla C2070, compute capability: 2.0, memory: 1205 MB available, 1279 MB total
CUDA version: 5.5
MAGMA version: 1.4.0
Checking GPU memory allocation...cudaSuccess

Next, we make a simple test case and transfer it to the GPU:

IDL> a = [[9.0, 2.0, 6.0], [4.0, 8.0, 7.0]]
IDL> da = gpuputarr(a)

There are a few variables we need to setup:

IDL> dims = size(da, /dimensions)
IDL> m = dims[0]
IDL> n = dims[1]

The GPU routine we will use, GPUSGEQRF, is a call directly into the MAGMA C layer, so we need to setup a GPU workspace variable of the correct size as well as create a CPU variable to return the multipliers in tau:

IDL> info = 0L
IDL> tau = fltarr(m < n)
IDL> nb = gpu_get_sgeqrf_nb(m)
IDL> lwork = (2L * (m < n) + (n + 31L) /32L * 32L ) * nb
IDL> dworkspace = gpufltarr(lwork)

We are ready to call the routine to perform the QR factorization:

IDL> status = gpusgeqrf(m, n, da->_getHandle(), m, tau, dworkspace->_getHandle(), info)

Part of the result is done in place in da (the other portion of the result is in the CPU variable tau):

IDL> result = gpugetarr(da)

Q is decomposed into H reflector matrices such that: $$Q = H_1 H_2 . . . H_k$$ where k = min(m, n), $$H_i = I – \tau_i v v^T$$ and v is stored in da:

v[i + 1:m - 1L] = result[i + 1:m - 1L, i]

The code to reconstruct Q is therefore:

IDL> q = identity(m)
IDL> .run
- for i = 0L, n - 1L do begin
- v = fltarr(m)
- v[i] = 1.0
- v[i + 1:m - 1L] = result[i + 1:m - 1L, i]
- q_sub = (identity(m) - tau[i] * (v # v))
- q = q ## q_sub
- endfor
-
- end

R is stored in the other triangle of da:

IDL> r = fltarr(m, n)
IDL> for i = 0L, n - 1L do r[0L:i, i] = result[0L:i, i]

To verify the result, r is upper triangular (IDL and standard mathematical notation have different conventions for displaying matrices):

IDL> print, r
-11.0000 0.00000 0.00000
-8.54546 -7.48166 0.00000

Q is orthogonal:

IDL> print, transpose(q) # q
1.00000 0.00000 5.96046e-08
0.00000 1.00000 -2.98023e-08
5.96046e-08 -2.98023e-08 1.00000

And, $$A = QR$$

IDL> print, transpose(q) # r<br />9.00000 2.00000 6.00000<br />4.00000 8.00000 7.00000

Full disclosure: I work for Tech-X and I am the product manager for the GPULib.


  1. CPU was 6 core Intel Core i7-3960X @ 3.30GHz, GPU Tesla C2070 with 1279 MB memory ??

older posts »