Category "HPC"

Noah Lorang argues that data scientists mostly just do arithmetic:

The dirty little secret of the ongoing “data science” boom is that most of what people talk about as being data science isn’t what businesses actually need. Businesses need accurate and actionable information to help them make decisions about how they spend their time and resources. There is a very small subset of business problems that are best solved by machine learning; most of them just need good data and an understanding of what it means that is best gained using simple methods.

This rings true for me in terms of the amount of time I spend doing various tasks during a typical day. I would estimate that 90-95% of my time is spent doing basic shell scripting, writing/modifying IDL code to do basic file and array manipulations, writing IDL GUI applications, etc. But the 5-10% of the other time is still important! The mundane tasks would be pointless without choosing the correct optimization technique and verifying it works. It might be that improving the performance of some section of code makes the difference between keeping up with incoming data or not and that might mean using some “hardcore” technique such as writing the section in C, using a GPU, or making use of some multicore technology.

via FlowingData

For some time, I have had a need for an easy way to make use of all of the cores of a single machine through multiple threads or processes, i.e., not a SIMD/vectorized paradigm. The IDL_IDLBridge is capable of doing this, but setup and usage is fairly painful. To make it easier, I have created a simple multicore library for IDL.

Continue reading “A simple multicore library for IDL.”

After spending awhile last Friday trying to vectorize a loop of a small matrix-vector multiplication for every pixel of an image, I gave up and decided to just write it as a DLM. For my image sizes of 1024 by 1024 pixels (actually two images of that size), the run time went from 3.15 seconds to 0.26 seconds on my MacBook Pro. That’s not a lot of time to save, but since we acquire imagery every 15 seconds, it was useful.

Check out analysis.c for source code. There are also unit tests showing how to use it.

GPULib 1.8 has been released with updates to the underlying libraries as well as many other features in many areas of the library. It has been updated to use the most recent versions of IDL and CUDA, IDL 8.4 and CUDA 6.5. The new features are:

  • Support for integer data types. I have been wanting to support integer types in GPULib for awhile and now GPULib supports all the numeric types provided by IDL! We can finally do:

    dx = gpuIndgen(10)
  • Added GPUREPMAT routine. This is a handy routine to create a new array by repeating a 2-dimensional array in a grid.

  • Added GPUCREATEKERNEL routine to create the source code of a simple kernel. This is a code generation routine that can be loaded with GPULOADMODULE/GPULOADFUNCTION and executed with GPUEXECUTEFUNCTION.

  • Added GPUFINITE routine similar to IDL’s library routine.

  • Added linear algebra routines GPULUDC, GPULUSOL, and GPULEAST_SQUARES. This fills out more of the GPU equivalent of the convenience routines provided by IDL so that the LAPACK interface of MAGMA is not required to perform linear algebra computations.

  • Added support for RHO and THETA keywords in GPURADON.

  • Added GPUMEAN routine. This routine has DIMENSION and NAN keywords with the same functionality as IDL’s library routine.

Full disclosure: I work for Tech-X and I am the product manager/developer for GPULib.

My MacBook Pro has three OpenCL devices: a CPU, an integrated GPU, and a discrete GPU. I was interested in the performance I could get with my OpenCL GPULib prototype on the various devices, so I ran the benchmark routine on each of them. CL_BENCHMARK simply computes the gamma function for an array of values; see the results for various array sizes below.

Gamma computation performance on host and various OpenCL devices

There are several interesting points to these results:

  1. the discrete GPU did not have the best performance
  2. the CPU OpenCL device performed better than the host, i.e., the CPU, for more than a few million elements

Contact me if you are interested in the GPULib OpenCL prototype (still very rough).

Here’s the details on the various OpenCL devices on my laptop:

IDL> cl_report
Platform 0
Name: Apple
Version: OpenCL 1.2 (Apr 25 2014 22:04:25)

  Device 0
  Name: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
  Global memory size: 17179869184 bytes (16384 MB)
  Double capable: yes
  Available: yes
  Compiler available: yes
  Device version: OpenCL 1.2
  Driver version: 1.1

  Device 1
  Name: Iris Pro
  Global memory size: 1610612736 bytes (1536 MB)
  Double capable: no
  Available: yes
  Compiler available: yes
  Device version: OpenCL 1.2
  Driver version: 1.2(May  5 2014 20:39:23)

  Device 2
  Name: GeForce GT 750M
  Global memory size: 2147483648 bytes (2048 MB)
  Double capable: yes
  Available: yes
  Compiler available: yes
  Device version: OpenCL 1.2
  Driver version: 8.26.21 310.40.35f08

Here are the slides from my talk about GPULib and FastDL today to the Scientific Programming in IDL class at the Exelis VIS office.

One of the features I was most excited about in CUDA 6 is the drop-in library (nvBLAS) support for BLAS. The idea is to use LD_PRELOAD when launching IDL to indicate that BLAS should be coming from the nvBLAS instead of the BLAS implementation that would normally be found, e.g., in IDL’s case, the distributed with IDL. I’ve had problems getting it to work so far, though.

Right now, the way I’m starting IDL is something like the following:

$ export NVBLAS_CONFIG_FILE=/path/to/nvblas.conf
$ export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib64
$ LD_PRELOAD=/usr/local/cuda-6.0/lib64/ idl

This seems to be recognizing nvBLAS because it will crash if I don’t set LD_LIBRARY_PATH and NVBLAS_CONFIG_FILE (I’m using a default configuration file). But I have not been able to get any different results testing performance of MATRIX_MULTIPLY between using nvBLAS or not. I will continue to test this, because it’s too interesting to pass up and there are a couple of items that I haven’t explored yet:

  1. I’m not using big enough matrices at 5000 x 5000 elements.
  2. I’m not setting something in the configuration file specified by NVBLAS_CONFIG_FILE.

I don’t imagine the speedup from nvBLAS is going to be amazing because memory transfer will eat into the performance, but you can’t beat not having to change any code at all. I am worried that the way that IDL loads dynamic libraries might get in the way of this working.

Numba is a Python package that uses the LLVM compiler to compile Python code to native code. Numba 0.13, released a few weeks ago, offers support for automatically creating CUDA kernels from Python code.

I created a notebook1 (HTML) to show off the demo code.

  1. I’m not sure which is cooler, IPython notebooks or Numba. 

The LAPACK library is distributed along with IDL, but wrappers to most of the routines are not provided. On Linux, CALL_EXTERNAL can be used to easily access any of the routines. On OS X, the Accelerate Framework can be used. I have not found a solution for Windows1 since the needed symbols in the DLL provided by IDL aren’t exported.

In the following code example for Linux, we will call SGEQRF to determine the QR factorization of a given matrix. First, we need to specify the location of the LAPACK library:

ext = !version.os_family eq 'unix' ? '.so' : '.dll'
lapack = filepath('idl_lapack' + ext, $
                  root=expand_path('<IDL_DEFAULT>', /dlm))

m = 20L
n = 10L
x = randomu(seed, m, n)

info = 0L
tau = fltarr(m < n)
lwork = -1L
work = fltarr(1)
status = call_external(lapack, 'sgeqrf_', m, n, x, m, tau, work, lwork, info, $
                       value=[0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L], $
lwork = long(work[0])
work2 = fltarr(lwork)
status = call_external(lapack, 'sgeqrf_', m, n, x, m, tau, work2, lwork, info, $
                       value=[0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L], $

Here is a more generic routine as an example of calling into LAPACK that works on Linux and OS X.

  1. Please let me know if you have a way to do this on Windows! 

QR factorization of a matrix A is the process of determining an orthogonal matrix Q and upper triangular matrix R such that $$A = QR$$ Using a QR factorization is faster and much more numerically stable than finding an inverse matrix of A to solve the system of equations Ax = b. QR factorization is used in fundamental linear algebra IDL routines such as LA_EIGENPROBLEM, LA_LEAST_SQUARES, LA_SVD, and others. I was recently using MAGMA‘s [SDCZ]GEQRF within GPULib to retrieve a QR factorization and found excellent performance improvements by using the GPU.

Also, the full decomposition into Q and R matrices is seldom needed, so the results are not returned in a naive manner. I will show how to retrieve Q and R for the GPULib routines (it’s slightly different than the CPU version).

Continue reading “QR factorization in GPULib.”

older posts »