Category "HPC"

Noah Lorang argues that [data scientists mostly just do arithmetic][arithmetic]:

> The dirty little secret of the ongoing “data science” boom is that most of what people talk about as being data science isn’t what businesses actually need. Businesses need accurate and actionable information to help them make decisions about how they spend their time and resources. There is a very small subset of business problems that are best solved by machine learning; most of them just need good data and an understanding of what it means that is best gained using simple methods.

This rings true for me in terms of the amount of time I spend doing various tasks during a typical day. I would estimate that 90-95% of my time is spent doing basic shell scripting, writing/modifying IDL code to do basic file and array manipulations, writing IDL GUI applications, etc. But the 5-10% of the other time is still important! The mundane tasks would be pointless without choosing the correct optimization technique and verifying it works. It might be that improving the performance of some section of code makes the difference between keeping up with incoming data or not and that might mean using some “hardcore” technique such as writing the section in C, using a GPU, or making use of some multicore technology.

via [FlowingData]

[arithmetic]: “Data scientists mostly just do arithmetic and that’s a good thing”
[FlowingData]: “Data scientists mostly just do arithmetic”

For some time, I have had a need for an easy way to make use of all of the cores of a single machine through multiple threads or processes, i.e., not a SIMD/vectorized paradigm. The `IDL_IDLBridge` is capable of doing this, but setup and usage is fairly painful. To make it easier, I have created a [simple multicore library][multiprocessing] for IDL.

[multiprocessing]: “mglib/src/multiprocessing”

Continue reading “A simple multicore library for IDL.”

After spending awhile last Friday trying to vectorize a loop of a small matrix-vector multiplication for every pixel of an image, I gave up and decided to just write it as a DLM. For my image sizes of 1024 by 1024 pixels (actually two images of that size), the run time went from 3.15 seconds to 0.26 seconds on my MacBook Pro. That’s not a lot of time to save, but since we acquire imagery every 15 seconds, it was useful.

Check out [analysis.c] for source code. There are also [unit tests] showing how to use it.

[analysis.c]: “mglib/src/analysis”

[unit tests]: “unit tests”

[GPULib 1.8] has been released with updates to the underlying libraries as well as many other features in many areas of the library. It has been updated to use the most recent versions of IDL and CUDA, IDL 8.4 and CUDA 6.5. The new features are:

* Support for integer data types. I have been wanting to support integer types in GPULib for awhile and now GPULib supports all the numeric types provided by IDL! We can finally do:

dx = gpuIndgen(10)

* Added `GPUREPMAT` routine. This is a handy routine to create a new array by repeating a 2-dimensional array in a grid.

* Added `GPUCREATEKERNEL` routine to create the source code of a simple kernel. This is a code generation routine that can be loaded with `GPULOADMODULE`/`GPULOADFUNCTION` and executed with `GPUEXECUTEFUNCTION`.

* Added `GPUFINITE` routine similar to IDL’s library routine.

* Added linear algebra routines `GPULUDC`, `GPULUSOL`, and `GPULEAST_SQUARES`. This fills out more of the GPU equivalent of the convenience routines provided by IDL so that the LAPACK interface of MAGMA is not required to perform linear algebra computations.

* Added support for `RHO` and `THETA` keywords in `GPURADON`.

* Added `GPUMEAN` routine. This routine has `DIMENSION` and `NAN` keywords with the same functionality as IDL’s library routine.

*Full disclosure: I work for Tech-X and I am the product manager/developer for GPULib.*

[GPULib 1.8]:

My MacBook Pro has three OpenCL devices: a CPU, an integrated GPU, and a discrete GPU. I was interested in the performance I could get with my OpenCL GPULib prototype on the various devices, so I ran the benchmark routine on each of them. `CL_BENCHMARK` simply computes the gamma function for an array of values; see the results for various array sizes below.

Gamma computation performance on host and various OpenCL devices

There are several interesting points to these results:

1. the discrete GPU did not have the best performance
2. the CPU OpenCL device performed better than the host, i.e., the CPU, for more than a few million elements

Contact me if you are interested in the GPULib OpenCL prototype (still very rough).

Here’s the details on the various OpenCL devices on my laptop:

IDL> cl_report
Platform 0
Name: Apple
Version: OpenCL 1.2 (Apr 25 2014 22:04:25)

Device 0
Name: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Global memory size: 17179869184 bytes (16384 MB)
Double capable: yes
Available: yes
Compiler available: yes
Device version: OpenCL 1.2
Driver version: 1.1

Device 1
Name: Iris Pro
Global memory size: 1610612736 bytes (1536 MB)
Double capable: no
Available: yes
Compiler available: yes
Device version: OpenCL 1.2
Driver version: 1.2(May 5 2014 20:39:23)

Device 2
Name: GeForce GT 750M
Global memory size: 2147483648 bytes (2048 MB)
Double capable: yes
Available: yes
Compiler available: yes
Device version: OpenCL 1.2
Driver version: 8.26.21 310.40.35f08

Here are the [slides] from my talk about GPULib and FastDL today to the Scientific Programming in IDL class at the Exelis VIS office.

[slides]: “High Performance Computing with IDL”

One of the features I was most excited about in [CUDA 6] is the drop-in library ([nvBLAS]) support for BLAS. The idea is to use `LD_PRELOAD` when launching IDL to indicate that BLAS should be coming from the nvBLAS instead of the BLAS implementation that would normally be found, e.g., in IDL’s case, the `` distributed with IDL. I’ve had problems getting it to work so far, though.

Right now, the way I’m starting IDL is something like the following:

$ export NVBLAS_CONFIG_FILE=/path/to/nvblas.conf
$ export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib64
$ LD_PRELOAD=/usr/local/cuda-6.0/lib64/ idl

This seems to be recognizing nvBLAS because it will crash if I don’t set `LD_LIBRARY_PATH` and `NVBLAS_CONFIG_FILE` (I’m using a default configuration file). But I have not been able to get any different results testing performance of `MATRIX_MULTIPLY` between using nvBLAS or not. I will continue to test this, because it’s too interesting to pass up and there are a couple of items that I haven’t explored yet:

1. I’m not using big enough matrices at 5000 x 5000 elements.
2. I’m not setting something in the configuration file specified by `NVBLAS_CONFIG_FILE`.

I don’t imagine the speedup from nvBLAS is going to be amazing because memory transfer will eat into the performance, but you can’t beat not having to change any code at all. I am worried that the way that IDL loads dynamic libraries might get in the way of this working.

[CUDA 6]: “CUDA Toolkit Documentation”
[nvBLAS]: “NVBLAS :: CUDA Toolkit Documentation”

Numba is a Python package that uses the LLVM compiler to compile Python code to native code. [Numba 0.13], released a few weeks ago, offers support for automatically creating CUDA kernels from Python code.

I created a [notebook][^1] ([HTML]) to show off the demo code.

[Numba 0.13]: “Numba 0.13”
[HTML]: “Numba with CUDA IPython HTML”
[notebook]: “HTML output of Numba with CUDA IPython notebook”

[^1]: I’m not sure which is cooler, IPython notebooks or Numba.

The LAPACK library is distributed along with IDL, but wrappers to most of the routines are not provided. On Linux, `CALL_EXTERNAL` can be used to easily access any of the routines. On OS X, the Accelerate Framework can be used. I have not found a solution for Windows[^1] since the needed symbols in the DLL provided by IDL aren’t exported.

In the following code example for Linux, we will call [SGEQRF] to determine the QR factorization of a given matrix. First, we need to specify the location of the LAPACK library:

ext = !version.os_family eq ‘unix’ ? ‘.so’ : ‘.dll’
lapack = filepath(‘idl_lapack’ + ext, $
root=expand_path(‘‘, /dlm))

m = 20L
n = 10L
x = randomu(seed, m, n)

info = 0L
tau = fltarr(m < n) lwork = -1L work = fltarr(1) status = call_external(lapack, 'sgeqrf_', m, n, x, m, tau, work, lwork, info, $ value=[0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L], $ /auto_glue) lwork = long(work[0]) work2 = fltarr(lwork) status = call_external(lapack, 'sgeqrf_', m, n, x, m, tau, work2, lwork, info, $ value=[0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L], $ /auto_glue) [Here][MG_CALL_LAPACK_EXAMPLE] is a more generic routine as an example of calling into LAPACK that works on Linux and OS X. [LAPACK docs]: "Explore LAPACK code" [SGEQRF]: "SRC/sgeqrf.f" [MG_CALL_LAPACK_EXAMPLE]: "" [^1]: Please let me know if you have a way to do this on Windows!

[QR factorization][QR] of a matrix `A` is the process of determining an orthogonal matrix `Q` and upper triangular matrix `R` such that $$A = QR$$ Using a QR factorization is faster and much more numerically stable than finding an inverse matrix of `A` to solve the system of equations `Ax = b`. QR factorization is used in fundamental linear algebra IDL routines such as `LA_EIGENPROBLEM`, `LA_LEAST_SQUARES`, `LA_SVD`, and others. I was recently using [MAGMA]’s `[SDCZ]GEQRF` within [GPULib] to retrieve a QR factorization and found excellent performance improvements by using the GPU.

Also, the full decomposition into `Q` and `R` matrices is seldom needed, so the results are not returned in a naive manner. I will show how to retrieve `Q` and `R` for the GPULib routines (it’s slightly different than the CPU version).

[QR]: “QR decomposition”
[GPULib]: “GPULib”

Continue reading “QR factorization in GPULib.”

older posts »