QR factorization of a matrix A is the process of determining an orthogonal matrix Q and upper triangular matrix R such that $$A = QR$$ Using a QR factorization is faster and much more numerically stable than finding an inverse matrix of A to solve the system of equations Ax = b. QR factorization is used in fundamental linear algebra IDL routines such as LA_EIGENPROBLEM, LA_LEAST_SQUARES, LA_SVD, and others. I was recently using MAGMA‘s [SDCZ]GEQRF within GPULib to retrieve a QR factorization and found excellent performance improvements by using the GPU.

Also, the full decomposition into Q and R matrices is seldom needed, so the results are not returned in a naive manner. I will show how to retrieve Q and R for the GPULib routines (it’s slightly different than the CPU version).

Performance of MAGMA for QR factorization was excellent and it scaled quite well compared to the CPU as shown below.

For small matrices of size 500 x 500, I got about a 25x speedup, but on larger 5000 x 5000 element matrices about a 435x speedup.1

Next, we’ll go through how to do perform QR factorization using GPULib and verify that we are really factoring A correctly. First, as always, we initialize GPULib:

IDL> gpuinit
GPULib Full 1.7.0 (Revision: 2876M)
Graphics card: Tesla C2070, compute capability: 2.0, memory: 1205 MB available, 1279 MB total
CUDA version: 5.5
MAGMA version: 1.4.0
Checking GPU memory allocation...cudaSuccess

Next, we make a simple test case and transfer it to the GPU:

IDL> a = [[9.0, 2.0, 6.0], [4.0, 8.0, 7.0]]
IDL> da = gpuputarr(a)

There are a few variables we need to setup:

IDL> dims = size(da, /dimensions)
IDL> m = dims[0]
IDL> n = dims[1]

The GPU routine we will use, GPUSGEQRF, is a call directly into the MAGMA C layer, so we need to setup a GPU workspace variable of the correct size as well as create a CPU variable to return the multipliers in tau:

IDL> info = 0L
IDL> tau = fltarr(m < n)
IDL> nb = gpu_get_sgeqrf_nb(m)
IDL> lwork = (2L * (m < n) + (n + 31L) /32L * 32L ) * nb
IDL> dworkspace = gpufltarr(lwork)

We are ready to call the routine to perform the QR factorization:

IDL> status = gpusgeqrf(m, n, da->_getHandle(), m, tau, dworkspace->_getHandle(), info)

Part of the result is done in place in da (the other portion of the result is in the CPU variable tau):

IDL> result = gpugetarr(da)

Q is decomposed into H reflector matrices such that: $$Q = H_1 H_2 . . . H_k$$ where k = min(m, n), $$H_i = I – \tau_i v v^T$$ and v is stored in da:

v[i + 1:m - 1L] = result[i + 1:m - 1L, i]

The code to reconstruct Q is therefore:

IDL> q = identity(m)
IDL> .run
- for i = 0L, n - 1L do begin
-   v = fltarr(m)
-   v[i] = 1.0
-   v[i + 1:m - 1L] = result[i + 1:m - 1L, i]
-   q_sub = (identity(m) - tau[i] * (v # v))
-   q = q ## q_sub
- endfor
-
- end

R is stored in the other triangle of da:

IDL> r = fltarr(m, n)
IDL> for i = 0L, n - 1L do r[0L:i, i] = result[0L:i, i]

To verify the result, r is upper triangular (IDL and standard mathematical notation have different conventions for displaying matrices):

IDL> print, r
-11.0000      0.00000      0.00000
-8.54546     -7.48166      0.00000

Q is orthogonal:

IDL> print, transpose(q) # q
1.00000      0.00000  5.96046e-08
0.00000      1.00000 -2.98023e-08
5.96046e-08 -2.98023e-08      1.00000

And, $$A = QR$$

IDL> print, transpose(q) # r
9.00000      2.00000      6.00000
4.00000      8.00000      7.00000

Full disclosure: I work for Tech-X and I am the product manager for the GPULib.

1. CPU was 6 core Intel Core i7-3960X @ 3.30GHz, GPU Tesla C2070 with 1279 MB memory