QR factorization of a matrix
A is the process of determining an orthogonal matrix
Q and upper triangular matrix
R such that $$A = QR$$ Using a QR factorization is faster and much more numerically stable than finding an inverse matrix of
A to solve the system of equations
Ax = b. QR factorization is used in fundamental linear algebra IDL routines such as
LA_SVD, and others. I was recently using MAGMA’s
[SDCZ]GEQRF within GPULib to retrieve a QR factorization and found excellent performance improvements by using the GPU.
Also, the full decomposition into
R matrices is seldom needed, so the results are not returned in a naive manner. I will show how to retrieve
R for the GPULib routines (it’s slightly different than the CPU version).
Performance of MAGMA for QR factorization was excellent and it scaled quite well compared to the CPU as shown below.
For small matrices of size 500 x 500, I got about a 25x speedup, but on larger 5000 x 5000 element matrices about a 435x speedup.
Next, we’ll go through how to do perform QR factorization using GPULib and verify that we are really factoring
A correctly. First, as always, we initialize GPULib:
IDL> gpuinit GPULib Full 1.7.0 (Revision: 2876M) Graphics card: Tesla C2070, compute capability: 2.0, memory: 1205 MB available, 1279 MB total CUDA version: 5.5 MAGMA version: 1.4.0 Checking GPU memory allocation...cudaSuccess
Next, we make a simple test case and transfer it to the GPU:
IDL> a = [[9.0, 2.0, 6.0], [4.0, 8.0, 7.0]] IDL> da = gpuputarr(a)
There are a few variables we need to setup:
IDL> dims = size(da, /dimensions) IDL> m = dims IDL> n = dims
The GPU routine we will use,
GPUSGEQRF, is a call directly into the MAGMA C layer, so we need to setup a GPU workspace variable of the correct size as well as create a CPU variable to return the multipliers in
IDL> info = 0L IDL> tau = fltarr(m < n) IDL> nb = gpu_get_sgeqrf_nb(m) IDL> lwork = (2L * (m < n) + (n + 31L) /32L * 32L ) * nb IDL> dworkspace = gpufltarr(lwork)
We are ready to call the routine to perform the QR factorization:
IDL> status = gpusgeqrf(m, n, da->_getHandle(), m, tau, dworkspace->_getHandle(), info)
Part of the result is done in place in
da (the other portion of the result is in the CPU variable
IDL> result = gpugetarr(da)
Q is decomposed into
H reflector matrices such that: $$Q = H_1 H_2 . . . H_k$$ where
k = min(m, n), $$H_i = I – \tau_i v v^T$$ and
v is stored in
v[i + 1:m - 1L] = result[i + 1:m - 1L, i]
The code to reconstruct
Q is therefore:
IDL> q = identity(m) IDL> .run - for i = 0L, n - 1L do begin - v = fltarr(m) - v[i] = 1.0 - v[i + 1:m - 1L] = result[i + 1:m - 1L, i] - q_sub = (identity(m) - tau[i] * (v # v)) - q = q ## q_sub - endfor - - end
R is stored in the other triangle of
IDL> r = fltarr(m, n) IDL> for i = 0L, n - 1L do r[0L:i, i] = result[0L:i, i]
To verify the result,
r is upper triangular (IDL and standard mathematical notation have different conventions for displaying matrices):
IDL> print, r -11.0000 0.00000 0.00000 -8.54546 -7.48166 0.00000
Q is orthogonal:
IDL> print, transpose(q) # q 1.00000 0.00000 5.96046e-08 0.00000 1.00000 -2.98023e-08 5.96046e-08 -2.98023e-08 1.00000
And, $$A = QR$$
IDL> print, transpose(q) # r<br />9.00000 2.00000 6.00000<br />4.00000 8.00000 7.00000
Full disclosure: I work for Tech-X and I am the product manager for the GPULib.
CPU was 6 core Intel Core i7-3960X @ 3.30GHz, GPU Tesla C2070 with 1279 MB memory ??