★ Finding the n smallest elements in an array
posted Fri 2 Jun 2006 by Michael Galloy under IDL, OptimizationFinding the n
smallest elements of an array is a common problem with an informative solution. It is not necessary to sort the entire array to solve this problem. To solve it, I’m going to use HISTOGRAM
in a slightly non-intuitive manner (or maybe it is intuitive if you’ve read J.D. Smith’s Histogram Tutorial)
The code quoted here is from a simplified MG_N_SMALLEST
routine that finds only the smallest elements of an array (the actual code can also find the largest elements of the array).
function mg_n_smallest, data, n<br />compile_opt strictarr
; use histogram to find a set with more elements than n
; of smallest elements
nData = n_elements(data)
nBins = nData / n
h = histogram(data, nbins=nBins, reverse_indices=ri)
; loop through the bins until we have n or more elements
nCandidates = 0L
for bin = 0L, nBins - 1L do begin
nCandidates += h[bin]
if (nCandidates ge n) then break
endfor
; get the candidates and sort them
candidates = ri[ri[0] : ri[bin + 1L] - 1L]
sortedCandidates = sort(data[candidates])
; return the proper n of them
return, (candidates[sortedCandidates])[0:n-1L]
end
This routine assumes that the data set is close to uniformly distributed. It uses a histogram to find a superset of candidates for the n
smallest, then sorts them to find exactly the right ones. If the data is non-uniform, the sorting step will have to sort more than n
elements, slowing the routine to the extent of the non-uniformity.
Let’s follow through the lines of code with a simple example data set.
IDL> data = randomu(seed, 10)
IDL> n = 3
IDL> print, data
0.115125 0.283748 0.942899 0.770475 0.818332
0.138929 0.175479 0.663068 0.806682 0.974345
IDL> nData = n_elements(data)
IDL> nBins = nData / n
The calculation of the histogram is (hopefully) where all the work will be done. (Remember, in bad cases, the SORT
routine will take up the most time.)
IDL> h = histogram(data, nbins=nBins, reverse_indices=ri)
IDL> print, h
4 5 1
IDL> print, ri
4 8 13 14 0
1 5 6 2 3
4 7 8 9
The four elements in the first bin of the histogram are the candidates, but which elements are they? The output from the REVERSE_INDICES
keyword of HISTOGRAM
indicates which elements fall in each bin. It is an index array in two parts; what makes it confusing is the first part is an index into the second part. In our example here, the parts are divided as (ri
below with the indices of the elements of ri
above it):
0 1 2 3 | 4 5 6 7 8 9 10 11 12 13
4 8 13 14 | 0 1 5 6 2 3 4 7 8 9
The first part, 4 8 13 14
, gives the starting indices of each of the three bins (plus a dummy last bin to indicate where the real last bin ends). These indices are indices into ri
itself. So the first bin starts at index 4 (of ri
), the second bin starts at index 8, and the third bin starts at index 13. The bins end one before the next bin starts (which is why the dummy last bin is there). So we have, the first bin in ri[4:7]
, the second bin in ri[8:12]
, and the third bin in ri[13:13]
. Now, the “values” in the bins are actually indices into the original data set. So values in the first bin are:
IDL> print, ri[4:7]
0 1 5 6
IDL> print, d[ri[4:7]]
0.115125 0.283748 0.138929 0.175479
If you wanted to do this in general for bin i
, it would be
ri[ri[i]:ri[i+1] - 1]
We’ll use this formula in a second. Back to our example, we now have the (four) candidates for the three smallest elements of the array. In the general case, there might not be n
elements here, so we might have to include more bins than just the first one.
nCandidates = 0L
for bin = 0L, nBins - 1L do begin
nCandidates += h[bin]
if (nCandidates ge n) then break
endfor
This code will stop as soon as we have have enough candidates, i.e. n
or more. We know how many bins (and hence how many candidates), but we don’t know which elements are the candidates yet. They are:
candidates = ri[ri[0] : ri[bin + 1L] - 1L]
We sort the candidates to find the smallest.
sortedCandidates = sort(data[candidates])
Now that we have sorted the candidates, just return the first n
of them (remember that SORT
returns indices into the original data array).
IDL> print, (candidates[sortedCandidates])[0:n-1L]
0 5 6
Here is the source code for MG_N_SMALLEST
(docs).
July 22nd, 2014 at 11:50 am
[…] have a more detailed description of what MG_N_SMALLEST is doing in an older article. I like this routine as a good example of using HISTOGRAM and its REVERSE_INDICES keyword. It also […]