After spending awhile last Friday trying to vectorize a loop of a small matrix-vector multiplication for every pixel of an image, I gave up and decided to just write it as a DLM. For my image sizes of 1024 by 1024 pixels (actually two images of that size), the run time went from 3.15 seconds to 0.26 seconds on my MacBook Pro. That's not a lot of time to save, but since we acquire imagery every 15 seconds, it was useful.
Check out [analysis.c] for source code. There are also [unit tests] showing how to use it.
[analysis.c]: https://github.com/mgalloy/mglib/tree/master/src/analysis "mglib/src/analysis"
[unit tests]: https://github.com/mgalloy/mglib/blob/master/unit/analysis_ut/mg_batched_matrix_vector_multiply_ut__define.pro "unit tests"