1 | initial version |
The only thing I can see that would improve your OpenMP time alone is to parallelize along rows, not columns. IE: the outer for loop. It leads to more efficient memory access.
Really though, just like berak said, all the real improvement here would come from the base algorithm. Look up separable filters, look at THIS tutorial, and as berak said, take a look at how OpenCV does it and time yourself against it. Not all of it will make sense for your project, but you can at least make sure you're not doing silly things, and get an idea of how well optimized your code is.