Hi, I am analysing set of images for subpixel image shifts. I have code which essantially loops through:
loop(){
- read binary image, send it to GpuMat/cuda
//next 2 points are based on dft, mulSpectrums, magnitude (all cuda "Streamable")
- convolve with smoothing/gradient kernels (cuda)
- cross-correlate (phase-correlate) with base image (cuda)
// next are locating maximum with subpixel precision
- find maxLoc (cuda, but value sent to Point.x/Point.y on CPU)
- copy maxLoc 3x3 neighbours into Mat (CPU)
- subpixel registration by quadratic fit (CPU)
- resulting (x,y) pixel shifts are placed in shift maps (CPU) }
All this is computed ~65000 times, it takes about 8 minutes to compute (256x256 base 16 bit B&W images). Cuda card is not even heating up (nvidia-smi shows 6% GPU-Util).
Any suggestions on how to parallelize (the faster the better) this?