Hi folks, I' trying to to the following pipeline in C++ using OpenCV 3.4.7:
feature extraction -> transform and crop 32x32 patches around features -> use the patches in LibTorch
.
The feature extraction is leveraged to a SIFT implementation made on CUDA since in my scenario I can't sacrifice the execution time.
After this is done, the data is retrieved from the GPU memory to the host memory and I compute the transformation matrices Ms
for every keypoint. After that, I crop the patches around the keypoint by calling warpAffine with the matrix Ms
and the following parameters for the interpolation WARP_INVERSE_MAP + INTER_LINEAR + WARP_FILL_OUTLIERS
, and the following for the border: BORDER_REPLICATE
. After a lot of time (well, not a lot but still too much for my scenario) I get the patches and I can feed it to LibTorch.
Actually the computation of the Ms
matrices and the warpAffine
calls is leveraged to the CPU by a loop that I can parallelize even with a parallel_for_
loop, but dealing with indices is not that easy stuff. I would like to avoid that much I/O operation and do all of the hard work in the GPU since I perform the Sift extraction on it and I've all the data stored inside the GPU memory. Moreover, LibTorch can run on a CUDA device and doing all the job on the GPU avoid all the upload/download
calls. I made a CUDA kernel that compute the Ms matrices for every keypoint. I haven't tested yet but I'm going to do it, and it is feasible. Then, I have to extract the patches: I can do it using cv::cuda::warpAffine
to leverage the computation to the graphic card and use the data stored inside. I checked the code and it doesn't seems that cv::cuda::warpAffine
expect the Ms
matrix to be already stored in the GPU but rather a creation and allocation is performed
The questions is the following:
Is there a way to call the warpAffine
OpenCV CUDA kernel directly with the parameters that I said on multiple keypoints? Since the final patches are just 32x32 I think there will be not that much of problems afterall. I've seen the .cu
file and I've seen that there are 2 dispatcher WarpDispatcherStream
and WarpDispatcherNonStream
and there is also the kernel warp
but I cannot find anything regarding the B<work_type>
, BorderReader
and Filter
declaration. I found something in the cudev
interface but I don't know how to use everything together. Calling the proper kernel directly should do the trick for my use case scenario. in that way I should be able to put the images in the shared memory and work directly on the keypoints avoiding function calling and any other overhead.
Thank you in advance.