Ask Your Question
0

[OPENCV GPU] How can I convert GpuMat and Vector<Point2f> using HostMem? [closed]

asked 2019-11-27 04:32:09 -0600

JordiGC gravatar image

updated 2019-11-27 07:29:05 -0600

Hello,

I manage to convert from GpuMat to Vector<point2f> by following this post. Now I am trying to optimize the code. I found that to reduce the time spent on the cudaMemCpy2D I have to pin the host buffer memory. In the following image you can see how cudaMemCpy2D is using a lot of resources at every frame:

image description

In order to pin the host memory, I found the class:

cv::cuda::HostMem

However, when I do:

 void download(const cuda::GpuMat& d_mat, vector<Point2f>& vec) 
{
         cv::Mat mat(1, vec.size(), CV_32FC2, (void*)&vec[0]);
         cv::cuda::HostMem h_mat(mat);
         d_mat.download(h_mat);
}

and then I run the function download and the vec is full with of points at 0,0:

std::vector of length 193, capacity 193 = {{x = 0, y = 0}, {x = 0, y = 0}, ... , {x = 0, y = 0}, {x = 0, y = 0}}

Is there anyone who can help me with this?

Thank you in advance.

edit retag flag offensive reopen merge delete

Closed for the following reason duplicate question by JordiGC
close date 2019-12-02 06:17:37.881079

1 answer

Sort by ยป oldest newest most voted
1

answered 2019-11-27 07:31:53 -0600

updated 2019-11-27 07:47:20 -0600

Hi, Can you share your code because you can use GpuMat with vector<Point2f> and HostMem in the following way

cv::cuda::Stream stream;
Point2f p = Point2f(1, 2);
vector<Point2f> vec = { p, p };
cv::cuda::HostMem h_vec_src(vec);
cv::cuda::GpuMat d_vec;
d_vec.upload(h_vec_src,stream);
cv::cuda::HostMem h_dst;
d_vec.download(h_dst,stream);
/* sync to ensure the result of h_dst has been downloaded - in practice if you are going to sync directly after downloading you loose the benefit of using CUDA streams.*/
stream.waitForCompletion();

where the contents of h_dst are equal to vec?

If the result is zero now and wasn't before using HostMem I would suspect that you could be using streams without synchronizing and/or you are synchronizing on the wrong stream.

Additionally from the Nvidia Visual Profiler output, the maximum speed up I would expect from using streams and an async pipeline is ~25%.

edit flag offensive delete link more

Comments

I am also using the streams to make it asynchronous. But not using the stream.waitForCompletion();. What is it for?

Can I treat the HostMem h_dst as the vector<Point2f> vec? Because d_mat is the output from cuda::createGoodFeaturesToTrackDetector(CV_8UC1,maxCorners, qualityLevel, minDistance, blockSize) and I need to access the points to calculate the centre of all of them, that's why I want to put all the points into vec.

JordiGC gravatar imageJordiGC ( 2019-11-27 08:04:05 -0600 )edit

When you use streams control returns immediately to the host. Therefore you need a way to know when the work has completed e.g. d_vec.download(h_dst,stream); (MemCpy (DToH)). One way is to query with stream.queryIfComplete() another is to force the host to wait until the device work has completed with stream.waitForCompletion();.

Put another way if you call d_vec.download(h_dst,stream); you cannot be sure that the copy from d_vec to h_dest has completed until stream.queryIfComplete() returns true. Alternatively you can call stream.waitForCompletion(); which guarantees that the host will wait until stream.queryIfComplete() would return true. The downside is that you have just lost the advantage of using an async pipeline.

cudawarped gravatar imagecudawarped ( 2019-11-27 09:00:45 -0600 )edit
  1. Then, why using streams if we need to make sure that they are done and then we lose the advantage of using an asynchronous pipeline? Is there a way that you don't lose the advantages of streams?

  2. How do you access the data of a HostMem variable to treat it as a vector<Point2f>?

JordiGC gravatar imageJordiGC ( 2019-11-27 09:16:40 -0600 )edit

Mainly streams allow you to overlap host and device computation. You need to sync before examining the result on the host. That said you get a choice when to sync, ideally this would be performed when you know all the work has finished on the device.

In your case if you are trying to examine the result immediately after downloading this defeats the objective of using streams. Ideally you would call download, then proceed with some more work on the host before synchronizing.

All that said I don't know what you are doing because you haven't shared you code yet?

I wouldn't worry about converting HostMem to vector<Point2f> at the moment as you may not even benefit from streams.

cudawarped gravatar imagecudawarped ( 2019-11-27 09:29:28 -0600 )edit

Okay. I will need to find a way to be able to use asynchronous calls.

I am preparing the code to share it with you.

JordiGC gravatar imageJordiGC ( 2019-11-27 09:56:08 -0600 )edit

OK have you checked out Accelerating OpenCV with CUDA streams in Python for an overview of stream usage.

Is a 25% reduction in execution time enough for your application?

cudawarped gravatar imagecudawarped ( 2019-11-27 10:03:48 -0600 )edit

Yes. I've been following this post to optimize my code. I will need to have a look on how I can make my code follow the second statement of the Summary in the post.

JordiGC gravatar imageJordiGC ( 2019-11-27 10:11:27 -0600 )edit

Please, find in this link the code I use: Link to code

JordiGC gravatar imageJordiGC ( 2019-11-29 02:35:02 -0600 )edit

I have had a very quick look and I may be wrong but it because the result of your CPU tracking determines the ROI which you use for the next detection of key points you will always need two sync points. If it was possible to perform the same tracking without using the CPU this would remove this requirement and allow you to run asynchronously with respect the host.

That said I think there are a few places where you can overlap host/device computation. For example you should be able to overlap the frame decoding by placing it in between the call to download the opt flow points and the call to detect the next key points. frame_grey is not used immediately so you may be able to move cuda::cvtColor(frame_rgb, frame_gray, COLOR_RGB2GRAY, 0, stream); a later point etc.

cudawarped gravatar imagecudawarped ( 2019-11-29 11:17:03 -0600 )edit

So you recommend me to write all the code that is now running on CPU on CUDA so it can go asynchronously with respect to the host? Do you have any link where I can get started fast with CUDA? Thank you very much for all your time and help.

JordiGC gravatar imageJordiGC ( 2019-12-02 02:17:03 -0600 )edit

Question Tools

1 follower

Stats

Asked: 2019-11-27 04:32:09 -0600

Seen: 1,380 times

Last updated: Nov 27 '19