Ask Your Question
0

Slow initial call in a batch of cuda sparse pyrLK optical flow operations.

asked 2015-10-06 10:08:50 -0600

dtmoodie gravatar image

updated 2015-10-06 10:43:53 -0600

I'm writing a program that registers a frame to a previous set of frames using optical flow to track key points. My keyframes are stored in a circular buffer fashion and then optical flow is called starting on the oldest frame in the buffer and moving towards the newer frames. I'm doing this on a Windows 7x64 computer with NVidia drivers 353.90 on a GTX Titan X.

Because of the architecture of the program, there may be a delay between batches of operations as new images are loaded, etc. IE the stream queue would look like:

upload
opt flow (20 ms)
opt flow (1 ms)
opt flow (1 ms)
opt flow  (1 ms)
opt flow (1 ms)
download
upload
opt flow (20 ms)
opt flow  (1 ms)
opt flow (1 ms)
.........

I'm running all of this on a stream, however for the sake of measuring time, I'm calling stream.waitForCompletion(). Ideally when this is working correctly, I'll be able to take out all of the synchronization.

I'm also familiar with the fact that first launches should take longer as the driver compiles code. However I was under the impression that this would just be the first launch, not the first launch in each batch of launches. Is there any way to reduce that 20 ms first call to optical flow to something more reasonable?
Should I setup two streams, so that the memory transfers are on one and I have one stream dedicated to optical flow?

[EDIT] I've tested if it could be a wddm driver queue issue similar to this: https://devtalk.nvidia.com/default/to.... by manually flushing the queue with a cudaEventQuery on one of my events, however this doesn't seem to do anything. If I remove synchronization, my second call to optical flow will cost 20ms.

edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted
0

answered 2016-11-10 15:48:04 -0600

jaybo_nomad gravatar image

I'd guess you're allocating a GpuMat at the start of the loop and then reusing it in subsequent iterations of your loop. I've found that a critical optimization to CUDA operations is to preallocate all GpuMats and never allocate them on the stack. Similarly, don't resize a GpuMat once allocated. CUDA itself seems blindingly fast, but GPU memory allocations not so much.

edit flag offensive delete link more

Question Tools

1 follower

Stats

Asked: 2015-10-06 10:08:50 -0600

Seen: 568 times

Last updated: Nov 10 '16