1 | initial version |
Whilst I don't have the same hardware or implementation the CUDA implementation (here) of the OpenCV CPU example on my hardware (CPU i7-8700, Mobile GPU RTX 2080) was 50% faster on the small resolution video. I would expect on larger video for the performance increase to be greater but this will depend on how the algorithm is implemented. If you are timing the same function calls with high resolution timers on the Jetson and the CUDA implementation is 2-3 times slower then I would guess the best you can do with any implementation is to get the same performance on the GPU as the CPU.
Regarding your other questions:
1. If you are timing the execution of SparsePyrLKOpticalFlow.calc()
only, then processing a video won't make any difference. If you are timing the execution of the entire pipeline and using CUDA streams then this could make a difference depending on the implementation of SparsePyrLKOpticalFlow.calc()
(e.g. if the alg is iterative with each iteration depending on the previous one then there will most likely be fixed device sync points which will stall execution on every iteration even if you use CUDA streams).
2. If the calculations inside SparsePyrLKOpticalFlow.calc()
can be placed in separate streams without any forced synchronization (as described above) then it is possible to overlap host (CPU) and device (GPU) computation to avoid performance degradation when the host relies on the output from the device.
3. If the above regarding CUDA streams is confusing then Accelerating OpenCV with CUDA streams in Python may be useful. Although the example is in python the concepts apart from the pre-allocation of arrays remain the same.
2 | No.2 Revision |
Whilst I don't have the same hardware or implementation the CUDA implementation (here) of the OpenCV CPU example on my hardware (CPU i7-8700, Mobile GPU RTX 2080) was 50% faster on the small resolution video. I would expect on larger video for the performance increase to be greater but this will depend on how the algorithm is implemented. If you are timing the same function calls with high resolution timers on the Jetson and the CUDA implementation is 2-3 times slower then I would guess the best you can do with any implementation is to get the same performance on the GPU as the CPU.
Regarding your other questions:
1. If you are timing the execution of SparsePyrLKOpticalFlow.calc()
only, then processing a video won't make any difference. If you are timing the execution of the entire pipeline and using CUDA streams then this could make a difference depending on the implementation of SparsePyrLKOpticalFlow.calc()
(e.g. if the alg is iterative with each iteration depending on the previous one then there will most likely be fixed device sync points which will stall execution on every iteration even if you use CUDA streams).
2. streams).
If the calculations inside SparsePyrLKOpticalFlow.calc()
can be placed in separate streams without any forced synchronization (as described above) then it is possible to overlap host (CPU) and device (GPU) computation to avoid performance degradation when the host relies on the output from the device.
3. device.
If the above regarding CUDA streams is confusing then Accelerating OpenCV with CUDA streams in Python may be useful. Although the example is in python the concepts apart from the pre-allocation of arrays remain the same.
3 | No.3 Revision |
Whilst I don't have the same hardware or implementation the CUDA implementation (here) of the OpenCV CPU example on my hardware (CPU i7-8700, Mobile GPU RTX 2080) was 50% faster on the small resolution video. I would expect on larger video for the performance increase to be greater but this will depend on how the algorithm is implemented. If you are timing the same function calls with high resolution timers on the Jetson and the CUDA implementation is 2-3 times slower then I would guess the best you can do with any implementation is to get the same performance on the GPU as the CPU.
Regarding your other questions:
1. questions:
If you are timing the execution of SparsePyrLKOpticalFlow.calc()
only, then processing a video won't make any difference. If you are timing the execution of the entire pipeline and using CUDA streams then this could make a difference depending on the implementation of SparsePyrLKOpticalFlow.calc()
(e.g. if the alg is iterative with each iteration depending on the previous one then there will most likely be fixed device sync points which will stall execution on every iteration even if you use CUDA streams).
If the calculations inside SparsePyrLKOpticalFlow.calc()
can be placed in separate streams without any forced synchronization (as described above) then it is possible to overlap host (CPU) and device (GPU) computation to avoid performance degradation when the host relies on the output from the device.
If the above regarding CUDA streams is confusing then Accelerating OpenCV with CUDA streams in Python may be useful. Although the example is in python the concepts apart from the pre-allocation of arrays remain the same.