Revision history [back]

How to estimate CUDA speedup in advance?

I wanted to investigate the performance gain that can be achieved using CUDA under Windows 32 bit to get some insights if it is useful to bother with CUDA in general and on my current laptop.

So I have used demo_performance.exe from OpenCV-2.4.0-GPU-demos-pack-win32.exe to compare CPU and GPU calculations and got the following results here, with average GPU speedup: x6.144. Then I have realized that - although willowgarage download page still suggests 2.4.0 - there is 2.4.3 version of GPU demos pack. The same test with this new demo gives the following results with average GPU speedup: x4.584. The two numbers differ significantly. I see that the tests are different and because of changes in the test types and image sizes I would say they are hard to compare. For the sake of comparison it would be good if they would match from version to version. Until this point I do know if the average speedup can be explained by the differences of the test or by the differences in OpenCV versions.

After configuring OpenCV-2.4.3 with CUDA I have compiled demo_perfomance with Visual Studio 2010 Express. Using the 2.4.0 source of GPU demo I got the following results with average GPU speedup: x3.728. Using the 2.4.3 source of GPU demo I got the following results with average GPU speedup: x4.151. I know that gemm tests are missing because of cublas so it is a bit misleading.

However I would say that the two own compiled tests are in the same range. If I assume that the gemm test would increase the own compiled averages (more than x30 speedup in the precompiled cases) they would be in the same range with precompiled 2.4.3 test. Moreover 2.4.0 precompiled demo is still away from them with x6 speedup.

If my argument is valid: What is the cause of the differences? Is it because of the new version? Is it because I have compiled with different options? Both?

In general how can I estimate if - in a certain case - it is useful to bother with the implementation of CUDA version instead of CPU-based algorithm before actually implementing the code with and without CUDA? Or from another perspective I should choose based on the specific algorithm and e.g. I should avoid Laplacian GPU calculation but prefer resize as the numbers are less than 1 for the first and over 1 for the second?