Hello all,
Prereqs for posting, my environment: Linux x86 64, OpenCV 2.4.6.1, CUDA 5.0, Tesla Kepler K20c GPU
I've got a simple C++ application to benchmark cuda performance. It makes and times the following calls once each in order:
cudaSetDevice(0);
cudaMalloc(&someMemory, sizeof(float)*1024*1024);
cudaFree(someMemory);
cudaDeviceReset();
With just the cuda libraries linking, this takes ~10s of milliseconds for each call except for the malloc, which is about 0.25 seconds. Fine...no biggie, it's all part of GPU startup costs.
Here's the weird part - if I include libopencv_gpu.so and libopencv_core.so in the linker list (-lopencv_gpu -lopencv_core), without changing code whatsoever, those timings go through the roof. The cudaSetDevice call takes ~2.5 seconds, and the malloc takes ~5 seconds. Calls after that seem to be just as fast, but a ~7.5 second startup cost is ridiculous considering it's only ~.5 seconds without the opencv libraries.
Another oddity, taking out libopencv_gpu and just leaving the core library still has an effect: the set device call still takes ~2.5 seconds, and the malloc takes ~.7 seconds. What gives?
This affects more than my benchmark app, and it is repeatable. Does anyone have any insight on how opencv is destroying my startup performance? I tried setting CUDA_DEVCODE_PATH to /tmp/devcode, thinking it was PTX compilations, but nothing was made in the directory - am I using it wrong?
Any help would be great. Thanks!