Ask Your Question
3

GPU vs CPU end to end latency for dynamic image resizing

asked 2018-02-22 12:29:57 -0600

gegupta gravatar image

updated 2018-02-28 15:20:54 -0600

I have currently used OpenCV and ImageMagick for some throughput benchmarking and I am not finding working with GPU to be much faster than CPUs. Our usecase on site is to resize dynamically to the size requested from a master copy based on a service call and trying to evaluate if having GPU makes sense to resize per service call dynamically.

Sharing the code I wrote for OpenCV. I am running the following function for all the images stored in a folder serially and Ultimately I am running N such processes to achieve X number of image resizes.I want to understand if my approach is incorrect to evaluate or if the usecase doesn't fit typical GPU usecases. And what exactly might be limiting GPU performance. I am not even maximizing the utilization to anywhere close to 100%

resizeGPU.cpp: {

    cv::Mat::setDefaultAllocator(cv::cuda::HostMem::getAllocator (cv::cuda::HostMem::AllocType::PAGE_LOCKED));

    auto t_start = std::chrono::high_resolution_clock::now();
    Mat src = imread(input_file,CV_LOAD_IMAGE_COLOR);
    auto t_end_read = std::chrono::high_resolution_clock::now();
    if(!src.data){
            std::cout<<"Image Not Found: "<< input_file << std::endl;
            return;
    }

    cuda::GpuMat d_src;
    d_src.upload(src,stream);
    auto t_end_h2d = std::chrono::high_resolution_clock::now();
    cuda::GpuMat d_dst;

    cuda::resize(d_src, d_dst, Size(400, 400),0,0, CV_INTER_AREA,stream);
    auto t_end_resize = std::chrono::high_resolution_clock::now();

    Mat dst;
    d_dst.download(dst,stream);
    auto t_end_d2h = std::chrono::high_resolution_clock::now();
    std::cout<<"read,"<<std::chrono::duration<double, std::milli>(t_end_read-t_start).count()<<",host2device,"<<std::chrono::duration<double, std::milli>(t_end_h2d-t_end_read).count()
                            <<",resize,"<<std::chrono::duration<double, std::milli>(t_end_resize-t_end_h2d).count()
                            <<",device2host,"<<std::chrono::duration<double, std::milli>(t_end_d2h-t_end_resize).count()
            <<",total,"<<std::chrono::duration<double, std::milli>(t_end_d2h-t_start).count()<<endl;


}

calling function :

    cv::cuda::Stream stream;
std::string dir_path="/home/gegupta/GPUvsCPU/";

const auto& directory_path = dir_path;
const auto& files = GetDirectoryFiles(directory_path);
for (const auto& file : files) {
    std::string full_path = dir_path + file;
    processUsingOpenCvGpu(full_path,stream);    
}

resizeCPU.cpp:

    auto t_start = std::chrono::high_resolution_clock::now();
    Mat src = imread(input_file,CV_LOAD_IMAGE_COLOR);
    auto t_end_read = std::chrono::high_resolution_clock::now();
    if(!src.data){
            std::cout<<"Image Not Found: "<< input_file << std::endl;
            return;
    }

    Mat dst;
    resize(src, dst, Size(400, 400),0,0, CV_INTER_AREA);
    auto t_end_resize = std::chrono::high_resolution_clock::now();

    std::cout<<"read,"<<std::chrono::duration<double, std::milli>(t_end_read-t_start).count()<<",resize,"<<std::chrono::duration<double, std::milli>(t_end_resize-t_end_read).count()
        <<",total,"<<std::chrono::duration<double, std::milli>(t_end_resize-t_start).count()<<endl;

Compiling : g++ -std=c++11 resizeCPU.cpp -o resizeCPU pkg-config --cflags --libs opencv

I am running each program N number of times controlled by following code : runMultipleGPU.sh

#!/bin/bash
echo $1
START=1
END=$1
for (( c=$START; c<=$END; c++ ))
do
./resizeGPU "$c" &
done
wait
echo All done

Run : ./runMultipleGPU.sh <number_of_such_processes>

Those timers around lead to following aggregate data

No_processes    resizeCPU   resizeGPU   memcpyGPU   totalresizeGPU
1                 1.51        0.55        2.13         2.68
10                5.67        0.37        2.43         2.80
15                6.35        2.30       12.45        14.75
20                6.30        2.05       10.56        12.61
30                8.09        4.57       23.97 ...
(more)
edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted
3

answered 2018-02-26 04:34:08 -0600

Hi, Whilst I don't have results for interpolation type CV_INTER_AREA, as I only have results from the standard performance tests which don't include this interpolation type. I can highlight a few things to be aware of.

  1. The CPU resize operations are fast because they are highly optimized using Intel IPP's.
  2. I know this is obvious but the performance gain will depend on which CPU and GPU you are using and how they compare in performance to each other.
  3. The performance increase from the CPU to the GPU is highly dependent on the image type, interpolation method, scale, and to a lesser extent on the image size. Below is an image showing the speedup in the resize operation when going from an i5_6500 to a 1060(less than half as powerful as a single P100).The original image size is 1080p and the scale is 0.5. As you can see there is a speed increase of 6.27 when resizing a 1080p by 0.5 using INTER_LINEAR, for CV_32F BGR images, however there is no speed up at all for CV_8U, Gray images. image description
  4. You are using streams, without any synchronization therefore I don't think your timing function is working as you expect. I would try timing without streams to start with or include cudaDeviceSynchronize() after d_src.upload(src,stream); and cuda::resize(d_src, d_dst, Size(400, 400),0,0, CV_INTER_AREA,stream); to ensure that you are only timing the resize operation.

I hope this points you in the right direction.

edit flag offensive delete link more

Comments

thanks cudawarped! Editted my post to include calling resize method. Input params are pretty much input_file and Stream which is created only once and then we go through each file in a directory in a for loop hence making it serial per process. So thought i dont need synchronization. How i am parallelizing the streams is by creating different processes doing exactly same thing but it will have its own stream. What your benchmark shows is individual operation speedup which as i mentioned above is better than CPU anyways. But overall there should be enough parallelism between CPU and GPU to hide cost of transfers and hence post to know how to best evaluate it. What i didnt consider i think overall is that multiple contexts compete for GPU resources on a first come, first serve basis

gegupta gravatar imagegegupta ( 2018-02-27 18:11:15 -0600 )edit
1

Hi, I am not sure if I understand correctly. Are the times you present including synchronization? If not how do you know the GPU is faster than the CPU for your use case? Your call to std::chrono::high_resolution_clock::now(); could happen before you CUDA resize kernel has started running?

Additionally are you expecting to be able to run multiple images at once on the GPU in the same was as you can on a CPU (1 per core for example), as this is not possible, each resize operation will be called sequentially? The GPU resize operation is data parallel not task parallel.

cudawarped gravatar imagecudawarped ( 2018-02-28 08:52:10 -0600 )edit

I was in more of an impression that opencv should manage/do that for me as it is an abstraction over Cuda and user wouldn't need to manage the synchronization. And also, many codes samples I saw across didn't have synchronization. Yes, after further reading and profiling i understood that only one resize can happen at a time, no matter how many images u throw at it, I am able to achieve parallelism between memcpy and resize or memcpy(s) amongst various processes but still since only one kernel can run at a time, i still need to wait on GPU for that before i return to CPU. Hence lot of blocking doesnt let me achieve that performance overall

gegupta gravatar imagegegupta ( 2018-02-28 09:28:40 -0600 )edit
1

Hi, it is not that you need synchronization, in fact the code you have is perfect for hiding the memcopy latency during kernel calls. Using streams is also perfect for getting the most out of the GPU because you can que up several kernels and let the gpu run them one after the other. In all senses the code you have presented is perfect for achieving your objective of getting the most out of the GPU. That said you need to time the operations with synchronization for the results to make sense and for you to know if the resize operation is quicker on the GPU and how long the memory operations take compared to the resize kernel. The results show the memcopy taking longer but it might not be unless you time it correctly.

cudawarped gravatar imagecudawarped ( 2018-02-28 10:47:08 -0600 )edit

thanks for that validation. definitely needed it :) If it is not too much, is it possible for you to edit one part above for me to know how synchronization would look like, i can modify all and then reevaluate. Since i have same timer on CPU.cpp I assumed the numbers should be comparable. From end user, it is the perceived latency, he would care about than what actual kernel run would take on the GPU.

gegupta gravatar imagegegupta ( 2018-02-28 11:25:24 -0600 )edit

I would try timing without streams to start with, (just don't pass the streams argument) or include cudaDeviceSynchronize() after d_src.upload(src,stream); cuda::resize(d_src, d_dst, Size(400, 400),0,0, CV_INTER_AREA,stream); and d_dst.download(dst,stream);, to ensure that you are timing the resize and memcopy operations successfully. This may not change the conclusion but it should give you the correct answer.

cudawarped gravatar imagecudawarped ( 2018-02-28 12:30:54 -0600 )edit

Question Tools

1 follower

Stats

Asked: 2018-02-22 12:29:57 -0600

Seen: 7,791 times

Last updated: Feb 28 '18