Ask Your Question
0

OpenCV optimisation - multiplying a set of matrices by a set of scalars and summing.

asked 2016-04-01 05:50:16 -0600

Goz gravatar image

updated 2016-04-01 09:36:50 -0600

I have an optimisation problem and I'm wondering what the best way to approach the problem is.

At present I have a set of matrices (mats) that I need to scale by a set of values held in a vector and then summed together. I have written the following code nut it seems to be pretty painfully slow (far more so than I would have thought).

cv::Mat sum = cv::Mat::zeros( mats[0].rows, mats[0].cols, cvType );
for( int m = 0; m < mats.size(); m++ )
{
     const Type val = rowVec.at< Type >( m );
     sum += val * mats[m];
}

Can anyone suggest a faster way of doing the above loop?

Edit:

I wrote a little function to try and aid my performance:

template< typename Type >
void ScaleMatAndSum( cv::Mat& scale, const cv::Mat& mats, const Type val )
{
    for( int r = 0; r < mats.rows; r++ )
    {
        for( int c = 0; c < mats.cols; c++ )
        {
            scale.at< Type >( r, c )    += mats.at< Type >( r, c ) * val;
        }
    }
}

This is about 8 times faster when multi-threaded than the original code. Can anyone explain what is going on?

edit retag flag offensive close merge delete

2 answers

Sort by ยป oldest newest most voted
0

answered 2016-04-01 20:31:02 -0600

Tetragramm gravatar image

updated 2016-04-01 20:38:55 -0600

Try using the scaleAdd function. It has SMID optimizations built in.

cv::Mat sum = cv::Mat::zeros( mats[0].rows, mats[0].cols, cvType );
for( int m = 0; m < mats.size(); m++ )
{
     const Type val = rowVec.at< Type >( m );
     cv::scaleAdd(mats[m], val, sum, sum);
}

Ok, ran the benchmarks.

Original method: 6.17536 s

ScaleAdd method: 2.76857s

That's a 65% speedup, whereas the best of the other answer was a 45% speedup.

edit flag offensive delete link more
0

answered 2016-04-01 20:21:02 -0600

Eduardo gravatar image

updated 2016-04-01 20:29:42 -0600

The following are my guesses as a non expert.

I think that in the first case, the product val * mats[m] will produce a temporary mat variable with the appropriate size whereas mats.at< Type >( r, c ) * val needs only a temporary Type variable.

What library did you use for multithreading your second code? Is it 8 times faster when you compare the second code multi-threaded vs your first code multi-threaded or vs your first code single-threaded? Also, what is the size of your matrix as some optimization methods are effective only on large data?


Here some comparisons / attempts to improve the performance:

The results are on my computer (50 iterations, 100 matrices of size: 1000x800):

Original method: sum1=[5.0812e+009, 0, 0, 0] ; t1=8.87578 s
Matrix method: sum2=[5.0812e+009, 0, 0, 0] ; t2=17.1907 s
Pointer access: sum3=[5.0812e+009, 0, 0, 0] ; t3=5.40258 s
Pointer access + unroll loop: sum4=[5.0812e+009, 0, 0, 0] ; t4=5.24404 s
Pointer access + unroll loop + parallel_for_: sum5=[5.0812e+009, 0, 0, 0] ; t5=4.79474 s

The best improvment seems to be achieved when switching from matrix multiplication with a scalar to iterating and perform directly the multiplication on matrix elements (17.1907 s vs 8.87578 s).

Switching to pointer access gives also a resonable improvment (8.87578 s vs 5.40258 s).

The code I used for benchmarking:

#include <opencv2/opencv.hpp>

//Original method
template< typename Type >
void ScaleMatAndSum( cv::Mat& scale, const cv::Mat& mats, const Type val )
{
    for( int r = 0; r < mats.rows; r++ )
    {
        for( int c = 0; c < mats.cols; c++ )
        {
            scale.at< Type >( r, c ) += mats.at< Type >( r, c ) * val;
        }
    }
}

//Pointer access
template< typename Type >
void ScaleMatAndSum_ptr( cv::Mat& scale, const cv::Mat& mats, const Type val )
{
    Type *ptr_scale_rows;
    const Type *ptr_mat_rows;
    for( int r = 0; r < mats.rows; r++ )
    {
        ptr_scale_rows = scale.ptr<Type>(r);
        ptr_mat_rows = mats.ptr<Type>(r);

        for( int c = 0; c < mats.cols; c++ )
        {
            ptr_scale_rows[c] += ptr_mat_rows[c] * val;
        }
    }
}

//Pointer access + unroll loop
template< typename Type >
void ScaleMatAndSum_ptr_unroll_loop( cv::Mat& scale, const cv::Mat& mats, const Type val )
{
    Type *ptr_scale_rows;
    const Type *ptr_mat_rows;
    for( int r = 0; r < mats.rows; r++ )
    {
        ptr_scale_rows = scale.ptr<Type>(r);
        ptr_mat_rows = mats.ptr<Type>(r);

        for( int c = 0; c < mats.cols; c += 4 )
        {
            ptr_scale_rows[c] += ptr_mat_rows[c] * val;
            ptr_scale_rows[c+1] += ptr_mat_rows[c+1] * val;
            ptr_scale_rows[c+2] += ptr_mat_rows[c+2] * val;
            ptr_scale_rows[c+3] += ptr_mat_rows[c+3] * val;
        }
    }
}

//Pointer access + unroll loop + ParallelLoopBody
template <class Type>
class Parallel_ScaleAndSum: public cv::ParallelLoopBody
{
private:
    cv::Mat m_mat;
    Type m_mul;
    cv::Mat *m_result;

public:
  Parallel_ScaleAndSum(cv::Mat *result, const cv::Mat &mat, const Type &mul)
      : m_mat ...
(more)
edit flag offensive delete link more

Question Tools

1 follower

Stats

Asked: 2016-04-01 05:50:16 -0600

Seen: 1,533 times

Last updated: Apr 01 '16