Matrix multiplication without memory allocation
Is it possible to speed up the overloaded matrix multiplication operator (*) in OpenCV by using preallocated cv::Mat instance with correct dimensions as a placeholder for where the result is being written into?
Something like the existing function:
CV_EXPORTS_W void gemm(InputArray src1, InputArray src2, double alpha,
InputArray src3, double beta, OutputArray dst, int flags = 0);
only simpler. I would like to have something like this:
CV_EXPORTS_W void matmul(InputArray src1, InputArray src2, OutputArray dst);
My concern is performance. Is it possible that
res = m1 * m2;
is equally fast as the hypothetical function:
matmul(m1, m1, res)
?