Hello,
I've implemented a parallel image descriptor extractor using opencv 3.0 parallel_for_ and BRISK descriptor extractor.
What I do is to split the image into horizontal stripes and compute the image descriptors in parallel. This is the outer call:
_extractor.init(img, featureCount); cv::parallel_for_(cv::Range(0, _processorCount), _extractor); // _processorCount == 4 _extractor.buildFinal(keyPoints, features);
Then, for each thread I call _featureExtractor->detectAndCompute(_image, keyPoints, mask); with a proper initialized horizontal mask stripe.
However, this implementation doesn't run faster than the serial implementation. It runs 2x slower. I did some debugging and saw opencv uses Microsoft concurrency framework. Moreover, when printing out the called range and getThreadNum() I get this:
range[0,1] thread [2] range[3,4] thread [2] range[2,3] thread [2] range[1,2] thread [2]
That means that my inner code runs on a single thread, 4 times like a serial implementation. Do you know what's wrong with my approach?
Thank you, Alin