I am using the master
branch from the repository (hash:361eb633f6e841bcda18f970193fc4fb439bc4c8) . I have a feature vector consisting of several ordered variables. My responses on the other hand are categorical and of type CV_32S. I now want to create a RTrees
for this problem. The documentation of TrainData::create()
states that it is possible to have train data of type CV_32S
:
responses – matrix of responses. If the responses are scalar, they should be stored as a single row or as a single column. The matrix should have type CV_32F or CV_32S (in the former case the responses are considered as ordered by default; in the latter case - as categorical)
In the documentation of RTrees
I can't find a reason for this to be illegal.
However if I train my RTrees
as follows:
#include <iostream>
#include <random>
#include <opencv2/core.hpp>
#include <opencv2/ml.hpp>
using namespace std;
using namespace cv;
using namespace cv::ml;
int main()
{
random_device rd;
mt19937 gen( rd() );
uniform_real_distribution<> dis( 0, 1 );
uniform_int_distribution<> dis1( 0, 1 );
int samples = 100;
Mat_<float> train( samples, 3 );
for ( auto & x : train ) { x = dis( gen ); }
// CASE #1
//Mat_<int> resp( samples, 1 );
//for ( auto & x : resp ) { x = dis1( gen ); }
// CASE #2
Mat resp( samples, 1, CV_32S );
for ( auto it = resp.begin<int>(); it != resp.end<int>(); ++it ) { *it = dis1( gen );}
// CASE #3
//Mat_<float> resp( samples, 1 );
//for ( auto & x : resp ) { x = dis1( gen ); }
Mat_<char> types( train.cols + 1, 1 );
types.setTo( cv::Scalar( VAR_ORDERED ) );
types( train.cols, 0 ) = VAR_CATEGORICAL;
Ptr<TrainData> tdata = TrainData::create( train, ROW_SAMPLE, resp, noArray(), noArray(), noArray(), types );
Ptr<RTrees> rf = RTrees::create();
rf->train( tdata );
Mat_<float> calc_out;
cout << "calc error: " << rf->calcError( tdata, false, noArray() ) << endl;
Mat_<float> pred_out;
rf->predict( tdata->getTrainSamples(), pred_out );
int missclass = 0;
for ( int i = 0; i < pred_out.rows; ++i )
{
Mat_<float> r = tdata->getTrainResponses();
if ( pred_out( i, 0 ) != r( i, 0 ) )
{
missclass++;
}
}
cout << "pred error: " << missclass / ( float )samples << endl;
return 0;
}
A Gist of this can also be found here: LINK
In Case #1 and Case #2 the output is something like the following:
calc error: 46
pred error: 0.17
Only for *Case #3" the error is computed correctly:
calc error: 24
pred error: 0.24
Question #1
Is this behavior desired? If so, maybe this should be clarified in the documentation of StatModel
, RTrees
or TrainData
?
The problem seems to be in this part of the StatModel::calcError()
method:
...
float val = predict(sample);
float val0 = responses.at<float>(si);
if( isclassifier )
err += fabs(val - val0) > FLT_EPSILON;
...
If responses
is of type int
this would lead to a different val0
then expected?
I think this could be fixed by checking the type of responses
and switching between at<float>
and at<int>
?
Question #2
I was quite confused, that calcError
returns a result between 0 <= x <= 100, although the return type is float
. In my opinion a return value between 0 <= x <= 1 would be more appropriate. What do you guys think?
Conclusion Should this be posted to the issue tracker? What are your opinions on the returned value? If there are changes to be made, I could try to provide a pull request with the necessary fixes.