Pose estimation produces wrong translation vector
Hi,
I'm trying to extract camera poses from a set of two images using features I extracted with BRISK. The feature points match quite brilliantly when I display them and the rotation matrix I get seems to be reasonable. The translation vector, however, is not.
I'm using the simple method of computing the fundamental matrix, essential matrix computing the SVD as presented in e.g. H&Z:
Mat fundamental_matrix =
findFundamentalMat(poi1, poi2, FM_RANSAC, deviation, 0.9, mask);
Mat essentialMatrix = calibrationMatrix.t() * fundamental_matrix * calibrationMatrix;
SVD decomp (essentialMatrix, SVD::FULL_UV);
Mat W = Mat::zeros(3, 3, CV_64F);
W.at<double>(0,1) = -1;
W.at<double>(1,0) = 1;
W.at<double>(2,2) = 1;
Mat R1= decomp.u * W * decomp.vt;
Mat R2= decomp.u * W.t() * decomp.vt;
if(determinant(R1) < 0)
R1 = -1 * R1;
if(determinant(R2) < 0)
R2 = -1 * R2;
Mat trans = decomp.u.col(2);
However, the resulting translation vector is horrible, especially the z coordinate: Usually it is near (0,0,1) regardless of the camera movement I performed while recording these images. Sometimes it seems that the first two coordinates might be kind of right, but they're far to small in comparison to the z coordinate (e.g. I moved the camera mainly in +x and the resulting vector is something like (0.2, 0, 0.98). Any help would be appreciated.
If you use the essential matrix to determine the poses of the camera you are going to get the rotation matrix (3x3) and a translation vector (A UNIT VECTOR); so you will only know the direction. You need to scale that vector in order to get the right units.
Thank you for your reply. That much is clear. However, the direction of that unit vector is completly wrong: There is no way i could uniformly scale a vector that points mostly to Z to reflect the actual movement which is mostly to X.
I had a similar problem, firstly I installed the latest version of OpenCV (from github), that version has a file called "five-point.cpp" that has the findEssentialMat, decomposeEssentialMat and recoverPose functions that could help you. And secondly, you might check the functions solvePnP for pose recovery and correctMatches for triangulation, for me those were very useful.
Again, thanks for the input. I did check out that five-point.cpp file. I could not yet test the findEssentialMat function as that obviously requieres the whole packet, but decomposeEssentialMat and recoverPose yields exactly the same results as my approach. Which is a good thing, I guess. Or not as it still hides where things go horribly wrong :). As for solvePnP: From what I gather, that function tries to estimate the pose from already calculated 3D Object Points. I do however need the pose to calculate those or am I mistaken here?
Can you tell me exactly what you need? What is the purpose of your program?
For the time being I would be satisfied if I could read in 2 images, extract keypoints, estimate the pose between them and triangulate 3D points of these keypoints. There are plenty of texts regarding this, but sadly, the pose estimation does fail as described above.
Do you have a pair of cameras (stereo system) or just one camera? If you have the stereo camera it is a simple problem (it can be solved with solvePnP), however if you only have one camera you are not going to be able to compute the scaled translation in each step, you will only know the direction with the unit vector. The problem you are saying is called visual odometry, there is a good tutorial online about it (google: visual odometry Davide Scaramuzza)
The odometry tip was not too helpful in my current problem, however, it may proof useful in future use. The new findEssentialMat after building the current snapshot however did the trick: It seems, that the 5 point algorithm is far better suited for pose extraction than the way using the fundamental matrix. Thanks again!
@RaulPL If it is only possible to get a direction (by the UNIT VECTOR) between the first two frames, it is also only possible to get a direction for the translation vector without that the translation vector between frame 1 and 2 is at the same scale as the one between 2 and 3, right?
In the stereo camera system, it actually can NOT be solved with solvePnP, assuming you haven't introduce extra length information(ex. chessboard or aprilTags) of the real world. The method to produce 3D points being triangulation, the extrinsics are also needed in the first place. If you have a chessboard or aprilTags in your images, then you could get the scaled translation, not only in stereo camera system, but also in monocular camera system.
@chrisk And yes, you could only get a unit translation vector for between frame 1 and 2, as well as between frame 2 and 3.