1 | initial version |
If you know the realworld dimensions of the block and you have detected its corners you can get a ratio between the number of pixels and realworld dimensions.
One of the most popular books about this topic is "Multiple View Geometry in Computer Vision". You have a maths background, you will love it.
I think SLAM-based approaches are currently the most advanced techniques to estimate a camera position in the real-world, but maybe you should start by simpler techniques.
About Python or C++, I am not experienced with the python version, but your decision can be based on these topics: