Revision history [back]

Generating good training data for haar cascades

I am trying to build haar cascades for doing OCR of a specific font; one classifier per character.

I can generate tons of training data just by drawing the font onto images. So, the plan is to generate positive training data for each character, and use the examples of other characters as negative training data (let me know if this is a dumb idea, please)

I am wondering how much variation I should put into the training data. Normally I'd just try everything, but I gather these things take days to train (for each character!) so some advice would be good.

So, a few questions:

Does the training algorithm recognise that I don't care about transparent pixels? Or will it perform better if I superimpose the characters over different backgrounds?
Should I include images where each character is shown with different prefixes and suffixes, or should I just treat each character individually?
Should I include images where the character is scaled up and down? I gather the algorithm pretty much ignores size, and scales everything down for efficiency anyway?

Thanks!

P.S. This question is also on StackOverflow. Apologies for cross-posting.