1 | initial version |
For 1.) you might want to experiment with a Sobel filter, which gives you e.g. the image derivative in x direction. Then setting a threshold to get a binary image with pixels indicating strong changes in x direction. And finally counting pixels for every column to find out the positions of clear phoneme starts/ends.
For 2.) you could have something like a histogram of gradients for a given region.