Hi, I am trying to see if I can approach the speech recognition (especially phoneme segmentation) problem using computer vision techniques on the spectrogram. I have some open questions, and I also invite everybody who is interested to get in touch with me for a possible collaboration.
A spectrogram looks like this. The X axis represents time, the Y axis is the spectrum of the underlying signal. I would like to extract some features using a pure image processing approach. At first, the parameters of the feature extraction like the thresholds that are dependent on the SNR will be set by hand through sliders in a prototype GUI, then I will look for a method to extract also these parameters out of the context.
So, the questions, please keep in mind that I am a complete beginner in image processing:
The human eye can clearly identify five main vertical shapes. Shall I look into edge detection in order to find their start and end points?That could be an algorithm that extracts just lines that are more or less straight, giving me the start and end point, and then I could check for the slope, and the intensity and length of the lines. This is equivalent to onset detection in audio signals.
I feel sufficiently comfortable using an audio algorithm to find the silent (dark) vertical regions. Given a successful recognition in the previous step I would like to operate on smaller slices of the image that represent one area of interest. One task would be to identify the slope of the oblique stripes with bigger intensity that are visible in the image (those are the formants of vowels). How to achieve that? Again, this could edge detection, but the output of Canny seems to be a series of points without information on how they are interconnected.
I don't need a whole recipe, but it would be nice to have suggestion on where to look in order to avoid a complex postprocessing of the wrong algorithms' output.