Bag of features with dense SIFT and SVM - understanding and implementation
My aim is to detect some underwater object - badminton racket among others. I have over 160 images of this racket laying underwater. I have created binary masks for this racket object (object I want to detect) and then I calculated based on that racket masks the underwater scenery masks (rocks, leafs, etc..,objects I don't want to detect). Now I want to use BOF with dense sift. What I intend to do:
- Create a visual dictionary - compute dense SIFT on the image applying the racket mask and then background mask(on each image I am calculating SIFT two times - for objects I want to detect(racket) and for all other underwater objects
- Having dictionary I have to calculate my SVM train data - so once again for every image I calculate SIFT applying my object mask( and label it 1) and applying background mask(label 0) - I am calculating frequency (histogram) of visual words from the dictionary.
- Object recognition - that part is tricky for me. My trained svm knows the frequencies of dictionary visual words for the racket(label 1) and the background(label 0). Now I have an image i want to test my SVM on - racket laying underwater among some rocks and other things. When I put that data in my SVM it will detect "both frequencies of visual words" - because on the image is my racket, and there is background as well. It is detecting both things. Now how I can prevent that? My idea is to segment image I want to classify on several (10-50) regions and then on each region calculate dense SIFT and then svm prediction based on dense sift on those regions?
Am I right, or I misunderstood something about this BOF method. If I am wrong, how can I achieve my goal. Once again at my disposal, I have a 160 sets of images(original frame, mask on the racket, mask on the background). Below I have an example of my image set:
My racket:
Racket mask:
Mask for the background:
I tried detecting it with SIFT descriptors, but there was much of a noise on the output image:
Then I tried to use BOW: I created Dictionary (dense SIFT on whole images, 5px size), the divided input images into 100 regions, and if any of those regions was situated on the mask, i calculated it's dense SIFT against my vocabulary (input data to SVM, label 1) and if any of those regions were situated on the background mask I did the same( calculate dense sift on the region, measure frequency of visual words and label it 0).
When I was testing my BOW , i divided test Image into 100 regions and did the same thing I did while training(dense sift and confront with dictionary). Here is my miserable result:
You can see the shape of the racket, but as you can see there are many errors.
Any idea how can I improve my algorithms? If I wasn't clear ...
what is a "racket" ? (you're not playing tennis underwater, or are you ?) can you add an example image ?
I have updated my question, please have a look :)