I never dealt with video stabilization myself, but I dealt with problem of aligning 2 similar images that may differ by not very big shift. And I had to perform that alignment really fast (~20 ms per match). So I hope same concept will work for you. The sequence was:
1) Subsample the images.
2) Convert them to gray.
3) Find points of interest in images by using FAST. By the way, I got much better result without non-maximal suppression.
4) Perform Hough voting between 2 sets of points of interest to find best match.
5) Shift one of the images.
Note that:
a) All steps except for 4 can be done by functions in OpenCV.
b) You can save half of the work on steps 1 - 3, if you remember results of matching previous pair of frames, i.e. only new frame should be processed.
c) If integer shift is enough than step 5 is just definition of ROI, i.e. no time at all.
d) Making efficient implementation of Hough voting for step 4 is important. It may take 1 millisecond if you make it right, and it may take 1 second if you make it wrong. Be carefull.
Edit (to answer your questions):
Hough transform that matches shape to set of points actaully takes points on the shape to perform the match (matching line is the only exception). Sometimes it is not said directly but this is what actually happening. For example when matching circle, you will choose discrete number of angles for the match, but it is esntially the same as choosing points on circle. So any Hough transform (except for line match) boils down to matching 2 sets of points, and that is what you need to do here.
In order to match 2 sets of points first allocate array for voting and set its values to zero (as usual in voting algorithms). Then for each point from set A and each point from set B calculate dx and dy between them, and increment appropriate bin in voting array. When this is done find the bin with maximum value. It corresponds to best shift between those sets of points. In your case you can save part of the work by matching point of A only to points of B that are in its neighborhood, because following frames should not be shifted too much.
In my application I got good results without use of descriptors, so I didn't bother to check them. But they might be helpful in your case.