I'm doing a project on Lip Reading which uses the Grid Corpus Dataset, the dataset contains 34000 videos, 34 speakers, 1000 videos per speaker, the videos are 3s and 75 frames each. The dataset contains an align file for each video, the align file contains certain markings and the words being spoken in the respective video. What could the markings in the video possibly stand for? Eg of an align file (Here sil stands for silence) :
0 12250 sil
12250 19250 set
19250 27250 white
27250 30500 with
30500 36000 p
36000 43250 two
43250 55250 soon
55250 74500 sil