Public available datasets for VSR

Name Language Contents Speaker Views
Tulips1 English 4 digits 12
DAVID English digits, alphabets, VCVCV utterances, etc. 124 正面、側面
XM2VTS English 3 sentences 295 正面
AVletters English 26 alphabet letters 10 正面
CUVAE English 10 digits, continuous 10 digits 36 正面、側面
GRID English 4コマンド、4色など 34 正面
AVletters2 English 26 alphabet letters 5 正面
OuluVS English 10 sentences 20 正面
OuluVS2 English 連続10数字、10文など 53 5方向
M2TINIT Japanese 503 sentences 1 正面
CENSREC-1-AV Japanese 連続数字1〜7桁 93 正面

  • Tulips1
    • HP: download zip file (free)
    • Group: University of California San Diego (USA)
    • Released: 1995
    • Language: English
    • Contents: digits("one", "two", "three", "four")
    • File format: pgm(grayscale)
    • Image size: 100x75[pixels]
    • Frame rate: 30fps
    • Speakers: 12
    • Reference:
      J.R.Movellan,
      Visual speech recognition with stochastic networks,
      Advances in Neutral Information Processing Systems, vol.7, 1995.

  • DAVIDDigital Audio-Visual Integrated Database)
    • HP: ???
    • Group: University of Wales Swansea (UK)
    • Released: 1996
    • Language: English
    • Contents: digits, alphabets, vowel-consonant-vowel syllable utterances, some video conference commands
    • File format: ???
    • Image size: 640x480[pixels]
    • Frame rate: ???
    • Speakers: 124
    • Reference:
      C.C.Chibelushi, S.Gandon, J.S.D.Mason, F.Deravi, F.D.Johnston,
      Design issues for a digital audio-visual integrated database,
      IEE Colloquium on Integrated Audio-Visual Processing for Recognition, Synthesis and Communication, pp.7/1-7/7, November 1996.

  • XM2VTSExtended M2VTS(Multi Modal Verification for Teleservices and Security applications))
    • HP [+]
    • Group: University of Surrey (UK)
    • Released: 1999
    • Language: English
    • Contents: 3 sentences
      • "0 1 2 3 4 5 6 7 8 9"
      • "5 0 6 9 2 8 1 3 7 4"
      • "Joe took fathers green shoe bench out"
    • File format: ???
    • Image size: 720x576[pixels]
    • Frame rate: ???
    • Speakers: 295
    • Reference:
      K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre,
      XM2VTSDB: The extended M2VTS database,
      Second International Conference on Audio and Video-Based Biometric Person Authentication, 1999.

  • AVletters
    • HP: ???
    • Group: University of East Anglia (USA), University of Manchester (UK)
    • Released: 2002
    • Language: English
    • Contents: 26 alphabet letters
    • File format: ???
    • Image size: 376 x 288 [pixels], mouth image: 80 x 60 [pixels]
    • Frame rate: 25 fps
    • Speakers: 10 (5M+5F)
    • Reference:
      I. Matthews, T.Cootes, J. Bangham, S. Cox, and R. Harvey,
      Extraction of visual features for lipreading,
      IEEE Trans. on Pattern Analysis and Machine Vision, vol.24, no.2, pp.198-213, 2002.

  • CUAVEClemson University Audio-Visual Experiments)
    • HP [+] (free)
    • Group: Clemson University (USA)
    • Released: 2002
    • Language: English
    • Contents: digits ("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine")
    • File format: ???
    • Image size: 720 x 480 [pixels]
    • Frame rate: 29.97 fps
    • Speakers: 36 (19M+17F)
    • Reference:
      Eric K. Patterson, Sabri Gurbuz, Zekeriya Tufekci, and John N. Gowdy,
      Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus,
      EURASIP Journal on Applied Signal Processing,
      Volume 2002, Issue 11, pp.1189-1201, 2002, doi:10.1155/S1110865702206101.

  • GRID
    • HP [+] (free)
    • Group: University of Sheffield (UK)
    • Released: 2006
    • Language: English
    • Contents: 6 components
      • 4 commands={"bin", "lay", "place", "set"}
      • 4 colors={"blue", "green", "red", "white"}
      • 4 prepositions={"at", "by", "in", "with"}
      • 25 letters("w"以外)
      • 10 digits={"zero"〜"nine"}
      • 4 adverbs={"again", "now", "please", "soon"}
    • File format: mpeg(color)
    • Image size: 360 x 288 [pixels], 720 x 576 [pixels]
    • Frame rate: 25 fps
    • Speakers: 34 (18M+16F)
    • Reference:
      M. Cooke, J. Barker, S. Cunningham, and X. Shao,
      An audio-visual corpus for speech perception and automatic speech recognition,
      Journal of the Acoustical Society of America. Vol.120, No.5, pp.2421-2424, 2006.

  • AVletters2
    • HP [+] (free)
    • Group: University of Manchester (UK)
    • Released: 2008
    • Language: English
    • Contents: 26 alphabet letters
    • File format: mov
    • Image size: 1920 x 1080 [pixels]
    • Frame rate: 50 fps
    • Speakers: 5 (5M)
    • Reference:
      S.J. Cox, R. Harvey, Y. Lan, J. Newman, B.J,
      Theobald. The challenge of multispeaker lip-reading,
      International Conference on Auditory-visual Speech Processing (AVSP2008), p179-184, 2008.

  • OuluVS
    • HP [+]
    • Group: University of Oulu (Finland)
    • Released: 2009
    • Language: English
    • Contents: 10 sentences
      • "Hello"
      • "Excuse me"
      • "I am sorry"
      • "Thank you"
      • "Good bye"
      • "See you"
      • "Nice to meet you"
      • "You are welcome"
      • "How are you"
      • "Have a good time"
    • File format: ???
    • Image size: 720 x 576 [pixels]
    • Frame rate: 25 fps
    • Speakers: 20 (17M+3F)
    • Reference:
      G. Zhao, M. Barnard, and M. Pietikainen,
      Lipreading with local spatiotemporal descriptors,
      IEEE Transactions on Multimedia, Vol.11, No.7, pp.1254-1265, 2009.

  • OuluVS2
    • HP: ???
    • Group: University of Oulu (Finland)
    • Released: 2015
    • Language: English
    • Contents: 3種
      • Phase 1:10桁の連続数字の発話シーン
      • Phase 2:10文の発話シーン
      • Phase 3:TIMITデータベースに含まれている文章からランダムに選ばれた10文の発話シーン
    • File format: ???
    • Image size: 1920 x 1080 [pixels], 640 x 480 [pixels]
    • Frame rate: 30 fps, 100 fps
    • Speakers: 53 (40M+13F)
    • Reference:
      I.Anina, Z.Zhou, G.Zhao and M.Pietikainen,
      OuluVS2: a multi-view audiovisual database for non-rigid mouth motion analysis,
      IEEE International Conference on Automatic Face and Gesture Recognition, 2015

  • M2TINITMulti-Modal Speech Database by Tokyo Institute of Technology and Nagoya Institute of Technology)
    • HP [+](メールで問い合わせ(60GB以上のHDD、無償))
    • Group: 東京工業 大学大学院院総合理工学研究科 小林隆夫研究室、名古屋工業 大学知能情報システム学科 北村・徳田研究室
    • Released: 2002
    • Language: Japanese
    • Contents: ATR音素バランス文503文
    • File format: ???
    • Image size: 720 x 480 [pixels]
    • Frame rate: 29.97 fps
    • Speakers: 1 (1M)
    • Reference:
      酒向 慎司, 徳田 恵一, 益子 貴史, 小林 隆夫, 北村 正,
      HMMに基づいた視聴覚テキスト音声合成―画像ベースアプローチ,
      情報処理学会論文誌, vol.43, no.7, pp.2169-2176, 2002.

  • CENSREC-1-AVマルチモーダル音声認識評価環境データベース
    • HP [+](メールで問い合わせ(無償))
    • Group: 情報処理学会 音声言語情報処理研究会 雑音下音声認識評価ワーキンググループ
    • Released: 2010
    • Language: 日本語
    • Contents: 連続数字1〜7桁の読み上げ
    • File format: カラー画像:bmp(24bit)、近赤外線画像:bmp(8bit)
    • Image size: 81 x 55 [pixels]
    • Frame rate: ??? fps
    • Speakers: 42 (22M+20F), 51 (25M+26F)
    • Reference:
      大西正真, 田村哲嗣, 速水悟,
      音声・画像のモダリティ間の相互作用に着目した音声認識のモデル適応,
      電子情報通信学会 技術研究報告,
      vol.111, no.97, SP2011-33, pp.17-22, June 2011.