智能语音交互技术进展
王斌;王育军;崔建伟;孟二利;
摘要(Abstract):
随着AIoT时代的到来,包含手机、智能音箱、智能电视、可穿戴产品在内的智能设备数量呈现井喷式增长。由于语音的便捷性,智能语音交互已经成为连接人与智能设备的主要方式。智能设备能够"听懂"用户的语言,执行相应的指令或者进行合理的回复。智能语音交互背后包含大量的人工智能技术。本文首先将智能语音交互技术拆解成语音识别、自然语言理解、人机对话和语音合成等几项主要技术,分别介绍了这些技术的概念、进展及未来的发展趋势展望,最后以小米智能助手"小爱同学"为例,介绍了这些技术在实际场景中的应用。
关键词(KeyWords): 语音识别;语音合成;人机对话;自然语言处理
基金项目(Foundation):
作者(Authors): 王斌;王育军;崔建伟;孟二利;
参考文献(References):
- [1]George E.Dahl,Dong Yu,Li Deng,Alex Acero.Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition[J].IEEE Transactions on Audio,Speech and Language Processing,2012.
- [2]He Y,Sainath T N,Prabhavalkar R,et al.Streaming End-to-end Speech Recognition for Mobile Devices[C]//ICASSP 2019-2019IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2019.
- [3]Graves,A.Jaitly,N.Towards End-To-End Speech Recognition with Recurrent Neural Networks[C]//.Proceedings of the 31st International Conference on Machine Learning,2014.
- [4]Dmitriy Serdyuk*,Wang Y.Towards end-to-end spoken language understanding[C]//2018IEEE//International Conference on Acoustics,Speech and Signal Processing (ICASSP),2018.
- [5]Jia Y,Weiss R J,Biadsy F,et al.Direct speech-to-speech translation with a sequence-to-sequence model[J].2019.ar Xiv:1904.06037
- [6]Mikolov T.Distributed Representations of Words and Phrases and their Compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
- [7]Vaswani,et.al.Attention is all you need[C]//Neur IPS,2017.
- [8]Peters M,Neumann M,Iyyer M,et al.Deep Contextualized Word Representations[C]//NAACL,2018.
- [9]Devlin J,Chang M W,Lee K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//NAACL,2019.
- [10]Brown T B,Mann B,Ryder N,et al.Language Models are Few-Shot Learners[C]//NAACL,2020.
- [11]Henderson,M.,Thomson,B.and Young,S..Deep neural network approach for the dialog state tracking challenge[C]//In Proceedings of the SIGDIAL 2013 Conference,2013.
- [12]Cuayáhuitl,H.,Keizer,S.and Lemon,O.Strategic dialogue management via deep reinforcement learning[J].2015.ar Xiv:1511.08099.
- [13]Ji Z,Lu Z,Li H.An Information Retrieval Approach to Short Text Conversation[J].Computer ence,2014.
- [14]Li L S Z L H.Neural Responding Machine for Short-Text Conversation[J].2015.ar Xiv:1503.02364.
- [15]Dong,L.,Yang,N.,Wang,et al.Unified language model pre-training for natural language understanding and generation[J].In Advances in Neural Information Processing Systems,2019:13063-13075.
- [16]Wang Y,Skerry-Ryan R,Stanton D,et al.Tacotron:Towards End-to-End Speech Synthesis[J].2017.ar Xiv:1703.10135.
- [17]Ren Y,Ruan Y,Tan X,et al.Fast Speech:Fast,Robust and Controllable Text to Speech[J].2019.ar Xiv:1905.09263.
- [18]Oord A V D,Dieleman S,Zen H,et al.Wave Net:A Generative Model for Raw Audio[J].2016.ar Xiv:1609.03499.
- [19]Nal Kalchbrenner,Erich Elsen,Karen Simonyan,Seb Noury,Norman Casagrande,et al.Efficient Neural Audio Synthesis[J].2018.ar Xiv:1802.08435.
- [20]Prenger R,Valle R,Catanzaro B.Wave Glow:A Flow-based Generative Network for Speech Synthesis[J].2018.ar Xiv:1811.00002.