[1]Chen, L.; Li, Z.; Maddox, R. K.; Duan, Z.; and Xu, C. 2018. Lip Movements Generation at a Glance. In Proceedings of the European Conference on Computer Vision (ECCV).
[2]Chen, L.; Maddox, R. K.; Duan, Z.; and Xu, C. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7832–7841.
[3]van den Oord, A.; Vinyals, O.; and kavukcuoglu, k. 2017. Neural Discrete Representation Learning. In Guyon, I.;
Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
[4]KR, P.; Mukhopadhyay, R.; Philip, J.; Jha, A.; Namboodiri, V.; and Jawahar, C. 2019. Towards automatic face-to-face translation. In Proceedings of the 27th ACM International Conference on Multimedia, 1428–1436.
[5]Prajwal, K. R.; Mukhopadhyay, R.; Namboodiri, V. P.; and Jawahar, C. 2020a. Learning Individual Speaking Styles
for Accurate Lip-to-Speech Synthesis. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[6]Prajwal, K. R.; Mukhopadhyay, R.; Namboodiri, V. P.; and Jawahar, C. 2020b. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia,
MM ’20, 484–492. New York, NY, USA: Association for Computing Machinery. ISBN 9781450379885.
[7]Song, Y.; Zhu, J.; Li, D.; Wang, A.; and Qi, H. 2019. Talking Face Generation by Conditional Recurrent Adversarial
Network. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19,
919–925. International Joint Conferences on Artificial Intelligence Organization.
[8]Zhou, H.; Liu, Y.; Liu, Z.; Luo, P.; andWang, X. 2019. Talking Face Generation by Adversarially Disentangled Audio-
Visual Representation. In AAAI Conference on Artificial Intelligence (AAAI).
[9]Ding, S.; and Gutierrez-Osuna, R. 2019. Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion. In Proc. Interspeech 2019, 724–728.
[10]Gao, R.; and Grauman, K. 2021. VisualVoice: Audio-Visual
Speech Separation with Cross-Modal Consistency. arXiv
preprint arXiv:2101.03149.
[11] Zhou, H.; Liu, Y.; Liu, Z.; Luo, P.; andWang, X. 2019. TalkingFace Generation by Adversarially Disentangled Audio-Visual Representation. In AAAI Conference on Artificial Intelligence(AAAI).
[12] Chou, J.-c.; Yeh, C.-c.; and Lee, H.-y. 2019. One-shotVoice Conversion by Separating Speaker and Content Representations with Instance Normalization. arXiv preprint arXiv:1904.05742.
[13] Li, N.; Liu, S.; Liu, Y.; Zhao, S.; and Liu, M. 2019. NeuralSpeech Synthesis with Transformer Network. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6706–6713.
[14] Feng, D.; Yang, S.; Shan, S.; and Chen, X. 2020. Learn an Effective Lip Reading Model without Pain. arXiv preprint arXiv:2011.07557.
[15] Stafylakis, T.; Khan, M. H.; and Tzimiropoulos, G. 2018. Pushing the boundaries of audio-visual word recognition using Residual Networks and LSTMs. Computer Vision and Image Understanding, 176-177: 22–32.
[16] Ma, P.; Wang, Y.; Shen, J.; Petridis, S.; and Pantic, M. 2021. Lip-Reading With Densely Connected Temporal Convolutional Networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2857–2866.
[17] Ma, P.; Wang, Y.; Shen, J.; Petridis, S.; and Pantic, M. 2021. Lip-Reading With Densely Connected Temporal Convolutional Networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2857–2866.
[18] Garcia, B.; Shillingford, B.; Liao, H.; Siohan, O.; de Pinho Forin Braga, O.; Makino, T.; and Assael, Y. 2019. Recurrent Neural Network Transducer for Audio-Visual Speech Recognition. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop
[19] Ren, S.; Du, Y.; Lv, J.; Han, G.; and He, S. 2021. Learning From the Master: Distilling Cross-Modal Advanced Knowledge for Lip Reading. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13325–13333.
[20] Zhao, Y.; Xu, R.; Wang, X.; Hou, P.; Tang, H.; and Song, M. 2020. Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04): 6917–6924.
[21] Soo, W; Joon, S and Hong, May 2019. Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1520-6149.
[22] Chih, Y; Wan, F; Cheng, Y; and Yu, W; 2022 “Cross-modal mutual learning for audio-visual speech recognition and manipulation,” in Proceedings of the 36th AAAI Conference on Artificial Intelligence, vol. 22
[23] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.