Deep Learning Cross-Modal Learning for Audio-Visual Speech Recognition

Hammood, Dalal A; Habil, Hayder Jasim; Mahmoud, Ibtehal Shakir; Hanafi, Effariza

doi:10.63463/kjes1090

Authors

¹ Department of Computer Technical Engineering, Middle Technical University (MTU), Electrical Engineering Technical College,Baghdad, Iraq

² Middle Technical University (MTU) - College of Applied Arts; Baghdad-Iraq

³ aliraqia university, Baghdad- Iraq

⁴ Universiti Malaya; Kuala Lumpur.

Document Type : Research Article

10.63463/kjes1090

Abstract

The ability to relate information about languages heard through visual and audio data is a crucial aspect of audio-visual speech recognition (AVSR), which has uses in data manipulation for audio-visual correspondence, including AVE-Net and SyncNet. The technique described in this research uses feature disentanglement to simultaneously handle the tasks listed above. By developing cross-modal standard learning methods, this model can transform visual or aural linguistic characteristics into modality-independent representations. AVE-Net and SyncNet can all be performed with the help of such derived linguistic expressions. Furthermore, audio and visual data output can be modified based on the required subject identity and linguistic content information. We do comprehensive trials on various recognition and synthesis tasks on both tasks separately, and that solution can successfully take on both audio-visual learning problems. The system gives great results in the enhanced video with 91.5% with 5 frames, while this will increase with the increase of frames with 99.03% with 15 frames, which is more efficient than the previous methods.

Keywords

References

[1]Chen, L.; Li, Z.; Maddox, R. K.; Duan, Z.; and Xu, C. 2018. Lip Movements Generation at a Glance. In Proceedings of the European Conference on Computer Vision (ECCV).

[2]Chen, L.; Maddox, R. K.; Duan, Z.; and Xu, C. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7832–7841.

[3]van den Oord, A.; Vinyals, O.; and kavukcuoglu, k. 2017. Neural Discrete Representation Learning. In Guyon, I.;

Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

[4]KR, P.; Mukhopadhyay, R.; Philip, J.; Jha, A.; Namboodiri, V.; and Jawahar, C. 2019. Towards automatic face-to-face translation. In Proceedings of the 27th ACM International Conference on Multimedia, 1428–1436.

[5]Prajwal, K. R.; Mukhopadhyay, R.; Namboodiri, V. P.; and Jawahar, C. 2020a. Learning Individual Speaking Styles

for Accurate Lip-to-Speech Synthesis. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]Prajwal, K. R.; Mukhopadhyay, R.; Namboodiri, V. P.; and Jawahar, C. 2020b. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia,

MM ’20, 484–492. New York, NY, USA: Association for Computing Machinery. ISBN 9781450379885.

[7]Song, Y.; Zhu, J.; Li, D.; Wang, A.; and Qi, H. 2019. Talking Face Generation by Conditional Recurrent Adversarial

Network. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19,

919–925. International Joint Conferences on Artificial Intelligence Organization.

[8]Zhou, H.; Liu, Y.; Liu, Z.; Luo, P.; andWang, X. 2019. Talking Face Generation by Adversarially Disentangled Audio-

Visual Representation. In AAAI Conference on Artificial Intelligence (AAAI).

[9]Ding, S.; and Gutierrez-Osuna, R. 2019. Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion. In Proc. Interspeech 2019, 724–728.

[10]Gao, R.; and Grauman, K. 2021. VisualVoice: Audio-Visual

Speech Separation with Cross-Modal Consistency. arXiv

preprint arXiv:2101.03149.

[11] Zhou, H.; Liu, Y.; Liu, Z.; Luo, P.; andWang, X. 2019. TalkingFace Generation by Adversarially Disentangled Audio-Visual Representation. In AAAI Conference on Artificial Intelligence(AAAI).

[12] Chou, J.-c.; Yeh, C.-c.; and Lee, H.-y. 2019. One-shotVoice Conversion by Separating Speaker and Content Representations with Instance Normalization. arXiv preprint arXiv:1904.05742.

[13] Li, N.; Liu, S.; Liu, Y.; Zhao, S.; and Liu, M. 2019. NeuralSpeech Synthesis with Transformer Network. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6706–6713.

[14] Feng, D.; Yang, S.; Shan, S.; and Chen, X. 2020. Learn an Effective Lip Reading Model without Pain. arXiv preprint arXiv:2011.07557.

[15] Stafylakis, T.; Khan, M. H.; and Tzimiropoulos, G. 2018. Pushing the boundaries of audio-visual word recognition using Residual Networks and LSTMs. Computer Vision and Image Understanding, 176-177: 22–32.

[16] Ma, P.; Wang, Y.; Shen, J.; Petridis, S.; and Pantic, M. 2021. Lip-Reading With Densely Connected Temporal Convolutional Networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2857–2866.

[17] Ma, P.; Wang, Y.; Shen, J.; Petridis, S.; and Pantic, M. 2021. Lip-Reading With Densely Connected Temporal Convolutional Networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2857–2866.

[18] Garcia, B.; Shillingford, B.; Liao, H.; Siohan, O.; de Pinho Forin Braga, O.; Makino, T.; and Assael, Y. 2019. Recurrent Neural Network Transducer for Audio-Visual Speech Recognition. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop

[19] Ren, S.; Du, Y.; Lv, J.; Han, G.; and He, S. 2021. Learning From the Master: Distilling Cross-Modal Advanced Knowledge for Lip Reading. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13325–13333.

[20] Zhao, Y.; Xu, R.; Wang, X.; Hou, P.; Tang, H.; and Song, M. 2020. Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04): 6917–6924.

[21] Soo, W; Joon, S and Hong, May 2019. Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1520-6149.

[22] Chih, Y; Wan, F; Cheng, Y; and Yu, W; 2022 “Cross-modal mutual learning for audio-visual speech recognition and manipulation,” in Proceedings of the 36th AAAI Conference on Artificial Intelligence, vol. 22

[23] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Article View: 3,864
PDF Download: 461

Kerbala Journal for Engineering Sciences

Deep Learning Cross-Modal Learning for Audio-Visual Speech Recognition

References

References

Volume 4, Issue 1
March 2024
Pages 1-14

Deep Learning Cross-Modal Learning for Audio-Visual Speech Recognition

References

References

Volume 4, Issue 1March 2024Pages 1-14

Volume 4, Issue 1
March 2024
Pages 1-14