• Register
  • Login
  • العربیة

Kerbala Journal for Engineering Sciences

  1. Home
  2. Deep Learning Cross-Modal Learning for Audio-Visual Speech Recognition

Current Issue

By Issue

By Author

By Subject

Author Index

Keyword Index

Indexing and Abstracting

Related Links

FAQ

Journal Metrics

News

Publication fees

Deep Learning Cross-Modal Learning for Audio-Visual Speech Recognition

    Authors

    • dalal A hammood 1
    • Hayder jasim habil 2
    • Ibtehal shakir Mahmoud 3
    • Effariza Hanafi 4

    1 Department of Computer Technical Engineering, Middle Technical University (MTU), Electrical Engineering Technical College,Baghdad, Iraq

    2 Middle Technical University (MTU) - College of Applied Arts; Baghdad-Iraq

    3 aliraqia university, Baghdad- Iraq

    4 Universiti Malaya; Kuala Lumpur.

,

Document Type : Research Article

10.63463/kjes1090
  • Article Information
  • References
  • Download
  • How to cite
  • Statistics
  • Share

Abstract

The ability to relate information about languages heard through visual and audio data is a crucial aspect of audio-visual speech recognition (AVSR), which has uses in data manipulation for audio-visual correspondence, including AVE-Net and SyncNet. The technique described in this research uses feature disentanglement to simultaneously handle the tasks listed above. By developing cross-modal standard learning methods, this model can transform visual or aural linguistic characteristics into modality-independent representations. AVE-Net and SyncNet can all be performed with the help of such derived linguistic expressions. Furthermore, audio and visual data output can be modified based on the required subject identity and linguistic content information. We do comprehensive trials on various recognition and synthesis tasks on both tasks separately, and that solution can successfully take on both audio-visual learning problems. The system gives great results in the enhanced video with 91.5% with 5 frames, while this will increase with the increase of frames with 99.03% with 15 frames, which is more efficient than the previous methods.

Keywords

  • CNNs,,
  • ,،deep learning,,
  • ,،AVE-Net,,
  • ,،SyncNet,,
  • ,،AVSR
  • XML
  • PDF 1.04 M
  • RIS
  • EndNote
  • Mendeley
  • BibTeX
  • APA
  • MLA
  • HARVARD
  • VANCOUVER
References
[1]Chen, L.; Li, Z.; Maddox, R. K.; Duan, Z.; and Xu, C. 2018. Lip Movements Generation at a Glance. In Proceedings of the European Conference on Computer Vision (ECCV).
[2]Chen, L.; Maddox, R. K.; Duan, Z.; and Xu, C. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7832–7841.
[3]van den Oord, A.; Vinyals, O.; and kavukcuoglu, k. 2017. Neural Discrete Representation Learning. In Guyon, I.;
Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
[4]KR, P.; Mukhopadhyay, R.; Philip, J.; Jha, A.; Namboodiri, V.; and Jawahar, C. 2019. Towards automatic face-to-face translation. In Proceedings of the 27th ACM International Conference on Multimedia, 1428–1436.
[5]Prajwal, K. R.; Mukhopadhyay, R.; Namboodiri, V. P.; and Jawahar, C. 2020a. Learning Individual Speaking Styles
for Accurate Lip-to-Speech Synthesis. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[6]Prajwal, K. R.; Mukhopadhyay, R.; Namboodiri, V. P.; and Jawahar, C. 2020b. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia,
MM ’20, 484–492. New York, NY, USA: Association for Computing Machinery. ISBN 9781450379885.
[7]Song, Y.; Zhu, J.; Li, D.; Wang, A.; and Qi, H. 2019. Talking Face Generation by Conditional Recurrent Adversarial
Network. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19,
919–925. International Joint Conferences on Artificial Intelligence Organization.
[8]Zhou, H.; Liu, Y.; Liu, Z.; Luo, P.; andWang, X. 2019. Talking Face Generation by Adversarially Disentangled Audio-
Visual Representation. In AAAI Conference on Artificial Intelligence (AAAI).
[9]Ding, S.; and Gutierrez-Osuna, R. 2019. Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion. In Proc. Interspeech 2019, 724–728.
[10]Gao, R.; and Grauman, K. 2021. VisualVoice: Audio-Visual
Speech Separation with Cross-Modal Consistency. arXiv
preprint arXiv:2101.03149.
[11] Zhou, H.; Liu, Y.; Liu, Z.; Luo, P.; andWang, X. 2019. TalkingFace Generation by Adversarially Disentangled Audio-Visual Representation. In AAAI Conference on Artificial Intelligence(AAAI).
[12] Chou, J.-c.; Yeh, C.-c.; and Lee, H.-y. 2019. One-shotVoice Conversion by Separating Speaker and Content Representations with Instance Normalization. arXiv preprint arXiv:1904.05742.
[13] Li, N.; Liu, S.; Liu, Y.; Zhao, S.; and Liu, M. 2019. NeuralSpeech Synthesis with Transformer Network. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6706–6713.
[14] Feng, D.; Yang, S.; Shan, S.; and Chen, X. 2020. Learn an Effective Lip Reading Model without Pain. arXiv preprint arXiv:2011.07557.
[15] Stafylakis, T.; Khan, M. H.; and Tzimiropoulos, G. 2018. Pushing the boundaries of audio-visual word recognition using Residual Networks and LSTMs. Computer Vision and Image Understanding, 176-177: 22–32.
[16] Ma, P.; Wang, Y.; Shen, J.; Petridis, S.; and Pantic, M. 2021. Lip-Reading With Densely Connected Temporal Convolutional Networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2857–2866.
[17] Ma, P.; Wang, Y.; Shen, J.; Petridis, S.; and Pantic, M. 2021. Lip-Reading With Densely Connected Temporal Convolutional Networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2857–2866.
[18] Garcia, B.; Shillingford, B.; Liao, H.; Siohan, O.; de Pinho Forin Braga, O.; Makino, T.; and Assael, Y. 2019. Recurrent Neural Network Transducer for Audio-Visual Speech Recognition. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop
[19] Ren, S.; Du, Y.; Lv, J.; Han, G.; and He, S. 2021. Learning From the Master: Distilling Cross-Modal Advanced Knowledge for Lip Reading. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13325–13333.
[20] Zhao, Y.; Xu, R.; Wang, X.; Hou, P.; Tang, H.; and Song, M. 2020. Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04): 6917–6924.
[21] Soo, W; Joon, S and Hong, May 2019. Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1520-6149.
[22] Chih, Y; Wan, F; Cheng, Y; and Yu, W; 2022 “Cross-modal mutual learning for audio-visual speech recognition and manipulation,” in Proceedings of the 36th AAAI Conference on Artificial Intelligence, vol. 22
[23] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 
 
 
    • Article View: 3,864
    • PDF Download: 461
Kerbala Journal for Engineering Sciences
Volume 4, Issue 1
March 2024
Pages 1-14
Files
  • XML
  • PDF 1.04 M
Share
How to cite
  • RIS
  • EndNote
  • Mendeley
  • BibTeX
  • APA
  • MLA
  • HARVARD
  • VANCOUVER
Statistics
  • Article View: 3,864
  • PDF Download: 461

APA

hammood, D., habil, H., Mahmoud, I., & Hanafi, E. (2024). Deep Learning Cross-Modal Learning for Audio-Visual Speech Recognition. Kerbala Journal for Engineering Sciences, 4(1), 1-14. doi: 10.63463/kjes1090

MLA

dalal A hammood; Hayder jasim habil; Ibtehal shakir Mahmoud; Effariza Hanafi. "Deep Learning Cross-Modal Learning for Audio-Visual Speech Recognition". Kerbala Journal for Engineering Sciences, 4, 1, 2024, 1-14. doi: 10.63463/kjes1090

HARVARD

hammood, D., habil, H., Mahmoud, I., Hanafi, E. (2024). 'Deep Learning Cross-Modal Learning for Audio-Visual Speech Recognition', Kerbala Journal for Engineering Sciences, 4(1), pp. 1-14. doi: 10.63463/kjes1090

VANCOUVER

hammood, D., habil, H., Mahmoud, I., Hanafi, E. Deep Learning Cross-Modal Learning for Audio-Visual Speech Recognition. Kerbala Journal for Engineering Sciences, 2024; 4(1): 1-14. doi: 10.63463/kjes1090

  • Home
  • About Journal
  • Editorial Board
  • Submit Manuscript
  • Contact Us
  • Glossary
  • Sitemap

News

  • Free publication for International researchers and ... 2025-04-04
  • Guidelines for Paper Submission in KJES 2021-11-08
  • Submit your paper 2021-05-27
  • The first issue has been published in Sept 2020. 2020-09-27

Newsletter Subscription

Subscribe to the journal newsletter and receive the latest news and updates

© Journal Management System. Powered by iJournalPro.com