CONVOLUTIONAL NEURAL NETWORKS IN DETECTING SPEECH ACTIVITY IN A STREAM
https://doi.org/10.53360/2788-7995-2024-4(16)-5
Abstract
The research presented in this article focuses on the development of a system for detecting speech activity in audio streams using convolutional neural networks (CNNs). Speech activity detection plays a crucial role in many modern applications, such as voice-activated assistants, real-time communication platforms, and automated transcription services. The study synthesizes findings from nine key studies, demonstrating the effectiveness of CNNs in handling complex audio data, isolating speech signals from noise, and improving overall detection accuracy.
The research emphasizes the architectural advantages of deep CNN models, such as VGG, ResNet, and AlexNet, highlighting their ability to capture intricate audio features and improve performance across various environments. The study also explores techniques like data augmentation and optimization algorithms, which further enhance the robustness and efficiency of these models.
By evaluating the effectiveness of different CNN architectures and comparing various evaluation metrics, the research identifies potential areas for future exploration, such as optimizing CNN models for real-time applications and exploring hybrid architectures. Overall, this research offers valuable insights into the state of CNN-based speech activity detection and its implications for real-world applications.
About the Author
N. M. TaubakabylKazakhstan
Nurlybek Muratbekuly Taubakabyl – Master's Student
010000, Republic of Kazakhstan, Astana, Mangilik El Avenue, С1
References
1. Deep speech 2: End-to-end speech recognition in English and Mandarin / D. Amodei et al // Computation and Language (cs.CL). – 2015. https://doi.org/10.48550/arXiv.1512.02595.
2. CNN architectures for large-scale audio classification / S. Hershey et al // In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). – 2017. – Р. 131-135. https://arxiv.org/pdf/1609.09430.
3. Very deep multilingual convolutional neural networks for LVCSR / T. Sercu et al // arXiv preprint arXiv:1509.08967. – 2016. https://arxiv.org/pdf/1509.08967.
4. Luo Y. Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation / Y. Luo, N. Mesgarani // IEEE/ACM Transactions on Audio, Speech, and Language Processing. – 2018. – № 27(8). – Р. 1256-1266. https://arxiv.org/pdf/1809.07454.
5. Grill T. Two convolutional neural networks for bird detection in audio signals / T. Grill, J. Schlüter, // In 2017 25th European Signal Processing Conference (EUSIPCO). – 2017. – Р. 1764-1768. https://www.ofai.at/~jan.schlueter/pubs/2017_eusipco.pdf.
6. Joint training of deep neural networks for audio-visual automatic speech recognition / Y. Qian et al // IEEE/ACM Transactions on Audio, Speech, and Language Processing. – 2017. – № 25(12). – Р. 2381-2393. https://arxiv.org/pdf/2205.13293.
7. Vincent E. Performance measurement in blind audio source separation / E. Vincent, R. Gribonval, C. Févotte // IEEE Transactions on Audio, Speech, and Language Processing. – 2006. – № 14(4). – Р. 1462-1469. https://inria.hal.science/inria-00544230/document.
8. Convolutional neural networks for speech recognition / O. Abdel-Hamidet al // IEEE/ACM Transactions on Audio, Speech, and Language Processing. – 2014. – № 22(10). – Р. 1533-1545. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/CNN_ASLPTrans2-14.pdf
9. Accelerating very deep convolutional networks for classification and detection / X. Zhang et al // IEEE Transactions on Pattern Analysis and Machine Intelligence. – 2016. – № 38(10). – Р. 1943-1955. https://arxiv.org/pdf/1505.06798.
10. VanderPlas J. Python Data Science Handbook / J. VanderPlas // Essential Tools for Working with Data. O'Reilly Media. https://jakevdp.github.io/PythonDataScienceHandbook.
Review
For citations:
Taubakabyl N.M. CONVOLUTIONAL NEURAL NETWORKS IN DETECTING SPEECH ACTIVITY IN A STREAM. Bulletin of Shakarim University. Technical Sciences. 2024;1(4(16)):33-40. https://doi.org/10.53360/2788-7995-2024-4(16)-5