CONVOLUTIONAL NEURAL NETWORKS IN DETECTING SPEECH ACTIVITY IN A STREAM

N. M. Taubakabyl

doi:10.53360/2788-7995-2024-4(16)-5

CONVOLUTIONAL NEURAL NETWORKS IN DETECTING SPEECH ACTIVITY IN A STREAM

N. M. Taubakabyl

https://doi.org/10.53360/2788-7995-2024-4(16)-5

Full Text:

PDF (Eng) |

Generate QR code

Abstract

The research presented in this article focuses on the development of a system for detecting speech activity in audio streams using convolutional neural networks (CNNs). Speech activity detection plays a crucial role in many modern applications, such as voice-activated assistants, real-time communication platforms, and automated transcription services. The study synthesizes findings from nine key studies, demonstrating the effectiveness of CNNs in handling complex audio data, isolating speech signals from noise, and improving overall detection accuracy.
The research emphasizes the architectural advantages of deep CNN models, such as VGG, ResNet, and AlexNet, highlighting their ability to capture intricate audio features and improve performance across various environments. The study also explores techniques like data augmentation and optimization algorithms, which further enhance the robustness and efficiency of these models.
By evaluating the effectiveness of different CNN architectures and comparing various evaluation metrics, the research identifies potential areas for future exploration, such as optimizing CNN models for real-time applications and exploring hybrid architectures. Overall, this research offers valuable insights into the state of CNN-based speech activity detection and its implications for real-world applications.

Keywords

Convolutional Neural Networks, Speech Activity Detection, Audio Streams, VGG, ResNet, AlexNet, Real-time Communication, Voice-activated Assistants, Speech Recognition, Audio Processing

About the Author

N. M. Taubakabyl

Astana IT University
Kazakhstan

Nurlybek Muratbekuly Taubakabyl – Master's Student

010000, Republic of Kazakhstan, Astana, Mangilik El Avenue, С1

References

1. Deep speech 2: End-to-end speech recognition in English and Mandarin / D. Amodei et al // Computation and Language (cs.CL). – 2015. https://doi.org/10.48550/arXiv.1512.02595.

2. CNN architectures for large-scale audio classification / S. Hershey et al // In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). – 2017. – Р. 131-135. https://arxiv.org/pdf/1609.09430.

3. Very deep multilingual convolutional neural networks for LVCSR / T. Sercu et al // arXiv preprint arXiv:1509.08967. – 2016. https://arxiv.org/pdf/1509.08967.

4. Luo Y. Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation / Y. Luo, N. Mesgarani // IEEE/ACM Transactions on Audio, Speech, and Language Processing. – 2018. – № 27(8). – Р. 1256-1266. https://arxiv.org/pdf/1809.07454.

5. Grill T. Two convolutional neural networks for bird detection in audio signals / T. Grill, J. Schlüter, // In 2017 25th European Signal Processing Conference (EUSIPCO). – 2017. – Р. 1764-1768. https://www.ofai.at/~jan.schlueter/pubs/2017_eusipco.pdf.

6. Joint training of deep neural networks for audio-visual automatic speech recognition / Y. Qian et al // IEEE/ACM Transactions on Audio, Speech, and Language Processing. – 2017. – № 25(12). – Р. 2381-2393. https://arxiv.org/pdf/2205.13293.

7. Vincent E. Performance measurement in blind audio source separation / E. Vincent, R. Gribonval, C. Févotte // IEEE Transactions on Audio, Speech, and Language Processing. – 2006. – № 14(4). – Р. 1462-1469. https://inria.hal.science/inria-00544230/document.

8. Convolutional neural networks for speech recognition / O. Abdel-Hamidet al // IEEE/ACM Transactions on Audio, Speech, and Language Processing. – 2014. – № 22(10). – Р. 1533-1545. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/CNN_ASLPTrans2-14.pdf

9. Accelerating very deep convolutional networks for classification and detection / X. Zhang et al // IEEE Transactions on Pattern Analysis and Machine Intelligence. – 2016. – № 38(10). – Р. 1943-1955. https://arxiv.org/pdf/1505.06798.

10. VanderPlas J. Python Data Science Handbook / J. VanderPlas // Essential Tools for Working with Data. O'Reilly Media. https://jakevdp.github.io/PythonDataScienceHandbook.

Review

For citations:

Taubakabyl N.M. CONVOLUTIONAL NEURAL NETWORKS IN DETECTING SPEECH ACTIVITY IN A STREAM. Bulletin of Shakarim University. Technical Sciences. 2024;1(4(16)):33-40. https://doi.org/10.53360/2788-7995-2024-4(16)-5

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2788-7995 (Print)
ISSN 3006-0524 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Bulletin of Shakarim University. Technical Sciences

CONVOLUTIONAL NEURAL NETWORKS IN DETECTING SPEECH ACTIVITY IN A STREAM

Full Text:

Abstract

Keywords

About the Author

References

Review

For citations:

Cookies policy