Exploration of Spontaneous Speech Corpus Development in Urban Agriculture Instructional Videos

Trisna Gelar, Aprianti Nanda

Abstract


Video transcription can be obtained automatically based on the original language translation of the video maker's speech, but the quality of the transcription depends on the quality of the audio signal and the natural voice of the speaker. In this study, Deep Speech is used to predict letters based on acoustic recognition without understanding language rules. The Common Voice multilingual corpus helps Deep Seech to transcribe Indonesian. However, this corpus does not accommodate the special topic of urban agriculture, so an additional corpus is needed to build acoustic and language models with the urban agriculture domain. A total of 15 popular videos with closed captions and nine E-Books with the theme of Horticulture (fruit, vegetables and medicinal plants) were curated. The video data were extracted into audio and transcription according to specifications as training data, while the agricultural text data were transformed into language models, which were used to predict recognition results. The evaluation results show that the number of epochs has an effect on improving the transcription performance. The language model score used during prediction improved WER performance as it interpreted words with agricultural terms. Another finding was that the model was unable to predict short words with informal varieties and located at the end of the sentence.


Keywords


Acoustic models, Corpus exploration, Language models, Spontaneous speech, Urban agriculture

Full Text:

PDF

References


Aafaq, N., Mian, A., Liu, W., Gilani, S. Z., and Shah, M. (2019). Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys (CSUR), 52(6), 1-37.

Bang, J. U., Yun, S., Kim, S. H., Choi, M. Y., Lee, M. K., Kim, Y. J., and Kim, S. H. (2020). Ksponspeech: Korean spontaneous speech corpus for automatic speech recognition. Applied Sciences, 10(19), 1-17.

Benkerzaz, S., Elmir, Y., and Dennai, A. (2019). A study on automatic speech recognition. Journal of Information Technology Review, 10(3), 77-85.

Besacier, L., Barnard, E., Karpov, A., and Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56, 85-100.

Budiman, V. E., and Widjaja, A. (2020). Building acoustic and language model for continuous speech recognition in bahasa Indonesia. Jurnal Teknik Informatika dan Sistem Informasi, 6(2), 301-310.

Chang, Y. J., Paruthi, G., Wu, H. Y., Lin, H. Y., and Newman, M. W. (2017). An investigation of using mobile and situated crowdsourcing to collect annotated travel activity data in real-word settings. International Journal of Human-Computer Studies, 102(C), 81-102.

Dyarbirru, Z., and Hidayat, S. (2020). Metode wavelet-MFCC dan korelasi dalam pengenalan suara digit. JTIM: Jurnal Teknologi Informasi dan Multimedia, 2(2), 100-108.

Handoko, I. T., and Suyanto, S. (2019). klasifikasi gender dan usia berdasarkan suara pembicara menggunakan hidden markov model. Indonesia Journal on Computing (Indo-JC), 4(3), 99-106.

Huang, Z., Siniscalchi, S. M., and Lee, C. H. (2016). A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition. Neurocomputing, 218, 448-459.

Gelar, T., and Nanda, A. (2020). Klasifikasi komentar video instruksional populer bertemakan pekarangan perkotaan menggunakan auto-keras. Journal of Software Engineering, Information and Communicaton Technology (SEICT), 1(1), 1-9.

Lapasau, M., and Setiawati, S. (2020). Slips of the tongue in Indonesia daily conversation: a psycholinguistic view. Hortatori: Jurnal Pendidikan Bahasa dan Sastra Indonesia, 4(2), 127-132.

Lotfian, R., and Busso, C. (2017). Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, 10(4), 471-483.

Novendri, R., Callista, A. S., Pratama, D. N., and Puspita, C. E. (2020). Sentiment analysis of YouTube movie trailer comments using naïve bayes. Bulletin of Computer Science and Electrical Engineering, 1(1), 26-32.

Permatasari, P. A., and Linawati, L. J. (2021). Survei tentang analisis sentimen pada media sosial. Maj. Ilm. Teknol. Elektro, 20(2), 177-185.

Ruder, S., Vulić, I., and Søgaard, A. (2019). A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research, 65, 569-631.

Shoufan, A. (2019). Estimating the cognitive value of YouTube's educational videos: A learning analytics approach. Computers in Human Behavior, 92(C), 450-458.

Suryani, N., Suryono, J., Gama, B., and Setyo, B. (2022). Faktor-faktor kebijakan redaksional youtube solopos dalam menentukan berita vaksin covid–19. Media and Empowerment Communication Journal, 1(1), 15-24.

Tachbelie, M. Y., Abate, S. T., and Schultz, T. (2022). Multilingual speech recognition for GlobalPhone languages. Speech Communication, 140, 71-86.

Vingilis, E., Yıldırım-Yenier, Z., Vingilis-Jaremko, L., Wickens, C., Seeley, J., Fleiter, J., and Grushka, D. H. (2017). Literature review on risky driving videos on YouTube: Unknown effects and areas for concern. Traffic Injury Prevention, 18(6), 606-615.

Watanabe, S., Hori, T., Kim, S., Hershey, J. R., and Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240-1253.




DOI: https://doi.org/10.17509/seict.v3i1.44548

Refbacks

  • There are currently no refbacks.


Copyright (c) 2022 Journal of Software Engineering, Information and Communication Technology (SEICT)

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Journal of Software Engineering, Information and Communicaton Technology (SEICT), 
(e-ISSN:
2774-1699 | p-ISSN:2744-1656) published by Program Studi Rekayasa Perangkat Lunak, Kampus UPI di Cibiru.


 Indexed by.