An automatic pipeline for processing streamed content: New horizons for corpus linguistics and phonetics
Coats, Steven
Coats, Steven
De Gruyter
Coats, S. (2025). 257An automatic pipeline for processing streamed content: New horizons for corpus linguistics and phonetics. In L. Cotgrove, L. Herzberg, & H. Lüngen (Eds.), Exploring digitally-mediated communication with corpora (pp. 257–274). De Gruyter. https://doi.org/10.1515/9783111434018-012
https://creativecommons.org/licenses/by/4.0/
© 2025 with the author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License.
https://creativecommons.org/licenses/by/4.0/
© 2025 with the author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License.
https://creativecommons.org/licenses/by/4.0/
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202507045062
https://urn.fi/URN:NBN:fi:oulu-202507045062
Tiivistelmä
Abstract
Large volumes of audio and video data are accessible through video sharing sites, streaming services, and social media platforms, but until recently, relatively little of this content has been utilized as research data for large-scale studies of grammatical or phonetic variation. This chapter discusses a notebook-based pipeline designed to analyze phonetic data from online video content, made possible by recent advances in language technology such as improvements in automatic speech recognition and forced alignment. It provides an overview of open-source frameworks for working with speech data, noting that while several tools have been developed to handle some or all of these tasks, their installation and setup may be complex and incompatibility issues may arise. Notebook-based pipelines, increasingly used in all fields of data science, offer the advantages of flexibility and adaptability. In this chapter, we introduce the Video Phonetics Pipeline (ViPP) for the extraction and analysis of audio and transcript data from video and streaming sites such as YouTube, X, TikTok, and many others, a pipeline which leverages functions from the open-source Python library yt-dlp to retrieve data, then utilizes the Montreal Forced Aligner to align audio with text. Formants are measured with Praat-Parselmouth, and packages from Python’s standard library can be used for statistical analysis and visualization. The script pipeline, available as a notebook at GitHub and in a Google Colab environment, is customizable. The utility of the pipeline is demonstrated with an example: a consideration of diphthong trajectories in contemporary North American English, based on data from the Corpus of North American Spoken English (CoNASE).
Large volumes of audio and video data are accessible through video sharing sites, streaming services, and social media platforms, but until recently, relatively little of this content has been utilized as research data for large-scale studies of grammatical or phonetic variation. This chapter discusses a notebook-based pipeline designed to analyze phonetic data from online video content, made possible by recent advances in language technology such as improvements in automatic speech recognition and forced alignment. It provides an overview of open-source frameworks for working with speech data, noting that while several tools have been developed to handle some or all of these tasks, their installation and setup may be complex and incompatibility issues may arise. Notebook-based pipelines, increasingly used in all fields of data science, offer the advantages of flexibility and adaptability. In this chapter, we introduce the Video Phonetics Pipeline (ViPP) for the extraction and analysis of audio and transcript data from video and streaming sites such as YouTube, X, TikTok, and many others, a pipeline which leverages functions from the open-source Python library yt-dlp to retrieve data, then utilizes the Montreal Forced Aligner to align audio with text. Formants are measured with Praat-Parselmouth, and packages from Python’s standard library can be used for statistical analysis and visualization. The script pipeline, available as a notebook at GitHub and in a Google Colab environment, is customizable. The utility of the pipeline is demonstrated with an example: a consideration of diphthong trajectories in contemporary North American English, based on data from the Corpus of North American Spoken English (CoNASE).
Kokoelmat
- Avoin saatavuus [38840]