Building a searchable online corpus of Australian and New Zealand aligned speech
Coats, Steven (2024-09-03)
Coats, Steven
Taylor & Francis
03.09.2024
Coats, S. (2024). Building a searchable online corpus of Australian and New Zealand aligned speech. Australian Journal of Linguistics, 1–17. https://doi.org/10.1080/07268602.2024.2368780
https://creativecommons.org/licenses/by/4.0/
© 2024 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The terms on which this article has been published allow the posting of the Accepted Manuscript in a repository by the author(s) or with their consent.
https://creativecommons.org/licenses/by/4.0/
© 2024 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The terms on which this article has been published allow the posting of the Accepted Manuscript in a repository by the author(s) or with their consent.
https://creativecommons.org/licenses/by/4.0/
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202409065725
https://urn.fi/URN:NBN:fi:oulu-202409065725
Tiivistelmä
Abstract
Advances in automatic speech recognition technology, increases in bandwidth availability, and the widespread use of video streaming and sharing platforms have opened new horizons for corpus phonetics. CoANZSE Audio, a searchable online version of the Corpus of Australian and New Zealand Spoken English, provides access to over 195 million words of transcribed speech from transcripts of videos uploaded to YouTube by councils and other local government entities in Australia and New Zealand. Audio and forced alignment files are also available, making the resource suitable for the investigation of a range of research questions pertaining to morphosyntax, phonetics, and discourse. The resource, which is freely available via login through CLARIN, Europe’s main language resources infrastructure network, was created through the use of open-source tools and software: yt-dlp, a Python library for collecting data from video and streaming websites; the Montreal Forced Aligner, a recent neural network alignment suite; and Parselmouth-Praat, Python bindings for the Praat acoustic analysis software. The website is powered by BlackLab, which combines a powerful search engine based on Apache Lucene with an intuitive web frontend. CoANZSE Audio may be useful for the investigation of regional differentiation of language features, and with additional annotation, differences in feature use according to social or demographic groups. Recent applications have included studies of double modals, a rare syntactic feature, and apology sequences. The nature of the audio and alignment data may make the resource especially suitable for the study of regional phonetic variation. Furthermore, the methods used to create the resource may be of interest to researchers seeking to adopt a pipeline approach for the creation of specialized corpora from publicly available online content.
Advances in automatic speech recognition technology, increases in bandwidth availability, and the widespread use of video streaming and sharing platforms have opened new horizons for corpus phonetics. CoANZSE Audio, a searchable online version of the Corpus of Australian and New Zealand Spoken English, provides access to over 195 million words of transcribed speech from transcripts of videos uploaded to YouTube by councils and other local government entities in Australia and New Zealand. Audio and forced alignment files are also available, making the resource suitable for the investigation of a range of research questions pertaining to morphosyntax, phonetics, and discourse. The resource, which is freely available via login through CLARIN, Europe’s main language resources infrastructure network, was created through the use of open-source tools and software: yt-dlp, a Python library for collecting data from video and streaming websites; the Montreal Forced Aligner, a recent neural network alignment suite; and Parselmouth-Praat, Python bindings for the Praat acoustic analysis software. The website is powered by BlackLab, which combines a powerful search engine based on Apache Lucene with an intuitive web frontend. CoANZSE Audio may be useful for the investigation of regional differentiation of language features, and with additional annotation, differences in feature use according to social or demographic groups. Recent applications have included studies of double modals, a rare syntactic feature, and apology sequences. The nature of the audio and alignment data may make the resource especially suitable for the study of regional phonetic variation. Furthermore, the methods used to create the resource may be of interest to researchers seeking to adopt a pipeline approach for the creation of specialized corpora from publicly available online content.
Kokoelmat
- Avoin saatavuus [34589]