Using neural networks to generate Fnnish word embedding vectors
Tapaninaho, Joonas (2024-04-29)
Tapaninaho, Joonas
J. Tapaninaho
29.04.2024
© 2024 Joonas Tapaninaho. Ellei toisin mainita, uudelleenkäyttö on sallittu Creative Commons Attribution 4.0 International (CC-BY 4.0) -lisenssillä (https://creativecommons.org/licenses/by/4.0/). Uudelleenkäyttö on sallittua edellyttäen, että lähde mainitaan asianmukaisesti ja mahdolliset muutokset merkitään. Sellaisten osien käyttö tai jäljentäminen, jotka eivät ole tekijän tai tekijöiden omaisuutta, saattaa edellyttää lupaa suoraan asianomaisilta oikeudenhaltijoilta.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202404293028
https://urn.fi/URN:NBN:fi:oulu-202404293028
Tiivistelmä
This bachelor’s thesis aims to find out how well word embedding vectors trained with the help of shallow neural networks are suitable for the Finnish language and how well they can recognize semantically similar words and synonyms. The word embedding vectors produced by neural networks that are the subject of comparison in the thesis were trained by using the Word2Vec and FastText algorithms. These algorithms contain small differences in the functioning of the neural networks, which can be seen when comparing the word embedding vectors. The thesis aims to accurately define structural differences and their effects on word embedding vectors, as well as the historical development of natural language processing from methods using machine learning to current neural network-based methods, which utilize deep learning. The common presupposition is that word embedding vectors trained using the FastText algorithm should better recognize semantically similar words and synonyms compared to those trained using the Word2vec algorithm. However, the thesis shows that this is not necessarily the case with the Finnish language, for example, the amount of training data and the source of it have a strong influence on the matter in addition to the dimension of the embedding vectors.
Kokoelmat
- Avoin saatavuus [36660]