Hyppää sisältöön
    • FI
    • ENG
  • FI
  • /
  • EN
OuluREPO – Oulun yliopiston julkaisuarkisto / University of Oulu repository
Näytä viite 
  •   OuluREPO etusivu
  • Oulun yliopisto
  • Avoin saatavuus
  • Näytä viite
  •   OuluREPO etusivu
  • Oulun yliopisto
  • Avoin saatavuus
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

Measuring LDA topic stability from clusters of replicated runs

Mäntylä, Mika V.; Claes, Maëlick; Farooq, Umar (2018-10-11)

 
Avaa tiedosto
nbnfi-fe201901031265.pdf (598.2Kt)
nbnfi-fe201901031265_meta.xml (32.58Kt)
nbnfi-fe201901031265_solr.xml (31.68Kt)
Lataukset: 

URL:
https://doi.org/10.1145/3239235.3267435

Mäntylä, Mika V.
Claes, Maëlick
Farooq, Umar
Association for Computing Machinery
11.10.2018

Mika V. Mantyla, Maelick Claes, and Umar Farooq. 2018. Measuring LDA topic stability from clusters of replicated runs. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM '18). ACM, New York, NY, USA, Article 49, 4 pages. DOI: https://doi.org/10.1145/3239235.3267435

https://rightsstatements.org/vocab/InC/1.0/
© ACM 2018. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM '18). ACM, New York, NY, USA, Article 49, 4 pages. DOI: https://doi.org/10.1145/3239235.3267435.
https://rightsstatements.org/vocab/InC/1.0/
doi:https://doi.org/10.1145/3239235.3267435
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe201901031265
Tiivistelmä

Abstract

Background: Unstructured and textual data is increasing rapidly and Latent Dirichlet Allocation (LDA) topic modeling is a popular data analysis methods for it. Past work suggests that instability of LDA topics may lead to systematic errors.

Aim: We propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics.

Method: We generate k LDA topics and replicate this process n times resulting in n*k topics. Then we use K-medioids to cluster the n*k topics to k clusters. The k clusters now represent the original LDA topics and we present them like normal LDA topics showing the ten most probable words. For the clusters, we try multiple stability metrics, out of which we recommend Rank-Biased Overlap, showing the stability of the topics inside the clusters.

Results: We provide an initial validation where our method is used for 270,000 Mozilla Firefox commit messages with k=20 and n=20. We show how our topic stability metrics are related to the contents of the topics.

Conclusions: Advances in text mining enable us to analyze large masses of text in software engineering but non-deterministic algorithms, such as LDA, may lead to unreplicable conclusions. Our approach makes LDA stability transparent and is also complementary rather than alternative to many prior works that focus on LDA parameter tuning.

Kokoelmat
  • Avoin saatavuus [37744]
oulurepo@oulu.fiOulun yliopiston kirjastoOuluCRISLaturiMuuntaja
SaavutettavuusselosteTietosuojailmoitusYlläpidon kirjautuminen
 

Selaa kokoelmaa

NimekkeetTekijätJulkaisuajatAsiasanatUusimmatSivukartta

Omat tiedot

Kirjaudu sisäänRekisteröidy
oulurepo@oulu.fiOulun yliopiston kirjastoOuluCRISLaturiMuuntaja
SaavutettavuusselosteTietosuojailmoitusYlläpidon kirjautuminen