From Reinvention to Reuse: An Empirical Example Study On Technical Debt Dataset
Rantala, Leevi; Mäntylä, Mika; Sridharan, Murali (2024-11-27)
Avaa tiedosto
Sisältö avataan julkiseksi: 27.11.2025
Rantala, Leevi
Mäntylä, Mika
Sridharan, Murali
Springer
27.11.2024
Rantala, L., Mäntylä, M.V., Sridharan, M. (2025). From Reinvention to Reuse: An Empirical Example Study on Technical Debt Dataset. In: Pfahl, D., Gonzalez Huerta, J., Klünder, J., Anwar, H. (eds) Product-Focused Software Process Improvement. PROFES 2024. Lecture Notes in Computer Science, vol 15452. Springer, Cham. https://doi.org/10.1007/978-3-031-78386-9_8
https://rightsstatements.org/vocab/InC/1.0/
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
https://rightsstatements.org/vocab/InC/1.0/
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
https://rightsstatements.org/vocab/InC/1.0/
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202502071536
https://urn.fi/URN:NBN:fi:oulu-202502071536
Tiivistelmä
Abstract
Self-Admitted Technical Debt (SATD) is a subset of Technical Debt (TD), where the developer leaves a comment on the source, thus marking the place where debt has been taken. Previous research on SATD relies on either the creation of new datasets or the reuse of existing ones. One seminal SATD dataset containing over 4,000 SATD comments and their classification into five different TD categories was published by Maldonado et al. [14]. The drawback of the dataset is its lack of any other information, e.g. static analysis, seriously limiting its possible use cases. We remedy this situation by reforming the dataset. We combine the original comments with contextual information and static analysis from the source codes and recreate the dataset as an SQLite database. Our reformed dataset contains over 13,000 files, nearly 14,000 classes, almost 100,000 methods, and over 650,000 code violation instances. The reformed dataset allows varied and detailed analyses in the future, which we demonstrate by examining the relationship of SATD comments to code violations. The results show that on the method level, the most important predictors are the number of code violations in total as well as the number of violations labelled as Priority 3 or belonging to the Documentation Rule Set. On the file level, LOC is an important predictor alongside the number of violations from the Documentation Rule Set or having a Priority 2 classification. Overall, our example study demonstrates the potential of what reforming existing datasets can have.
Self-Admitted Technical Debt (SATD) is a subset of Technical Debt (TD), where the developer leaves a comment on the source, thus marking the place where debt has been taken. Previous research on SATD relies on either the creation of new datasets or the reuse of existing ones. One seminal SATD dataset containing over 4,000 SATD comments and their classification into five different TD categories was published by Maldonado et al. [14]. The drawback of the dataset is its lack of any other information, e.g. static analysis, seriously limiting its possible use cases. We remedy this situation by reforming the dataset. We combine the original comments with contextual information and static analysis from the source codes and recreate the dataset as an SQLite database. Our reformed dataset contains over 13,000 files, nearly 14,000 classes, almost 100,000 methods, and over 650,000 code violation instances. The reformed dataset allows varied and detailed analyses in the future, which we demonstrate by examining the relationship of SATD comments to code violations. The results show that on the method level, the most important predictors are the number of code violations in total as well as the number of violations labelled as Priority 3 or belonging to the Documentation Rule Set. On the file level, LOC is an important predictor alongside the number of violations from the Documentation Rule Set or having a Priority 2 classification. Overall, our example study demonstrates the potential of what reforming existing datasets can have.
Kokoelmat
- Avoin saatavuus [38840]