Evaluating Text Summarization Techniques and Factual Consistency with Language Models
Islam, Md Moinul; Muhammad, Usman; Oussalah, Mourad (2025-01-16)
Islam, Md Moinul
Muhammad, Usman
Oussalah, Mourad
IEEE
16.01.2025
M. M. Islam, U. Muhammad and M. Oussalah, "Evaluating Text Summarization Techniques and Factual Consistency with Language Models," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 116-122, doi: 10.1109/BigData62323.2024.10826032.
https://rightsstatements.org/vocab/InC/1.0/
© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
https://rightsstatements.org/vocab/InC/1.0/
© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
https://rightsstatements.org/vocab/InC/1.0/
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202505263918
https://urn.fi/URN:NBN:fi:oulu-202505263918
Tiivistelmä
Abstract
Standard evaluation of automated text summarization (ATS) methods relies on manually crafted golden summaries. With the advances in Large Language Models (LLMs), it is legitimate to question whether these models can now potentially complement or replace human-crafted summaries. This study examines the effectiveness of several language models (LMs) in specifically addressing the issue of preserving factual consistency. By conducting a thorough assessment of various conventional and state-of-the-art performance metrics, such as ROUGE, BLEU, BERTScore, FActScore, and LongDocFACTScore across diverse datasets, our findings highlight the important relationship between linguistic eloquence and factual accuracy. The findings suggest that whereas LLMs, such as GPT and LLaMA, demonstrate considerable competence in producing concise and contextually-aware summaries, there remain difficulties in ensuring factual accuracy, particularly in domain-specific situations. Moreover, this work enhances the existing knowledge on summarization dynamics and highlights the need of developing more reliable and tailored evaluation techniques that minimize the probability of factual errors in text generated by ATS. In particular, the findings advance the current domain by providing a rigorous assessment of the balance between linguistic fluency and factual correct- ness, highlighting the limitations of current ATS frameworks and metrics to enhance the factual reliability of LM-generated summaries.
Standard evaluation of automated text summarization (ATS) methods relies on manually crafted golden summaries. With the advances in Large Language Models (LLMs), it is legitimate to question whether these models can now potentially complement or replace human-crafted summaries. This study examines the effectiveness of several language models (LMs) in specifically addressing the issue of preserving factual consistency. By conducting a thorough assessment of various conventional and state-of-the-art performance metrics, such as ROUGE, BLEU, BERTScore, FActScore, and LongDocFACTScore across diverse datasets, our findings highlight the important relationship between linguistic eloquence and factual accuracy. The findings suggest that whereas LLMs, such as GPT and LLaMA, demonstrate considerable competence in producing concise and contextually-aware summaries, there remain difficulties in ensuring factual accuracy, particularly in domain-specific situations. Moreover, this work enhances the existing knowledge on summarization dynamics and highlights the need of developing more reliable and tailored evaluation techniques that minimize the probability of factual errors in text generated by ATS. In particular, the findings advance the current domain by providing a rigorous assessment of the balance between linguistic fluency and factual correct- ness, highlighting the limitations of current ATS frameworks and metrics to enhance the factual reliability of LM-generated summaries.
Kokoelmat
- Avoin saatavuus [38618]