Semantic-Driven Focused Crawling Using LASER and FAISS: A Novel Approach for Threat Detection and Improved Information Retrieval
Balasubramanian, Prasasthy; Seby, Justin; Kostakos, Panos (2024-05-29)
Balasubramanian, Prasasthy
Seby, Justin
Kostakos, Panos
IEEE
29.05.2024
P. Balasubramanian, J. Seby and P. Kostakos, "Semantic-Driven Focused Crawling Using LASER and FAISS: A Novel Approach for Threat Detection and Improved Information Retrieval," 2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Exeter, United Kingdom, 2023, pp. 1598-1605, doi: 10.1109/TrustCom60117.2023.00218.
https://rightsstatements.org/vocab/InC/1.0/
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
https://rightsstatements.org/vocab/InC/1.0/
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
https://rightsstatements.org/vocab/InC/1.0/
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202408075251
https://urn.fi/URN:NBN:fi:oulu-202408075251
Tiivistelmä
Abstract
Focused crawling is a technique used to retrieve topic-specific information from the internet by directing the crawling search towards a particular subset of webpages. This approach can significantly improve the discovery and extraction of relevant cybersecurity content, thereby enhancing the precision, recall, and overall relevancy of the collected data. The current state-of-the-art focused crawlers depend on basic keyword matching and phrase matching, or they utilize complex logic, all while demonstrating only average efficiency in identifying relevant webpages. In this study, the implementation of a semantic-driven focused crawler is carried out, expressly constructed for discovering pages that contain Cyber Threat Intelligence (CTI). The proposed method introduces a holistic and creative approach that seamlessly integrates keyword-based and link-based strategies with the power of LASER embeddings, enabling a more nuanced understanding of web content and boosting the efficiency and effectiveness of focused crawling for web page classification. To measure the accuracy and efficiency of the focused crawler, a transformer-based large language model (LLM) classifier has been trained, functioning as a web page classifier to identify the relevance of the pages that the crawler has processed. The effectiveness of the suggested approach is assessed based on two main metrics: harvest rate and irrelevance ratio. The proposed methodology demonstrates superior performance compared to existing methodologies, achieving an average harvest rate of 0.87 and an irrelevance ratio of 0.13.
Focused crawling is a technique used to retrieve topic-specific information from the internet by directing the crawling search towards a particular subset of webpages. This approach can significantly improve the discovery and extraction of relevant cybersecurity content, thereby enhancing the precision, recall, and overall relevancy of the collected data. The current state-of-the-art focused crawlers depend on basic keyword matching and phrase matching, or they utilize complex logic, all while demonstrating only average efficiency in identifying relevant webpages. In this study, the implementation of a semantic-driven focused crawler is carried out, expressly constructed for discovering pages that contain Cyber Threat Intelligence (CTI). The proposed method introduces a holistic and creative approach that seamlessly integrates keyword-based and link-based strategies with the power of LASER embeddings, enabling a more nuanced understanding of web content and boosting the efficiency and effectiveness of focused crawling for web page classification. To measure the accuracy and efficiency of the focused crawler, a transformer-based large language model (LLM) classifier has been trained, functioning as a web page classifier to identify the relevance of the pages that the crawler has processed. The effectiveness of the suggested approach is assessed based on two main metrics: harvest rate and irrelevance ratio. The proposed methodology demonstrates superior performance compared to existing methodologies, achieving an average harvest rate of 0.87 and an irrelevance ratio of 0.13.
Kokoelmat
- Avoin saatavuus [34343]