A cognitive platform for collecting cyber threat intelligence and real-time detection using cloud computing
Balasubramanian, Prasasthy; Nazari, Sadaf; Kholgh, Danial Khosh; Mahmoodi, Alireza; Seby, Justin; Kostakos, Panos (2025-01-08)
Balasubramanian, Prasasthy
Nazari, Sadaf
Kholgh, Danial Khosh
Mahmoodi, Alireza
Seby, Justin
Kostakos, Panos
Elsevier
08.01.2025
Prasasthy Balasubramanian, Sadaf Nazari, Danial Khosh Kholgh, Alireza Mahmoodi, Justin Seby, Panos Kostakos, A cognitive platform for collecting cyber threat intelligence and real-time detection using cloud computing, Decision Analytics Journal, Volume 14, 2025, 100545, ISSN 2772-6622, https://doi.org/10.1016/j.dajour.2025.100545
https://creativecommons.org/licenses/by/4.0/
© 2025 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
https://creativecommons.org/licenses/by/4.0/
© 2025 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
https://creativecommons.org/licenses/by/4.0/
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202504172777
https://urn.fi/URN:NBN:fi:oulu-202504172777
Tiivistelmä
Abstract
The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy that enhances the resilience of both Information Technology (IT) and Operational Technology (OT) environments against large-scale cyber-attacks. However, for most organizations, collecting actionable CTI remains both a technical bottleneck and a black box. While previous research has focused on improving individual components of the extraction process, the community lacks open-source platforms for deploying streaming CTI data pipelines in the wild. This study proposes an efficient platform capable of processing compute-intensive data pipelines, based on cloud computing, for real-time detection, collection, and sharing of CTI from various online sources. We developed a prototype platform (TSTEM) with a containerized microservice architecture that uses Tweepy, Scrapy, Terraform, Elasticsearch, Logstash, and Kibana (ELK), Kafka, and Machine Learning Operations (MLOps) to autonomously search, extract, and index indicators of compromise (IOCs) in the wild. Moreover, the provisioning, monitoring, and management of the platform are achieved through infrastructure as code (IaC). Custom focus-crawlers collect web content, processed by a first-level classifier to identify potential IOCs. Relevant content advances to a second level for further examination. State-of-the-art natural language processing (NLP) models are used for classification and entity extraction, enhancing the IOC extraction methodology. Our results indicate these models exhibit high accuracy (exceeding 98%) in classification and extraction tasks, achieving this performance within less than a minute. The system’s effectiveness is due to a finely-tuned IOC extraction method that operates at multiple stages, ensuring precise identification with low false positives.
The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy that enhances the resilience of both Information Technology (IT) and Operational Technology (OT) environments against large-scale cyber-attacks. However, for most organizations, collecting actionable CTI remains both a technical bottleneck and a black box. While previous research has focused on improving individual components of the extraction process, the community lacks open-source platforms for deploying streaming CTI data pipelines in the wild. This study proposes an efficient platform capable of processing compute-intensive data pipelines, based on cloud computing, for real-time detection, collection, and sharing of CTI from various online sources. We developed a prototype platform (TSTEM) with a containerized microservice architecture that uses Tweepy, Scrapy, Terraform, Elasticsearch, Logstash, and Kibana (ELK), Kafka, and Machine Learning Operations (MLOps) to autonomously search, extract, and index indicators of compromise (IOCs) in the wild. Moreover, the provisioning, monitoring, and management of the platform are achieved through infrastructure as code (IaC). Custom focus-crawlers collect web content, processed by a first-level classifier to identify potential IOCs. Relevant content advances to a second level for further examination. State-of-the-art natural language processing (NLP) models are used for classification and entity extraction, enhancing the IOC extraction methodology. Our results indicate these models exhibit high accuracy (exceeding 98%) in classification and extraction tasks, achieving this performance within less than a minute. The system’s effectiveness is due to a finely-tuned IOC extraction method that operates at multiple stages, ensuring precise identification with low false positives.
Kokoelmat
- Avoin saatavuus [37689]