PAC-GPT: A Novel Approach to Generating Synthetic Network Traffic With GPT-3
Kholgh, Danial Khosh; Kostakos, Panos (2023-10-18)
Kholgh, Danial Khosh
Kostakos, Panos
IEEE
18.10.2023
Kholgh, D. K., & Kostakos, P. (2023). PAC-GPT: A Novel Approach to Generating Synthetic Network Traffic With GPT-3. In IEEE Access (Vol. 11, pp. 114936–114951). Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109/access.2023.3325727.
https://creativecommons.org/licenses/by-nc-nd/4.0/
© 2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/.
https://creativecommons.org/licenses/by-nc-nd/4.0/
© 2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/.
https://creativecommons.org/licenses/by-nc-nd/4.0/
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202312113625
https://urn.fi/URN:NBN:fi:oulu-202312113625
Tiivistelmä
Abstract
The application of machine learning models, particularly in cybersecurity, has surged significantly in the past few years. However, the effectiveness of these models is predominantly tethered to the quality and breadth of the training data they ingest. The scarcity of realistic datasets within the cybersecurity field constitutes a considerable challenge to the development of industry-grade tools intended for real-world application scenarios. Specifically, current datasets are either significantly outdated or fall short on both qualitative and quantitative fronts, primarily because many organizations exhibit reluctance in data sharing, stemming from privacy concerns or the potential threat to trade secrets. To address this challenge, the paper introduces PAC-GPT, a novel framework to generate reliable synthetic data for machine learning methods based on Open AI’s Generative Pre-trained Transformer 3 (GPT-3). The core components of this framework are two modules, namely a Flow Generator, which is responsible for capturing and regenerating patterns in a series of network packets, and Packet Generator, which can generate individual network packets given the network flow. We also propose a packet generator based on LLM chaining and then proceed to assess, compare, and evaluate its performance using metrics such as loss, accuracy and success rate, concluding that transformers are a suitable approach for synthetic packet generation with minimal fine-tuning performed. Lastly, a streamlined command line interface (CLI) tool has been devised to facilitate the seamless access of this innovative data generation strategy by professionals from various disciplines.
The application of machine learning models, particularly in cybersecurity, has surged significantly in the past few years. However, the effectiveness of these models is predominantly tethered to the quality and breadth of the training data they ingest. The scarcity of realistic datasets within the cybersecurity field constitutes a considerable challenge to the development of industry-grade tools intended for real-world application scenarios. Specifically, current datasets are either significantly outdated or fall short on both qualitative and quantitative fronts, primarily because many organizations exhibit reluctance in data sharing, stemming from privacy concerns or the potential threat to trade secrets. To address this challenge, the paper introduces PAC-GPT, a novel framework to generate reliable synthetic data for machine learning methods based on Open AI’s Generative Pre-trained Transformer 3 (GPT-3). The core components of this framework are two modules, namely a Flow Generator, which is responsible for capturing and regenerating patterns in a series of network packets, and Packet Generator, which can generate individual network packets given the network flow. We also propose a packet generator based on LLM chaining and then proceed to assess, compare, and evaluate its performance using metrics such as loss, accuracy and success rate, concluding that transformers are a suitable approach for synthetic packet generation with minimal fine-tuning performed. Lastly, a streamlined command line interface (CLI) tool has been devised to facilitate the seamless access of this innovative data generation strategy by professionals from various disciplines.
Kokoelmat
- Avoin saatavuus [34210]