Skip to main content

Synthetic Cyber Knowledge Graphs

Blog Authors: Felecia ML., James S., Donita R.

Team: James S. (Fall 2023 Lead), Donita R., Felecia ML., Chris L., Shanita T., Al J. (Spring 2023 Lead), The University of North Carolina at Pembroke (Fall 2023) and Winston-Salem State University (Spring 2023)

Background

As data continues to grow in cyberspace, it can be complex to triage the most relevant information against the volume, value, variety, velocity, and veracity (i.e., the 5 Vs of Big Data) of the data.  Developing techniques to effectively discover the knowns and unknowns of information in the cyber domain can be challenging as cyber datasets are either limited, private, sensitive, classified, and/or proprietary.  Therefore, how can researchers and data scientists develop algorithms to detect patterns and trends, such as how the adversaries change over time or new types of cyber attacks in restriction-free cyber datasets? 

Knowledge graphs (KGs) can be very powerful to effectively organize, understand, and visualize large datasets.  Leveraging KGs have the benefits of

  1. capturing entities and their relationships;
  2. put in context large amounts of information;
  3. having an ontology that captures knowledge about the domain.  

Furthermore, synthetic data generation, which is artificial data generated to have the same statistical properties as an actual dataset, can

  1. minimize Personal Identifiable Information (PII) and proprietary or sensitive information;
  2. increase the size of the training dataset;
  3. address data imbalance;
  4. potentially offer a cheaper and more efficient method to expand the training dataset.  

Combining both synthetic data generation and knowledge graphs can allow for rendering cyber scenarios that have the same characteristics as actual/real cyber events while providing structure to analyze the synthetically generated dataset.  Researchers from government, industry, and academia can have a capability to generate a synthetic cyber scenario/dataset to develop, test, and evaluate AI/ML techniques and develop algorithms to triage the most relevant information in the cybersecurity realm without the worry of any sensitivities in the data.

Objective

Generate synthetic KGs in STIX 2.1 format that have similar properties to existing commercial, governmental, or academic cyber data/graphs that will allow users to

  1. render a variety of restriction-free example cyber knowledge graphs
  2. to enable tool builders to build better algorithms for cybersecurity that can extract patterns and trends in the dataset.  

STIX is a language and serialization format used to exchange cyber threat intelligence (CTI). STIX 2.1 allows for information to be visually represented in a knowledge graph or stored as a JSON.

Academic Partnerships

The LAS collaborated with two minority-serving institutions (MSIs) located in North Carolina to research techniques to synthesize new KGs to represent a variety of cyber scenarios: Winston-Salem University (WSSU) and The University of North Carolina at Pembroke (UNC at Pembroke). 

Partnerships were formed with both WSSU and the UNC at Pembroke through the minority-serving institution Cooperative Research and Development Agreement (MSI CRADA), thus allowing the LAS to support building and sustaining a diverse and expert workforce at the NSA.  The Office of Research and Technology Applications (ORTA) created the MSI CRADA, which provides MSIs a means to partner with the NSA on research and development topics such as Internet of Things, Cyber Security, and Secure Composition and System Science.

Through an existing MSI CRADA, the LAS collaborated with WSSU senior design students in Spring 2023 for the exploration phase of this project.  Also, the LAS worked with ORTA to establish a MSI CRADA with the UNC at Pembroke in Spring 2023, which allowed the LAS to partner with the UNC at Pembroke’s Fall 2023 capstone students.

Spring 2023 – Exploration (Phase I)

Dr. Elva Jones
Professor and Chair of Computer Science
Winston-Salem State University

The LAS partnered with Winston-Salem State University (WSSU) senior design students, testing different input techniques (e.g., command line, Tkinter, etc.) to generate synthetic cyber scenarios in STIX 2.1 format.  With the mentorship and guidance from the LAS, NSA Cyber experts and co-advisor Dr. Jones (WSSU), a randomizer was used to query the stored data in an array to create additional synthetic data.  Also, a custom visualizer was created as a proof-of-concept for the STIX bundle, which provided an interface for verifying, visualizing, and modifying generated STIX 2.1 content and allowed for adding and deleting additional domain objects in the graph.

Fall 2023 – Proof-of-Concept / Use Case (Phase II)

Dr. Prashanth BusiReddyGari, Ph.D.
Director of Cyber Defense Education Center
Assistant Professor of Computer Science
Program Coordinator of Cybersecurity
University of North Carolina at Pembroke Independent Study Class
Center of Academic Excellence Cyber Defense (CAE-CD) 2023-2028

The LAS partnered with the UNC at Pembroke, leveraging their cyber defense expertise to develop an initial proof-of-concept to generate cyber scenarios in the STIX 2.1 format.  The UNC at Pembroke received their Center of Academic Excellence Cyber Defense designation in early 2023.  With the mentorship and guidance from the LAS, a NSA SCRUM Master, NSA Cyber experts, and co-advisor Dr. BusiReddyGari (UNC at Pembroke), the students built a dynamic website allowing users to create synthetic data from a cyber campaign and/or cyber scenarios (e.g., emergency vehicles). The students divided this effort into two parts: a frontend effort and a backend effort. 

As a CAE-CD institution, the UNC at Pembroke will play a critical role in addressing the growing demand for skilled cybersecurity professionals. The university will continue to enhance its cybersecurity programs and initiatives to ensure that students are well-prepared to tackle the evolving challenges in the field.

The frontend allows users to input a realistic cyber scenario (e.g., emergency vehicles) in the STIX 2.1 format by selecting multiple Domain (18) and Relationship (2) Objects to build their story.  For example, the user can define the STIX 2.1 Domain Objects such as Attack Pattern as “Data interception”, Identity as “AmbuCare Inc” and Campaign as “OperationAttackEmergVehicle” to categorize specific attributes in the cyber scenario.  The defined STIX 2.1 Objects can then be linked by the STIX 2.1 Relationship Object such as “OperationAttackEmergVehicle” Targets “AmbuCare Inc”.  The information is gathered from the frontend and sent to the backend to be expanded synthetically while maintaining the STIX 2.1 Domain and Relationship Objects in the cyber scenario.

The user creates the emergency vehicle cyber scenario with STIX2.1 Domain Objects
The user defines the STIX2.1 Relationship Objects for the emergency vehicle cyber scenario

The backend generates synthetic data that mimics the frontend cyber scenario. Once the backend server receives the information in JSON, it is parsed to determine the Relationship and Domain Objects from the frontend cyber scenario. Various weights and metrics are assigned (that reflect real-world cyber scenarios) to the STIX Domain Objects, which are then used to generate synthetic Domain Objects that mimic the characteristics of an expanded cyber attack space.  The bundle is then created from the newly synthesized KG and can be visualized via the Oasis STIX Visualizer.

Next Steps

The UNC at Pembroke will continue partnering with the LAS via the MSI CRADA to develop the next phase of the Synthetic Cyber Knowledge Graph project.  The UNC at Pembroke will execute this project along with applying deep learning techniques as part of their 2024-2025 Cyber Capstone course.  The next phase of this project will include:

  • Generating large (e.g., six times the RAM of a computer) synthetic cyber KGs in STIX 2.1 format
  • Incorporate OpenTAXII with the persistence API to allow for the storage, retrieval and sharing of cyber threats between multiple groups
  • Devise a capability to query and statistically extract patterns and trends such as attribution, techniques, tactics, and procedures and how adversaries change over time from the synthetically generated KG.
  • Generate synthetic data based on customizable real-world cyber threat metrics