ORBS: Operationalizing Recommender-Based SIGINT
Blake Hartley, Mike Geide, Mattia Shin, PUNCH Cyber Analytics Group
Stephen W., Lori Wachter, Pauline M., Al J., Jacque J., Stephen S.
What our recommender system does
Over the past few years, PUNCH Cyber has been working with LAS to study the possibility of using Machine Learning (ML) recommender system techniques to recommend and prioritize SIGINT for cyber analysts. We’ve found that industry-standard recommender system algorithms can identify data of interest with more than 90% accuracy. These recommendations can be for anywhere from a single analyst to groups of analysts or even whole organizations. Our work until the start of 2023, however, was mostly in R&D and not focused on bringing these recommendations to real analysts.
How we’re bringing it to users
With ORBS, we have been developing a full-scale modular pipeline that can be deployed by customers hoping to leverage our work on their data and systems. Our pipeline has end-to-end support from data ingest to model serving, and as such consists of several systems which allow users on both the front end and back end to manage different stages of the pipeline. To elucidate how our pipeline works, we will walk through the stages of our pipeline for an example batch of data.
The goal of our work to date has been to make recommendations for analysts working with SIGINT data. To explain our process, we want a dataset that can be used as a stand-in for explaining how we’re applying recommender system techniques on the high side. E-commerce websites keep logs of customer browsing and purchasing, so we can imagine a sample dataset consisting of a table of customer orders, with each order having several columns corresponding to each customer information, order information, shipping information, and so on. One can imagine how this could correspond to logs of analysts interacting with SIGINT data on the high side.
The backbone of the ORBS pipeline is DVC (Data Version Control). DVC allows full runs of the system to be reproduced reliably between systems and manages which portions of the pipeline need to be re-run when changes are made. We break ORBS into a sequence of stages, each of which corresponds to a stage in DVC that can be run independently:
- Data Ingest and Pre-processing
ORBS is designed to ingest raw data in bulk and perform pre-processing at the time of ingestion. Our high-side setup uses local data stores, but any data storage or streaming platform that can interface with DVC (such as AWS, AirfLow, etc.) can be used. As the raw data is being ingested, we use a set of Python scripts to perform pre-processing. Pre-processing is essentially a transformation that we expect will be consistent with time, and accomplishes a variety of goals, such as: scrubbing sensitive data, dropping/transforming columns to save space, converting columns into a format a downstream ML system may require, etc.
- Feature Engineering
Feature engineering is a blanket term for various manipulations that can be done to a dataset to improve ML training performance. Columns can be split, combined, transformed, added, deleted, and so on. For example, a column representing the time of an E-commerce order could have the format YYYY-MM-DD-HH-MM-SS. This single column could be broken into multiple columns as desired. If all the data is from the same year, the YYYY sub-column could be dropped, while isolated the DD column could make it easier for ML models to pick up day-to-day trends. These features are stored in a feature store, which for ORBS is provided by Feast. This service allows users to manage and serve features as needed for ML algorithms in the following stage.
- Model Training
ORBS currently contains a selection of recommender system algorithms that we have studied throughout our research. Examples include XGBoost, TabNet, Random Forest, Collaborative Filtering, Embedding with K-Nearest Neighbors, and Neural Networks. The diversity of these methods means that a variety of different hyperparameters and metrics need to be optimized and studied as models are trained and new data is received. We accomplish this using MLFlow, which can keep records of all training episodes of these disparate methods and the trained models produced therein.
- Model Serving
The output of our training is a model which can be queried for recommendations as desired. In the E-commerce example, a user may want to input a customer ID and get recommended products that they may be interested in. In the high side R&D phase, we worked with these models directly, but with ORBS we started working with a customer to deploy them in their environment. To this end, we have developed a web interface using the Python framework Beer Garden. The models produced by the ORBS pipeline are sent to the Beer Garden plugin, allowing users to interact with the models via a web API.
- Model Evaluation
Once training is complete, we use Evidently to track the health of models as new data arrives and data drift becomes significant. Evidently allows users to manually set thresholds for model performance, which automatically signal Evidently for the need to retrain. We also have a web dashboard utilizing Grafana, which allows analysts to do data exploration and dig deeper into model results through a web UI without having to get into the weeds of the earlier stages of the pipeline.
Post-Deployment Project Development
One of our goals is to make ORBS able to be packaged and handed off to interested parties without having to do an untenable amount of transition work ourselves. To this end, part of our work over the 2023 period was rebuilding the ORBS repository using the Cookie Cutter Data Science model as its foundation. By standardizing the project structure, we hope that any data scientist wishing to deploy or develop using ORBS will be able to easily understand and edit our code base. This is also why we modularized and leveraged industry-standard platforms like DVC, Feast, MLFlow, Evidently, and Grafana. Any developer with a working understanding of one of these platforms should be able to manage any required changes to a module based on that platform. For example, if ML model performance is degrading as time goes on, model tuning and training can be managed from within MLFlow exclusively.
Future Work
While ORBS is currently a fully functional end-to-end ML pipeline, at many points along the pipeline we have made specific choices that apply to our data specifically. Moving forward, we hope to build up each module of the project so that users will have more options for handling feature engineering, more ML models for training, more dashboards for results visualization, and more platforms for recommender deployment. We also hope to work with analysts who are using ORBS to get feedback on what needs they have that ORBS can fulfill and iterate to that end.
This material is based upon work done, in whole or in part, in coordination with the Department of Defense (DoD). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the DoD and/or any agency or entity of the United States Government.
- Categories: