Skip to main content

2023 Research Symposium

Each year, LAS undertakes a research program involving partners from a variety of academic, industry, and government communities. The outcomes of these research projects are of interest to our intelligence community stakeholders, as well as the respective communities of our academic and industry partners.

Program

On December 6, 2023, LAS hosted its annual research symposium at NC State’s Talley Student Union in Raleigh, NC. The program featured a keynote speaker, a conversation with LAS leadership, and presentations on LAS research themes followed by interactive posters and demonstrations.

Featured Remarks

Projects

We invite you to learn about this year’s unclassified research projects. Projects are grouped into three themes: Human-Machine Teaming, Operationalizing AI/ML and Content Triage.

Human-Machine Teaming

This theme encompasses efforts toward enhancing the effectiveness of human analysts partnering with emerging automated technologies. We consider both the human user’s experience with teaming technologies and the creation of technologies to alleviate pain points and reduce cognitive load for analysts. Though technological innovations have the potential to reduce the cognitive burden associated with processing and managing vast datasets, their successful implementation hinges on a deep understanding of how humans can process the outputs and seamlessly integrate these technologies into their workflows. Machine teammates have significant potential to enhance efficiency on tasks; however, they may not be effective unless grounded in a robust model of how human users carry out tasks and the cognitive needs of the user.

Within this theme, we group projects by the following topic areas:

Modeling Users & Tradecraft

These projects examine what a human user needs and does when making decisions in different tasks and settings.

Eric Ragan, Jeremy Block, Jennifer Cremer, Jascha Swisher, Christine Brugh, Patti K., Sue Mi K., Kenneth T.

The research advances software techniques to identify and visualize patterns in human analysis activities based on analysts’ interaction log data. The vision for effective human-machine teaming for intelligence analysis aims to augment human capability with computational support that bolsters the effectiveness for generating meaningful intelligence. Within this human-machine collaboration, improved understanding of analysts’ strategies makes it possible to refine machine assistance by tailoring the automation to better match the human workflow. Even external to live analysis, user modeling of the analysts’ behaviors can help tool designers to prioritize the need for technical capabilities to optimize the benefit for analysts. From a management perspective, meta-analysis of human analysis behaviors can also inform training for new analysts or enable novel forms of feedback for seasoned analysts. Reviewing strategies of the most effective analysts can lead to lessons for others or identify key approaches for tradecraft. To support such needs, the research develops descriptive metrics for different types of analysis behaviors and generates a visual interface to enable meta-analysis of trends. Based on discussions with analysts and analysis of sample interaction logs, the team formulated behavioral metrics that operate on software log data to characterize strategies in how analysts explore data in open-ended intelligence investigations. Examples include capturing individual differences in: analysts’ attention to different types of information sources over time; patterns in conducting depth vs breadth exploration behaviors; efficiency of queries or information retrieval; and cycles or branching in information exploration. A visual exploration tool facilitates review of similarities and differences in analyst behaviors.

Patti K., Sue Mi K., Susie B., Pauline M., Jessica E., John Slankas, Jascha Swisher, Brent Younce

PandaJam is an observational study designed to elucidate the workflow of language analysts and their cognitive processes when seeking to retrieve relevant information from audio. Using an elasticsearch-based query tool to explore thousands of hours of audio from the Nixon White House tapes and machine transcriptions of this audio, analysts are given a task to answer four questions about the pandas that China gifted to the U.S. in 1972. Our 2022 pilot study revealed that analysts interacted with machine transcriptions in various ways to help them achieve sub-tasks, or goals, in their analytic process while completing this task. In this context, sub-tasks included a wide range of inquiries that analysts pursued in order to complete the larger task, ranging from efforts to gain a sense of the totality of relevant information available, to finding specific information about a particular entity named in the data. This year, we expanded the pool of participants in the study and again employed a “think-aloud” protocol such that analysts narrated their thought process as they worked through the task. We recorded and transcribed these sessions, and logged users’ interactions with the search tool. We categorized the actions analysts took when using the machine transcriptions to help achieve their goals, and conducted a qualitative and quantitative review of how analysts used the machine transcriptions to meet their goals. Our preliminary findings indicate that the actions currently available to analysts for leveraging machine transcriptions are often misaligned with their cognitive goals.

Rob Capra, Jaime Arguello, Bogeum Choi, Mengtian Guo, Jascha Swisher, Liz Richerson, Patti K., Michelle W., Susie B.

Intelligence analysis of audio data involves a triage stage in which data is transcribed, interpreted, and annotated by analysts making operator comments (OCs). OCs are meant to help other analysts make sense of the conversation. Our research focused on two main threads. First, we conducted a study with 30 intelligence analysts that aimed to:

  1. develop a taxonomy of OC types,
  2. understand the challenges analysts face when making/viewing OCs, and
  3. gain insights about possible tools to support these processes.

During the study, participants were asked to make hypothetical OCs using transcripts from the Nixon White House Tapes. Additionally, participants were asked open-ended questions about the challenges they face when making/viewing OCs and tools that might support these processes. Based on a qualitative analysis of hypothetical OCs, we developed a taxonomy of 25 OC types. Most OCs involve entity identification—disambiguating references to people, organizations, events, etc. Other OCs aim to provide important contextual information, explain an idiomatic expression, expand an acronym, and signal that a phrase is a “cover term”. Participants commented on different challenges related to the process of making/viewing OCs. For example, while making OCs, participants commented on challenges related to:

  1. lacking contextual knowledge,
  2. deciding when to make an OC, and
  3. finding the required information to complete an OC.

While viewing OCs, participants commented on challenges related to:

  1. visual clutter,
  2. distinguishing facts from speculations, and
  3. lack of standardization.

Second, based on insights gained from our user study, we developed a prototype system for making OCs. The prototype was designed to add structure to the OC-making process. The prototype enables the analyst to specify the type of OC being made and complete different fields that are relevant to the OC type based on our study results.

Antonio Girona, Grace B., James Peters, Wenyuan Wang, R. Jordan Crouser

Intelligence analysis involves the collection, analysis, and interpretation of vast amounts of information from diverse sources to produce accurate and timely insights. Tailored tools hold great promise in providing individualized support, enhancing efficiency, and facilitating the identification of crucial intelligence gaps and trends where traditional tools fail. The effectiveness of tailored tools depends on the analysts’ unique needs and motivations, as well as the broader context in which they operate. This poster describes a series of focus discovery exercises, which revealed a distinct hierarchy of needs for intelligence analysts. This reflection on the balance between competing needs is of particular value in the context of intelligence analysis, where the compartmentalization required for security can make it difficult to group design patterns in stakeholder values. We hope that this study will enable the development of more effective tools, supporting the well-being and performance of intelligence analysts as well as the organizations they serve.

Calibrating Trust

These projects explore the interactions and interdependencies between human and machine, enabling the human to evaluate information in context.

Cara Widmer, Amy Summerville, Joshua Fiechter, Louis Marti, Christine Brugh, Sue Mi K., Jacque J., Susie B.

Human analysts increasingly rely on automated teammates for making sense of vast troves of data. For these teammates to be useful, they should be both trusted by analysts to produce reliable and usable output, as well as reduce analysts’ cognitive load. One feature of automated teammates that may influence trust, load, and task performance is the degree to which there is alignment between a machine teammate and a user’s cognitive style. In the current work, we investigated how stylistic traits such as Tolerance of Ambiguity (TOA) intersect with the level of detail provided in recommendations presented by a machine teammate – specifically whether the teammate offered binary recommendations (i.e., “the correct answer is A”) or as a calibrated likelihood (i.e., “it is somewhat likely that the correct answer is A”). We evaluated trust and load as latent states via Hidden Markov Models (HMMs). We assessed indicators of latent states, and also estimated probabilities of transitioning between states. Additionally, we examined accuracy, correspondence between their response and the AI recommendation, and self-reported trust. In our first study, participants tended to have higher latent trust when AI recommendations were presented as calibrated likelihoods than as binaries, suggesting users may trust an AI more when it acknowledges its level of certainty. We observed limited evidence that higher TOA scores lead to lower latent trust and higher latent load, but did not find strong evidence that alignment between cognitive style and recommendation format influenced latent trust or load. A follow-up study currently underway aims to replicate these findings, as well as introduce a third, more precise recommendation condition where the machine teammate provides exact probabilities (i.e., “it is 67% likely that the correct answer is A”) to further investigate the impact of alignment between cognitive style and recommendation format.

R. Jordan Crouser, Alvitta Ottley, Jennifer Ha, Syrine Matoussi, Emily Kung, Christine Brugh, Patti K., Tina K.

Prior work has demonstrated that the user’s experience, personality, and cognitive abilities can have significant implications for tools that support data-driven reasoning and decision-making. This project continues to advance an agenda for transforming the one-size-fits-all methodology for the design of interactive visual tools to support analysis, extending this line of inquiry to the investigation of how individual differences influence the interpretation of confidence values that accompany the output of machine learning algorithms. First, this work seeks to enhance decision-making capabilities by systematically investigating individual traits’ role in choosing between accepting automated transcription output or listening to the source audio. Second, we examine methods for communicating confidence values to the user and comparing their impact on analysts’ reliance on transcripts. Finally, we use this information to establish design guidelines for developing decision-making tools that are responsive to the needs of individual analysts.

Generating & Exploring Hypotheses

These analytic workflows integrate machine capabilities to synthesize information for different tasks.

Tim van Gelder, Richard de Rozario, Tim Dwyer, Kadek Satriadi, Christine Brugh, John L., Jacque J., Michele K.

Intelligence analysts frequently develop “stories” or narratives to explain the evidence pertaining to some situation of interest, and must assess the plausibility of competing narratives. This is a form of abductive reasoning, and so we call it narrative abduction. Previously, narrative abduction had been given very little research attention, relative to how common and important it is. In a 2022 project, we did foundational work on narrative abduction. One thing revealed in that work was that narratives, evidence sets, and their relationships can be very complex, which makes narrative abduction cognitively demanding. This 2023 project has focused on how these complex situations might be visualized, in order to better support analysts’ overall judgements. It has three components. First is the specification of a “visual language” – a visualization design, and a set of diagramming conventions, for producing diagrammatic representations of narrative abduction challenges. Second is a software tool for producing and modifying diagrams using this visual language. We are building this authoring tool on top of an open source diagramming package, draw.io. Third, we will be showing how generative AI can be used for auto-generation of parts of narrative visualizations in this tool, thus augmenting narrative abductive reasoning.

Mike G., Patti K., Sue Mi K., Susie B., Skip S., Michele K.

GUESS, or Gathering Unstructured Evidence with Semantic Search, is a prototype tool for the analyst discovery and triage toolbox that allows analysts to directly query data in new ways. GUESS is based on semantic embeddings produced by a pre-trained, off-the-shelf language model and demonstrates the use of these sentence embeddings in a manner that enables effective information retrieval from machine generated speech to text data. Via natural language queries, dynamic topic modeling, and interactive visualizations supported by the underlying machine learning components, GUESS emphasizes a user-centric, human-in-the-loop approach and fosters an iterative conversation with a large corpora of text resulting in the rapid identification of analytically relevant information that might otherwise be difficult to find. These methods foster alignment of user goals with the capabilities of the technologies employed and the context of the available data, and calibrates trust between the user and the application, thus increasing likelihood of attaining useful results. In this manner, GUESS creates a tangible example of effective human-machine teaming that can directly inform analyst workflows and decision making.

Brent Harrison, Stephen Ware, Anton Vinogradov, Rachelyn Farrell, Patti K., Sue Mi K., Michele K., Sean L., Skip S.

We investigated multiple approaches for using artificial intelligence and machine learning to suggest hypotheses to analysts as they played a serious game that simulates an analytic task. In the game, analysts investigate the emails, calendar events, and server logs of a fictional company to determine which employee committed an insider attack. We explored several ways of using large language models to summarize existing evidence, suggest hypotheses about the crime, and find evidence that either supports or refutes each hypothesis. We also explored two ways of identifying missing evidence based on known events: one uses an LLM to assist an analyst in building a formal logical model of the crime to which belief and intention algorithms can be applied, and the other adapts an LLM to be a storytelling algorithm that crafts stories consistent with known evidence.

Promoting Cognitive Engagement

These projects leverage AI capabilities to enhance the ability of the human user to interact with complex data.

Patti K., Michele K., Christine Brugh, Lori Wachter, Helen Armstrong, Ned Babbott, Diksha Bahirwani, Sasa Crkvenjas, Adam Noel, Isha Parate, Kayla Rondinelli, Kevin Ward

Stemming from a complex and diverse global environment, voice language analysis is a fundamental need in the production of intelligence for international security. The advent of automatic speech recognition and other AI/ML technologies have not only added more tasks to the language analysts’ workflow, but unleashed new possibilities for how to go about information retrieval and exploration. As the analysts’ toolkit has evolved, various analytics have been developed in isolation and are presented to the analysts in separate tools. This creates an undue burden on the analyst as they must constantly move between different tools, or leverage a technology in an environment that wasn’t originally intended for that purpose. This project explores the design of a unified interface environment for language analysts that enables them to take full advantage of new technologies while seamlessly pivoting between the various stages of their workflow. Using human-centered design research methods, students from the N.C. State Master of Graphic & Experience Design Program partnered with LAS to prototype potential interface systems that use the affordances of machine learning to enable voice language analysts to quickly produce reliable and robust intelligence that accurately conveys content, intent, and context. The three resulting prototypes and corresponding scenario videos—each responding to distinct pain points experienced by a specific type of language analyst—demonstrate the potential for analysts to effectively team with artificial intelligence, in particular with models utilizing natural language processing.

Helen Armstrong, Matthew Peterson, Isha Parate, Ashley Anderson, Kayla Rondinelli, Susie B., Sue Mi K., Stephen S., Ken T., Lori Wachter

An intelligence analyst engages with an extensive collection of communication events and data from a considerable number of sources in the process of sensemaking. Analysts face significant difficulty in reconnecting with relationships found among data points from distinct sources when the data is multitudinous, and when they step away at the end of the workday. This project explores how text-to-image generative models might promote analysts’ serendipitous recognition of past data connections, and also how AI may proactively find connections between data sources. We utilized a design investigation model to reveal potential innovations in interface design centered around the “conceptual peg hypothesis,” which asserts that concrete imagery, in this case small pictographs, can serve as a conceptual peg upon which to hang more abstract — and we presume more extensive — information. We identified two phases of visual conceptual peg creation: selection of a target concept, and its subsequent visualization as a pictograph. For these concepts and pictographs, we proposed three criteria: essential, distinctive, and compact. In addition, we revised Scott McCloud’s Big Triangle into the Pictorial Trapezoid, a framework for systematizing semiotic qualities of pictures that interact with our criteria, setting the stage for training an AI to generate effective conceptual pegs. After setting this foundation, we explored extensive interface concepts, analyzed their feature sets with LAS collaborators, and produced scenario videos that demonstrate possible conceptual peg–based interfaces. These interfaces combine features that are dependent upon the conceptual peg premise with others that are independent of it and enhance intelligence analysis by other means. Finally, we analyzed previous research phases to draft forward-thinking research questions — organized as premise-testing, premise-dependent, or premise-adjacent — and for select questions we prepared brief white papers on potential future work, noting expertise beyond design that is needed to answer them, suggesting potential interdisciplinary collaborations.

Susie B., Natalie Kraft, Lori Wachter, Jessica E., Michelle W., Staci K., Sue Mi K., Mike G., Troy W., Skip S., Stephen S., Lithios

The LAS’s Knowledge-MiNER project demonstrates Human-Machine Teaming (HMT) to help automate entity linking tasks and information extraction. Every day, analysts spend hours sifting through information and annotating and saving that information in both unstructured and structured formats. That structured annotation and data curation, mainly in the form of linking entities to a corporate repository and inputting information about those entities based on a schema, is all done manually. In most instances, an analyst is manually linking to the same subset of entities repetitively across text formats — to include notes, transcripts, and reports. Knowledge-MiNER aims to harness the power of true Human-Machine Teaming (HMT) to help automate that entity linking task, while concurrently helping analysts to populate structured repositories via information extraction.

At its core, Knowledge-MiNER employs Named Entity Recognition (NER) using a SpaCy model, an underlying knowledge graph (KG) with a predefined ontology, and a MiniBert information extraction Question & Answer (Q&A) model. It begins by analyzing human-written, English-language texts, searching for a select set of entity types. Extracted entities are then cross-referenced with the underlying database. Once the analyst confirms the relevant entities, the ontology is used to pose templated questions to the text using the MiniBert model. The answers are compared to the existing KG, and analysts are presented with only new or conflicting information for analyst validation.

Knowledge-MiNER moves beyond the human marking of machine generated data that many analysts are painfully familiar with, and instead asks machines to mark human data. The final state of human validation ensures data accuracy while allowing the rapid creation of truth-marked data embedded within an analyst’s workflow, and where analysts are already intellectually invested and engaged. Starting in early 2024, LAS will be human-testing Knowledge-MiNER with real analysts on unclassified data and has begun planning to bring Knowledge-MiNER to a corporate tool.

Sean L., Patti K., Aaron W., Brent Younce

Each semester the LAS sponsors several teams of senior undergraduate students from the NC State Computer Science department to work on design & development projects within the LAS’s mission. For this project, the student team developed an application of Large Language Models (LLMs) which attempts to correct the results of Speech-To-Text algorithms (STT) in foreign language transcriptions. Modern media sources produce immense quantities of speech audio recordings every day across the globe. Information producers and consumers both benefit from cross-lingual transcriptions. However, language analysts are of course overwhelmed in this environment, and in most cases employing their services is cost-prohibitive. Thankfully, machine-learning methods have generated moderately capable STT and machine translation (MT) algorithms which are far faster, and more economical, to deploy. Of course, while for some applications these solutions are sufficient, regional dialects and accents are complicating factors and it is simply the case that the accuracy of the models is often lacking, even for common languages. These shortcomings limit, and even prohibit, the utility of STT and MT for many applications. Using LLMs, the student team developed a prototype application which utilizes available English translations of foreign speech audio to improve the results of the foreign language STT. This information can also be fed back to the STT algorithm developers as a fresh, in-domain, set of “labeled” data to create fine-tuned algorithms, or improve multi-lingual models through additional training.

Operationalizing AI & ML

LAS research on machine learning (ML) and artificial intelligence (AI) focuses on how machine learning concepts and techniques can still be useful even when working under the constraints of an operational environment. Researchers examine the impact of those constraints on AI/ML performance when applied in operational settings. They also seek methods to mitigate the costs of those constraints, whether they be financial, time, or cognitive resources.

These efforts include creating data from which models can learn, training models, running models in realistic conditions, and sharing models for widespread use.

Operational constraints might include limited amounts of quality data available or competing priorities for subject matter experts who can annotate data for domain-specific tasks; limited computational resources for training models; highly variable conditions in which AI/ML might be deployed; and logistical challenges around running models outside of the team or environment in which they were developed.

Within this theme, we group projects by the following topic areas:

Amplifying Knowledge

Benjamin Bauchwitz, Mary Cummings, Aaron W., Jascha Swisher

Computer vision systems that leverage supervised machine learning techniques are increasingly popular with wide application. Recent problems with autonomous vehicles and facial recognition programs have highlighted that these systems may have failure modes that are not well understood. One possible source of such problems is the lack of formalization around data annotation problems, i.e., do badly-labeled or even mis-labeled images cause problems in the performance of computer vision systems and to what degree? Human labeler performance can decline with fatigue and a lack of motivation but autolabeling software can often generate completely wrong labels and it is not unusual to see error rates in the double digits for such programs. This effort will develop an error classification framework for human and automated annotation processes, and demonstrate how and to what degree these errors affect convolutional neural network performance in image segmentation tasks. Moreover, whether human labelers can collaborate with automated annotation tools to achieve superior performance than either alone will be explored.

Lillian Thistlethwaite, Ben Radford, Jascha Swisher, Liz Richerson, Mike G., Skip S., Stephen S., Sue Mi K., Susie B.

Identifying events and the key attributes that characterize those events can be a useful tool in intelligence analysis, as it supports an analyst’s ability to answer operational questions such as assessing or predicting potential future actions and threats. The event extraction task, formalized in natural language processing, is typically supported by a closed-domain event ontology and a corresponding hand-labeled training corpus. In contrast, open domain models do not require an ontology or a training corpus, but their model performance is consistently worse. Unfortunately, it is often the case that existing closed-domain event ontologies either do not capture all event types of interest to analysts, or the training corpora developed do not contain enough examples of an event type to train robust models. The effort involved in developing a novel event ontology and training corpus is exceptionally complex, time consuming, expensive, and not conducive to integration with existing analyst workflows. As an LAS 2023 industry partner, we developed DAFEE (Data Augmentation for Event Extraction), an event extraction pipeline that hybridizes the benefits found in both closed- (supervised, performant) and open-domain (unsupervised, automated) approaches. DAFEE leverages several data augmentation approaches to facilitate the modeling of custom event types, without requiring the effortful manual labeling of event mentions. Another important component of this work is the ability to extract events from a stack of communications that may differ substantially from the corpora that pretrained large language models (LLMs) were trained on. Because many event extractors use transformer-based encodings, it is important to tune word embeddings generated by these LLMs to be sensitive to domain-specific word usage and contexts.

Al J., James S., Felecia ML., Donita R., Chris L., Shanita T.

As data continues to grow in cyberspace, it can be complex to triage the most relevant information. So how can users effectively find the knowns and unknowns of information in the cyber domain such as attribution, how the adversaries change over time or new types of cyber attacks? Applying knowledge graphs (KGs) to the cyber domain allows users to better understand and visualize the data more effectively such as

  1. capturing entities and their relationships;
  2. put in context large amounts of information;
  3. having an ontology that captures knowledge about the domain.

Since existing cyber KGs are either private, sensitive, or limited, it is necessary to synthesize new KGs to represent a variety of scenarios. This is where synthetic data generation comes into play. Synthetically generated KGs will have the same characteristics as actual/real graph patterns and relationships associated with each entity in the graph’s ontology. This allows the user to render cyber campaigns in a KG to test and evaluate various AI/ML techniques and develop algorithms to triage the most relevant information.

Felecia ML., James S., Donita R.

Developing a process at the Laboratory for Analytic Sciences (LAS) to collaborate with minority-serving institutions (MSIs) in North Carolina on research and development (R&D) projects, supporting one of NSA’s top priorities of building and sustaining a diverse, expert workforce that continues to provide the Nation with competitive advantages. Being in close proximity to many MSIs in North Carolina, the LAS team sought to strengthen NSA’s partnership with MSIs for the mutual benefit of MSIs and NSA mission customers. The LAS partnered with NSA’s Office of Research and Technology Applications organization to establish Cooperative Research and Development Agreements (CRADAs) with many MSIs in North Carolina. CRADAs provide those institutions the opportunity to collaboratively partner with the NSA on research projects and develop capabilities of significant size and impact to the NSA and Intelligence Community. These partnerships have

  1. helped attract skill sets and untapped talent from diverse backgrounds and diversify the NSA workforce;
  2. enhanced research infrastructure and expertise at MSIs;
  3. established a foundation for continued engagement with NSA.

Increasing Efficiency

Will Gleave, Julie Hong, Sam Saltwick, Joe Hackett, Kyle Rose, Stephen W., Lori Wachter, Brent Younce., Al J., Mike G., Troy W., Felicia ML., James S.

Traditional computer vision model development requires extensive data curation and annotation. This process is time consuming, labor intensive, and expensive, in some circumstances requiring up to hundreds of thousands of annotations to train an effective model. Gathering a dataset that is robust to the types of data drift that are often seen in production is also challenging, as developing a model that generalizes well requires a dataset with sufficient diversity. Techniques such as transfer learning, pretraining, knowledge distillation, and the use of larger foundation models have reduced the data annotation burden to allow performant computer vision models to be trained with smaller labeled dataset. In this work, we explore and quantify the tradeoffs of using different combinations of these state-of-the-art techniques in a model development pipeline and identify conditions leading to performance deterioration.

We use the problem of detecting objects in video footage in order to comprehensively study and optimize computer vision model development pipelines. We gathered a dataset of open source videos and annotated a subset of the frames for the task of object detection. We construct various model development pipelines with the goal of answering the following questions:

  1. Which model development techniques enable more efficient model development?
  2. Which model development techniques enable performance improvements, and what are the tradeoffs of using those techniques?
  3. What are the best methods for detecting data drift, and how are the models produced by different development pipelines impacted by data drift?
  4. When and how should models be retrained when impacted by data drift?

Kewen Peng, Tim Menzies

In this project, we explore technologies for improving the efficiency of creating and maintaining ML models. Specifically, we design and implement methods to reduce the computational resources and/or labeling cost required to fine-tune large language models. Our project contains experiments conducted in various fields: question answering (QA), semantic textual similarity (STS), and Conversational AI. In QA tasks, we use adaptive sampling and semi-supervised strategy to fine-tune the model. We also use prompt engineering for changing interrogative words for better performance. Selecting a fraction (5%) of data based on question intent compared to random selection shows a 10% improvement. Applying a simple self-training SSL framework shows another 10% improvement. In STS tasks, we leverage active learning to reduce the training cost of LLM. With the certainty-estimation approach, we actively feed the transformer with data points of greatest uncertainty, which tends to cause greater parameter updates within the model. As a result, the labeling cost of training samples is reduced to 90%, and parameter updates during the training phase are reduced to 10%.

Blog: Tackling Challenges in Language Model Training with LAS and NC State

Gedas Bertasius, Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Lori Wachter, Stephen W., Aaron W., Mike G., Skip S.

The last several years have witnessed remarkable progress in video-and-language (VidL) understanding. However, most modern VidL approaches use complex and specialized model architectures and sophisticated pretraining protocols, making the reproducibility, analysis and comparisons of these frameworks difficult. Hence, instead of proposing yet another new VidL model, this paper conducts a thorough empirical study demystifying the most important factors in the VidL model design. Among the factors that we investigate are

  1. the spatiotemporal architecture design,
  2. the multimodal fusion schemes,
  3. the pretraining objectives,
  4. the choice of pretraining data,
  5. pretraining and finetuning protocols, and
  6. dataset and model scaling.

Our empirical study reveals that the most important design factors include: temporal modeling, video-to-text multimodal fusion, masked modeling objectives, and joint training on images and videos. Using these empirical insights, we then develop a step-by-step recipe, dubbed VindLU, for effective VidL pretraining. Our final model trained using our recipe achieves comparable or better than state-of-the-art results on several VidL tasks without relying on external CLIP pretraining. In particular, on the text-to-video retrieval task, our approach obtains 61.2% on DiDeMo, and 55.0% on ActivityNet, outperforming current SOTA by 7.8% and 6.1% respectively. Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA.

Expanding Usability

Troy W., Brent Younce, Michael G., Aaron W.

The LAS Model Deployment Service (MDS) focuses on researching and testing methods for consistently deploying, monitoring and scaling machine learning (ML) models and pipelines using open source cloud agnostic technologies. MDS’s two primary goals are to serve as a resource for researchers at LAS and to evaluate technologies and MLOps best practices and share this knowledge with stakeholders. MDS focuses on deployment of ML models into Kubernetes using Seldon Core V2, Apache Kafka, Nvidia Triton and Seldon MLServer. The adoption of Seldon Core V2 as a core pillar of MDS’s technology stack allows the system to separate ML model deployment into 3 principal components: ML models, servers capable of serving models and pipelines. These components can be scaled and deployed dynamically, ML models can run replicas on multiple available servers, servers can scale based on load and pipelines can be added or changed without the need to redeploy models or servers. Multi model serving in Nvidia Triton and Seldon MLServer provides better resource utilization, hot swappable models and eliminates the need to build and store custom containers for every model. Monitoring and metrics are essential to running ML models at scale, out of the box Seldon Core V2 provides MDS with an extendable set of system health metrics through Prometheus and opentelemetry. All of these features should easily integrate with other existing MLOps technologies and allow the system to grow and change as the MLOps landscape evolves. Ultimately MDS strives to strike a balance supporting researchers and evaluating scalable solutions essential to empower the use of ML at a production level.

Sean L., Troy W., Aaron W., Brent Younce

Each semester the LAS sponsors several teams of senior undergraduate students from the NC State Computer Science department to work on design & development projects within the LAS’s mission. For this project, the student team implemented, demonstrated, and evaluated the applicability of using the orchestration technology Argo Workflows to automate Machine Learning Operations (MLOps) tasks in a Kubernetes cluster. LAS is prototyping a ML Model Deployment Service (MDS) to facilitate the deployment and scaling of ML models in Kubernetes. A critical component of this system will be an orchestration tool to abstract and automate regular manual tasks, which can be very complex. Argo is an open source project that may prove useful to this end, and the student team evaluated this utility. Alleviating this burden of manually performing regular, complex tasks from the data scientists is the ultimate payoff. An overarching goal of this project is also enabling integration of open source projects in a proper, systemic fashion. The student team created a web interface and API to customize workflows and trigger execution.

Mike G., Patti K., Sue Mi K., Susie B., Skip S., Michele K.

GUESS, or Gathering Unstructured Evidence with Semantic Search, is a prototype tool for the analyst discovery and triage toolbox that allows analysts to directly query data in new ways. GUESS is based on semantic embeddings produced by a pre-trained, off-the-shelf language model and demonstrates the use of these sentence embeddings in a manner that enables effective information retrieval from machine generated speech to text data. Via natural language queries, dynamic topic modeling, and interactive visualizations supported by the underlying machine learning components, GUESS emphasizes a user-centric, human-in-the-loop approach and fosters an iterative conversation with a large corpora of text resulting in the rapid identification of analytically relevant information that might otherwise be difficult to find. These methods foster alignment of user goals with the capabilities of the technologies employed and the context of the available data, and calibrates trust between the user and the application, thus increasing likelihood of attaining useful results. In this manner, GUESS creates a tangible example of effective human-machine teaming that can directly inform analyst workflows and decision making.

Overcoming Variability

Sambit Bhattacharya, Tivon Brown, Catherine Spooner, Ashley Sutherland, Givante Lewis, Ahzsa Strange, John Slankas, Felicia M., James S., Donita R.

Detecting rare objects poses challenges for Artificial Intelligence (AI) models due to limited availability of training data, leading to subpar performance. To solve this problem, we developed a method that generates synthetic data that is used to train an AI object detection model. To create synthetic data for rare objects we used 3D object modeling software and game engine software for 3D rendering in diverse environments. Metadata was automatically generated through custom design (i.e., segmentation, depth, auto labeling) using an extension of the game engine that was created for Computer Vision tasks. Generative Adversarial Networks (GANs) were used for image-to-image translation to generate additional synthetic data. By integrating synthetic data, we improve the models’ ability to detect and classify rare and uncommon objects. We reduced the effort needed to hand-label data and trained effective models through few-shot learning. We developed methods to address environmental variability which is often a reason for the failure of object detection. Our work addresses fundamental questions in specific use cases relevant to the US Intelligence Community. It enhances object detection, saves resources, and has broad applications, improving AI system accuracy and reliability.

Agata Bogacki, David Schumann, Jonathan Walker, Jay Revere, Logan Perry, Patrick Dougherty, John Slankas, Stephen W., Felecia ML., James S., Michael G.

Artificial Intelligence and Machine Learning (AI/ML) models have proven to be valuable in increasing analyst efficiencies in triaging video content. However, analysts usually require the ability to assess incredibly large data holdings, and existing models often struggle when presented with previously unseen video or image data that was not included in the model training set. This is particularly challenging for the IC, as their mission is often fluid and constantly changing across environments and objectives.

SAS worked with LAS to lower the training data burden and identify ways to efficiently repurpose computer vision models to new situations. Building on previous work focused on model explainability and identifying inconsistencies in model performance, SAS implemented few-shot learning and un-/semi-/self-supervised algorithms on untrained video feeds that identify small objects of interest. The team also explored using synthetic data to adjust to new operational environments and worked with LAS to investigate and develop key metrics (e.g., accuracy, bias, explainability, etc.) to measure model performance. These metrics can then be evaluated to select models that best support a wide range of operational requirements, directly quantifying model performance and providing feedback to end users. Interactive dashboards were developed in SAS Visual Analytics to inform and enable end users to compare and understand the performance of various object detection methods.

Content Triage

This theme has evolved to focus on mission needs for image, audio, and text, while still addressing the historical Content Triage driver to focus on large volumes of data. Content Triage is also pushing the boundaries of machine learning technology by tackling multimodal challenges that track challenging modern communications systems and data. In 2023, Content Triage research examined new, efficient methods to access and exploit large amounts of data. These new methods include semantic queries that help an analyst better navigate unknown information, improved use of knowledge graph structures, better ways to characterize voice, and flexible detection of sounds and images – all to enable analysts to retrieve unknown insights from data. After all, useful information that systems contain but are never aware of is perhaps the most tragic missed opportunity for the intelligence community’s prodigious efforts.

Content Triage projects are divided into three scopes to explore at our symposium: Sight, Sound, and Search. These innovative projects demonstrate novel ways to address mission challenges around the ever-present need to process and exploit large data volumes.

Sight

Ketan Mayer-Patel, Montek Singh, Andrew Freeman

Due to the enormous data redundancy inherent to surveillance video, it can be difficult and slow for analysts to triage large time spans of surveillance data. We propose a novel system for capturing and analyzing extremely high-rate, high dynamic range surveillance video in an efficient, rate-controlled manner. We build upon our previous LAS work in video object detection, which leverages an asynchronous video representation layer to directly encode motion information in the image representation. Our system can efficiently transcode and compress surveillance footage captured by both framed and event-based camera sensors with up to microsecond temporal precision. We demonstrate a source-agnostic feature detection application which ensures that moving areas of interest maintain a high quality. By contrast, pixels that are not deemed interesting are processed at a much lower rate, thus reducing the computational cost of the application. We also present preliminary results showing that a source-modeled compression scheme for our video representation can exceed the compression performance of state-of-the-art video codecs in scenes with high temporal redundancy.

Felecia ML., James S., Donita R., John Slankas, Ashley S.

What happens when open-source computer vision models such as You Only Look Once (YOLO) try to detect objects such as customized weapons or flags of adversaries in foggy or dark environments? As you may guess, the models tend to perform poorly. Why you might ask? These types of objects are considered rare and uncommon, which are objects that are underrepresented or not represented at all in datasets. Developing object detection models for certain types of objects is hampered by the lack of relevant training data. For example, open-source computer vision models are pre-trained to detect common objects such as people, cars, and pets in perfect lighting or weather conditions. However, rare/uncommon objects, which can be just as important as detecting common objects, may be misidentified or missed altogether. Therefore, the LAS is researching the best approach for increasing the accuracy of detecting rare/uncommon objects by computer vision models.

Matthew Wright, Kelly Wu, Saniat Sohrawardi, Jascha Swisher, Aaron W., Jacque J., John L., Michelle W., Candice G.

The rapid evolution of deepfake technology, driven by deep-learning tools, poses a significant and growing threat due to its ability to manipulate media and disseminate false information. This danger was acutely demonstrated during the 2022 Russia-Ukraine conflict when deepfake videos portraying President Putin and President Zelensky surrendering were widely circulated, highlighting the potential for deepfakes to exacerbate disinformation campaigns in times of geopolitical turmoil. As this technology continues to advance, the demand for reliable tools to empower intelligence analysts with the ability to swiftly and accurately evaluate manipulated media becomes paramount.

Existing deepfake detection methods, predominantly relying on complex deep learning models, demonstrate remarkable performance in controlled environments but falter when confronted with real-world online content. Additionally, they often lack transparency, making it challenging for human analysts to comprehend and validate their results effectively.

In response to these challenges, our research focuses on understanding the nuanced needs of intelligence analysts. Through our investigations, we’ve found that analysts seek intuitive explanations for detection methods aligned with traditional approaches, benefiting from baseline expectations to enhance result comprehension. They also desire greater transparency and control over the tools they use, preferring standardized communication of output, which aids analysis and report generation. Our work extends beyond deepfake detection; we’ve also analyzed and standardized the taxonomy of media manipulation detection methods, aiming to facilitate seamless communication between the tool, users, and stakeholders. By addressing these specific needs, our research aims to create user-friendly, transparent, and effective deepfake detection tools, aligning with LAS’s interest in Triaging Manipulated Media and emphasizing the importance of machine-human collaboration in safeguarding national security.

Sound

Donita R., Pauline M., Patti K., Jacque J., Sean L., Tina K.

Although analysts currently have several tools for searching voice data for specific information (e.g. key words and speakers), they do not currently have practical tools for exploring large repositories of voice data when they have no clues to form search queries. This research project is aimed at providing analysts with algorithms to characterize voice data and provide them with a starting point for learning more about the data in a repository when search tools are ineffective in reducing the amount of data to a size small enough to discover unknown information in the repository.

During 2023 the voice triage team focused on testing our 2022 algorithms on data-in-the-wild, analyzing the results, modifying our code and processes, and re-assessing. We also learned more about the corporate network architecture and designed potential solutions for porting our algorithms to it.

Carlos Busso, Abinay Reddy Naini, Lucas Goncalves, Tina K., Liz Richerson, Donita R.

This project aims to create self-supervised learning (SSL)-based representation for speech emotion recognition (SER) that are not only robust, but generalize to new conditions. Advances in artificial intelligence can significantly impact the effectiveness of the U.S. Intelligence Community (IC) by automatically detecting relevant traits or behaviors conveyed on massive unlabeled, unconstrained recordings, such as emotions. There is a clear need to improve SER performance for usability purpose in IC operations. Studies using SSL speech representations have demonstrated significantly better performance than conventional methods. The first aim explores existing domain-agnostic SSL speech representation to improve the SER performance for the prediction of valence, arousal and dominance. We will explore the tradeoff between computational complexity associated with different model architectures and their performance. The second aim uses SSL to build emotional representations from multimodal domain-dependent pre-text tasks to improve SER performance. During training, the model is trained with multimodal pre-text tasks. The model is then used for emotion recognition using only speech. We will create several unimodal and multimodal pre-text tasks carefully designed to

  1. create discriminative SER representations, and
  2. leverage the complementary relationship between acoustic, lexical and facial features.

Tina K., Patti K., Jacque J., Donita R., Ed S.

We seek to provide analysts with the ability to triage voice data in audio repositories based on the presence of non-speech sounds (i.e. such as trains, explosions, and barking dogs) in audio recordings. Users may define specific sounds of interest, or iteratively explore non-speech sounds from a particular set of recordings. Our 2023 work has focused on creating algorithms with models that can be tuned to add more sound classes to the algorithm’s model repository, and to test these algorithms on data-in-the-wild, that is, messy, real data versus the clean and curated data used for algorithm research. Our data-in-the-wild testing enables us to discover and remedy sources of failure not experienced in the curated corpora. Once we have models that perform well on messy, realistic data, we will port the algorithm to the corporate network and test it on mission data. Once again, when we discover sources of failure we will find solutions. We are currently using VGGish embeddings as features for a deep multi-layer perceptron (MLP), and we will show the results of our research forays.

Search

Gedas Bertasius, Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Lori Wachter, Stephen W., Aaron W., Mike G., Skip S.

The last several years have witnessed remarkable progress in video-and-language (VidL) understanding. However, most modern VidL approaches use complex and specialized model architectures and sophisticated pretraining protocols, making the reproducibility, analysis and comparisons of these frameworks difficult. Hence, instead of proposing yet another new VidL model, this paper conducts a thorough empirical study demystifying the most important factors in the VidL model design. Among the factors that we investigate are

  1. the spatiotemporal architecture design,
  2. the multimodal fusion schemes,
  3. the pretraining objectives,
  4. the choice of pretraining data,
  5. pretraining and finetuning protocols, and
  6. dataset and model scaling.

Our empirical study reveals that the most important design factors include: temporal modeling, video-to-text multimodal fusion, masked modeling objectives, and joint training on images and videos. Using these empirical insights, we then develop a step-by-step recipe, dubbed VindLU, for effective VidL pretraining. Our final model trained using our recipe achieves comparable or better than state-of-the-art results on several VidL tasks without relying on external CLIP pretraining. In particular, on the text-to-video retrieval task, our approach obtains 61.2% on DiDeMo, and 55.0% on ActivityNet, outperforming current SOTA by 7.8% and 6.1% respectively. Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA.

Stephen W., Brent Younce., Ethan D., Nicole F., Troy W., Aaron W., Tina K., Felecia ML., James S., Sheila B., Stacie K.

The LAS EYERECKON project is creating an unclassified video content processing pipeline to demonstrate the full scale of MLOps, including: techniques, tools, workflows and resources. With unclassified video content provided by customers, LAS has crafted a demo to showcase how video content can be natively searched and prioritized within the intelligence analyst workflow. Key areas of research for this project include ML assisted labeling, object detection, recognition and tracking, processing and storage of ML output, user search capabilities, and summarization of video data (e.g. geolocation focused).

Blake Hartley, Mike Geide, Mattia Shin, Stephen W., Lori Wachter, Pauline M., Al J., Jacque J., Stephen S.

Recommender systems have seen a growing application in the field of Cyber Security. Over the past three years, PUNCH has applied recommender system and transfer learning theory to SIGINT data, and found that these datasets have a vast amount of structure which can be used to train ML models to make predictions and recommendations. We also developed pipelines which can be configured on unclassified datasets then deployed and tuned on classified data. With ORBS, we worked to create a modular recommender system which could be deployed in the field and configured based on analyst preference and pipeline results and performance. The first module of our system is an MLOps repository which can be installed locally and used to load data, preprocess columns, engineer features, train models, and create models for recommender systems. The second module of our system is a Web UI which takes models produced by the first module and allows users to use the model for recommendations.

Mike G., Patti K., Sue Mi K., Susie B., Skip S., Michele K.

GUESS, or Gathering Unstructured Evidence with Semantic Search, is a prototype tool for the analyst discovery and triage toolbox that allows analysts to directly query data in new ways. GUESS is based on semantic embeddings produced by a pre-trained, off-the-shelf language model and demonstrates the use of these sentence embeddings in a manner that enables effective information retrieval from machine generated speech to text data. Via natural language queries, dynamic topic modeling, and interactive visualizations supported by the underlying machine learning components, GUESS emphasizes a user-centric, human-in-the-loop approach and fosters an iterative conversation with a large corpora of text resulting in the rapid identification of analytically relevant information that might otherwise be difficult to find. These methods foster alignment of user goals with the capabilities of the technologies employed and the context of the available data, and calibrates trust between the user and the application, thus increasing likelihood of attaining useful results. In this manner, GUESS creates a tangible example of effective human-machine teaming that can directly inform analyst workflows and decision making.

Sean L., Stephen W., Aaron W., Brent Younce

Each semester the LAS sponsors several teams of senior undergraduate students from the Department of Computer Science at NC State to work on design & development projects within the LAS’s mission. For this project, two student teams over subsequent semesters developed a web-based application that enables system managers, and users, to manage data flows and warehouses within their organizations. As data volumes continue to increase, and as demand for derivative transformations of data such as ML-based vector embeddings, increase, the need to prioritize data of value increases since pipelines and warehouses cannot process and retain such volumes of data at such velocity/variety. Specifically, the student-developed application provides managers/users two methods of prioritizing data entering, and within, warehouses for extraction, transformation, loading, indexing, and retention. The first method is manual, whereby managers specify rules defining data to be kept, and where, and for how long. The second method is more automated via a machine learning based process that infers which data a set of users may wish to process/retain, and for how long, based on their past activity metrics.