Skip to main content
Top of Page

2025 Research Symposium

Each year, LAS undertakes a research program that involves collaborators from academic, industry, and government communities. The outcomes of these research projects are of interest to our intelligence community stakeholders, as well as the respective communities of our academic and industry partners.

Projects

We invite you to learn about this year’s unclassified research projects. In addition to the abstracts below, our research teams have provided a video or article that explains their motivations, methods, and outcomes. This year’s projects are grouped into the following themes: Audio Sensemaking, Video Sensemaking, AI-Infused Workflows and Analytics, and AI Assessment and Benchmarking.

Audio Sensemaking

Structured Annotation of Audio Data (University of North Carolina at Chapel Hill)

Jaime Arguello; Robert Capra, Bogeum Choi; Jiaming Qu; Jaycee Sansom; Ziyi Wei; Nathan Cavender; Elizabeth Richerson; Christine Brugh; Patricia K.; Tim S.; Sue Mi K.

Intelligence analysis of audio data involves a triage stage in which data is transcribed, interpreted, and annotated by analysts making operator comments (OCs). Over the past three years, we have worked with our LAS partners on developing a system for making structured OCs. In 2023, we conducted a study in which 30 analysts made hypothetical OCs on transcribed conversations from an unclassified domain. A qualitative analysis of OCs made by participants resulted in a taxonomy of 25 OC categories along with input fields associated with each category. In 2024, we developed a prototype system for making structured OCs. The system enables analysts to categorize their OCs and complete specific input fields deemed relevant to the OC category. In 2025, we expanded the prototype to highlight OC predictions.  We developed tools for analysts to select specific OC categories and see locations in a transcript where those types of OCs may need to be made.  The interface enables analysts to easily accept and reject predictions.  Additionally, analysts can use sliders to adjust the confidence for the predictions they wish to see (e.g., only high- or only low- confidence predictions).  We conducted a two-phase study (N=18) in which analysts were asked to make OCs with the prototype with and without the predictions.  After each phase, analysts completed questionnaires about workload, difficulty, satisfaction, and perceptions of the OCs made.  Analysts also completed an exit interview about their perceptions of the system, the OC predictions, the process of making structured OCs, and the OC taxonomy itself.

Tailored Cross-Lingual Conversational Data Summarization and Evaluation (RTX BBN Technologies)

Sammi Sung, Hemanth Kandula, William Hartmann, Matthew Snover, Elizabeth Richerson, Jascha Swisher, Michael G., Paula F., Patricia K., John Nolan

With the increasing capability and availability of LLMs, LLM-based document summarization is being widely adopted. However, summarization evaluation is still a major challenge. LLMs are often deployed based on the qualitative feel of a handful of examples as opposed to a systematic quantitative evaluation. Without metrics, it is impossible to objectively compare summarization systems and measure improvement over time.

In response to this challenge, BBN, in collaboration with LAS, developed a fact-based evaluation pipeline that measures the accuracy (precision) and information coverage (recall) of summaries. For both the summary and the underlying document, we generate a set of facts. We then use an entailment model to determine whether the summarized facts are supported by the document and whether the source facts are covered by the summary. Precision measures whether the information in the summary is supported by the document and helps detect hallucinations. Recall measures the percentage of facts from the document that are carried over to the summary and represents the amount of information lost in the summarization process. 

As part of our research, we delivered an evaluation pipeline that supports both general and topic-focused summarization evaluation. The delivered pipeline includes a standalone, fine-tuned fact extractor and the ability to add citations inside the summary. The citations point back to statements within the original document that support the statements in the summary.

We tested our evaluation pipeline on conversational Mandarin speech with a range of off-the-shelf and fine-tuned summarization systems. We were able to demonstrate the ability of our metric to capture the impact of translation and transcription errors, as well as detect error states in summarization systems, outperforming previously proposed metrics. Our metric allows developers to effectively compare summarization systems and strategies without human reference summaries.

Video Sensemaking

Meta-Embeddings for Small Object Image Search (Saint Louis University, Washington University in St. Louis, George Washington University, Temple University)

Abby Stylianou, Robert Pless, Nathan Jacobs, Richard Souvenir, Stephen W., Lori Wachter, Brent Younce

Analysts searching large-scale image databases often need to locate small but distinctive visual elements—such as a unique lamp in a hotel room or a specific vehicle in a surveillance image—within vast and visually diverse collections. Our team focused on improving investigative capabilities for systems like this, and particularly on improving the underlying models and operational capabilities of TraffickCam, a deployed image search platform used to support human trafficking investigations at the National Center for Missing and Exploited Children. Our work has focused on improving vision-language model representations to reduce concept entanglement between text and image features, enabling more precise and interpretable retrieval for mixed-modality queries. This year specifically, we introduced Query-Adaptive Retrieval Improvement (QuARI), a technique that learns lightweight, query-specific linear transformations of the embedding space to emphasize features most relevant to a given query. This approach improves fine-grained and small-object retrieval performance without requiring costly re-embedding or fine-tuning, and scales efficiently to millions of images. These advances strengthen the ability to connect investigative queries to relevant imagery, enhancing analysts’ search effectiveness, and demonstrating how research innovations in vision-language modeling can directly translate into operational impact in real-world investigative contexts.

3D Reconstruction and Synthetic Data Generation Pipeline for AI Model Training (Fayetteville State University)

Sambit Bhattacharya, Catherine Spooner, Zachary Delaney, Tyuss Handley, Miriam Delgado, Joshua Lockart, Jesse Claiborne, Isara Witten, Bryce Herring, Abdullah Waleed, Michael Backus, Felecia M., James S., Daran L., John Slankas, Jascha Swisher

This project developed an integrated pipeline combining advanced 3D reconstruction with synthetic data generation capabilities to train robust AI object detection models. The work addresses critical challenges in acquiring and labeling real-world data for defense and aerospace applications by creating photorealistic 3D assets and generating diverse synthetic training datasets.

We met initial milestones with Neural Radiance Field (NeRF) algorithms for transforming sparse 2D image collections into high-fidelity 3D meshes. Next, we adopted techniques based on flow-based diffusion transformers and structured latent representations, which demonstrated improved reconstruction quality and efficiency. These methods employ two-stage pipelines that separately generate geometry and synthesize textures, and leverage large-scale pre-trained models rather than per-scene optimization. We have developed a workflow for mesh refinement using Blender, and for object extraction and retopology. This workflow can reduce manual effort for intelligence analysts while producing clean, precise 3D models suitable for synthetic environment integration.

We use the Digital Combat Simulator (DCS) to generate comprehensive synthetic datasets. Diverse environmental conditions including weather variations, lighting scenarios, and camera parameters are varied to create realistic training data. Currently, we utilize high-quality 3D models available in DCS, with our reconstructed models providing a pathway for future custom object integration.

We have demonstrated the pipeline’s effectiveness through successful training of AI object detection models capable of identifying 30-50 distinct objects across various operational conditions. Beyond detecting objects of military interest, we are developing specialized detectors for aircraft payload identification and aircraft takeoff action recognition, with models being optimized for deployment environments and validated across challenging scenarios.

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering (University of North Carolina at Chapel Hill)

Gedas Bertasius, Md Mohaiminul Islam, Lori Wachter, Brent Younce, Stephen W.

Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information and modeling long-range dependencies from many redundant frames. The self-attention mechanism provides a general solution for sequence modeling, but it has a prohibitive cost when applied to a massive number of spatiotemporal tokens in long videos. Most prior methods rely on compression strategies to lower the computational cost, such as reducing the input length via sparse frame sampling or compressing the output sequence passed to the large language model (LLM) via space-time pooling. However, these naive approaches over-represent redundant information and often miss salient events or fast-occurring space-time patterns. In this work, we introduce \model, an efficient state-space model to handle long-form videos. Our model leverages the selective scan algorithm to learn to effectively select critical information from high-dimensional video and transform it into a reduced token sequence for efficient LLM Selective processing. Extensive experiments demonstrate that \model\ achieves state-of-the-art accuracy on multiple long-form VQA benchmarks, including PerceptionTest, NExT-QA, EgoSchema, VNBench, LongVideoBench, Video-MME, and MLVU. Code and models are available at https://sites.google.com/view/bimba-mll.

AI-Infused Workflows and Analytics

Uncovering Analyst Preferences for Layered Explainable Interfaces in Speech-to-Text Model Selection Scenarios to Support Model Agility (North Carolina State University)

Deborah Littlejohn, Valeria López Torres, Amaya Hush, Laurel Zhang, Olha Novikova, Christine Brugh, Lori Watcher, Tim S., Patti K., Jacque J.

This research examines how seniority level influences Language Analysts’ information needs and visual preferences for explainable user interfaces (XUI) to support their decision-making when selecting Speech-to-Text (STT) models for audio sensemaking tasks. Grounded in human-centered design methods and explainable AI (XAI) principles, we developed prototypes for STT model selection interfaces and conducted a structured qualitative study with 28 participants to elicit interpretations of explanation modalities and information architectures. Our results indicated that, in addition to model performance data and other quantitative metrics, analysts across all seniority levels value community reports of usage preferences (e.g., peer-validated model performance, task appropriateness ratings, and model usage frequency) as reliable criteria for selecting STT models. We conclude that leveraging community usage data alongside technical data is key to calibrating trust, as users frequently prioritize real-world operational feedback over abstract performance scores. Analysts also strongly prefer layered explanations that integrate immediate, plain-language rationales with this community input to enhance model agility without contributing to cognitive overload. This research defines a tiered structure for future STT model selection interfaces based on three levels of explanation: Level 1 (Glanceable) prioritizes plain-language, task-specific recommendations; Level 2 (Intermediate) synthesizes technical data with community input; and Level 3 (Technical) is tailored for domain experts requiring granular metrics (e.g., Word Error Rate, training data). Outcomes from this research include a robust framework for layered explanation delivery, and design guidelines for responsive XUIs that dynamically match explanation levels to the analyst’s workflow and information needs, thereby accelerating the audio analysis process.

Facilitating the Practical Implementation of Improved Explainability and Visual Representation for Confidence and Uncertainty in Speaker Models (North Carolina State University)

Helen Armstrong, Matthew Peterson, Rebecca Planchart, Kweku Baidoo, Tim S., Jacque J., Christine Brugh, Lori Wachter, Patrick J., and others

There are significant challenges inherent to the calibration of trust within human-machine teams in the intelligence community. The visualization of confidence and uncertainty, embedded within a user interface and user experience, should help language analysts appropriately calibrate trust via model transparency and interpretability. Such calibration could enable an analyst to more effectively evaluate model outputs when making a decision. To support the calibration of trust between analysts and speaker models, an effective visualization of confidence and uncertainty must be paired with a user interface and user experience that enable progressive disclosure of layered explanations as well as a dynamic system enabling analysts to adjust risk parameters in consideration of the larger mission context. This design investigation utilized a novel depth of engagement framework to reconsider what information analysts encounter when, and to structure visual exploration of confidence scores. We delivered three systematic visualization solutions, Arc Gauge.

EvalOps: Advancing Human-Machine Teaming through Rapid, Scalable Evaluation (Worcester Polytechnic Institute, Kenyon College, University of North Carolina at Chapel Hill)

Lane Harrison, Ray Wang, R. Jordan Crouser, Liz Richerson, Christine Brugh, Bo L., Aislinn P., Ed S., Tim S.

Evaluation remains a significant challenge in the design and development of AI-enabled tools for the IC, particularly as systems grow in complexity across automated summarization, recommendation, entity and relationship extraction, and underlying models. While empirical evaluation and management of system components falls under the umbrella of MLOps, there is a substantial gulf between technically performant systems and systems that reliably improve end-user efficiency and experience.

This project has developed EvalOps, workflows for rapid development of highly-instrumented empirical studies that can be immediately deployed with stakeholders for feedback. Where MLOps accelerates the development of models and management of the architecture that drive systems, EvalOps accelerates the development of interfaces and management of stakeholder feedback that shapes systems.

This project develops systems and processes that facilitate efficient, direct stakeholder evaluation of AI-driven interfaces. It supports evaluating how systems can create AI workflows that meet analyst needs, skillsets, and constraints. By tackling technical and workflow challenges of conducting effective empirical studies with AI systems, we aim to more rapidly operationalize the potential of AI and ML techniques for the IC.

AI Assessment and Benchmarking

LLMming: Validation of Black-box Generative ML Models for Safety and Qualified Trust (Rockfish Research)

Chris Argenta, Aaron W., John Slankas, Jascha Swisher

Large Language Models (LLMs) and other generative models are already employed in a wide range of domains, often with only antidotal evidence of suitability or safety. These are generally black-box models, so while they may perform perfectly well on many topics, they may harbor hidden risks with respect to some other topic (intentionally or not). Unfortunately, inspection of the architecture, weights/biases, or benchmark scores will not likely discover such risks. These risks have been particularly salient this year with the release of several new LLM models that challenged the previous market expectations.

This project focuses on designing automated methods of validating AI models so that they can be employed with more confidence and with a deeper understanding of their capabilities, performance, risks, and applicability to the tasks at hand. Managing safety also requires managing meaningful changes across model versions and discovering unintended effects resulting from fine-tuning, quantization, unlearning, or constitutional checks. With validation processes, we can establish qualified trust and safety guardrails that ensure suitability for intended uses.

Our results this year include creating and demonstrating an approach to comparative validation techniques for black-box models that targets domains for which we do not have known “right answers”. We have successfully developed tools that generate domain validation prompts, executes sampling on remote and local LLMs, analyzes and compares the responses given by various models, aggregates those differences with inferred positions/perspectives, and presents/explains the results visually. We have demonstrated our validation process across multiple domains, on multiple foundation models, over multiple versions/variants of the same models, across different language prompting, and with different system prompts/personas.

Automatically Generating Traceable, Multi-Source, and Multimodal Summaries of Crises and Conflict (University of Maryland)

Cody Buntain, Tasmeer Alam, Mohammed Ansari, Mike G., Liz Richerson, Jascha Swisher

The FACTS summarization project advances the state of the art in crisis-event summarization, fusing text- and image-data gathered from disparate social media and online news sources into crisis-response-relevant daily summaries. In this project, we establish 1) a framework for evaluating multimodal large vision language models (VLMs) in a task-specific context, 2) an assessment of the information density of images versus text in crisis summarization, and 3) an end-to-end pipeline for fusing data daily, traceable, multimodal summaries. Our evaluation framework enables multimodal-model comparisons in an information-extraction task, currently using 18 crisis events gathered and summarized as part of a community-driven data challenge at the annual Text Retrieval Conference. We evaluate each model on its ability to extract relevant and valid facts from a collection of text and images, assessing the coverage and redundancy of extracted facts against human- and AI-annotated facts; current results suggest InternVL 2.5 and Google’s Gemma perform well relative to other models, with InternVL 2.5 dominating in Gemma in runtime. Analysis of facts extracted from text- and image-based data show limited overlap in the information provided by these two sources, as only approximately 7% of image-based facts appear supported by text-based facts in our current datasets. We are currently constructing an additional annotation task to validate image-based fact extraction. We will finalize the project with an end-to-end prototype that uses these best-performing models to produce daily, traceable summaries for crises.

SANDGLOW 

Paul N.

Sandglow is a modular, GPU-optimized framework for large-scale multimedia analysis, designed to unify video, image, and audio processing within a single, extensible architecture. The project addresses a key challenge in applied AI research: efficiently extracting, transforming, and interpreting multimodal data without relying on monolithic or proprietary systems.

Sandglow introduces a configurable pipeline capable of frame-level video sampling, distortion-free image scaling, and integrated audio transcription. Its design emphasizes reproducibility, scalability, and transparency, offering a foundation for experimentation in multimodal reasoning and content understanding. By enabling selective frame sampling and distributed GPU processing, Sandglow achieves a balance between computational efficiency and analytical precision.

In contrast to existing end-to-end media analysis platforms, Sandglow prioritizes modularity and interpretability—allowing researchers to modify, inspect, or extend individual components. Preliminary benchmarks demonstrate significant time savings in video frame extraction and processing across GPU configurations.

This work contributes to the broader effort of developing adaptable, open frameworks for multimodal AI research. Sandglow’s flexible architecture enables researchers to prototype and evaluate intelligent media-processing pipelines, fostering reproducible, high-performance experimentation in AI-driven perception and understanding.

WOLFSIGHT 

Edward S., Nicole F., Liz Richerson, John Nolan

If you’ve ever watched a movie and understood a plot point based on off-screen noises rather than what the characters are acting out on screen, then your viewing experience relied on multimodal context. When searching multimodal data, users might expect such multimodal context to factor into search results, where each modality provides some amount of information to the search results; that is not always the case. In this work, we aim to implement search capabilities that produce results based on available data modalities without relying exclusively on computationally expensive large multimodal models. We examined the feasibility of extending existing infrastructure for executing text-image based search to support additional data modalities present in video data, with the goal of incorporating context from each modality represented in the data to improve ranking of search results. In both the text-image and video use cases, we employ small models to provide context from each data modality to enable a coarse filter for search results, which are passed to a large multimodal model to combine modalities and rerank results to better reflect relevance to the search query.

Efficient Edge AI for Security

Sambit Bhattacharya, Paul Rodriguez, Abhirup Dasgupta, Johnathankeith Murchison, Miriam Delgado, Anita Amofah, Abdullah Waleed, Waylon Robinson, Felecia M., James S.

We present progress on the Efficient Edge AI for Security project, an initiative for developing edge-deployed artificial intelligence solutions for intelligent crowd behavior analysis and security monitoring. The project addresses critical security challenges by enabling near real-time video analytics directly on resource-constrained edge devices.

The team has successfully developed Sentinel Nexus V2.0, an integrated system architecture featuring edge devices with video capture capabilities, networked connectivity, and a user-friendly interface enabling operators to deploy natural language queries through video language models for efficient data triage. It comprises of three components: host client, server middleware, and edge client, facilitating device discovery, query transmission, and image retrieval across distributed surveillance networks. The edge subsystem employs a modular three-script architecture where the client listens for commands, the handler interprets requests, and containerized models execute detection tasks.

Initial hardware validation employed edge devices, and camera sensors. Performance benchmarking demonstrated significant improvements. Network functionality validation confirmed successful message and image transmission, including testing with open vision-language AI models.

Ongoing and future research focuses on addressing the ephemeral data challenge, developing surveillance-specific adaptations of lightweight vision-language models for edge devices, implementing storage-aware batch processing algorithms that optimize analysis within deletion windows, and creating result aggregation protocols that combine findings from multiple independent edge devices. These advancements will enable analysts to broadcast natural language queries simultaneously across entire surveillance networks, with each device independently processing video archives and returning matching content before deletion cycles eliminate source data.

Dive Deeper

Learn more about the motivation, methods and outcomes of our 2025 projects.