Skip to main content

PandaJam: Understanding Triage of Speech-to-Text

Patti K., Sue Mi K., Susie B., Jessica E., Pauline M., John S., Jascha S., Brent Y.

PandaJam logo
PandaJam seeks to understand the workflow of language analysts, in order to lay a foundation to build machine teammates for analysts

Gaining a deeper understanding of analyst workflows is a fundamental step towards implementing human-machine teaming in an operational setting, where analysts and machines work together to achieve outcomes greater than either could achieve independently.

PandaJam is an observational study designed to elucidate the workflow of language analysts and their cognitive processes when seeking to retrieve relevant information from audio. In the PandaJam study, analysts use an elasticsearch-based query tool to explore thousands of hours of audio from the Nixon White House Tapes and speech-to-text (STT) transcriptions of this audio; they are given a task to answer four questions about the pandas that China gifted to the U.S. in 1972, and asked to explain their thought process out loud, while their actions are recorded in the query tool.

Our 2022 pilot study demonstrated that analysts were limited in their ability to successfully keyword search “noisy” STT, a major barrier to completing the task. This motivated us in 2023 to study, in greater depth, how analysts leverage STT to achieve their sub-tasks or goals during the task (sub-tasks include a wide range of inquiries that analysts pursued in order to complete the larger task, ranging from efforts to gain a sense of the totality of relevant information available, to finding specific information about a particular entity named in the data). This year, 10 additional analysts with voice language analysis experience (all but one of whom had experience using STT in the course of their jobs) participated in our study. Our analysis focused less on whether they could answer the questions “correctly;” instead, we examined the actions they took during the task that directly involved the STT. We developed a taxonomy of these actions, and used our qualitative data to associate each type of action with the analysts’ sub-task or goal. Finally, we assessed whether their goals were supported by the actions they took, using both qualitative and quantitative analysis.

“From being presented with a question and having to formulate a query…[to the] actual steps of querying and triaging results and all of that, [it] really mimicked real-world situations.” -PandaJam study participant

Actions, Goals, and (Mis)alignment

A preliminary analysis of 6 of our 10 participants’ data resulted in a taxonomy (Figure 1) of action types where the analyst directly leveraged the STT during the task.  We followed Gotz’s approach and defined the actions at an abstract level that can be generalized across a range of tools and tasks where triage of STT occurs.

Action TypeAction Description
QueryUsing search with parameters that include keyword(s) in STT to bring information into results page
ExploreReviewing query results by scanning multiple STT transcripts
PanReviewing query results by scanning multiple STT transcripts around specific keyword(s) included in the original query
FilterReviewing query results by scanning multiple STT transcripts around specific keyword(s) not included in the original query
InspectInvestigate individual STT transcript
JumpInvestigate individual STT transcript focused around specific keyword(s)
BookmarkSaving specific portion of individual STT transcripts as a reference/time stamp
AnnotateSaving specific portion of individual STT transcripts as content
AnalyzeCreate visualizations for query results by requesting a word cloud for multiple STT transcripts
ExamineCreate visualizations for query results by requesting a word cloud for individual STT transcript
Figure 1: Taxonomy of actions taken by analysts when they directly leveraged the STT in the PandaJam study.

In our analysis, we also identified the analyst’s goal, or sub-task, each time they took these actions. Each action type may support different goals at different times, and each goal may be supported by different action types. In Figure 2 (below), we outline the analyst goals we encountered in our data, and which actions were used to support them.

GoalActions Used to Support Goal
Capture a diverse range of STT transcripts to help the analyst get a sense of the totality of information about a topicQuery
Capture a diverse range of STT transcripts in which analyst can reasonably expect to find specific information about an entity/topic/questionQuery
Retrieve particular STT transcriptQuery
Identify individual STT transcripts  where different topics are discussed in relation to each otherPan
Facilitate skimming of relevant sections of STT transcriptsJump
Refresh memory about content of audio record they previously listened toInspect
Determine if they have already reviewed or listened to a particular audio recordJump, Inspect
Determine which section of audio record to listen toJump, Inspect
Review previously evaluated STT transcripts or audio records to facilitate sensemaking or documentation processJump, Inspect
Return to a previous line of inquiryJump, Inspect
Determine if query should be reformulated/validate soundness of queryAnalyze, Explore
Identify individual STT transcripts that can be excluded from further evaluationExplore, Pan, Jump
Identify individual STT transcripts that require further evaluationExplore, Pan, Jump, Inspect, Filter, Analyze
Figure 2: Analyst goals in the PandaJam study that were supported by actions directly leveraging the STT

The next step in our analysis was to evaluate to what degree the goals we identified were supported by the action that leveraged the STT. We were particularly interested in understanding whether

  1. analyst goals would be better supported if the STT were better, or even perfect; or,
  2. the ways analysts were able to interact with STT were fundamentally misaligned with analyst goals, regardless of STT quality.

As expected based on our 2022 work, we observed significant misalignment between the query action and associated goals. One way this misalignment manifested was in analysts’ tendency to struggle with creating what they judged as sound queries based on keywords. According to our preliminary analysis, 36% of queries run by analysts were immediately followed by the analyst running a new query, without examining the initial results.

In this year’s analysis, it became clear that this misalignment was not limited to the query action; keywords were often used by analysts for other actions as well, including pan, filter, and jump. These actions were used to support multiple key goals during the analytic task, as described in Figure 2, and the challenges associated with using keywords for the query action, which are discussed further below, were equally present for each of these actions.

The workflow chronology shows analysts struggled to create keyword queries that supported their goals. They also used keywords for other action types.

Actions that do not rely on keywords, but support the same goals as pan, filter, and jump, include explore and inspect. In our preliminary analysis, pan, filter, and jump were used more than twice as much as explore and inspect, likely for the very simple reason that this was a time-bound task, and the latter two actions are much more time-intensive. Because these actions do not rely on keywords, the factors contributing to misalignment are less pronounced; it is easy to imagine that, given perfect STT and unlimited time, these actions could indeed support analyst goals. There seems, therefore, to be a tradeoff between alignment and efficiency when it comes to actions analysts can currently take to support their goals while triaging STT. Ideally, tools would support alignment between analyst actions and goals, while also allowing for efficiency, as analysts obviously do not have unlimited time in an operational setting.

Factors Contributing to Misalignment

The qualitative data we gathered with regard to analysts’ keyword searches indicated that the difficulties they encountered in such searches would not necessarily be overcome by perfect STT; instead, the analysts’ goals and the search function were fundamentally misaligned. Namely, keyword searches do not allow search terms that account for many factors inherent in spoken conversation, including polysemy, synonyms, ambiguity, and natural language variations among speakers, or idiolects. Available tools do not give analysts the same flexibility in shaping query terms that speakers have in shaping conversation; as a result, analysts were often forced to exclude relevant query terms because they brought back too much information, or were forced to limit overly broad searches in ways that proved to be too narrow to be useful to the analyst.

Cartoon of a brain saying "Oh wow...we got a lot of stuff here, probably because you could use zoo as not literally a zoo, but saying something was a zoo."
Queries that rely on keyword search do not allow analysts to account for various nuances of spoken language in their queries, including polysemy. Polysemy is the coexistence of many possible meanings for a word or phrase.

As discussed above, these challenges were also present in other actions that analysts took during the task, such as filter.  For example, after determining a very narrow date range was particularly relevant to the task, one participant queried on all records from within that date range, then attempted to filter on the word “panda” to find results about that topic within the date range. When the filter action returned no results, she expressed frustration before trying to filter on another semantically related word, “zoo.” When “zoo” also failed to return results, she was forced to pivot to another query.

Auxiliary Effects

Despite the misalignment between some actions available to analysts and their goals, there were positive “auxiliary effects” of these actions observed in their workflow. Those who work to augment current capabilities or reimagine implementations of technology should consider how eliminating or enhancing these auxiliary effects might impact analysts and their workflows. As analysts iterated through the actions defined above, we observed that they gradually gained a better overall picture of the data; specifically, they gained a sense of the STT accuracy and developed an understanding of the context in which certain topics were likely to be discussed by the speakers, both of which demonstrably shaped the analysts’ tradecraft and impacted the trajectory of their actions and goals in later steps.

Why This is a Suitable Proxy Task for Language Analysis Work

President Richard Nixon, seated at his desk in the Oval Office, speaks on the phone.
Analysts who participated in the study agreed that the Nixon White House Tapes are a suitable proxy for the types of audio data they typically work with.

The above section includes a veiled call for augmenting current capabilities or reimagining implementations of existing technology, based on this work. It is only fair, then, to question whether the results obtained in this study are indeed applicable to what one might see in actual language analysis work. Feedback from each participant upon completing the task indicates that they are; all participants agreed that the data was a very good proxy for the type of material they work with on the job, with some indicating the STT they encounter on the job is typically better than what they encountered in the study, and others indicating it is typically worse than the STT in the study. As one such participant noted, “If we had good speech-to-text output, I feel like this would be pretty similar to the way that we would roll things.” Given that we were particularly interested in evaluating whether actions that leverage the STT were fundamentally misaligned with analyst goals, the feedback that the task accurately represents how analysts use STT is more significant than feedback on the STT quality.

Some participants expressed mild concerns about the learning curve associated with using an unfamiliar tool in the study; one participant noted they felt they were “fumbling around” and if they had been using a tool they knew, there would have been more “fluidity.” Any lack of “fluidity” on the part of the participants, however, seems to have affected the quantity of actions performed rather than the types of actions themselves; again, analyst feedback indicated the possible actions related to STT were well represented in the study. One participant stated, “I thought that it was a well-structured exercise, and it really captured the workflow, from being presented with a question and having to formulate a query…[to the] actual steps of querying and triaging results and all of that, [it] really mimicked real-world situations.”

Future Work

The above discussion is the result of a preliminary analysis of our qualitative data; we will complete this analysis in 2024, and determine how best to document our complete findings. In addition, the data collected in this study is not only unique, but uniquely rich, and we believe there is potential to tease out additional insights to offer to those who seek to achieve synergy between humans and machines in language analyst workflows, and thus improve analytic outcomes for the intelligence community. We therefore hope to explore other avenues of inquiry using the data already collected and, should the need arise, we have the ability to expand the study to additional language analysts who have expressed interest in participating in this research.