How DARPA Does Big Data

By Nicole Hemsoth

The world lost one of its most profound science fiction authors in the early eighties, long before the flood of data came down the vast virtual mountain.

It was a sad loss for literature, but it also bore a devastating hole in the hopes of those seeking a modern fortuneteller who could so gracefully grasp the impact of the data-humanity dynamic. Dick foresaw a massively connected society—and all of the challenges, beauties, frights and potential for invasion or (or safety, depending on your outlook).

Without dwelling too long in the realm of the fantastic since we’re focusing today on the tangible projects powering the next generation of needs for the wired military and security complex, it’s worth saying now that had he lived long enough, Dick could have seen his world of dreams, data and domination come startlingly to life.

All of the darkness and damnation of the tales aside, the technological messages about the promises of massive, diverse data continue to resonate with eerie accuracy. On the cusp of this real-time data stream reality, we are seeing the possibilities of stitching together new governments, societies, militaries and economies through data and imagination resonates still. Projects underway now at government and military agencies like the Defense Advanced Research Projects Agency (DARPA) are highlighting these possibilities—and keeping the imaginations of those inclined to wonder what is next for society at large—keenly tuned-in.

DARPA, like other government agencies worldwide, is struggling to keep up with its lava flow of hot military intellignce data. Research and public sector organizations have become experts at finding new ways to create data, so the challenge has been keeping up with it—effectively running fast enough to stay just ahead of the heat with the hopes of being able to understand its source before the stream hardens and becomes static, useless.

As many are already aware, these challenges were at the heart of the U.S. government’s recent big data drive, where funding was doled out to address barriers to making use of the flood of intelligence, research and military data.

This week we wanted to take a step back and look at how a defense-oriented intelligence and research organization is trying to capture, handle and make the best use of its data flows by highlighting select projects.

Without further delay, let’s begin with the first big intel data project–

Who Needs Precogs When You Have ADAMS?

It’s a sad but relatively commonplace surprise when a solider or government agent whom others might have thought to be in good mental health suddenly begins making bad decisions—either to the detriment of national security or those around him. When this happens, the first reaction is often one of awe, “how could something like this happen—how couldn’t someone know that there was a problem before it got to such a point?”

In other words, in the case of a government that has some of the most sophisticated intelligence-gathering and analysis capabilities, how could anything slip through the cracks?

DARPA is seeking to snag this problem by understanding operative and soldier patterns via network activity and large volumes of data with a $35 million project that has been underway since late 2010.

According to DARPA, the Anomaly Detection at Multiple Scales (ADAMS) program has been designed to “create, adapt and apply technology to anomaly characterization and detection in massive data sets.” The agency says that triggers in the large data would tip them off to possible “insider threats” in which “malevolent (or possibly inadvertent) actions by a trusted individual are detected against a background of everyday network activity.”

DARPA says that the importance of anomaly detection is cemented in the “fact that anomalies in data translate to significant, and often critical actionable information.” They claim that operators in the counter-intelligence community are the target end users for ADAMS insider threat detection technology.

While there are not many details about the actual algorithms or systems used to handle this information, when the project was first announced the agency was seeking an “automated and integrated modeling, correlation, exploitation, prediction, and resource management” system to handle the needs.

Researchers from Georgia Tech are among those who are helping DARPA with its insider threat detection project. Under the leadership of computer scientist Dr. David Bader, the team has been in the midst of a $9 million, 2-year project to create a suite of algorithms that can scan for such anomalies across a diverse pool of data, including email, text messages, file transfers and other forms of data.

To develop new approaches for identifying “insider threats” before an incident occurs, Georgia Tech researchers will have access to massive data sets collected from operational environments where individuals have explicitly agreed to be monitored. The information will include electronically recorded activities, such as computer logins, emails, instant messages and file transfers.

The ADAMS system will be capable of pulling these terabytes of data together and using novel algorithms to quickly analyze the information to discover anomalies.

“We need to bring together high-performance computing, algorithms and systems on an unprecedented scale because we’re collecting a massive amount of information in real time for a long period of time,” explained Bader. “We are further challenged because we are capturing the information at different rates — keystroke information is collected at very rapid rates and other information, such as file transfers, is collected at slower rates.”

We will hold off for now speculating about a massive-scale (okay, let’s say nuclear powered exascale-type facility) that crunches this kind of data on a national scale—and we certainly won’t mention the possibility of being nabbed for a future crime, but needless to say, this type of data mining has significant value for just about every business in existence—not to mention every government.

DARPA’s Insight into Data Oversight

There are endless tales from the application and algorithm front to be told, but before we address any more of those, a brief description of systems to handle the DARPA data deluge is in order first.

DARPA’s Insight Program addresses the critical challenges of working with projects like ADAMS or any other for that matter. After all, without adequate high performance hardware and software systems working in concert across massive, diverse and performance-craving data sets little is possible. Over the last decade, and especially now in the age of nanomachines and “super-sensors” governments have become adept at creating data-generating wonders—but there is a hefty analytics cost involved.

The agency’s Insight Program seeks to mitigate those virtual costs to handling the waves of data, especially for the benefit of soldiers in need of real-time sensemaking from it. DARPA is working toward the development of “an adaptable, integrated human-machine Exploitation and Resource Management System (E&RM) System.” They describe this more specifically as a “next generation intelligence, surveillance and reconnaissance (ISR) system that, through the development of semi- and fully automated technologies, can provide real-time or near real-time capabilities in direct support of tactical users on the battlefield.”

This is no small task, of course. For instance, real-time intelligence for soldiers on the field means gathering, meshing and analyzing data from multiple sources of diverse data types (from all those many new, complex sensors, text, image or video, etc) in addition to melding that data with the info from other systems, including for example, behavioral discovery and prediction algorithm-derived intelligence.

As Henry Kenyon detailed upon the first word of the Insight Program goas, “The shortcomings of current ISR platforms and systems include a lack of automated tools to interpret, edit and weave data streams into a form useful to human analysts. According to the solicitation, vital information is often lost or overlooked due to the overwhelming flow of incoming data. A lack of integrated human-machine reasoning tools limits the ability of system to use operators’ knowledge and ability to understand complex data.”

If successful, the program will achieve a number of DARPA’s data handling goals, including the ability to replace existing stovepipes with an integrated system that operates across national, theatre and lower-level tactical intelligence systems. This means they would be able to ultimately create a mission and sensor-agnostic system that would work across different theatres of operation and promote greater collaboration between different intelligence and military analyst communities and agencies.

The most recent word is that the program, following a field test earlier this year, showed full functionality of the E&RM System to perform sequence-neutral (i.e., out-of-order) fusion of data from multi-INT sources, as well as graph-based multi-INT fusion.

According to DARPA, “The field test also produced a unique, multi-modality, high-fidelity truthed data set which is available to ISR researchers across the Department of Defense and Intelligence Community. This data set, combined with the foundational data set collected in the Fall of 2010, provides an unparalleled 135-terabyte resource to more than 240 users across government, industry and academia.”

It’s All in the Mind’s Eye

One of the more prominent big data projects that has rolled out of DARPA in recent years has been the Mind’s Eye project. Despite is creepy, Big Brother-esque moniker, the project could potentially save the lives of soldiers via the use of a truly smart “camera” system that can use vast amounts of diverse data to describe a landscape or situation—without putting human lives at risk.

The technology behind this smart camera hails from machine-based visual intelligence. The program will be able to use these cameras to develop the capability for remote visual intelligence by automating the ability to learn “generally applicable and generative representations of action between objects in a scene directly from visual inputs, and then reason over those learned representations.”

The smart cameras would replace the traditional mode of surveillance, which required dangerous missions that necessitated temporary observation post set-up and constant monitoring of the site. These cameras could take data from visual scenarios and describe in vivid textual detail what it sees, what is obstructing its view, and what it is able to reason algorithmically from what it is before it. Further, the agency could train these cameras to report only on a select set of activities or flagged actions to minimize the flood of data.

DARPA says this project differs from of the other commercial and security applications of machine vision that have found their way to market in that is their system has made continual progress in recognizing a wide range of objects and their properties—what might be thought of as the nouns in the description of a scene. The agency claims that the focus of Mind’s Eye is to add the perceptual and cognitive underpinnings for recognizing and reasoning about the verbs in those scenes, enabling a more complete narrative of action in the visual experience.

The most recent update on the success of the endeavor claims that in the first 18 months of the program, Mind’s Eye “demonstrated fundamentally new capabilities in visual intelligence, including the ability of automated systems to recognize actions they had never seen, describe observed events using simple text messages, and flag anomalous behaviors.” DARPA is looking ahead to new possibilities, noting that precision and accuracy tweaks, filling temporal gaps (answering “What just happened?” and “What might happen next?”), and answering questions about events in a scene are on their bucket list of improvements.

As one might imagine, one of the biggest hurdles to the program is on the computational horsepower side; the team says they need to lower the computational requirements of visual intelligence to address operational use constraints, such as power requirements for unmanned ground vehicles.

The Coming of XDATA

One of the most highly-publicized examples of DARPA’s focus on big data comes in the form of the XDATA program, which is a broad-based initiative to put the best minds in applied mathematics, computer science and visualization to work to fine-tune and create new tools for managing massive military and intelligence data.

The needs that created the program are simple in theory, and we’ve already touched on the massive sensor problem (many sensors, but without enough mature ways to handle all the diverse data they are feeding). DARPA says that the Department of Defense has been challenged in how it uses, fuses and analyzes all of the data from the many military networks.

The DoD openly said during the launch of the XDATA program that the systems they had in place for processing, handling and analyzing all of the information from multiple intelligence networks was not scaling to fit their needs. Additionally, they noted that the “volume and characteristics of the data and the range of applications for data analysis require a fundamentally new approach to data science, analysis and incorporation into mission planning on timelines consistent with operational tempo.”

DARPA began the XDATA program to develop computational techniques and software tools for processing and analyzing the vast amount of mission-oriented information for Defense activities. As part of this exploration, XDATA aims to address the need for scalable algorithms for processing and visualization of imperfect and incomplete data. And because of the variety of DoD users, XDATA leaders says they anticipate the creation of human-computer interaction tools that could be easily customized for different missions.

Military Reading Machines

Examples of machine reading programs are not unique to large-scale military agencies, but DARPA has high hopes for its program, which takes a different approach. While traditional text processing research has emphasized locating specific text and transforming into other forms of text (via translation or summaries) the goal has always been to make it readable by humans.

Their program, on the other hand, will involve little to no human interpretation—machines will “learn to read from a few examples and will read to learn what they need in order to answer questions or perform some reasoning task.”

DARPA says that when it comes to operational warfighters, the amount of textual information from reports, email and other communications and the rapid processing required to make it useful is a strain on internal systems. The agency claims that when it comes to the need for this Machine Reading program, “AI offers a promising approach to this problem, however the cost of handcrafting information within the narrow confines of first order logic or other AI formalisms is currently prohibitive for many applications.

The Machine Reading program seeks to address these challenges by replacing “knowledge engineers” with “self-supervised learning systems” that are able to understand natural text and dump into the proper AI holding tank for more refined processing and machine reasoning.

The team behind the project has been working on building universal text engines that will capture information from text, then transform it into the formal representations used by artificial intelligence applications. This involves a number of specific elements, including designing a system that can select and annotate text, the creation of model reasoning systems that the reading systems will interact with, and the formulation of question and answer sets with the appropriate protocols to determine progress.

For those interested in the specifics of the algorithms and applications, there is a detailed paper that describes the concepts available.

On the DARPA Data Horizon

As we have noted in the past, the government as a whole is putting significant emphasis (and money) on the possibilities of big data analytics and systems.

Among a few of the projects that are included in DARPA’s big data portfolio are other notworthy additions, including:

The Programming Computation on Encrypted Data (PROCEED)

PROCEED is a research effort that seeks to develop methods that allow computing with encrypted data without first decrypting it, making it more difficult for malware programmers to write viruses. The Video and Image Retrieval and Analysis Tool (VIRAT) program aims to develop a system to provide military imagery analysts with the capability to exploit the vast amount of overhead video content being collected. If successful, VIRAT will enable analysts to establish alerts for activities and events of interest as they occur. VIRAT also seeks to develop tools that would enable analysts to rapidly retrieve, with high precision and recall, video content from extremely large video libraries.

Cyber-Insider Threat (CINDER) Program

This effort is looking for new approaches to detect activities consistent with cyber espionage in military computer networks. As a means to expose hidden operations, CINDER will apply various models of adversary missions to “normal” activity on internal networks. CINDER also aims to increase the accuracy, rate and speed with which cyber threats are detected.

The Mission-oriented Resilient Clouds Program

The goal of this effort is to address security challenges inherent in cloud computing by developing technologies to detect, diagnose and respond to attacks, effectively building a “community health system” for the cloud. The program also aims to develop technologies to enable cloud applications and infrastructure to continue functioning while under attack. The loss of individual hosts and tasks within the cloud ensemble would be allowable as long as overall mission effectiveness was preserved.

Related Stories

Seven Big Winners in the U.S. Big Data Drive

World’s Top Data-Intensive Systems Unveiled

Government Puts $200 Million Behind Big Data Initiative

David A. Bader
David A. Bader
Distinguished Professor and Director of the Institute for Data Science

David A. Bader is a Distinguished Professor in the Department of Computer Science at New Jersey Institute of Technology.