David Bader, director of NJIT’s Institute for Data Science, works on computing initiatives that will help people make sense of large, diverse and evolving streams of data from news reports, distributed sensors and lab test equipment, among other sources connected to worldwide networks.
When a patient arrives in an emergency room with high fever, coughing and shivering, the speed of diagnosis and treatment depends on the skills of the medical staff, but also on information. If it’s a rare or newly spreading infectious disease, reliance on today’s clinical diagnostic methods may not immediately recognize it. Ready access to a variety of data — disease genomics, geospatial maps of its spread and electronic health records — might even predict it.
“When we track communities, for example during global pandemics or for finding influencers in marketing, using data streams rather than static snapshots enables us to follow relationships that change over time,” says David Bader, director of NJIT’s newly established Institute for Data Science. “We want to see how these relationships change, such as in pandemic spread of disease, by following global transportation, disease spread and social interactions, in near-real time.”
But that level of gathering, processing and assembly requires new computing capabilities: the capacity to search through massive volumes of data from diverse sources, from unstructured texts and social media to passenger lists and patient records, and the ability to perform complex queries at unprecedented speeds. This is particularly important in health care, he notes, where the speed of diagnosis can stop the spread of deadly diseases in their tracks.
“Big data is used to analyze problems related to massive data sets. Today, they are loaded from storage into memory, manipulated and analyzed using high-performance computing (HPC) algorithms, and then returned in a useful format,” he notes. “This end-to-end workflow provides an excellent platform for forensic analysis; there is a critical need, however, for systems that support decision-making with a continuous workflow.”
Bader has been working with NVIDIA to develop methods to allow people to stream relationship entities and analytics in continuously updated live feeds through its RAPIDS.ai, an open accelerated data science framework for accelerating end-to-end data science and analytics pipelines entirely on graphics processing units. These graph algorithms, which contain far more memory access per unit of computation than traditional scientific computing, are used to make sense of large volumes of data from news reports, distributed sensors and lab test equipment, among other sources connected to worldwide networks.
In hardware, HPC systems use custom accelerators that assist with loading and transforming data for particular data science tasks. “For instance, we may only need a few fields from electronic patient records archived in storage. Rather than retrieve the entire library before searching records for the information, we could retrieve only the fields necessary for the query, thus saving significant cost,” he notes, adding that with accelerators, key tasks will move them closer to the hardware and data storage systems.
He notes that general-purpose computers are reaching a performance plateau as they hit a ceiling on the number of transistors that can be placed on a chip. While exploiting parallelism using multicore processors provided a path forward for some applications, the target is higher-performing, specialized chips designed to perform specific functions such as the Xilinx field-programmable gate array for signal processing or Google’s tensor processing unit. Cerebras, he adds, is setting records with the 1.2 trillion-transistors Wafer-Scale Engine for accelerating deep learning.
“The hardware-software co-design for analytics is exciting as we enter a new era with the convergence of data science and high-performance computing. As data are created and collected, dynamic graph algorithms make it possible to answer highly specialized and complex relationship queries over the entire data set in near-real time, reducing the latency between data collection and the capability to take action,” Bader says.
These systems are also designed to be energy-efficient and easy to program, while reducing transaction times by orders of magnitude. The goal is for analysts and data scientists to be able to make queries in their subject domain and receive rapid solutions that execute efficiently, rather than requiring sophisticated programming expertise.
“The variety of data we ingest continues to change and we will see data sources we can’t envision today. We need to create systems that will be able to ingest them,” he says. Where will this all lead? Bader tells the story of one company’s aspiration to turn ranting customers into raving fans by tracking public blogs and social media, identifying them in their customer records and even preemptively responding with fixes and repairs before customers had even contacted the company’s support team. A decade ago, Dell reported a 98% resolution rate and a 34% conversion rate turning online “ranters” to “ravers” using social listening.
“We are developing predictive analytics — the use of data to anticipate the future,” Bader says. “Instead of understanding what has happened, we wish to predict what will happen.”