NJBDA Talk: Large-Scale Graph Analytics in Arkouda


Date
Apr 30, 2021 11:00 AM — 11:30 AM
Location
VIRTUAL

Oliver Alvarado Rodriguez (presenter), Zhihui Du and David Bader; New Jersey Institute of Technology

Exploratory graph analytics is a much sought out approach to help extract useful information from graphs. One of its main challenges arises when the size of the graph expands outside of the memory capacity that a typical computer can handle. Solutions must then be developed to allow data scientists to efficiently handle and analyze large graphs in a short period of time, using machines that have the capacity to handle massive file sizes. Arkouda is a software package under early development created with the intent to bridge the gap between massive parallel computations and data scientists wishing to perform exploratory data analysis (EDA). The communication system between the Chapel back-end and the Python front-end helps to create an easy-to-use interface for data scientists that does not require knowledge of the underlying Chapel code and instead allows them to utilize the simple Python front-end to carry out all their large file and graph EDA needs. In this work, a graph data structure is designed and implemented into the Arkouda framework at both the Chapel back-end and the Python front-end. The main attraction of this data structure is its ability to occupy less memory space and perform efficient adjacency edge searching. A parallel breadth-first search (BFS) algorithm is also presented to help demonstrate how easily one can implement parallel algorithms in Arkouda to increase EDA productivity with graphs. Lastly, real-world graphs from different domains, such as biology and social networks, are utilized to evaluate the efficiency of the graph data structure and the BFS algorithm. The results obtained from this benchmarking help show that the Arkouda overhead is almost negligible, and data scientists can utilize Arkouda for large scale graph analytics. This work can help further bridge the gap between high-performance computing (HPC) software and data science to create a framework that is straightforward for all data scientists to use. All of the code in this project and in Arkouda is open source and can be found on GitHub. This is joint work with Mike Merrill and William Reus. We acknowledge the support of National Science Foundation grant award CCF- 2109988.

David A. Bader
David A. Bader
Distinguished Professor and Director of the Institute for Data Science

David A. Bader is a Distinguished Professor in the Department of Computer Science at New Jersey Institute of Technology.