Parallel Algorithms for Personalized Communication and Sorting with an Experimental Study (Extended Abstract)


A fundamental challenge for parallel computing is to obtain high-level, architecture independent, algorithms which execute efficiently on general-purpose parallel machines. With the emergence of message passing standards such as MPI, it has become easier to design efficient and portable parallel algorithms by making use of these communication primitives. While existing primitives allow an assortment of collective communication routines, they do not handle an important communication event when most or all processors have non-uniformly sized personalized messages to exchange with each other. We first present an algorithm for the h-relation personalized communication whose efficient implementation will allow high performance implementations of a large class of algorithms. We then consider how to effectively use these communication primitives to address the problem of sorting. Previous schemes for sorting on general-purpose parallel machines have had to choose between poor load balancing and irregular communication or multiple rounds of all-to-all personalized communication. In this paper, we introduce a novel variation on sample sort which uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. Another variation using regular sampling for choosing the splitters has similar performance with deterministic guaranteed bounds on the memory and communication requirements. Both of these variations efficiently handle the presence of duplicates without the overhead of tagging each element. The personalized communication and sorting algorithms presented in this paper have been coded in SPLIT-C and run on a variety of platforms, including the Thinking Machines CM-5, IBM SP-2, Cray Research T3D, Meiko Scientific CS-2, and the Intel Paragon. Our experimental results are consistent with the theoretical analyses and illustrate the scalability and efficiency of our algorithms across different platforms. In fact, they seem to outperform all similar algorithms known to the authors on these platforms, and performance is invariant over the set of input distributions unlike previous efficient algorithms. Our sorting results also compare favorably with those reported for the simpler ranking problem posed by the NAS Integer Sorting (IS) Benchmark.

Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ‘96, Padua, Italy, June 24-26, 1996