Date of Original Version



Conference Proceeding

Abstract or Description

The authors describe an optimized implementation of a set of scan (also called all-prefix-sums) primitives on a single processor of a CRAY Y-MP, and demonstrate that their use leads to greatly improved performance for several applications that cannot be vectorized with existing computer technology. The algorithm used to implement the scans is based on an algorithm for parallel computers. A set of segmented versions of these scans is only marginally more expensive than the unsegmented versions. The authors describe a radix sorting routine based on the scans that is 13 times faster than a Fortran version and within 20% of a highly optimized library sort routine, three operations on trees that are between 10 to 20 times faster than the corresponding C versions, and a connectionist learning algorithm that is 10 times faster than the corresponding C version for sparse and irregular networks