|
Organizers |
Multigrain parallel algorithms for eigenvalue calculations
by
Andreas Stathopoulos
College of William and Mary
Coauthors: James R. McCombs
The numerical solution of large, sparse, eigenvalue problems
is central to many scientific and engineering applications.
Krylov iterative methods, which do not modify the matrix, often provide
the only means for solving these problems.
Parallel computing is a critical component in solving large scale applications.
Yet, the performance of existing methods falls short of the
capabilities of today's hardware platforms for scientific computing,
especially massively parallel and extended memory hierarchy computers.
Traditionally, iterative methods have been implemented in a fine grain way
on distributed memory architectures, such as massively parallel processors
(MPPs) and clusters of workstations (COWs).
However, in case of large number of processors in MPPs or high
overhead interconnection networks in COWs, the synchronization costs
are excessive, limiting the scalability of the application.
Block (or multivector - the term multivector is often used
to differentiate from methods that block-decompose the matrix -)
iterative methods increase granularity by having each processor apply
the same operations on a block of vectors.
As a side effect, there are also more CPU operations per memory access,
resulting in better single processor performance.
However, the total number of operations increases with the block size,
so an architecture-dependent, optimal block size must be identified.
Recently, multithreading ideas have also led research in the opposite direction.
By putting more than one process on each processor, communication latencies
and load imbalances can be better hidden.
The major disadvantage of this approach is that local preconditioners based on
domain decomposition become much weaker as fine granularity increases.
The key observation in this research is that large memory sizes
often enable small subgroups of processors of MPPs or COWs to store the
whole matrix.
Also, in many matrix-free applications, such as from materials science,
a matrix-vector product function is available on each processor.
Each of these subgroups can apply the preconditioning step
independently on a different block vector, thus mixing fine and coarse grain.
Moreover, the preconditioning step can involve several steps of some
accelerator, thus, in theory, enabling arbitrary parallel speedups.
As an example, our method using block size of 4 would use each of the quadrants
of a 256 processor COW almost independently to solve a different
preconditioning equation.
Thus the effective parallelization speedup would be that of a 64 processor COW.
Because of the block size of 4, the total number of operations would increase,
but a domain decomposition preconditioner would work with four times larger
domains, counterbalancing some of these effects.
We have developed such a multigrain implementation of the popular
Jacobi-Davidson eigensolver.
Exploiting the advantages of this parallelism required the fine tuning
of several issues:
minimizing the communication required for alternating between different
granularities;
choosing the block size such as the independent section of the algorithm
dominates the computation but it does not increase the number of matrix
vector multiplications considerably;
finding the appropriate preconditioning for the correction equation;
implementing these efficiently in MPI.
Our experiments on small, high latency environments, such as COWS connected through Ethernet, have demonstrated excellent parallel speedups, with only minimal increase in computational work. Similarly, experiments on more tightly coupled multicomputers have shown that execution time can be reduced by using larger number of processors.
Date received: February 20, 2001
Copyright © 2001 by the author(s). The author(s) of this document and the organizers of the conference have granted their consent to include this abstract in Atlas Mathematical Conference Abstracts. Document # cagm-22.