I'm not sure about whether this is a bottlenecking step in applications, but even so, is it interesting to ask which parts of this are gpu-friendly? That is, is there a (sparse) matrix representation which is used in gpu's? And does it make sense to carry through the dag/tree construction as a sort of "prep" step (on cpu or gpu)?
The initial plain/dense algorithm looks pretty straightforward, but not sure about the tree construction.
Almost no part of the algorithm as specified in the blog post is GPU friendly, but it can be improved.
The "symbolic" phase is very hard to port to the GPU in the first place and almost always happens on the CPU side. But even then it's hard (but not impossible) even to parallelize it. This is generally OK because it is usually cheaper than the numeric phase anyways. But it can be trouble on workstations where the CPU exists primarily as a device for shoveling data into many GPUs, you don't want to stall their kernel pipelines with your symbolic analysis.
For the numeric phase you need to introduce the concept of either a "front" or a "supernode." This is a technique where multiple columns get batched together and you get a new elimination tree in terms of those batches of columns rather than individual columns. This turns a lot of irregular memory access (gather/scatter style updates) into densely addressed memory patterns, often just calls into level 3 BLAS or LAPACK.
There are some sparsity patterns for which the above supernode/multifrontal approach does not work very well, but for many practical cases like PDE simulations it does. The sparsity patterns which defeat it usually spread nonzeros out all over the matrix rather than containing them within a specific band. What this causes in practice is when you materialize a supernode's dense matrix, most of its numeric entries will just be 0s, so you create a lot of extra work that serves no productive purpose.
> I'm not sure about whether this is a bottlenecking step in applications
In barrier method of solving linear programs, numeric factorization normally takes about 1/3 of the time of main loop (like after all symbolic/presolve steps).
> That is, is there a (sparse) matrix representation which is used in gpu's?
We (COPT) did manage to write a gpu version of Cholesky factorization, and it's faster than our (pretty fast) cpu implementation. Although one can argue that the CPUs for benchmarking is not the fastest, or it is not as expensive as the GPUs.
Currently our barrier implementation is 67% faster on H200 than on i7-11700K [1]. Not all of the improvement contributes to Cholesky factorization.
Another GPU Cholesky factorization implementation we are aware of is CUDSS by nvidia.
> And does it make sense to carry through the dag/tree construction as a sort of "prep" step (on cpu or gpu)?
For both CPU and GPU version, a symbolic phase is necessary, as we usually expect many factorizations on matrices with the same sparse pattern. CUDSS has (at least some part of) their symbolic phase on GPU.
The initial plain/dense algorithm looks pretty straightforward, but not sure about the tree construction.
The "symbolic" phase is very hard to port to the GPU in the first place and almost always happens on the CPU side. But even then it's hard (but not impossible) even to parallelize it. This is generally OK because it is usually cheaper than the numeric phase anyways. But it can be trouble on workstations where the CPU exists primarily as a device for shoveling data into many GPUs, you don't want to stall their kernel pipelines with your symbolic analysis.
For the numeric phase you need to introduce the concept of either a "front" or a "supernode." This is a technique where multiple columns get batched together and you get a new elimination tree in terms of those batches of columns rather than individual columns. This turns a lot of irregular memory access (gather/scatter style updates) into densely addressed memory patterns, often just calls into level 3 BLAS or LAPACK.
There are some sparsity patterns for which the above supernode/multifrontal approach does not work very well, but for many practical cases like PDE simulations it does. The sparsity patterns which defeat it usually spread nonzeros out all over the matrix rather than containing them within a specific band. What this causes in practice is when you materialize a supernode's dense matrix, most of its numeric entries will just be 0s, so you create a lot of extra work that serves no productive purpose.
In barrier method of solving linear programs, numeric factorization normally takes about 1/3 of the time of main loop (like after all symbolic/presolve steps).
> That is, is there a (sparse) matrix representation which is used in gpu's?
We (COPT) did manage to write a gpu version of Cholesky factorization, and it's faster than our (pretty fast) cpu implementation. Although one can argue that the CPUs for benchmarking is not the fastest, or it is not as expensive as the GPUs.
Currently our barrier implementation is 67% faster on H200 than on i7-11700K [1]. Not all of the improvement contributes to Cholesky factorization.
Another GPU Cholesky factorization implementation we are aware of is CUDSS by nvidia.
> And does it make sense to carry through the dag/tree construction as a sort of "prep" step (on cpu or gpu)?
For both CPU and GPU version, a symbolic phase is necessary, as we usually expect many factorizations on matrices with the same sparse pattern. CUDSS has (at least some part of) their symbolic phase on GPU.
[1] https://plato.asu.edu/ftp/lpfeas.html