本页对Eigen提供的密集矩阵分解方法的速度进行了分析for a wide range of square matrices and overconstrained problems.
对于 the features and numerical robustness of 线性求解器和分解, 检查这个 table .
该基准测试在配备有Intel Core i7 @ 2,6 GHz的笔记本电脑上运行,并使用clang进行了编译,并启用了AVX和FMA指令集,但没有多线程。它使用单精度浮点数。对于double,您可以通过将时序乘以2来获得良好的估算。
The square matrices 是对称的, and for the overconstrained 矩阵, the reported timmings 包含了 the cost to compute the symmetric covariance matrix for the first four solvers based on Cholesky and LU, as denoted by the * symbol (top-right corner part of the table). Timings are in milliseconds, and factors are relative to the LLT decomposition which is the fastest but also the least general and robust.
solver/size | 8x8 | 100x100 | 1000x1000 | 4000x4000 | 10000x8 | 10000x100 | 10000x1000 | 10000x4000 |
---|---|---|---|---|---|---|---|---|
LLT | 0.05 | 0.42 | 5.83 | 374.55 | 6.79 * | 30.15 * | 236.34 * | 3847.17 * |
LDLT | 0.07 (x1.3) | 0.65 (x1.5) | 26.86 (x4.6) | 2361.18 (x6.3) | 6.81 (x1) * | 31.91 (x1.1) * | 252.61 (x1.1) * | 5807.66 (x1.5) * |
PartialPivLU | 0.08 (x1.5) | 0.69 (x1.6) | 15.63 (x2.7) | 709.32 (x1.9) | 6.81 (x1) * | 31.32 (x1) * | 241.68 (x1) * | 4270.48 (x1.1) * |
FullPivLU | 0.1 (x1.9) | 4.48 (x10.6) | 281.33 (x48.2) | - | 6.83 (x1) * | 32.67 (x1.1) * | 498.25 (x2.1) * | - |
HouseholderQR | 0.19 (x3.5) | 2.18 (x5.2) | 23.42 (x4) | 1337.52 (x3.6) | 34.26 (x5) | 129.01 (x4.3) | 377.37 (x1.6) | 4839.1 (x1.3) |
ColPivHouseholderQR | 0.23 (x4.3) | 2.23 (x5.3) | 103.34 (x17.7) | 9987.16 (x26.7) | 36.05 (x5.3) | 163.18 (x5.4) | 2354.08 (x10) | 37860.5 (x9.8) |
CompleteOrthogonalDecomposition | 0.23 (x4.3) | 2.22 (x5.2) | 99.44 (x17.1) | 10555.3 (x28.2) | 35.75 (x5.3) | 169.39 (x5.6) | 2150.56 (x9.1) | 36981.8 (x9.6) |
FullPivHouseholderQR | 0.23 (x4.3) | 4.64 (x11) | 289.1 (x49.6) | - | 69.38 (x10.2) | 446.73 (x14.8) | 4852.12 (x20.5) | - |
JacobiSVD | 1.01 (x18.6) | 71.43 (x168.4) | - | - | 113.81 (x16.7) | 1179.66 (x39.1) | - | - |
BDCSVD | 1.07 (x19.7) | 21.83 (x51.5) | 331.77 (x56.9) | 18587.9 (x49.6) | 110.53 (x16.3) | 397.67 (x13.2) | 2975 (x12.6) | 48593.2 (x12.6) |
*: 分解不支持直接对 least-square solving for over-constrained problems, and the reported timing include the cost to form the symmetric covariance matrix .
观察结果:
- LLT始终是最快的解决方案。
- For largely over-constrained problems, Cholesky / LU分解的成本主要由对称协方差矩阵的计算决定。
- For large problem sizes, only the decomposition implementing a cache-friendly blocking strategy scale well. Those include LLT, PartialPivLU, HouseholderQR, and BDCSVD. This explain why for a 4k x 4k matrix, HouseholderQR is faster than LDLT. In the future, LDLT and ColPivHouseholderQR will also implement blocking strategies.
- CompleteOrthogonalDecomposition基于ColPivHouseholderQR,因此可以达到相同的性能水平。
The above table has been generated by the bench/dense_solvers.cpp file, feel-free to hack it to generate a table matching your hardware, compiler, and favorite problem sizes.