Tensor permutation in-place on linearized data #2412
+757
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements the LibMatrixReorg.transposeInPlaceTensor()(src/main/java/org/apache/sysds/runtime/matrix/data/LibMatrixReorg.java) functionality, facilitating in-place permutations for tensors linearized within a MatrixBlock. The implementation is inspired by the EITHOT algorithm (Efficient In-Place Transposition for High-Order Tensors) and focuses on minimizing memory overhead by avoiding full-copy allocations.
Implementation Details:
The implementation handles permutations by identifying and applying fundamental primitive patterns:
Primitive 21 (Tensor -> Matrix): When the permutation allows for a split index where both sub-permutations maintain their internal order. Dimensions in each sub-permutation gets multiplied and build the new row and column dimensions. The logic leverages the highly optimized LibMatrixReorg.transposeInPlaceDenseBrenner() method. This ensures peak performance for tensors which could be reduced to matrices.
Primitive 1324 (General Permutation): For more complex high-order permutations, the algorithm resolves the transposition by swapping neighboring data blocks while maintaining the rest of the dimensions fixed and applying a cycle-following algorithm.
Cycle Tracking: To maximize generalizability and reduce implementation complexity, a robust cycle-tracking strategy was utilized. While the EITHOT paper suggests Catanzaro's algorithm for certain index calculations due to efficiency in specific cases, this implementation utilizes a simplified conversion strategy to ensure generalizability.
Parallelization Potential: The block-based structure is highly amenable to future GPU acceleration. References to EITHOT's parallelization strategies could be accessed following the link below.
Testing Framework:
Test Location: src/test/java/org/apache/sysds/test/component/matrix/libMatrixReorg/TransposeInPlaceBrennerTest.java.
Arbitrary Permutations: Verified the algorithm against a wide range of high-dimensional tensor shapes and arbitrary permutation vectors.
Memory Constraints in Validation: While the transposeInPlaceTensor() function is memory-efficient (buffers in block sizes), the test suite utilizes an out-of-place reference implementation to verify correctness. Consequently, tests involving extremely large tensor dimensions may trigger a java.lang.OutOfMemoryError: Java heap space due to the memory requirements of the reference copy.
Scope: Validation currently focuses on verifying permutation logic accuracy across high-order dimensions and varying shapes within standard heap limits.
Validation Logic: A helper method, compareTensorValues(), was implemented within the test component src/test/java/org/apache/sysds/test/TestUtils.java, to compare each cell value.
Link for the article: https://www.semanticscholar.org/paper/EITHOT%3A-Efficient-In-place-Transposition-of-High-on-Wu-Tu/cf4c177a64e1e271ccf1b742b5cc2efdb77fda9b