Skip to content

refactor: Path compression for DiscrTree#2577

Closed
nomeata wants to merge 11 commits intoleanprover:masterfrom
nomeata:joachim/compress-discrtree
Closed

refactor: Path compression for DiscrTree#2577
nomeata wants to merge 11 commits intoleanprover:masterfrom
nomeata:joachim/compress-discrtree

Conversation

@nomeata
Copy link
Copy Markdown
Collaborator

@nomeata nomeata commented Sep 24, 2023

Trie data structures tend to be very wasteful if one doesn’t apply path compression: If you have a sequence of nodes of degree 1, better put them in a flat array, instead of nested .nodes with singleton arrays.

If applied to the DiscrTree, this would reduce the size of the DiscrTree used by mathlib’s library_search from 217 MB to 41 MB, and the maximum depth from 2596 to 27. The maximum depths was causing stack overflow in mathlib PRs.

@github-actions github-actions bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Sep 24, 2023
ghost pushed a commit to leanprover-community/mathlib4 that referenced this pull request Sep 24, 2023
@ghost ghost added the breaks-mathlib This is not necessarily a blocker for merging: but there needs to be a plan label Sep 24, 2023
@ghost
Copy link
Copy Markdown

ghost commented Sep 24, 2023

nomeata added a commit to leanprover-community/mathlib4 that referenced this pull request Sep 24, 2023
@nomeata
Copy link
Copy Markdown
Collaborator Author

nomeata commented Sep 25, 2023

Could someone !bench this?

@Kha
Copy link
Copy Markdown
Member

Kha commented Sep 25, 2023

!bench

@leanprover-bot
Copy link
Copy Markdown
Collaborator

Here are the benchmark results for commit 83e2599.
There were no significant changes against commit 2ac782c.

@nomeata
Copy link
Copy Markdown
Collaborator Author

nomeata commented Sep 25, 2023

There were no significant changes against commit 2ac782c.

That’s a bit disappointing. I’ll make mathlib compile, maybe there is something there.

nomeata added a commit to nomeata/aesop that referenced this pull request Sep 25, 2023
ghost pushed a commit to leanprover-community/mathlib4 that referenced this pull request Sep 25, 2023
@nomeata
Copy link
Copy Markdown
Collaborator Author

nomeata commented Sep 25, 2023

From 237MB to:

$ ls -sh ./build/lib/MathlibExtras/LibrarySearch.extra
58M ./build/lib/MathlibExtras/LibrarySearch.extra

That's at least pretty nice.

@ghost ghost added builds-mathlib CI has verified that Mathlib builds against this PR and removed breaks-mathlib This is not necessarily a blocker for merging: but there needs to be a plan labels Sep 25, 2023
@nomeata
Copy link
Copy Markdown
Collaborator Author

nomeata commented Sep 25, 2023

@nomeata nomeata marked this pull request as ready for review September 25, 2023 21:09
@nomeata
Copy link
Copy Markdown
Collaborator Author

nomeata commented Sep 25, 2023

awaiting-review

Guess we have the data to weigh this idea. The code certainly gets more complex, but the size gains at least for library_search (didn't check the size of other in instances used) are significant, and a 3.5% mathlib compile time improvement is at least nice.

Jovan Gerbscheid pointed out that due to the well-nested structure of key sequences, values are only ever found at the leaves. One could therefore remove the .empty constructor and the recursive argument to .values, and make insertion panic when the key sequence is invalid, for a bit code reduction. OTOH, the current code is a bit more robust, for example when people mess with the key sequences as in leanprover-community/mathlib4#7345, so maybe better like this for now.

@nomeata nomeata force-pushed the joachim/compress-discrtree branch from e1f47a4 to 3a9a19f Compare October 14, 2023 13:03
ghost pushed a commit to leanprover-community/mathlib4 that referenced this pull request Oct 14, 2023
ghost pushed a commit to leanprover-community/mathlib4 that referenced this pull request Oct 16, 2023
@ghost ghost added breaks-mathlib This is not necessarily a blocker for merging: but there needs to be a plan and removed builds-mathlib CI has verified that Mathlib builds against this PR labels Oct 16, 2023
ghost pushed a commit to leanprover-community/mathlib4 that referenced this pull request Oct 17, 2023
@ghost ghost added builds-mathlib CI has verified that Mathlib builds against this PR and removed breaks-mathlib This is not necessarily a blocker for merging: but there needs to be a plan labels Oct 18, 2023
@nomeata
Copy link
Copy Markdown
Collaborator Author

nomeata commented Oct 18, 2023

I updated the branches, and created a new comparison report:
http://speed.lean-fro.org/mathlib4/compare/63a60bc6-4997-4dbe-a363-9ceff22def1c/to/1c8d5e96-3fad-4da2-b00c-4b1a63cdc75b

I’m not sure how to best get a good signal from the report.
This time build wall-clock has improved by 0.9% (probably noise). Instruction counts go down through the bank, but less than 1%. All not very exiciting, not sure if this is worth pursuing, it seems only the size improvement to the library_search and rw? cache still look good.

@kim-em
Copy link
Copy Markdown
Collaborator

kim-em commented Oct 19, 2023

Oh, I think you're underselling. A 1% improvement on Mathlib is still great news, and the size improvements for the caches are a significant QOL win, because they are usually the last to finish when running lake exe cache get.

@kim-em kim-em added awaiting-author Waiting for PR author to address issues and removed awaiting-review Waiting for someone to review the PR labels Nov 2, 2023
@kim-em
Copy link
Copy Markdown
Collaborator

kim-em commented Nov 2, 2023

@nomeata, I'm marking this back to awaiting-author now that there are conflicts.

@nomeata
Copy link
Copy Markdown
Collaborator Author

nomeata commented Nov 2, 2023

Absolutely! Although I think I’ll see leanprover-community/batteries#285 (which I also have to update now) through first, so that refactorings like the present only affect one downstream repo, and not three.

@Kha Kha requested review from Kha and leodemoura as code owners November 20, 2023 08:15
@nomeata
Copy link
Copy Markdown
Collaborator Author

nomeata commented Dec 2, 2023

I’ll close this. Others are working on DiscrTree more actively these days, have a better sense if this is useful, and if it is, can probably cherry-pick the ideas easily.

@nomeata nomeata closed this Dec 2, 2023
@JovanGerb
Copy link
Copy Markdown
Contributor

@nomeata, do you have interest in reviving this PR in some form? As mathlib and Lean continue to grow, the number of simp lemmas keeps growing, and building the simp discrimination tree (at import time) is taking a significant amount of time when importing all of mathlib. The refactor in this PR would be one way to mitigate this inefficiency.

@nomeata
Copy link
Copy Markdown
Collaborator Author

nomeata commented Mar 7, 2026

Do we have evidence that this PR is helping with import times?

@JovanGerb
Copy link
Copy Markdown
Contributor

I measured that for import Mathlib, the simp discrimination tree takes about 0.5s and the instances discrimination tree takes about 0.1s. When building mathlib, the imports are typically smaller, so it will take less time per file, but with over 8000 files this will add up. (see #mathlib4 > Performance cost of environment extensions @ 💬)

And I recall that when this PR was benchmarked against mathlib there was a significant speedup. So I figured that this speedup is from the reduced import times, and that the effect would be even more significant now that mathlib has grown more.

@nomeata
Copy link
Copy Markdown
Collaborator Author

nomeata commented Mar 7, 2026

Improving startup time for import Mathlib would be nice. I’ll ask claude to update this PR and we can get some measurement :-)

@nomeata nomeata reopened this Mar 7, 2026
@JovanGerb
Copy link
Copy Markdown
Contributor

An alternative approach would be to make use of a Thunk data structure so as to never compute the entire discrimination tree. However when I tried this, I realized that the implementation would require some more flexibility from the environment extension framework: at import time we would have to first construct the root node of the discrimination tree, and after that is created, map over it with some Thunk.mk. Otherwise we would need to use a Thunk.map for each imported simp lemma, which I somehow imagine to not be efficient.

@nomeata
Copy link
Copy Markdown
Collaborator Author

nomeata commented Mar 7, 2026

Closing this PR in favor of #12838

Some form of lazy loading may possibly also be interesting, of course.

@nomeata nomeata closed this Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting-author Waiting for PR author to address issues builds-mathlib CI has verified that Mathlib builds against this PR toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants