Skip to content

.Net: Fix TextChunker token-counted paragraph merge#14015

Open
pragnyanramtha wants to merge 2 commits into
microsoft:mainfrom
pragnyanramtha:fix-textchunker-token-counter-13713
Open

.Net: Fix TextChunker token-counted paragraph merge#14015
pragnyanramtha wants to merge 2 commits into
microsoft:mainfrom
pragnyanramtha:fix-textchunker-token-counter-13713

Conversation

@pragnyanramtha
Copy link
Copy Markdown

Fixes #13713.

Summary

  • Validate the final short-paragraph merge candidate with the configured TokenCounter
  • Preserve the existing merge behavior when the candidate fits under the paragraph token budget
  • Add custom-token-counter regressions for the over-limit and still-fits cases

Verification

Using local .NET SDK 10.0.100 installed under /tmp/dotnet:

  • PATH=/tmp/dotnet:$PATH DOTNET_ROOT=/tmp/dotnet DOTNET_CLI_TELEMETRY_OPTOUT=1 dotnet test dotnet/src/SemanticKernel.UnitTests/SemanticKernel.UnitTests.csproj --filter TextChunkerTests -> 41 passed
  • reviewer pass: same focused TextChunker test command -> 41 passed
  • PATH=/tmp/dotnet:$PATH DOTNET_ROOT=/tmp/dotnet DOTNET_CLI_TELEMETRY_OPTOUT=1 dotnet format dotnet/src/SemanticKernel.UnitTests/SemanticKernel.UnitTests.csproj --include dotnet/src/SemanticKernel.Core/Text/TextChunker.cs dotnet/src/SemanticKernel.UnitTests/Text/TextChunkerTests.cs --verify-no-changes --no-restore
  • git diff --check

Copilot AI review requested due to automatic review settings May 15, 2026 19:08
@pragnyanramtha pragnyanramtha requested a review from a team as a code owner May 15, 2026 19:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes a bug in TextChunker where the final short-paragraph merge step could exceed the configured token budget when a custom TokenCounter was used.

Changes:

  • Validate the merged paragraph against the token budget before applying the merge.
  • Skip the merge when the candidate exceeds the limit.
  • Add unit tests covering both the over-limit and under-limit scenarios.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
dotnet/src/SemanticKernel.Core/Text/TextChunker.cs Guard the final paragraph merge with a token-count check using the configured counter.
dotnet/src/SemanticKernel.UnitTests/Text/TextChunkerTests.cs Add regression tests for the over-limit and still-fits merge cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@moonbox3 moonbox3 added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel labels May 15, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 94% | Result: All clear

Reviewed: Correctness, Security Reliability, Test Coverage, Design Approach


Automated review by pragnyanramtha's agents

@github-actions github-actions Bot changed the title Fix TextChunker token-counted paragraph merge .Net: Fix TextChunker token-counted paragraph merge May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kernel Issues or pull requests impacting the core kernel .NET Issue or Pull requests regarding .NET code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: the TextChunker.SplitPlainTextParagraphs sometimes overcount the chunk sizes

3 participants