Skip to content

Performance improvements.. lots of em.#6

Open
boyter wants to merge 14 commits intomasterfrom
claude
Open

Performance improvements.. lots of em.#6
boyter wants to merge 14 commits intomasterfrom
claude

Conversation

@boyter
Copy link
Owner

@boyter boyter commented Feb 23, 2026

No description provided.

@boyter boyter requested a review from Copilot February 23, 2026 08:42
@pr-insights pr-insights bot added M/complexity Normal or medium complexity M/size Normal or medium sized change labels Feb 23, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements significant performance optimizations to the duplicate code detector, focusing on reducing memory allocations and GC pressure while adding new flexibility features.

Changes:

  • Replaces string-based file identification with uint32 IDs throughout the codebase to reduce GC pressure and improve lookup performance
  • Adds gap tolerance feature allowing detection of duplicates with inserted/deleted/modified lines, controlled by --gap-tolerance and --max-gap-bridges flags
  • Introduces single-file mode (--file flag) for comparing one file against the rest of the codebase
  • Adds comprehensive unit test suite (14 tests) covering gap bridging, edge cases, and deduplication logic
  • Optimizes candidate filtering with pre-computed sorted unique hash arrays and pair deduplication via sync.Map
  • Parallelizes file processing in selectFiles() with worker pool pattern
  • Replaces division-based hash reduction with bit masking for faster operation
  • Adds --duplicates-both-ways flag to control whether duplicates are reported from both file perspectives

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
variables.go Adds new global variables for gap tolerance, max gap bridges, duplicates both ways flag, single file path, and file-by-ID map
structs.go Replaces Filename field with ID in duplicateFile struct, adds SortedUniqueHashes field, adds GapCount to duplicateMatch
processor_test.go Comprehensive new test suite with 14 tests covering gap bridging, boundary conditions, and deduplication logic
processor.go Core algorithm changes: gap bridging logic, pair deduplication, ID-based lookups, sharedHashCount optimization, updated output formatting
main.go Adds CLI flags for gap-tolerance, max-gap-bridges, duplicates-both-ways, and file
helper.go Removes unused removeStringDuplicates function
file.go Parallelizes file processing with worker pool, generates unique IDs, computes sorted unique hashes, handles single-file mode
README.md Minor grammar correction (adds comma)
CLAUDE.md New file providing AI assistant guidance about project structure, architecture, and conventions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


## Architecture

Single `main` package, ~610 lines across 7 files. No sub-packages.
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line count is stated as "~610 lines across 7 files" but the actual count is approximately 778 lines across 6 main files (file.go: 219, helper.go: 16, main.go: 126, processor.go: 372, structs.go: 19, variables.go: 26), or 7 files if processor_test.go is included. Consider updating this to reflect the actual size after the performance improvements.

Suggested change
Single `main` package, ~610 lines across 7 files. No sub-packages.
Single `main` package, ~780 lines across 6 main files (7 including tests). No sub-packages.

Copilot uses AI. Check for mistakes.
Comment on lines +148 to 152
if singleFilePath != "" {
abs, err := filepath.Abs(singleFilePath)
if err == nil {
singleFilePath = abs
}
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If filepath.Abs fails at line 149, singleFilePath remains relative. Later at line 199, filepath.Abs is called on r.file.Location for comparison. If one succeeds and the other fails, or if they resolve to different absolute paths due to different working directories, the single file might not be correctly excluded from the hash index. Consider handling the error at line 149 or storing the result to avoid mismatches.

Copilot uses AI. Check for mistakes.
Comment on lines +196 to 203
extensionFileMap[r.file.Extension] = append(extensionFileMap[r.file.Extension], r.file)
// Don't index the single file — it's the source, not a candidate
if singleFilePath != "" {
absLoc, err := filepath.Abs(r.file.Location)
if err == nil && absLoc == singleFilePath {
continue
}
totalLines++
}
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single file is excluded from the hash index (lines 198-203) but is still added to extensionFileMap (line 196). This means when processing the single file, it will compare against itself in processFile since candidateID == f.ID won't be skipped until line 110. The sameFile logic at line 111 will catch this, but it's inefficient to include the file in extensionFileMap if it's also the source file being compared.

Copilot uses AI. Check for mistakes.
Comment on lines +119 to 130
if !duplicatesBothWays {
// Pack two uint32 IDs into one uint64 for pair dedup
var pairKey uint64
if f.ID < candidateID {
pairKey = uint64(f.ID)<<32 | uint64(candidateID)
} else {
pairKey = uint64(candidateID)<<32 | uint64(f.ID)
}
if _, seen := processedPairs.LoadOrStore(pairKey, struct{}{}); seen {
continue
}
}
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When duplicatesBothWays is false and sameFile is true, the pair deduplication logic (lines 119-130) will skip comparing a file to itself. However, this happens after the sameFile check at line 110-117. If processSameFile is true, same-file comparisons should be allowed even when duplicatesBothWays is false. The current logic will incorrectly skip these. The pair deduplication should check for sameFile and skip the deduplication logic in that case.

Copilot uses AI. Check for mistakes.
boyter and others added 3 commits February 23, 2026 20:28
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

M/complexity Normal or medium complexity M/size Normal or medium sized change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants