-
Notifications
You must be signed in to change notification settings - Fork 437
fix: improve face clustering algorithm #830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: improve face clustering algorithm #830
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds per-face quality scoring into detection, database, and clustering; introduces conservative, DBSCAN, and hierarchical clustering utilities with adaptive eps and post-merge logic; updates schema, insertion/retrieval, and tests to propagate and validate quality-aware behavior. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Detector as FaceDetector
participant Quality as FaceQualitySvc
participant Clustering as ClusteringEngine
participant DB as Database
Detector->>Quality: send detected face images
Quality->>Quality: compute sharpness, brightness, size -> quality scores
Quality-->>Detector: per-face quality scores
alt Quality filtering enabled
Detector->>Detector: drop faces below threshold
end
Detector->>Clustering: embeddings + quality metadata
Clustering->>Clustering: normalize embeddings
opt auto-eps
Clustering->>Clustering: calculate_adaptive_eps (k-NN)
end
Clustering->>Clustering: run algorithm (conservative/dbscan/hierarchical)
Clustering->>Clustering: validate/split/merge clusters
Clustering-->>Detector: labels, cluster_means, cluster_quality_scores
alt Assign unclustered faces
Detector->>Clustering: compute similarity to cluster means (quality-aware)
Clustering-->>Detector: assignment decisions
end
Detector->>DB: insert/update faces with embedding, cluster_uuid, quality
DB-->>DB: persist rows including `quality` and per-cluster `quality_scores`
DB-->>Detector: ack/ids
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/database/faces.py (1)
36-50: Existing deployments require a schema migration to add thequalitycolumn.The code uses
COALESCE(quality, 0.5)in SELECT queries, which gracefully handles NULL values for new rows. However,CREATE TABLE IF NOT EXISTSwill not modify existing tables, so databases from before this change won't have thequalitycolumn at all. Queries referencing this column will fail with "no such column" errors.For existing deployments, add an
ALTER TABLE faces ADD COLUMN quality REAL DEFAULT 0.5;migration before deploying this code. New deployments from scratch will not be affected.
🧹 Nitpick comments (7)
backend/tests/test_clustering_algorithm.py (1)
186-233: Test may be non-deterministic due to random embeddings.The test uses
np.random.rand(512) * 0.01perturbations, but the comment acknowledges "clusters might or might not merge." This makes the test assertion weak—it only verifies the function runs without errors.Consider using fixed embeddings with known distances to make the test deterministic and actually verify the merge behavior.
🔎 Suggested improvement for deterministic testing
def test_post_merge_combines_close_clusters( self, mock_filter, mock_cluster, mock_get_faces ): """Test that post-merge combines clusters with similar mean embeddings.""" - # Create embeddings that will form 2 clusters with very similar means - base_embedding = np.random.rand(512) + # Create deterministic embeddings with known distance + base_embedding = np.zeros(512) + base_embedding[0] = 1.0 # Unit vector in first dimension mock_get_faces.return_value = [ { "face_id": 1, - "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(), + "embeddings": base_embedding.tolist(), "cluster_name": None, "quality": 0.8, }, # ... similar changes for other faces ]backend/app/utils/face_clusters.py (1)
437-451: Potential division by zero if embedding has zero norm.
_calculate_cosine_distancedivides bynp.linalg.norm()without checking for zero-norm vectors. While face embeddings from a neural network should never be zero, a defensive guard would prevent potentialnanresults.🔎 Suggested defensive guard
def _calculate_cosine_distance(embedding_a: NDArray, embedding_b: NDArray) -> float: """ Calculate cosine distance between two embeddings. ... """ - norm_a = embedding_a / np.linalg.norm(embedding_a) - norm_b = embedding_b / np.linalg.norm(embedding_b) + norm_a_val = np.linalg.norm(embedding_a) + norm_b_val = np.linalg.norm(embedding_b) + if norm_a_val == 0 or norm_b_val == 0: + return 2.0 # Maximum distance for degenerate case + norm_a = embedding_a / norm_a_val + norm_b = embedding_b / norm_b_val similarity = np.dot(norm_a, norm_b) return 1 - similaritybackend/app/utils/clustering_advanced.py (4)
23-64:calculate_adaptive_epsis defined but unused within this module.This function is never called internally—
cluster_faces_dbscanusesselect_conservative_epsilonfrom the conservative module instead (line 100). Consider:
- Using this function where
auto_eps=Trueincluster_faces_dbscan, or- Removing it if
select_conservative_epsilonis the preferred approach, or- Documenting it as an alternative utility for external callers.
The current state may confuse maintainers about which epsilon calculation to use.
211-217: Redundantmax(min_samples, 2)enforcement.Line 215 applies
max(min_samples, 2), butcluster_faces_dbscanalready enforces this at line 104. While harmless, consider removing the redundancy here to keep the enforcement in one place.🔎 Proposed fix
elif algorithm == "dbscan": return cluster_faces_dbscan( embeddings, eps=eps, - min_samples=max(min_samples, 2), + min_samples=min_samples, auto_eps=auto_eps, )
267-269:float("inf")for single-element cluster density may cause downstream issues.If these statistics are later aggregated (e.g., averaging densities),
infwill propagate unexpectedly. Consider usingNoneornp.nanto explicitly indicate "undefined" rather than infinite density.🔎 Proposed fix
else: cluster_diameters.append(0.0) - cluster_densities.append(float("inf")) + cluster_densities.append(np.nan) # Undefined for single-element clusters
295-296: Hardcoded embedding dimension 512 is fragile.If the embedding model changes (e.g., to 256 or 1024 dimensions), this will return an incorrectly sized array. Consider raising an error for empty input or accepting the dimension as a parameter.
🔎 Proposed fix
-def calculate_cluster_mean(embeddings: NDArray) -> NDArray: +def calculate_cluster_mean(embeddings: NDArray, embedding_dim: int = 512) -> NDArray: """ Calculate the mean embedding for a cluster. Args: embeddings: Face embeddings for a cluster + embedding_dim: Dimension of embeddings (used only for empty input fallback) Returns: Normalized mean embedding """ if len(embeddings) == 0: - return np.zeros(512) # Default embedding dimension + return np.zeros(embedding_dim)backend/app/utils/clustering_conservative.py (1)
117-127: Broadexcept Exceptionsilently swallows all errors.This could hide genuine issues (e.g., memory errors, numpy bugs). Consider catching a more specific exception or at least logging the failure.
🔎 Proposed fix
+ import logging + logger = logging.getLogger(__name__) + try: sub_labels = clustering.fit_predict(embeddings) # Group indices by sub-cluster sub_clusters = defaultdict(list) for i, label in enumerate(sub_labels): sub_clusters[label].append(face_indices[i]) return [np.array(indices) for indices in sub_clusters.values()] - except Exception: + except ValueError as e: + # AgglomerativeClustering can raise ValueError for edge cases + logger.debug(f"Failed to split cluster: {e}") return [face_indices]
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
backend/app/database/faces.pybackend/app/models/FaceDetector.pybackend/app/utils/clustering_advanced.pybackend/app/utils/clustering_conservative.pybackend/app/utils/face_clusters.pybackend/app/utils/face_quality.pybackend/requirements.txtbackend/test.pybackend/tests/test_clustering_algorithm.pybackend/tests/test_face_quality.py
🧰 Additional context used
🧬 Code graph analysis (4)
backend/tests/test_face_quality.py (1)
backend/app/utils/face_quality.py (6)
assess_face_sharpness(15-41)assess_face_brightness(44-73)assess_face_size(76-102)calculate_face_quality(105-156)should_include_face(159-170)filter_quality_faces(173-184)
backend/tests/test_clustering_algorithm.py (1)
backend/app/utils/face_clusters.py (2)
cluster_util_cluster_all_face_embeddings(194-342)_calculate_cosine_distance(437-451)
backend/app/utils/clustering_advanced.py (1)
backend/app/utils/clustering_conservative.py (3)
cluster_conservative(374-406)select_conservative_epsilon(33-77)fit_predict(171-207)
backend/app/models/FaceDetector.py (2)
backend/app/utils/face_quality.py (1)
calculate_face_quality(105-156)backend/app/database/faces.py (1)
db_insert_face_embeddings_by_image_id(101-152)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Agent
- GitHub Check: Backend Tests
🔇 Additional comments (37)
backend/requirements.txt (1)
74-74: LGTM!The
kneeddependency is appropriately added to support adaptive epsilon calculation in the clustering utilities. The minimum version constraint>=0.8.5allows flexibility for compatible updates.backend/app/utils/face_quality.py (5)
15-41: LGTM with minor consideration for edge cases.The sharpness assessment using Laplacian variance is a well-established technique. The normalization to 0-1 scale with 100 as the empirical threshold is reasonable.
Consider adding a guard for degenerate images (e.g., 0-dimensional or single-pixel images) that could cause issues with
cv2.Laplacianorcv2.cvtColor. However, since this is called after face detection with padding, such edge cases are unlikely in practice.
44-73: LGTM!The brightness assessment correctly evaluates how close the average brightness is to the optimal midpoint (128). The scoring formula ensures scores range from 0.0 (extreme brightness/darkness) to 1.0 (optimal).
76-102: LGTM!The size assessment correctly uses the minimum dimension to handle non-square face crops and caps at 1.0 for faces meeting or exceeding the target size.
105-156: LGTM!Well-designed composite quality function with:
- Weight validation using
np.isclosefor floating-point safety- Comprehensive return dictionary for debugging and analysis
- Sensible default weights emphasizing sharpness (0.4)
159-184: LGTM!Simple and effective filtering utilities. The use of
.get("quality", 0.0)infilter_quality_facesprovides a sensible fallback for faces without quality scores.backend/app/models/FaceDetector.py (3)
37-37: LGTM!Good addition of the
qualitieslist to track per-face quality scores alongside other face metadata.
53-65: LGTM!Quality assessment is appropriately integrated into the face detection loop. The quality is calculated on the padded face crop (
face_img), which provides consistent input for the quality metrics.The debug logging of individual quality components will be valuable for troubleshooting clustering issues.
74-91: LGTM!Good integration of quality scores into the database insertion path. The aggregate statistics logging (average quality and high-quality count) provides useful visibility into detection quality across images.
backend/tests/test_clustering_algorithm.py (4)
25-51: LGTM!Comprehensive tests for
_calculate_cosine_distancecovering key cases:
- Identical embeddings → distance 0
- Orthogonal embeddings → distance 1
- Opposite embeddings → distance 2
- Unnormalized inputs handled correctly
57-68: LGTM!Good practice to test configuration constants. These tests serve as documentation and will fail if someone inadvertently changes critical clustering parameters.
307-369: LGTM!Good coverage of cluster name determination logic:
- Majority voting correctly selects "John" when it appears more frequently
Noneis returned when no faces have existing names
267-304: The patching strategy is correct as implemented.The test properly patches
CLUSTERING_QUALITY_FILTER_ENABLEDandCLUSTERING_QUALITY_MIN_THRESHOLDat their definition and usage location (app.utils.face_clusters). Since these constants are read directly within the same module at runtime (lines 224 and 227 ofcluster_util_cluster_all_face_embeddings()), the@patchdecorators will correctly intercept the reads, allowing the realfilter_quality_faces()function to execute with the patched threshold values.backend/tests/test_face_quality.py (6)
1-19: LGTM!Well-organized test module with clear imports and class-based test organization. The tests provide good coverage for the face quality assessment utilities.
21-55: LGTM!Effective sharpness tests using:
- Checkerboard pattern for high sharpness (high edge content)
- Uniform gray image for low sharpness (no edges)
- Grayscale input handling verification
57-86: LGTM!Brightness tests appropriately cover the full range: optimal (128), very dark (20), and overexposed (240).
88-122: LGTM!Size assessment tests cover important cases including rectangular images using the minimum dimension.
124-180: LGTM!Comprehensive tests for the composite quality function including:
- Return structure validation
- Score range validation
- Custom weight support
- Invalid weight error handling
- High-quality face scenario
182-258: LGTM!Good coverage of filtering utilities including:
- Threshold boundary behavior (exact threshold is inclusive)
- Empty list handling
- All pass/fail scenarios
- Missing quality key defaulting to 0
backend/app/utils/face_clusters.py (3)
35-73: LGTM! Well-documented configuration.The clustering configuration is comprehensive with clear documentation explaining the rationale behind each parameter. Key highlights:
CLUSTERING_MIN_SAMPLES = 2prevents bridge-point chaining (the core fix)- Post-merge threshold (0.28) is tighter than DBSCAN eps (0.35) for safety
- Low quality threshold (0.15) is intentionally permissive
288-320: Post-merge logic is correct but has O(n²) complexity.The nested loop correctly implements pairwise cluster merging based on mean embedding distance. The use of a
usedset prevents double-merging.For very large numbers of clusters, consider optimizing with spatial indexing, though this is unlikely to be a bottleneck for typical face clustering scenarios.
345-434: LGTM!The incremental cluster assignment function correctly:
- Uses quality-weighted cluster representatives
- Applies quality filtering to unassigned faces
- Enforces similarity threshold before assignment
- Provides helpful debug logging for skipped faces
backend/app/database/faces.py (3)
57-98: LGTM!The insert function correctly handles the new quality parameter with proper SQL parameter binding.
240-254: LGTM!Good use of
COALESCE(quality, 0.5)to handle NULL values gracefully, ensuring backward compatibility with existing data that lacks quality scores.
349-405: LGTM!The function correctly returns quality scores alongside embeddings for quality-aware clustering operations. The grouping logic properly maintains separate quality lists per cluster.
backend/app/utils/clustering_advanced.py (4)
1-21: LGTM on module structure and imports.The module docstring clearly explains the purpose and the critical
min_samples >= 2enforcement. Imports are appropriate for the clustering operations.
67-114: LGTM—min_samples >= 2enforcement is correctly implemented.The function properly normalizes embeddings, enforces the critical
min_samples >= 2constraint to prevent bridge-point chaining, and delegates epsilon selection appropriately.
117-161: LGTM—hierarchical clustering with complete linkage is a good conservative choice.The complete linkage ensures all pairs within a cluster meet the distance threshold, aligning with the PR objective of preventing incorrect merges.
313-331: LGTM—backward compatibility alias and exports are well-defined.The
advanced_face_clusteringalias maintains API compatibility, and__all__correctly exports the public interface.backend/app/utils/clustering_conservative.py (8)
1-23: LGTM—module docstring clearly articulates the conservative clustering philosophy.The key principles are well-documented: strict thresholds, cluster validation, and cautious merging.
25-77: LGTM—epsilon selection logic is well-designed.The adaptive percentile selection based on data spread (tight, moderate, wide) is a thoughtful approach that balances clustering quality across different datasets.
80-95: LGTM—cluster validation is straightforward and correct.
171-207: LGTM—fit_predictimplements a well-structured pipeline.The sequence of normalize → select epsilon → DBSCAN → validate/split → safe merge → relabel is logical and aligns with the two-phase clustering approach described in the PR objectives.
209-246: LGTM—cluster validation and splitting logic is correct.The 0.8 multiplier on
max_diameterfor sub-cluster splitting (line 230) is a good heuristic to ensure tighter sub-clusters.
248-364: LGTM—safe merge implementation with Union-Find is well-designed.Key strengths:
- Union-Find with path compression handles transitive merges correctly
- Multiple merge strategies (centroid distance, min pairwise + match ratio, high match ratio) address angle variation
- Pre-merge validation (line 355) prevents creating oversized clusters
This directly addresses the PR objective of reuniting over-split clusters while maintaining safety.
366-371: LGTM—relabeling is straightforward and correct.
374-406: LGTM—convenience function provides a clean API.Parameters are well-documented and correctly forwarded to
ConservativeFaceClustering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves the face clustering algorithm to fix issue #793 where faces were incorrectly merged across different people (bridge-point chaining) and split for the same person across different angles.
Key Changes:
- Implements two-phase clustering: DBSCAN with
min_samples=2to prevent bridge-point chaining, followed by post-cluster mean merge to reunite same-person splits - Adds face quality assessment module to filter low-quality face detections before clustering
- Introduces conservative and hierarchical clustering algorithms as alternatives to standard DBSCAN
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| backend/app/utils/face_clusters.py | Core clustering logic with post-merge functionality and quality filtering integration |
| backend/app/utils/clustering_advanced.py | Main clustering entry point supporting multiple algorithms (DBSCAN, hierarchical, conservative) |
| backend/app/utils/clustering_conservative.py | Conservative clustering implementation with cluster validation and safe merging |
| backend/app/utils/face_quality.py | New module for assessing face quality based on sharpness, brightness, and size |
| backend/app/models/FaceDetector.py | Integration of quality assessment during face detection |
| backend/app/database/faces.py | Database schema and query updates to store and retrieve quality scores |
| backend/tests/test_clustering_algorithm.py | Comprehensive unit tests for clustering algorithms and post-merge functionality |
| backend/tests/test_face_quality.py | Unit tests for face quality assessment functions |
| backend/test.py | Test script parameter updates (contains bug with min_samples=1) |
| backend/requirements.txt | Added kneed dependency (currently unused) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| psutil>=5.9.5 | ||
| pytest-asyncio>=1.0.0 | ||
| setuptools==66.1.1 | ||
| kneed>=0.8.5 |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'kneed' package is added to requirements.txt but doesn't appear to be used anywhere in the codebase. If this dependency is not needed for the current implementation, it should be removed. If it's intended for future use, consider adding a comment explaining its purpose.
| kneed>=0.8.5 |
| validate_clusters: Whether to validate and split loose clusters | ||
| auto_eps: Auto-select conservative epsilon | ||
| merge_close_clusters: Whether to merge clusters that are clearly same person | ||
| merge_threshold: Max centroid distance for merging (default: 0.30) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring states "merge_threshold: Max centroid distance for merging (default: 0.30)" but the actual default value in the function signature is 0.40. The documentation should match the implementation.
| merge_threshold: Max centroid distance for merging (default: 0.30) | |
| merge_threshold: Max centroid distance for merging (default: 0.40) |
| mean_i = np.mean([f["embedding"] for f in faces_i], axis=0) | ||
| merged[label_i] = list(faces_i) # Copy the list | ||
|
|
||
| for j in range(i + 1, len(cluster_items)): | ||
| label_j, faces_j = cluster_items[j] | ||
| if label_j in used: | ||
| continue | ||
|
|
||
| mean_j = np.mean([f["embedding"] for f in faces_j], axis=0) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mean embeddings are calculated without normalization. Since cosine distance is being used (which is scale-invariant), the embeddings should be normalized before computing the mean. Otherwise, the mean of unnormalized embeddings may not be representative. Consider using the calculate_cluster_mean function from clustering_advanced.py which properly normalizes embeddings before and after computing the mean.
| fixed_eps=eps, | ||
| distance_threshold=HIERARCHICAL_DISTANCE_THRESHOLD, | ||
| linkage=HIERARCHICAL_LINKAGE, | ||
| density_refinement=VALIDATE_CLUSTERS, |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter 'density_refinement' is passed to cluster_faces but the conservative clustering algorithm expects 'validate_clusters'. The parameter name mismatch means this setting won't be applied when using the conservative algorithm. Update line 265 to pass 'validate_clusters=VALIDATE_CLUSTERS' or add 'density_refinement' as an alias in the cluster_faces function that maps to 'validate_clusters'.
| density_refinement=VALIDATE_CLUSTERS, | |
| validate_clusters=VALIDATE_CLUSTERS, |
| linkage="complete", # Conservative: all pairs must be similar | ||
| ) | ||
| else: | ||
| clustering = AgglomerativeClustering( | ||
| n_clusters=None, | ||
| distance_threshold=distance_threshold, | ||
| metric="cosine", | ||
| linkage="complete", |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'linkage' parameter is passed from face_clusters.py but is ignored in cluster_faces_hierarchical, which always uses "complete" linkage. If complete linkage is mandatory for correctness, remove the linkage parameter from the caller. Otherwise, update cluster_faces_hierarchical to accept and use a linkage parameter.
| linkage="complete", # Conservative: all pairs must be similar | |
| ) | |
| else: | |
| clustering = AgglomerativeClustering( | |
| n_clusters=None, | |
| distance_threshold=distance_threshold, | |
| metric="cosine", | |
| linkage="complete", | |
| linkage=linkage, # Conservative: all pairs must be similar by default | |
| ) | |
| else: | |
| clustering = AgglomerativeClustering( | |
| n_clusters=None, | |
| distance_threshold=distance_threshold, | |
| metric="cosine", | |
| linkage=linkage, |
backend/test.py
Outdated
| # Updated parameters to match production improvements | ||
| # eps=0.2 for tighter clustering, min_samples=1 to include all faces | ||
| dbscan = DBSCAN(eps=0.2, min_samples=1, metric="cosine") |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "min_samples=1 to include all faces" but the PR description and configuration explicitly set min_samples=2 to prevent bridge-point chaining, which is the core fix for issue #793. This test file uses min_samples=1 which reintroduces the exact bug being fixed. The parameters should match the production configuration: eps=0.35 and min_samples=2.
| # Updated parameters to match production improvements | |
| # eps=0.2 for tighter clustering, min_samples=1 to include all faces | |
| dbscan = DBSCAN(eps=0.2, min_samples=1, metric="cosine") | |
| # Parameters aligned with production configuration to prevent bridge-point chaining | |
| # eps=0.35, min_samples=2 | |
| dbscan = DBSCAN(eps=0.35, min_samples=2, metric="cosine") |
| max_cluster_diameter: Maximum allowed cluster diameter (default: 0.55) | ||
| validate_clusters: Whether to validate and split loose clusters | ||
| auto_eps: Auto-select conservative epsilon | ||
| merge_close_clusters: Whether to merge clusters that are clearly same person | ||
| merge_threshold: Max centroid distance for merging (default: 0.30) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring states "max_cluster_diameter: Maximum allowed cluster diameter (default: 0.55)" but the actual default value in the function signature is 0.60. The documentation should match the implementation.
| max_cluster_diameter: Maximum allowed cluster diameter (default: 0.55) | |
| validate_clusters: Whether to validate and split loose clusters | |
| auto_eps: Auto-select conservative epsilon | |
| merge_close_clusters: Whether to merge clusters that are clearly same person | |
| merge_threshold: Max centroid distance for merging (default: 0.30) | |
| max_cluster_diameter: Maximum allowed cluster diameter (default: 0.60) | |
| validate_clusters: Whether to validate and split loose clusters | |
| auto_eps: Auto-select conservative epsilon | |
| merge_close_clusters: Whether to merge clusters that are clearly same person | |
| merge_threshold: Max centroid distance for merging (default: 0.40) |
| sub_clusters[label].append(face_indices[i]) | ||
|
|
||
| return [np.array(indices) for indices in sub_clusters.values()] | ||
| except Exception: |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a bare except clause catches all exceptions including SystemExit and KeyboardInterrupt, which can make debugging difficult. Consider catching specific exceptions (e.g., ValueError, RuntimeError) that AgglomerativeClustering might raise, or at minimum use "except Exception:" to avoid catching system exceptions.
| except Exception: | |
| except (ValueError, RuntimeError): |
| metric="cosine", | ||
| n_jobs=-1, # Use all available CPU cores | ||
| auto_eps=CLUSTERING_AUTO_EPSILON, | ||
| fixed_eps=eps, |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter 'fixed_eps' passed from face_clusters.py (line 262) is not used when algorithm="dbscan". The cluster_faces_dbscan function has an 'eps' parameter, but 'fixed_eps' is passed via **kwargs and ignored. The 'eps' parameter should be used directly instead of 'fixed_eps' in the call from face_clusters.py, or the cluster_faces function should explicitly map fixed_eps to eps before calling cluster_faces_dbscan.
| fixed_eps=eps, | |
| eps=eps, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
backend/tests/test_clustering_algorithm.py (2)
65-68: Consider testing the exact threshold value.The test only verifies the threshold is conservative (< 0.35), but per the PR description, the intended value is 0.28. Consider either:
- Testing the exact value:
assert POST_MERGE_MEAN_DISTANCE_THRESHOLD == 0.28- Adding a comment explaining why a range check is preferred over an exact match
This helps catch unintended configuration drift.
184-232: Strengthen post-merge test with deterministic embeddings.The test uses random embeddings and doesn't verify that clusters are actually merged (lines 228-229 acknowledge this). This means the test could pass even if post-merge logic is broken.
Consider using deterministic embeddings to reliably verify merge behavior:
🔎 Proposed fix
def test_post_merge_combines_close_clusters( self, mock_filter, mock_cluster, mock_get_faces ): """Test that post-merge combines clusters with similar mean embeddings.""" - # Create embeddings that will form 2 clusters with very similar means - base_embedding = np.random.rand(512) + # Create embeddings with known means that are close (distance < 0.28) + # Cluster 0 mean will be ~[1.0, 0, 0, ...] + # Cluster 1 mean will be ~[0.95, 0, 0, ...] (cosine distance ~0.05) + embedding_1 = np.array([1.0] + [0.0] * 511) + embedding_2 = np.array([1.0] + [0.0] * 511) + embedding_3 = np.array([0.95] + [0.0] * 511) + embedding_4 = np.array([0.95] + [0.0] * 511) + mock_get_faces.return_value = [ { "face_id": 1, - "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(), + "embeddings": embedding_1.tolist(), "cluster_name": None, "quality": 0.8, }, { "face_id": 2, - "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(), + "embeddings": embedding_2.tolist(), "cluster_name": None, "quality": 0.8, }, { "face_id": 3, - "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(), + "embeddings": embedding_3.tolist(), "cluster_name": None, "quality": 0.8, }, { "face_id": 4, - "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(), + "embeddings": embedding_4.tolist(), "cluster_name": None, "quality": 0.8, }, ] mock_filter.return_value = mock_get_faces.return_value # DBSCAN creates 2 separate clusters mock_cluster.return_value = np.array([0, 0, 1, 1]) results = cluster_util_cluster_all_face_embeddings() - # Post-merge should combine the 2 clusters into 1 - # All 4 faces should have the same cluster_uuid - # Note: Due to random embeddings, clusters might or might not merge - # At minimum, we verify the function ran without errors and returned all faces + # Post-merge should combine the 2 clusters into 1 + # All 4 faces should have the same cluster_uuid assert len(results) == 4 - assert all(r.cluster_uuid is not None for r in results) + cluster_uuids = [r.cluster_uuid for r in results] + # Verify all faces merged into a single cluster + assert len(set(cluster_uuids)) == 1, "Expected all faces to be merged into one cluster"
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/tests/test_clustering_algorithm.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/tests/test_clustering_algorithm.py (1)
backend/app/utils/face_clusters.py (2)
cluster_util_cluster_all_face_embeddings(194-342)_calculate_cosine_distance(437-451)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests
🔇 Additional comments (8)
backend/tests/test_clustering_algorithm.py (8)
1-16: LGTM - Test module setup is clean.The imports and module structure are well-organized. Testing the private
_calculate_cosine_distancefunction directly is appropriate for unit testing.
25-52: LGTM - Comprehensive helper function tests.The cosine distance tests cover all critical cases (identical, orthogonal, opposite, unnormalized) with appropriate assertions and numerical tolerances.
74-85: LGTM - Good edge case coverage.Properly tests the empty database scenario.
86-109: LGTM - Correctly validates min_samples=2 behavior.The test properly verifies that a single face becomes noise when
min_samples=2, preventing bridge-point chaining as per PR objectives.
152-179: LGTM - Proper noise point handling.Correctly validates that faces labeled as noise (-1) are excluded from results.
233-260: LGTM - Correctly validates post-merge disabled behavior.Properly verifies that clusters remain separate when post-merge is disabled.
265-303: LGTM - Good integration test for quality filtering.The test appropriately allows the real
filter_quality_facesfunction to run while patching configuration, verifying that low-quality faces (0.3) are filtered before clustering while high-quality faces (0.8, 0.6) pass through.
305-368: LGTM - Cluster name determination tests are solid.Both tests properly validate:
- Majority voting correctly selects "John" when it appears twice
- Cluster name remains None when no existing names are present
| embeddings = [ | ||
| np.array([1.0] + [0.0] * 511), # Distinct embedding 1 | ||
| np.array([0.9] + [0.1] * 511), # Similar to 1 | ||
| np.array([0.0, 1.0] + [0.0] * 510), # Distinct embedding 2 | ||
| np.array([0.1, 0.9] + [0.0] * 510), # Similar to 2 | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix misleading comments about embedding similarity.
The comments state embeddings are "Similar to 1" and "Similar to 2", but the actual embeddings are not particularly similar:
[0.9] + [0.1] * 511creates[0.9, 0.1, 0.1, ...]which differs significantly from[1.0, 0.0, 0.0, ...]- Similarly for embeddings 3 and 4
Since the clustering result is mocked (line 136), the actual similarity doesn't affect the test, but the comments should be updated to avoid confusion.
🔎 Suggested fix
embeddings = [
- np.array([1.0] + [0.0] * 511), # Distinct embedding 1
- np.array([0.9] + [0.1] * 511), # Similar to 1
- np.array([0.0, 1.0] + [0.0] * 510), # Distinct embedding 2
- np.array([0.1, 0.9] + [0.0] * 510), # Similar to 2
+ np.array([1.0] + [0.0] * 511), # Embedding for cluster 0
+ np.array([0.9] + [0.1] * 511), # Embedding for cluster 0
+ np.array([0.0, 1.0] + [0.0] * 510), # Embedding for cluster 1
+ np.array([0.1, 0.9] + [0.0] * 510), # Embedding for cluster 1
]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| embeddings = [ | |
| np.array([1.0] + [0.0] * 511), # Distinct embedding 1 | |
| np.array([0.9] + [0.1] * 511), # Similar to 1 | |
| np.array([0.0, 1.0] + [0.0] * 510), # Distinct embedding 2 | |
| np.array([0.1, 0.9] + [0.0] * 510), # Similar to 2 | |
| ] | |
| embeddings = [ | |
| np.array([1.0] + [0.0] * 511), # Embedding for cluster 0 | |
| np.array([0.9] + [0.1] * 511), # Embedding for cluster 0 | |
| np.array([0.0, 1.0] + [0.0] * 510), # Embedding for cluster 1 | |
| np.array([0.1, 0.9] + [0.0] * 510), # Embedding for cluster 1 | |
| ] |
🤖 Prompt for AI Agents
In backend/tests/test_clustering_algorithm.py around lines 119 to 124, the
inline comments claim embeddings are "Similar to 1" and "Similar to 2" but the
arrays shown are not close to the referenced vectors and thus are misleading;
update the comments to accurately describe the embeddings (e.g., "Different but
biased towards first dimension" or "Biased towards first/second dimension" or
simply "Variant of embedding 1/2") so they reflect the actual numeric
relationships, keeping the mocked clustering note intact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
backend/app/utils/clustering_conservative.py (2)
141-169: Parameters and documentation look good; consider making the min_samples override more explicit.The docstring defaults now correctly match the actual parameter defaults (addressing past review comments). The initialization logic is sound.
Line 164 silently overrides
min_samplesto enforce a minimum of 2, which aligns with the PR's goal of preventing bridge-point chaining. While the comment documents this behavior, consider logging a warning when the override occurs to make it more explicit to callers.🔎 Optional improvement: Add warning for min_samples override
+import logging + +logger = logging.getLogger(__name__) + class ConservativeFaceClustering: def __init__( self, eps: float = 0.25, min_samples: int = 2, max_cluster_diameter: float = 0.60, validate_clusters: bool = True, auto_eps: bool = True, merge_close_clusters: bool = True, merge_threshold: float = 0.40, ): ... self.eps = eps + if min_samples < 2: + logger.warning( + f"min_samples={min_samples} is too low; enforcing min_samples=2 " + "to prevent bridge-point chaining" + ) self.min_samples = max(min_samples, 2) # Never allow 1
248-364: Well-designed merge logic; consider documenting the threshold multipliers.The safe merge implementation is sophisticated and addresses the PR objective of reuniting same-person splits:
- Union-Find enables transitive merges
- Dual strategies (centroid + pairwise distance) handle angle variations
- Match ratio check prevents spurious merges from a few outlier faces
- Pre-merge validation ensures cluster quality
The hardcoded threshold multipliers (0.8, 0.6) and match ratio thresholds (0.3, 0.5) work well but could benefit from inline comments explaining their rationale or extraction as named constants for clarity.
🔎 Optional: Extract magic numbers as named constants
class ConservativeFaceClustering: + # Merge strategy thresholds + _CENTROID_MERGE_FACTOR = 0.8 # Merge if centroids within 80% of threshold + _MIN_DIST_MERGE_FACTOR = 0.6 # Merge if any pair within 60% of threshold + _MIN_MATCH_RATIO = 0.3 # At least 30% cross-cluster matches required + _HIGH_MATCH_RATIO = 0.5 # High confidence merge threshold + def _safe_merge_clusters(self, embeddings: NDArray, labels: NDArray) -> NDArray: ... - if centroid_dist < self.merge_threshold * 0.8: + if centroid_dist < self.merge_threshold * self._CENTROID_MERGE_FACTOR: should_merge = True - elif min_dist < self.merge_threshold * 0.6 and match_ratio >= 0.3: + elif min_dist < self.merge_threshold * self._MIN_DIST_MERGE_FACTOR and match_ratio >= self._MIN_MATCH_RATIO: should_merge = True - elif match_ratio >= 0.5 and min_dist < self.merge_threshold: + elif match_ratio >= self._HIGH_MATCH_RATIO and min_dist < self.merge_threshold: should_merge = True
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/app/utils/clustering_conservative.pybackend/test.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests
🔇 Additional comments (5)
backend/test.py (1)
43-45: LGTM! Parameters now align with production configuration.The updated DBSCAN parameters (eps=0.35, min_samples=2) correctly match the production configuration and address the past review comments. Using min_samples=2 prevents bridge-point chaining between distinct people, which is the core fix for issue #793.
backend/app/utils/clustering_conservative.py (4)
25-127: LGTM! Helper functions are well-implemented.The utility functions are solid:
compute_pairwise_distancesproperly handles normalization with division-by-zero protectionselect_conservative_epsilonuses adaptive selection with sensible fallbacks and clampingvalidate_clusterprovides straightforward diameter validationsplit_loose_clusterappropriately usesexcept Exception:(addressing the past review concern about bare except)
171-207: LGTM! Well-structured clustering workflow.The
fit_predictmethod implements a clear, logical flow:
- Edge case handling for small datasets
- Normalization and epsilon selection
- Initial DBSCAN clustering
- Optional validation/splitting
- Optional safe merging
- Consecutive relabeling
The implementation aligns well with the PR's two-phase clustering approach.
209-246: LGTM! Validation and splitting logic is sound.The cluster validation and splitting implementation correctly:
- Validates cluster diameter against threshold
- Splits loose clusters into tighter sub-clusters
- Relabels sub-clusters appropriately
- Marks undersized sub-clusters as noise
This addresses the PR objective of preventing distinct people from being merged.
366-406: LGTM! Clean utility functions.The relabeling and convenience wrapper functions are straightforward and correctly implemented:
_relabel_consecutiveproperly maps clusters to consecutive integers while preserving noise label (-1)cluster_conservativeprovides a clean functional interface to the clustering class
Problem
Face clustering was merging different people into the same cluster due to bridge-point chaining, where a single intermediate face could connect two distinct people's clusters.
Solution
Implemented a two-phase clustering approach:
min_samples=2- Prevents single faces from acting as bridges between clustersChanges
face_clusters.py: Added post-merge logic and_calculate_cosine_distance()helperclustering_conservative.py: Code cleanup (removed unused functions)clustering_advanced.py: Code cleanup (removed unused functions)test_clustering_algorithm.py: Added 16 comprehensive unit testsConfiguration
Benefits
Demo
Before
face_clustering_before.mp4
After
face_clustering_after.mp4
Summary by CodeRabbit
New Features
Improvements
Tests
✏️ Tip: You can customize this high-level summary in your review settings.