Skip to content

Conversation

@SiddharthJiyani
Copy link
Contributor

@SiddharthJiyani SiddharthJiyani commented Dec 23, 2025

Problem

Face clustering was merging different people into the same cluster due to bridge-point chaining, where a single intermediate face could connect two distinct people's clusters.

Solution

Implemented a two-phase clustering approach:

  1. DBSCAN with min_samples=2 - Prevents single faces from acting as bridges between clusters
  2. Post-cluster mean merge - Merges clusters with similar mean embeddings (threshold: 0.28) to fix same-person splits caused by pose/lighting variation

Changes

  • face_clusters.py: Added post-merge logic and _calculate_cosine_distance() helper
  • clustering_conservative.py: Code cleanup (removed unused functions)
  • clustering_advanced.py: Code cleanup (removed unused functions)
  • test_clustering_algorithm.py: Added 16 comprehensive unit tests

Configuration

CLUSTERING_ALGORITHM = "dbscan"
CLUSTERING_MIN_SAMPLES = 2  # Prevents bridge chaining
CLUSTERING_FIXED_EPSILON = 0.35
POST_MERGE_ENABLED = True
POST_MERGE_MEAN_DISTANCE_THRESHOLD = 0.28

Benefits

  • Reduces mis-clustered faces by filtering poor quality detections
  • Automatically adapts to different photo collections
  • More robust cluster representatives

Demo

Before

face_clustering_before.mp4

After

  • To see the clustering results directly, please jump to 00:40.
face_clustering_after.mp4

Summary by CodeRabbit

  • New Features

    • Per-face quality scoring (sharpness, brightness, size) persisted with faces and exposed in clustering results; quality-driven filtering for clustering and assignments.
    • Multiple clustering strategies (conservative, DBSCAN, hierarchical) with adaptive parameter selection and improved cluster-merge options.
  • Improvements

    • Quality-aware cluster mean computation, assignment of unclustered faces, and post-merge merging to reduce mis-assignments; richer logging.
  • Tests

    • Comprehensive unit tests for clustering flows and face quality utilities.

✏️ Tip: You can customize this high-level summary in your review settings.

Copilot AI review requested due to automatic review settings December 23, 2025 17:49
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 23, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds per-face quality scoring into detection, database, and clustering; introduces conservative, DBSCAN, and hierarchical clustering utilities with adaptive eps and post-merge logic; updates schema, insertion/retrieval, and tests to propagate and validate quality-aware behavior.

Changes

Cohort / File(s) Summary
Face Quality Assessment & Tests
backend/app/utils/face_quality.py, backend/tests/test_face_quality.py
New module computing per-face quality (sharpness, brightness, size), helpers to filter/include faces, and comprehensive unit tests covering metrics, composition, thresholds, and edge cases.
Database Schema & Face Operations
backend/app/database/faces.py
Added quality REAL DEFAULT 0.5 column; extended FaceData type and select queries to include quality (COALESCE fallback); propagated quality through single and bulk insert functions and cluster-mean aggregation as quality_scores.
Face Detection Integration
backend/app/models/FaceDetector.py
Imported calculate_face_quality, computed per-face quality during detection, logged quality statistics, and passed quality list into db_insert_face_embeddings_by_image_id.
Conservative Clustering Module
backend/app/utils/clustering_conservative.py
New ConservativeFaceClustering class and helpers: pairwise distances, adaptive epsilon selection, cluster validation/splitting, safe merge, relabeling; cluster_conservative wrapper.
Advanced Clustering Utilities
backend/app/utils/clustering_advanced.py
New dispatcher and algorithms: adaptive eps calculation, DBSCAN and hierarchical variants, unified cluster_faces entry point, cluster stats and mean calculation, and advanced_face_clustering alias.
Clustering Workflow & Assignments
backend/app/utils/face_clusters.py
Reworked pipeline to use advanced clustering options, optional quality filtering (config-gated), post-merge consolidation, quality-aware cluster representatives, and updated assignment logic/logging.
Tests, Dev Script & Requirements
backend/tests/test_clustering_algorithm.py, backend/test.py, backend/requirements.txt
New clustering test suite with mocks for post-merge and quality-filtering scenarios; adjusted local test DBSCAN param; added dependency kneed>=0.8.5.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Detector as FaceDetector
    participant Quality as FaceQualitySvc
    participant Clustering as ClusteringEngine
    participant DB as Database

    Detector->>Quality: send detected face images
    Quality->>Quality: compute sharpness, brightness, size -> quality scores
    Quality-->>Detector: per-face quality scores

    alt Quality filtering enabled
        Detector->>Detector: drop faces below threshold
    end

    Detector->>Clustering: embeddings + quality metadata
    Clustering->>Clustering: normalize embeddings
    opt auto-eps
        Clustering->>Clustering: calculate_adaptive_eps (k-NN)
    end
    Clustering->>Clustering: run algorithm (conservative/dbscan/hierarchical)
    Clustering->>Clustering: validate/split/merge clusters
    Clustering-->>Detector: labels, cluster_means, cluster_quality_scores

    alt Assign unclustered faces
        Detector->>Clustering: compute similarity to cluster means (quality-aware)
        Clustering-->>Detector: assignment decisions
    end

    Detector->>DB: insert/update faces with embedding, cluster_uuid, quality
    DB-->>DB: persist rows including `quality` and per-cluster `quality_scores`
    DB-->>Detector: ack/ids
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

enhancement

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: improving the face clustering algorithm, which is the core focus of this PR addressing issue #793.
Linked Issues check ✅ Passed The PR implements a two-phase clustering approach with DBSCAN and post-merge logic that directly addresses issue #793's requirements: preventing bridge-point chaining and preventing same-person splits through configurable thresholds.
Out of Scope Changes check ✅ Passed All changes are directly related to addressing issue #793's clustering quality problems. New utility modules (face_quality.py, clustering_conservative.py, clustering_advanced.py) and modifications support the primary clustering improvement objectives.
Docstring Coverage ✅ Passed Docstring coverage is 93.51% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/database/faces.py (1)

36-50: Existing deployments require a schema migration to add the quality column.

The code uses COALESCE(quality, 0.5) in SELECT queries, which gracefully handles NULL values for new rows. However, CREATE TABLE IF NOT EXISTS will not modify existing tables, so databases from before this change won't have the quality column at all. Queries referencing this column will fail with "no such column" errors.

For existing deployments, add an ALTER TABLE faces ADD COLUMN quality REAL DEFAULT 0.5; migration before deploying this code. New deployments from scratch will not be affected.

🧹 Nitpick comments (7)
backend/tests/test_clustering_algorithm.py (1)

186-233: Test may be non-deterministic due to random embeddings.

The test uses np.random.rand(512) * 0.01 perturbations, but the comment acknowledges "clusters might or might not merge." This makes the test assertion weak—it only verifies the function runs without errors.

Consider using fixed embeddings with known distances to make the test deterministic and actually verify the merge behavior.

🔎 Suggested improvement for deterministic testing
     def test_post_merge_combines_close_clusters(
         self, mock_filter, mock_cluster, mock_get_faces
     ):
         """Test that post-merge combines clusters with similar mean embeddings."""
-        # Create embeddings that will form 2 clusters with very similar means
-        base_embedding = np.random.rand(512)
+        # Create deterministic embeddings with known distance
+        base_embedding = np.zeros(512)
+        base_embedding[0] = 1.0  # Unit vector in first dimension
         mock_get_faces.return_value = [
             {
                 "face_id": 1,
-                "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(),
+                "embeddings": base_embedding.tolist(),
                 "cluster_name": None,
                 "quality": 0.8,
             },
             # ... similar changes for other faces
         ]
backend/app/utils/face_clusters.py (1)

437-451: Potential division by zero if embedding has zero norm.

_calculate_cosine_distance divides by np.linalg.norm() without checking for zero-norm vectors. While face embeddings from a neural network should never be zero, a defensive guard would prevent potential nan results.

🔎 Suggested defensive guard
 def _calculate_cosine_distance(embedding_a: NDArray, embedding_b: NDArray) -> float:
     """
     Calculate cosine distance between two embeddings.
     ...
     """
-    norm_a = embedding_a / np.linalg.norm(embedding_a)
-    norm_b = embedding_b / np.linalg.norm(embedding_b)
+    norm_a_val = np.linalg.norm(embedding_a)
+    norm_b_val = np.linalg.norm(embedding_b)
+    if norm_a_val == 0 or norm_b_val == 0:
+        return 2.0  # Maximum distance for degenerate case
+    norm_a = embedding_a / norm_a_val
+    norm_b = embedding_b / norm_b_val
     similarity = np.dot(norm_a, norm_b)
     return 1 - similarity
backend/app/utils/clustering_advanced.py (4)

23-64: calculate_adaptive_eps is defined but unused within this module.

This function is never called internally—cluster_faces_dbscan uses select_conservative_epsilon from the conservative module instead (line 100). Consider:

  1. Using this function where auto_eps=True in cluster_faces_dbscan, or
  2. Removing it if select_conservative_epsilon is the preferred approach, or
  3. Documenting it as an alternative utility for external callers.

The current state may confuse maintainers about which epsilon calculation to use.


211-217: Redundant max(min_samples, 2) enforcement.

Line 215 applies max(min_samples, 2), but cluster_faces_dbscan already enforces this at line 104. While harmless, consider removing the redundancy here to keep the enforcement in one place.

🔎 Proposed fix
     elif algorithm == "dbscan":
         return cluster_faces_dbscan(
             embeddings,
             eps=eps,
-            min_samples=max(min_samples, 2),
+            min_samples=min_samples,
             auto_eps=auto_eps,
         )

267-269: float("inf") for single-element cluster density may cause downstream issues.

If these statistics are later aggregated (e.g., averaging densities), inf will propagate unexpectedly. Consider using None or np.nan to explicitly indicate "undefined" rather than infinite density.

🔎 Proposed fix
         else:
             cluster_diameters.append(0.0)
-            cluster_densities.append(float("inf"))
+            cluster_densities.append(np.nan)  # Undefined for single-element clusters

295-296: Hardcoded embedding dimension 512 is fragile.

If the embedding model changes (e.g., to 256 or 1024 dimensions), this will return an incorrectly sized array. Consider raising an error for empty input or accepting the dimension as a parameter.

🔎 Proposed fix
-def calculate_cluster_mean(embeddings: NDArray) -> NDArray:
+def calculate_cluster_mean(embeddings: NDArray, embedding_dim: int = 512) -> NDArray:
     """
     Calculate the mean embedding for a cluster.
 
     Args:
         embeddings: Face embeddings for a cluster
+        embedding_dim: Dimension of embeddings (used only for empty input fallback)
 
     Returns:
         Normalized mean embedding
     """
     if len(embeddings) == 0:
-        return np.zeros(512)  # Default embedding dimension
+        return np.zeros(embedding_dim)
backend/app/utils/clustering_conservative.py (1)

117-127: Broad except Exception silently swallows all errors.

This could hide genuine issues (e.g., memory errors, numpy bugs). Consider catching a more specific exception or at least logging the failure.

🔎 Proposed fix
+    import logging
+    logger = logging.getLogger(__name__)
+
     try:
         sub_labels = clustering.fit_predict(embeddings)
 
         # Group indices by sub-cluster
         sub_clusters = defaultdict(list)
         for i, label in enumerate(sub_labels):
             sub_clusters[label].append(face_indices[i])
 
         return [np.array(indices) for indices in sub_clusters.values()]
-    except Exception:
+    except ValueError as e:
+        # AgglomerativeClustering can raise ValueError for edge cases
+        logger.debug(f"Failed to split cluster: {e}")
         return [face_indices]
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d07d817 and 51043fe.

📒 Files selected for processing (10)
  • backend/app/database/faces.py
  • backend/app/models/FaceDetector.py
  • backend/app/utils/clustering_advanced.py
  • backend/app/utils/clustering_conservative.py
  • backend/app/utils/face_clusters.py
  • backend/app/utils/face_quality.py
  • backend/requirements.txt
  • backend/test.py
  • backend/tests/test_clustering_algorithm.py
  • backend/tests/test_face_quality.py
🧰 Additional context used
🧬 Code graph analysis (4)
backend/tests/test_face_quality.py (1)
backend/app/utils/face_quality.py (6)
  • assess_face_sharpness (15-41)
  • assess_face_brightness (44-73)
  • assess_face_size (76-102)
  • calculate_face_quality (105-156)
  • should_include_face (159-170)
  • filter_quality_faces (173-184)
backend/tests/test_clustering_algorithm.py (1)
backend/app/utils/face_clusters.py (2)
  • cluster_util_cluster_all_face_embeddings (194-342)
  • _calculate_cosine_distance (437-451)
backend/app/utils/clustering_advanced.py (1)
backend/app/utils/clustering_conservative.py (3)
  • cluster_conservative (374-406)
  • select_conservative_epsilon (33-77)
  • fit_predict (171-207)
backend/app/models/FaceDetector.py (2)
backend/app/utils/face_quality.py (1)
  • calculate_face_quality (105-156)
backend/app/database/faces.py (1)
  • db_insert_face_embeddings_by_image_id (101-152)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Agent
  • GitHub Check: Backend Tests
🔇 Additional comments (37)
backend/requirements.txt (1)

74-74: LGTM!

The kneed dependency is appropriately added to support adaptive epsilon calculation in the clustering utilities. The minimum version constraint >=0.8.5 allows flexibility for compatible updates.

backend/app/utils/face_quality.py (5)

15-41: LGTM with minor consideration for edge cases.

The sharpness assessment using Laplacian variance is a well-established technique. The normalization to 0-1 scale with 100 as the empirical threshold is reasonable.

Consider adding a guard for degenerate images (e.g., 0-dimensional or single-pixel images) that could cause issues with cv2.Laplacian or cv2.cvtColor. However, since this is called after face detection with padding, such edge cases are unlikely in practice.


44-73: LGTM!

The brightness assessment correctly evaluates how close the average brightness is to the optimal midpoint (128). The scoring formula ensures scores range from 0.0 (extreme brightness/darkness) to 1.0 (optimal).


76-102: LGTM!

The size assessment correctly uses the minimum dimension to handle non-square face crops and caps at 1.0 for faces meeting or exceeding the target size.


105-156: LGTM!

Well-designed composite quality function with:

  • Weight validation using np.isclose for floating-point safety
  • Comprehensive return dictionary for debugging and analysis
  • Sensible default weights emphasizing sharpness (0.4)

159-184: LGTM!

Simple and effective filtering utilities. The use of .get("quality", 0.0) in filter_quality_faces provides a sensible fallback for faces without quality scores.

backend/app/models/FaceDetector.py (3)

37-37: LGTM!

Good addition of the qualities list to track per-face quality scores alongside other face metadata.


53-65: LGTM!

Quality assessment is appropriately integrated into the face detection loop. The quality is calculated on the padded face crop (face_img), which provides consistent input for the quality metrics.

The debug logging of individual quality components will be valuable for troubleshooting clustering issues.


74-91: LGTM!

Good integration of quality scores into the database insertion path. The aggregate statistics logging (average quality and high-quality count) provides useful visibility into detection quality across images.

backend/tests/test_clustering_algorithm.py (4)

25-51: LGTM!

Comprehensive tests for _calculate_cosine_distance covering key cases:

  • Identical embeddings → distance 0
  • Orthogonal embeddings → distance 1
  • Opposite embeddings → distance 2
  • Unnormalized inputs handled correctly

57-68: LGTM!

Good practice to test configuration constants. These tests serve as documentation and will fail if someone inadvertently changes critical clustering parameters.


307-369: LGTM!

Good coverage of cluster name determination logic:

  • Majority voting correctly selects "John" when it appears more frequently
  • None is returned when no faces have existing names

267-304: The patching strategy is correct as implemented.

The test properly patches CLUSTERING_QUALITY_FILTER_ENABLED and CLUSTERING_QUALITY_MIN_THRESHOLD at their definition and usage location (app.utils.face_clusters). Since these constants are read directly within the same module at runtime (lines 224 and 227 of cluster_util_cluster_all_face_embeddings()), the @patch decorators will correctly intercept the reads, allowing the real filter_quality_faces() function to execute with the patched threshold values.

backend/tests/test_face_quality.py (6)

1-19: LGTM!

Well-organized test module with clear imports and class-based test organization. The tests provide good coverage for the face quality assessment utilities.


21-55: LGTM!

Effective sharpness tests using:

  • Checkerboard pattern for high sharpness (high edge content)
  • Uniform gray image for low sharpness (no edges)
  • Grayscale input handling verification

57-86: LGTM!

Brightness tests appropriately cover the full range: optimal (128), very dark (20), and overexposed (240).


88-122: LGTM!

Size assessment tests cover important cases including rectangular images using the minimum dimension.


124-180: LGTM!

Comprehensive tests for the composite quality function including:

  • Return structure validation
  • Score range validation
  • Custom weight support
  • Invalid weight error handling
  • High-quality face scenario

182-258: LGTM!

Good coverage of filtering utilities including:

  • Threshold boundary behavior (exact threshold is inclusive)
  • Empty list handling
  • All pass/fail scenarios
  • Missing quality key defaulting to 0
backend/app/utils/face_clusters.py (3)

35-73: LGTM! Well-documented configuration.

The clustering configuration is comprehensive with clear documentation explaining the rationale behind each parameter. Key highlights:

  • CLUSTERING_MIN_SAMPLES = 2 prevents bridge-point chaining (the core fix)
  • Post-merge threshold (0.28) is tighter than DBSCAN eps (0.35) for safety
  • Low quality threshold (0.15) is intentionally permissive

288-320: Post-merge logic is correct but has O(n²) complexity.

The nested loop correctly implements pairwise cluster merging based on mean embedding distance. The use of a used set prevents double-merging.

For very large numbers of clusters, consider optimizing with spatial indexing, though this is unlikely to be a bottleneck for typical face clustering scenarios.


345-434: LGTM!

The incremental cluster assignment function correctly:

  • Uses quality-weighted cluster representatives
  • Applies quality filtering to unassigned faces
  • Enforces similarity threshold before assignment
  • Provides helpful debug logging for skipped faces
backend/app/database/faces.py (3)

57-98: LGTM!

The insert function correctly handles the new quality parameter with proper SQL parameter binding.


240-254: LGTM!

Good use of COALESCE(quality, 0.5) to handle NULL values gracefully, ensuring backward compatibility with existing data that lacks quality scores.


349-405: LGTM!

The function correctly returns quality scores alongside embeddings for quality-aware clustering operations. The grouping logic properly maintains separate quality lists per cluster.

backend/app/utils/clustering_advanced.py (4)

1-21: LGTM on module structure and imports.

The module docstring clearly explains the purpose and the critical min_samples >= 2 enforcement. Imports are appropriate for the clustering operations.


67-114: LGTM—min_samples >= 2 enforcement is correctly implemented.

The function properly normalizes embeddings, enforces the critical min_samples >= 2 constraint to prevent bridge-point chaining, and delegates epsilon selection appropriately.


117-161: LGTM—hierarchical clustering with complete linkage is a good conservative choice.

The complete linkage ensures all pairs within a cluster meet the distance threshold, aligning with the PR objective of preventing incorrect merges.


313-331: LGTM—backward compatibility alias and exports are well-defined.

The advanced_face_clustering alias maintains API compatibility, and __all__ correctly exports the public interface.

backend/app/utils/clustering_conservative.py (8)

1-23: LGTM—module docstring clearly articulates the conservative clustering philosophy.

The key principles are well-documented: strict thresholds, cluster validation, and cautious merging.


25-77: LGTM—epsilon selection logic is well-designed.

The adaptive percentile selection based on data spread (tight, moderate, wide) is a thoughtful approach that balances clustering quality across different datasets.


80-95: LGTM—cluster validation is straightforward and correct.


171-207: LGTM—fit_predict implements a well-structured pipeline.

The sequence of normalize → select epsilon → DBSCAN → validate/split → safe merge → relabel is logical and aligns with the two-phase clustering approach described in the PR objectives.


209-246: LGTM—cluster validation and splitting logic is correct.

The 0.8 multiplier on max_diameter for sub-cluster splitting (line 230) is a good heuristic to ensure tighter sub-clusters.


248-364: LGTM—safe merge implementation with Union-Find is well-designed.

Key strengths:

  • Union-Find with path compression handles transitive merges correctly
  • Multiple merge strategies (centroid distance, min pairwise + match ratio, high match ratio) address angle variation
  • Pre-merge validation (line 355) prevents creating oversized clusters

This directly addresses the PR objective of reuniting over-split clusters while maintaining safety.


366-371: LGTM—relabeling is straightforward and correct.


374-406: LGTM—convenience function provides a clean API.

Parameters are well-documented and correctly forwarded to ConservativeFaceClustering.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the face clustering algorithm to fix issue #793 where faces were incorrectly merged across different people (bridge-point chaining) and split for the same person across different angles.

Key Changes:

  • Implements two-phase clustering: DBSCAN with min_samples=2 to prevent bridge-point chaining, followed by post-cluster mean merge to reunite same-person splits
  • Adds face quality assessment module to filter low-quality face detections before clustering
  • Introduces conservative and hierarchical clustering algorithms as alternatives to standard DBSCAN

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
backend/app/utils/face_clusters.py Core clustering logic with post-merge functionality and quality filtering integration
backend/app/utils/clustering_advanced.py Main clustering entry point supporting multiple algorithms (DBSCAN, hierarchical, conservative)
backend/app/utils/clustering_conservative.py Conservative clustering implementation with cluster validation and safe merging
backend/app/utils/face_quality.py New module for assessing face quality based on sharpness, brightness, and size
backend/app/models/FaceDetector.py Integration of quality assessment during face detection
backend/app/database/faces.py Database schema and query updates to store and retrieve quality scores
backend/tests/test_clustering_algorithm.py Comprehensive unit tests for clustering algorithms and post-merge functionality
backend/tests/test_face_quality.py Unit tests for face quality assessment functions
backend/test.py Test script parameter updates (contains bug with min_samples=1)
backend/requirements.txt Added kneed dependency (currently unused)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

psutil>=5.9.5
pytest-asyncio>=1.0.0
setuptools==66.1.1
kneed>=0.8.5
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'kneed' package is added to requirements.txt but doesn't appear to be used anywhere in the codebase. If this dependency is not needed for the current implementation, it should be removed. If it's intended for future use, consider adding a comment explaining its purpose.

Suggested change
kneed>=0.8.5

Copilot uses AI. Check for mistakes.
validate_clusters: Whether to validate and split loose clusters
auto_eps: Auto-select conservative epsilon
merge_close_clusters: Whether to merge clusters that are clearly same person
merge_threshold: Max centroid distance for merging (default: 0.30)
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states "merge_threshold: Max centroid distance for merging (default: 0.30)" but the actual default value in the function signature is 0.40. The documentation should match the implementation.

Suggested change
merge_threshold: Max centroid distance for merging (default: 0.30)
merge_threshold: Max centroid distance for merging (default: 0.40)

Copilot uses AI. Check for mistakes.
Comment on lines +299 to +307
mean_i = np.mean([f["embedding"] for f in faces_i], axis=0)
merged[label_i] = list(faces_i) # Copy the list

for j in range(i + 1, len(cluster_items)):
label_j, faces_j = cluster_items[j]
if label_j in used:
continue

mean_j = np.mean([f["embedding"] for f in faces_j], axis=0)
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mean embeddings are calculated without normalization. Since cosine distance is being used (which is scale-invariant), the embeddings should be normalized before computing the mean. Otherwise, the mean of unnormalized embeddings may not be representative. Consider using the calculate_cluster_mean function from clustering_advanced.py which properly normalizes embeddings before and after computing the mean.

Copilot uses AI. Check for mistakes.
fixed_eps=eps,
distance_threshold=HIERARCHICAL_DISTANCE_THRESHOLD,
linkage=HIERARCHICAL_LINKAGE,
density_refinement=VALIDATE_CLUSTERS,
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter 'density_refinement' is passed to cluster_faces but the conservative clustering algorithm expects 'validate_clusters'. The parameter name mismatch means this setting won't be applied when using the conservative algorithm. Update line 265 to pass 'validate_clusters=VALIDATE_CLUSTERS' or add 'density_refinement' as an alias in the cluster_faces function that maps to 'validate_clusters'.

Suggested change
density_refinement=VALIDATE_CLUSTERS,
validate_clusters=VALIDATE_CLUSTERS,

Copilot uses AI. Check for mistakes.
Comment on lines +150 to +157
linkage="complete", # Conservative: all pairs must be similar
)
else:
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=distance_threshold,
metric="cosine",
linkage="complete",
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'linkage' parameter is passed from face_clusters.py but is ignored in cluster_faces_hierarchical, which always uses "complete" linkage. If complete linkage is mandatory for correctness, remove the linkage parameter from the caller. Otherwise, update cluster_faces_hierarchical to accept and use a linkage parameter.

Suggested change
linkage="complete", # Conservative: all pairs must be similar
)
else:
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=distance_threshold,
metric="cosine",
linkage="complete",
linkage=linkage, # Conservative: all pairs must be similar by default
)
else:
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=distance_threshold,
metric="cosine",
linkage=linkage,

Copilot uses AI. Check for mistakes.
backend/test.py Outdated
Comment on lines 43 to 45
# Updated parameters to match production improvements
# eps=0.2 for tighter clustering, min_samples=1 to include all faces
dbscan = DBSCAN(eps=0.2, min_samples=1, metric="cosine")
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "min_samples=1 to include all faces" but the PR description and configuration explicitly set min_samples=2 to prevent bridge-point chaining, which is the core fix for issue #793. This test file uses min_samples=1 which reintroduces the exact bug being fixed. The parameters should match the production configuration: eps=0.35 and min_samples=2.

Suggested change
# Updated parameters to match production improvements
# eps=0.2 for tighter clustering, min_samples=1 to include all faces
dbscan = DBSCAN(eps=0.2, min_samples=1, metric="cosine")
# Parameters aligned with production configuration to prevent bridge-point chaining
# eps=0.35, min_samples=2
dbscan = DBSCAN(eps=0.35, min_samples=2, metric="cosine")

Copilot uses AI. Check for mistakes.
Comment on lines 157 to 161
max_cluster_diameter: Maximum allowed cluster diameter (default: 0.55)
validate_clusters: Whether to validate and split loose clusters
auto_eps: Auto-select conservative epsilon
merge_close_clusters: Whether to merge clusters that are clearly same person
merge_threshold: Max centroid distance for merging (default: 0.30)
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states "max_cluster_diameter: Maximum allowed cluster diameter (default: 0.55)" but the actual default value in the function signature is 0.60. The documentation should match the implementation.

Suggested change
max_cluster_diameter: Maximum allowed cluster diameter (default: 0.55)
validate_clusters: Whether to validate and split loose clusters
auto_eps: Auto-select conservative epsilon
merge_close_clusters: Whether to merge clusters that are clearly same person
merge_threshold: Max centroid distance for merging (default: 0.30)
max_cluster_diameter: Maximum allowed cluster diameter (default: 0.60)
validate_clusters: Whether to validate and split loose clusters
auto_eps: Auto-select conservative epsilon
merge_close_clusters: Whether to merge clusters that are clearly same person
merge_threshold: Max centroid distance for merging (default: 0.40)

Copilot uses AI. Check for mistakes.
sub_clusters[label].append(face_indices[i])

return [np.array(indices) for indices in sub_clusters.values()]
except Exception:
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a bare except clause catches all exceptions including SystemExit and KeyboardInterrupt, which can make debugging difficult. Consider catching specific exceptions (e.g., ValueError, RuntimeError) that AgglomerativeClustering might raise, or at minimum use "except Exception:" to avoid catching system exceptions.

Suggested change
except Exception:
except (ValueError, RuntimeError):

Copilot uses AI. Check for mistakes.
metric="cosine",
n_jobs=-1, # Use all available CPU cores
auto_eps=CLUSTERING_AUTO_EPSILON,
fixed_eps=eps,
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter 'fixed_eps' passed from face_clusters.py (line 262) is not used when algorithm="dbscan". The cluster_faces_dbscan function has an 'eps' parameter, but 'fixed_eps' is passed via **kwargs and ignored. The 'eps' parameter should be used directly instead of 'fixed_eps' in the call from face_clusters.py, or the cluster_faces function should explicitly map fixed_eps to eps before calling cluster_faces_dbscan.

Suggested change
fixed_eps=eps,
eps=eps,

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
backend/tests/test_clustering_algorithm.py (2)

65-68: Consider testing the exact threshold value.

The test only verifies the threshold is conservative (< 0.35), but per the PR description, the intended value is 0.28. Consider either:

  • Testing the exact value: assert POST_MERGE_MEAN_DISTANCE_THRESHOLD == 0.28
  • Adding a comment explaining why a range check is preferred over an exact match

This helps catch unintended configuration drift.


184-232: Strengthen post-merge test with deterministic embeddings.

The test uses random embeddings and doesn't verify that clusters are actually merged (lines 228-229 acknowledge this). This means the test could pass even if post-merge logic is broken.

Consider using deterministic embeddings to reliably verify merge behavior:

🔎 Proposed fix
    def test_post_merge_combines_close_clusters(
        self, mock_filter, mock_cluster, mock_get_faces
    ):
        """Test that post-merge combines clusters with similar mean embeddings."""
-        # Create embeddings that will form 2 clusters with very similar means
-        base_embedding = np.random.rand(512)
+        # Create embeddings with known means that are close (distance < 0.28)
+        # Cluster 0 mean will be ~[1.0, 0, 0, ...]
+        # Cluster 1 mean will be ~[0.95, 0, 0, ...] (cosine distance ~0.05)
+        embedding_1 = np.array([1.0] + [0.0] * 511)
+        embedding_2 = np.array([1.0] + [0.0] * 511)
+        embedding_3 = np.array([0.95] + [0.0] * 511)
+        embedding_4 = np.array([0.95] + [0.0] * 511)
+        
        mock_get_faces.return_value = [
            {
                "face_id": 1,
-                "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(),
+                "embeddings": embedding_1.tolist(),
                "cluster_name": None,
                "quality": 0.8,
            },
            {
                "face_id": 2,
-                "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(),
+                "embeddings": embedding_2.tolist(),
                "cluster_name": None,
                "quality": 0.8,
            },
            {
                "face_id": 3,
-                "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(),
+                "embeddings": embedding_3.tolist(),
                "cluster_name": None,
                "quality": 0.8,
            },
            {
                "face_id": 4,
-                "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(),
+                "embeddings": embedding_4.tolist(),
                "cluster_name": None,
                "quality": 0.8,
            },
        ]
        mock_filter.return_value = mock_get_faces.return_value
        # DBSCAN creates 2 separate clusters
        mock_cluster.return_value = np.array([0, 0, 1, 1])

        results = cluster_util_cluster_all_face_embeddings()

-        # Post-merge should combine the 2 clusters into 1
-        # All 4 faces should have the same cluster_uuid
-        # Note: Due to random embeddings, clusters might or might not merge
-        # At minimum, we verify the function ran without errors and returned all faces
+        # Post-merge should combine the 2 clusters into 1
+        # All 4 faces should have the same cluster_uuid
        assert len(results) == 4
-        assert all(r.cluster_uuid is not None for r in results)
+        cluster_uuids = [r.cluster_uuid for r in results]
+        # Verify all faces merged into a single cluster
+        assert len(set(cluster_uuids)) == 1, "Expected all faces to be merged into one cluster"
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51043fe and f17c161.

📒 Files selected for processing (1)
  • backend/tests/test_clustering_algorithm.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/tests/test_clustering_algorithm.py (1)
backend/app/utils/face_clusters.py (2)
  • cluster_util_cluster_all_face_embeddings (194-342)
  • _calculate_cosine_distance (437-451)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Backend Tests
🔇 Additional comments (8)
backend/tests/test_clustering_algorithm.py (8)

1-16: LGTM - Test module setup is clean.

The imports and module structure are well-organized. Testing the private _calculate_cosine_distance function directly is appropriate for unit testing.


25-52: LGTM - Comprehensive helper function tests.

The cosine distance tests cover all critical cases (identical, orthogonal, opposite, unnormalized) with appropriate assertions and numerical tolerances.


74-85: LGTM - Good edge case coverage.

Properly tests the empty database scenario.


86-109: LGTM - Correctly validates min_samples=2 behavior.

The test properly verifies that a single face becomes noise when min_samples=2, preventing bridge-point chaining as per PR objectives.


152-179: LGTM - Proper noise point handling.

Correctly validates that faces labeled as noise (-1) are excluded from results.


233-260: LGTM - Correctly validates post-merge disabled behavior.

Properly verifies that clusters remain separate when post-merge is disabled.


265-303: LGTM - Good integration test for quality filtering.

The test appropriately allows the real filter_quality_faces function to run while patching configuration, verifying that low-quality faces (0.3) are filtered before clustering while high-quality faces (0.8, 0.6) pass through.


305-368: LGTM - Cluster name determination tests are solid.

Both tests properly validate:

  1. Majority voting correctly selects "John" when it appears twice
  2. Cluster name remains None when no existing names are present

Comment on lines +119 to +124
embeddings = [
np.array([1.0] + [0.0] * 511), # Distinct embedding 1
np.array([0.9] + [0.1] * 511), # Similar to 1
np.array([0.0, 1.0] + [0.0] * 510), # Distinct embedding 2
np.array([0.1, 0.9] + [0.0] * 510), # Similar to 2
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix misleading comments about embedding similarity.

The comments state embeddings are "Similar to 1" and "Similar to 2", but the actual embeddings are not particularly similar:

  • [0.9] + [0.1] * 511 creates [0.9, 0.1, 0.1, ...] which differs significantly from [1.0, 0.0, 0.0, ...]
  • Similarly for embeddings 3 and 4

Since the clustering result is mocked (line 136), the actual similarity doesn't affect the test, but the comments should be updated to avoid confusion.

🔎 Suggested fix
        embeddings = [
-            np.array([1.0] + [0.0] * 511),  # Distinct embedding 1
-            np.array([0.9] + [0.1] * 511),  # Similar to 1
-            np.array([0.0, 1.0] + [0.0] * 510),  # Distinct embedding 2
-            np.array([0.1, 0.9] + [0.0] * 510),  # Similar to 2
+            np.array([1.0] + [0.0] * 511),  # Embedding for cluster 0
+            np.array([0.9] + [0.1] * 511),  # Embedding for cluster 0
+            np.array([0.0, 1.0] + [0.0] * 510),  # Embedding for cluster 1
+            np.array([0.1, 0.9] + [0.0] * 510),  # Embedding for cluster 1
        ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
embeddings = [
np.array([1.0] + [0.0] * 511), # Distinct embedding 1
np.array([0.9] + [0.1] * 511), # Similar to 1
np.array([0.0, 1.0] + [0.0] * 510), # Distinct embedding 2
np.array([0.1, 0.9] + [0.0] * 510), # Similar to 2
]
embeddings = [
np.array([1.0] + [0.0] * 511), # Embedding for cluster 0
np.array([0.9] + [0.1] * 511), # Embedding for cluster 0
np.array([0.0, 1.0] + [0.0] * 510), # Embedding for cluster 1
np.array([0.1, 0.9] + [0.0] * 510), # Embedding for cluster 1
]
🤖 Prompt for AI Agents
In backend/tests/test_clustering_algorithm.py around lines 119 to 124, the
inline comments claim embeddings are "Similar to 1" and "Similar to 2" but the
arrays shown are not close to the referenced vectors and thus are misleading;
update the comments to accurately describe the embeddings (e.g., "Different but
biased towards first dimension" or "Biased towards first/second dimension" or
simply "Variant of embedding 1/2") so they reflect the actual numeric
relationships, keeping the mocked clustering note intact.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
backend/app/utils/clustering_conservative.py (2)

141-169: Parameters and documentation look good; consider making the min_samples override more explicit.

The docstring defaults now correctly match the actual parameter defaults (addressing past review comments). The initialization logic is sound.

Line 164 silently overrides min_samples to enforce a minimum of 2, which aligns with the PR's goal of preventing bridge-point chaining. While the comment documents this behavior, consider logging a warning when the override occurs to make it more explicit to callers.

🔎 Optional improvement: Add warning for min_samples override
+import logging
+
+logger = logging.getLogger(__name__)
+
 class ConservativeFaceClustering:
     def __init__(
         self,
         eps: float = 0.25,
         min_samples: int = 2,
         max_cluster_diameter: float = 0.60,
         validate_clusters: bool = True,
         auto_eps: bool = True,
         merge_close_clusters: bool = True,
         merge_threshold: float = 0.40,
     ):
         ...
         self.eps = eps
+        if min_samples < 2:
+            logger.warning(
+                f"min_samples={min_samples} is too low; enforcing min_samples=2 "
+                "to prevent bridge-point chaining"
+            )
         self.min_samples = max(min_samples, 2)  # Never allow 1

248-364: Well-designed merge logic; consider documenting the threshold multipliers.

The safe merge implementation is sophisticated and addresses the PR objective of reuniting same-person splits:

  • Union-Find enables transitive merges
  • Dual strategies (centroid + pairwise distance) handle angle variations
  • Match ratio check prevents spurious merges from a few outlier faces
  • Pre-merge validation ensures cluster quality

The hardcoded threshold multipliers (0.8, 0.6) and match ratio thresholds (0.3, 0.5) work well but could benefit from inline comments explaining their rationale or extraction as named constants for clarity.

🔎 Optional: Extract magic numbers as named constants
 class ConservativeFaceClustering:
+    # Merge strategy thresholds
+    _CENTROID_MERGE_FACTOR = 0.8  # Merge if centroids within 80% of threshold
+    _MIN_DIST_MERGE_FACTOR = 0.6  # Merge if any pair within 60% of threshold
+    _MIN_MATCH_RATIO = 0.3        # At least 30% cross-cluster matches required
+    _HIGH_MATCH_RATIO = 0.5       # High confidence merge threshold
+
     def _safe_merge_clusters(self, embeddings: NDArray, labels: NDArray) -> NDArray:
         ...
-                if centroid_dist < self.merge_threshold * 0.8:
+                if centroid_dist < self.merge_threshold * self._CENTROID_MERGE_FACTOR:
                     should_merge = True
-                elif min_dist < self.merge_threshold * 0.6 and match_ratio >= 0.3:
+                elif min_dist < self.merge_threshold * self._MIN_DIST_MERGE_FACTOR and match_ratio >= self._MIN_MATCH_RATIO:
                     should_merge = True
-                elif match_ratio >= 0.5 and min_dist < self.merge_threshold:
+                elif match_ratio >= self._HIGH_MATCH_RATIO and min_dist < self.merge_threshold:
                     should_merge = True
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f17c161 and 34130c1.

📒 Files selected for processing (2)
  • backend/app/utils/clustering_conservative.py
  • backend/test.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Backend Tests
🔇 Additional comments (5)
backend/test.py (1)

43-45: LGTM! Parameters now align with production configuration.

The updated DBSCAN parameters (eps=0.35, min_samples=2) correctly match the production configuration and address the past review comments. Using min_samples=2 prevents bridge-point chaining between distinct people, which is the core fix for issue #793.

backend/app/utils/clustering_conservative.py (4)

25-127: LGTM! Helper functions are well-implemented.

The utility functions are solid:

  • compute_pairwise_distances properly handles normalization with division-by-zero protection
  • select_conservative_epsilon uses adaptive selection with sensible fallbacks and clamping
  • validate_cluster provides straightforward diameter validation
  • split_loose_cluster appropriately uses except Exception: (addressing the past review concern about bare except)

171-207: LGTM! Well-structured clustering workflow.

The fit_predict method implements a clear, logical flow:

  1. Edge case handling for small datasets
  2. Normalization and epsilon selection
  3. Initial DBSCAN clustering
  4. Optional validation/splitting
  5. Optional safe merging
  6. Consecutive relabeling

The implementation aligns well with the PR's two-phase clustering approach.


209-246: LGTM! Validation and splitting logic is sound.

The cluster validation and splitting implementation correctly:

  • Validates cluster diameter against threshold
  • Splits loose clusters into tighter sub-clusters
  • Relabels sub-clusters appropriately
  • Marks undersized sub-clusters as noise

This addresses the PR objective of preventing distinct people from being merged.


366-406: LGTM! Clean utility functions.

The relabeling and convenience wrapper functions are straightforward and correctly implemented:

  • _relabel_consecutive properly maps clusters to consecutive integers while preserving noise label (-1)
  • cluster_conservative provides a clean functional interface to the clustering class

@github-actions github-actions bot added backend bug Something isn't working medium labels Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Face Clustering Quality Issues

2 participants