fix: improve face clustering algorithm #830

SiddharthJiyani · 2025-12-23T17:49:31Z

Fixes BUG: Face Clustering Quality Issues #793

Problem

Face clustering was merging different people into the same cluster due to bridge-point chaining, where a single intermediate face could connect two distinct people's clusters.

Solution

Implemented a two-phase clustering approach:

DBSCAN with min_samples=2 - Prevents single faces from acting as bridges between clusters
Post-cluster mean merge - Merges clusters with similar mean embeddings (threshold: 0.28) to fix same-person splits caused by pose/lighting variation

Changes

face_clusters.py: Added post-merge logic and _calculate_cosine_distance() helper
clustering_conservative.py: Code cleanup (removed unused functions)
clustering_advanced.py: Code cleanup (removed unused functions)
test_clustering_algorithm.py: Added 16 comprehensive unit tests

Configuration

CLUSTERING_ALGORITHM = "dbscan"
CLUSTERING_MIN_SAMPLES = 2  # Prevents bridge chaining
CLUSTERING_FIXED_EPSILON = 0.35
POST_MERGE_ENABLED = True
POST_MERGE_MEAN_DISTANCE_THRESHOLD = 0.28

Benefits

Reduces mis-clustered faces by filtering poor quality detections
Automatically adapts to different photo collections
More robust cluster representatives

Demo

Before

face_clustering_before.mp4

After

To see the clustering results directly, please jump to 00:40.

face_clustering_after.mp4

Summary by CodeRabbit

New Features
- Per-face quality scoring (sharpness, brightness, size) persisted with faces and exposed in clustering results; quality-driven filtering for clustering and assignments.
- Multiple clustering strategies (conservative, DBSCAN, hierarchical) with adaptive parameter selection and improved cluster-merge options.
Improvements
- Quality-aware cluster mean computation, assignment of unclustered faces, and post-merge merging to reduce mis-assignments; richer logging.
Tests
- Comprehensive unit tests for clustering flows and face quality utilities.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-23T17:49:53Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds per-face quality scoring into detection, database, and clustering; introduces conservative, DBSCAN, and hierarchical clustering utilities with adaptive eps and post-merge logic; updates schema, insertion/retrieval, and tests to propagate and validate quality-aware behavior.

Changes

Cohort / File(s)	Summary
Face Quality Assessment & Tests `backend/app/utils/face_quality.py`, `backend/tests/test_face_quality.py`	New module computing per-face quality (sharpness, brightness, size), helpers to filter/include faces, and comprehensive unit tests covering metrics, composition, thresholds, and edge cases.
Database Schema & Face Operations `backend/app/database/faces.py`	Added `quality REAL DEFAULT 0.5` column; extended `FaceData` type and select queries to include `quality` (COALESCE fallback); propagated `quality` through single and bulk insert functions and cluster-mean aggregation as `quality_scores`.
Face Detection Integration `backend/app/models/FaceDetector.py`	Imported `calculate_face_quality`, computed per-face quality during detection, logged quality statistics, and passed `quality` list into `db_insert_face_embeddings_by_image_id`.
Conservative Clustering Module `backend/app/utils/clustering_conservative.py`	New ConservativeFaceClustering class and helpers: pairwise distances, adaptive epsilon selection, cluster validation/splitting, safe merge, relabeling; `cluster_conservative` wrapper.
Advanced Clustering Utilities `backend/app/utils/clustering_advanced.py`	New dispatcher and algorithms: adaptive eps calculation, DBSCAN and hierarchical variants, unified `cluster_faces` entry point, cluster stats and mean calculation, and `advanced_face_clustering` alias.
Clustering Workflow & Assignments `backend/app/utils/face_clusters.py`	Reworked pipeline to use advanced clustering options, optional quality filtering (config-gated), post-merge consolidation, quality-aware cluster representatives, and updated assignment logic/logging.
Tests, Dev Script & Requirements `backend/tests/test_clustering_algorithm.py`, `backend/test.py`, `backend/requirements.txt`	New clustering test suite with mocks for post-merge and quality-filtering scenarios; adjusted local test DBSCAN param; added dependency `kneed>=0.8.5`.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Detector as FaceDetector
    participant Quality as FaceQualitySvc
    participant Clustering as ClusteringEngine
    participant DB as Database

    Detector->>Quality: send detected face images
    Quality->>Quality: compute sharpness, brightness, size -> quality scores
    Quality-->>Detector: per-face quality scores

    alt Quality filtering enabled
        Detector->>Detector: drop faces below threshold
    end

    Detector->>Clustering: embeddings + quality metadata
    Clustering->>Clustering: normalize embeddings
    opt auto-eps
        Clustering->>Clustering: calculate_adaptive_eps (k-NN)
    end
    Clustering->>Clustering: run algorithm (conservative/dbscan/hierarchical)
    Clustering->>Clustering: validate/split/merge clusters
    Clustering-->>Detector: labels, cluster_means, cluster_quality_scores

    alt Assign unclustered faces
        Detector->>Clustering: compute similarity to cluster means (quality-aware)
        Clustering-->>Detector: assignment decisions
    end

    Detector->>DB: insert/update faces with embedding, cluster_uuid, quality
    DB-->>DB: persist rows including `quality` and per-cluster `quality_scores`
    DB-->>Detector: ack/ids

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

GSoC 2025 Backend Revamp #466 — Overlaps changes to face pipeline, DB schema, and clustering utilities; likely touches the same insertion and clustering call sites.
Gsoc'25 Frontend Revamp #489 — Modifies face insertion signatures and per-face metadata handling (confidence/bbox); may conflict with new quality parameter on insert paths.
Feat: Scrollbar Timeline with {Month-Year} Markers for Home page and AI Tagging page. #552 — Adjusts face retrieval/row-parsing logic similar to this PR's additions of quality in SELECT/row parsing.

Suggested labels

enhancement

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main change: improving the face clustering algorithm, which is the core focus of this PR addressing issue #793.
Linked Issues check	✅ Passed	The PR implements a two-phase clustering approach with DBSCAN and post-merge logic that directly addresses issue #793's requirements: preventing bridge-point chaining and preventing same-person splits through configurable thresholds.
Out of Scope Changes check	✅ Passed	All changes are directly related to addressing issue #793's clustering quality problems. New utility modules (face_quality.py, clustering_conservative.py, clustering_advanced.py) and modifications support the primary clustering improvement objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 93.51% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/database/faces.py (1)

36-50: Existing deployments require a schema migration to add the quality column.

The code uses COALESCE(quality, 0.5) in SELECT queries, which gracefully handles NULL values for new rows. However, CREATE TABLE IF NOT EXISTS will not modify existing tables, so databases from before this change won't have the quality column at all. Queries referencing this column will fail with "no such column" errors.

For existing deployments, add an ALTER TABLE faces ADD COLUMN quality REAL DEFAULT 0.5; migration before deploying this code. New deployments from scratch will not be affected.

🧹 Nitpick comments (7)

backend/tests/test_clustering_algorithm.py (1)
186-233: Test may be non-deterministic due to random embeddings.

The test uses np.random.rand(512) * 0.01 perturbations, but the comment acknowledges "clusters might or might not merge." This makes the test assertion weak—it only verifies the function runs without errors.

Consider using fixed embeddings with known distances to make the test deterministic and actually verify the merge behavior.
🔎 Suggested improvement for deterministic testing
     def test_post_merge_combines_close_clusters(
         self, mock_filter, mock_cluster, mock_get_faces
     ):
         """Test that post-merge combines clusters with similar mean embeddings."""
-        # Create embeddings that will form 2 clusters with very similar means
-        base_embedding = np.random.rand(512)
+        # Create deterministic embeddings with known distance
+        base_embedding = np.zeros(512)
+        base_embedding[0] = 1.0  # Unit vector in first dimension
         mock_get_faces.return_value = [
             {
                 "face_id": 1,
-                "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(),
+                "embeddings": base_embedding.tolist(),
                 "cluster_name": None,
                 "quality": 0.8,
             },
             # ... similar changes for other faces
         ]
backend/app/utils/face_clusters.py (1)
437-451: Potential division by zero if embedding has zero norm.

_calculate_cosine_distance divides by np.linalg.norm() without checking for zero-norm vectors. While face embeddings from a neural network should never be zero, a defensive guard would prevent potential nan results.
🔎 Suggested defensive guard
 def _calculate_cosine_distance(embedding_a: NDArray, embedding_b: NDArray) -> float:
     """
     Calculate cosine distance between two embeddings.
     ...
     """
-    norm_a = embedding_a / np.linalg.norm(embedding_a)
-    norm_b = embedding_b / np.linalg.norm(embedding_b)
+    norm_a_val = np.linalg.norm(embedding_a)
+    norm_b_val = np.linalg.norm(embedding_b)
+    if norm_a_val == 0 or norm_b_val == 0:
+        return 2.0  # Maximum distance for degenerate case
+    norm_a = embedding_a / norm_a_val
+    norm_b = embedding_b / norm_b_val
     similarity = np.dot(norm_a, norm_b)
     return 1 - similarity
backend/app/utils/clustering_advanced.py (4)
23-64: calculate_adaptive_eps is defined but unused within this module.

This function is never called internally—cluster_faces_dbscan uses select_conservative_epsilon from the conservative module instead (line 100). Consider:

Using this function where auto_eps=True in cluster_faces_dbscan, or

Removing it if select_conservative_epsilon is the preferred approach, or

Documenting it as an alternative utility for external callers.

The current state may confuse maintainers about which epsilon calculation to use.

211-217: Redundant max(min_samples, 2) enforcement.

Line 215 applies max(min_samples, 2), but cluster_faces_dbscan already enforces this at line 104. While harmless, consider removing the redundancy here to keep the enforcement in one place.
🔎 Proposed fix
     elif algorithm == "dbscan":
         return cluster_faces_dbscan(
             embeddings,
             eps=eps,
-            min_samples=max(min_samples, 2),
+            min_samples=min_samples,
             auto_eps=auto_eps,
         )
267-269: float("inf") for single-element cluster density may cause downstream issues.

If these statistics are later aggregated (e.g., averaging densities), inf will propagate unexpectedly. Consider using None or np.nan to explicitly indicate "undefined" rather than infinite density.
🔎 Proposed fix
         else:
             cluster_diameters.append(0.0)
-            cluster_densities.append(float("inf"))
+            cluster_densities.append(np.nan)  # Undefined for single-element clusters
295-296: Hardcoded embedding dimension 512 is fragile.

If the embedding model changes (e.g., to 256 or 1024 dimensions), this will return an incorrectly sized array. Consider raising an error for empty input or accepting the dimension as a parameter.
🔎 Proposed fix
-def calculate_cluster_mean(embeddings: NDArray) -> NDArray:
+def calculate_cluster_mean(embeddings: NDArray, embedding_dim: int = 512) -> NDArray:
     """
     Calculate the mean embedding for a cluster.
 
     Args:
         embeddings: Face embeddings for a cluster
+        embedding_dim: Dimension of embeddings (used only for empty input fallback)
 
     Returns:
         Normalized mean embedding
     """
     if len(embeddings) == 0:
-        return np.zeros(512)  # Default embedding dimension
+        return np.zeros(embedding_dim)
backend/app/utils/clustering_conservative.py (1)
117-127: Broad except Exception silently swallows all errors.

This could hide genuine issues (e.g., memory errors, numpy bugs). Consider catching a more specific exception or at least logging the failure.
🔎 Proposed fix
+    import logging
+    logger = logging.getLogger(__name__)
+
     try:
         sub_labels = clustering.fit_predict(embeddings)
 
         # Group indices by sub-cluster
         sub_clusters = defaultdict(list)
         for i, label in enumerate(sub_labels):
             sub_clusters[label].append(face_indices[i])
 
         return [np.array(indices) for indices in sub_clusters.values()]
-    except Exception:
+    except ValueError as e:
+        # AgglomerativeClustering can raise ValueError for edge cases
+        logger.debug(f"Failed to split cluster: {e}")
         return [face_indices]

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d07d817 and 51043fe.

📒 Files selected for processing (10)

backend/app/database/faces.py
backend/app/models/FaceDetector.py
backend/app/utils/clustering_advanced.py
backend/app/utils/clustering_conservative.py
backend/app/utils/face_clusters.py
backend/app/utils/face_quality.py
backend/requirements.txt
backend/test.py
backend/tests/test_clustering_algorithm.py
backend/tests/test_face_quality.py

🧰 Additional context used

🧬 Code graph analysis (4)

backend/tests/test_face_quality.py (1)

backend/app/utils/face_quality.py (6)

assess_face_sharpness (15-41)

assess_face_brightness (44-73)

assess_face_size (76-102)

calculate_face_quality (105-156)

should_include_face (159-170)

filter_quality_faces (173-184)

backend/tests/test_clustering_algorithm.py (1)

backend/app/utils/face_clusters.py (2)

cluster_util_cluster_all_face_embeddings (194-342)

_calculate_cosine_distance (437-451)

backend/app/utils/clustering_advanced.py (1)

backend/app/utils/clustering_conservative.py (3)

cluster_conservative (374-406)

select_conservative_epsilon (33-77)

fit_predict (171-207)

backend/app/models/FaceDetector.py (2)

backend/app/utils/face_quality.py (1)

calculate_face_quality (105-156)

backend/app/database/faces.py (1)

db_insert_face_embeddings_by_image_id (101-152)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Agent
GitHub Check: Backend Tests

🔇 Additional comments (37)

backend/requirements.txt (1)

74-74: LGTM!

The kneed dependency is appropriately added to support adaptive epsilon calculation in the clustering utilities. The minimum version constraint >=0.8.5 allows flexibility for compatible updates.

backend/app/utils/face_quality.py (5)

15-41: LGTM with minor consideration for edge cases.

The sharpness assessment using Laplacian variance is a well-established technique. The normalization to 0-1 scale with 100 as the empirical threshold is reasonable.

Consider adding a guard for degenerate images (e.g., 0-dimensional or single-pixel images) that could cause issues with cv2.Laplacian or cv2.cvtColor. However, since this is called after face detection with padding, such edge cases are unlikely in practice.

44-73: LGTM!

The brightness assessment correctly evaluates how close the average brightness is to the optimal midpoint (128). The scoring formula ensures scores range from 0.0 (extreme brightness/darkness) to 1.0 (optimal).

76-102: LGTM!

The size assessment correctly uses the minimum dimension to handle non-square face crops and caps at 1.0 for faces meeting or exceeding the target size.

105-156: LGTM!

Well-designed composite quality function with:

Weight validation using np.isclose for floating-point safety

Comprehensive return dictionary for debugging and analysis

Sensible default weights emphasizing sharpness (0.4)

159-184: LGTM!

Simple and effective filtering utilities. The use of .get("quality", 0.0) in filter_quality_faces provides a sensible fallback for faces without quality scores.

backend/app/models/FaceDetector.py (3)

37-37: LGTM!

Good addition of the qualities list to track per-face quality scores alongside other face metadata.

53-65: LGTM!

Quality assessment is appropriately integrated into the face detection loop. The quality is calculated on the padded face crop (face_img), which provides consistent input for the quality metrics.

The debug logging of individual quality components will be valuable for troubleshooting clustering issues.

74-91: LGTM!

Good integration of quality scores into the database insertion path. The aggregate statistics logging (average quality and high-quality count) provides useful visibility into detection quality across images.

backend/tests/test_clustering_algorithm.py (4)

25-51: LGTM!

Comprehensive tests for _calculate_cosine_distance covering key cases:

Identical embeddings → distance 0

Orthogonal embeddings → distance 1

Opposite embeddings → distance 2

Unnormalized inputs handled correctly

57-68: LGTM!

Good practice to test configuration constants. These tests serve as documentation and will fail if someone inadvertently changes critical clustering parameters.

307-369: LGTM!

Good coverage of cluster name determination logic:

Majority voting correctly selects "John" when it appears more frequently

None is returned when no faces have existing names

267-304: The patching strategy is correct as implemented.

The test properly patches CLUSTERING_QUALITY_FILTER_ENABLED and CLUSTERING_QUALITY_MIN_THRESHOLD at their definition and usage location (app.utils.face_clusters). Since these constants are read directly within the same module at runtime (lines 224 and 227 of cluster_util_cluster_all_face_embeddings()), the @patch decorators will correctly intercept the reads, allowing the real filter_quality_faces() function to execute with the patched threshold values.

backend/tests/test_face_quality.py (6)

1-19: LGTM!

Well-organized test module with clear imports and class-based test organization. The tests provide good coverage for the face quality assessment utilities.

21-55: LGTM!

Effective sharpness tests using:

Checkerboard pattern for high sharpness (high edge content)

Uniform gray image for low sharpness (no edges)

Grayscale input handling verification

57-86: LGTM!

Brightness tests appropriately cover the full range: optimal (128), very dark (20), and overexposed (240).

88-122: LGTM!

Size assessment tests cover important cases including rectangular images using the minimum dimension.

124-180: LGTM!

Comprehensive tests for the composite quality function including:

Return structure validation

Score range validation

Custom weight support

Invalid weight error handling

High-quality face scenario

182-258: LGTM!

Good coverage of filtering utilities including:

Threshold boundary behavior (exact threshold is inclusive)

Empty list handling

All pass/fail scenarios

Missing quality key defaulting to 0

backend/app/utils/face_clusters.py (3)

35-73: LGTM! Well-documented configuration.

The clustering configuration is comprehensive with clear documentation explaining the rationale behind each parameter. Key highlights:

CLUSTERING_MIN_SAMPLES = 2 prevents bridge-point chaining (the core fix)

Post-merge threshold (0.28) is tighter than DBSCAN eps (0.35) for safety

Low quality threshold (0.15) is intentionally permissive

288-320: Post-merge logic is correct but has O(n²) complexity.

The nested loop correctly implements pairwise cluster merging based on mean embedding distance. The use of a used set prevents double-merging.

For very large numbers of clusters, consider optimizing with spatial indexing, though this is unlikely to be a bottleneck for typical face clustering scenarios.

345-434: LGTM!

The incremental cluster assignment function correctly:

Uses quality-weighted cluster representatives

Applies quality filtering to unassigned faces

Enforces similarity threshold before assignment

Provides helpful debug logging for skipped faces

backend/app/database/faces.py (3)

57-98: LGTM!

The insert function correctly handles the new quality parameter with proper SQL parameter binding.

240-254: LGTM!

Good use of COALESCE(quality, 0.5) to handle NULL values gracefully, ensuring backward compatibility with existing data that lacks quality scores.

349-405: LGTM!

The function correctly returns quality scores alongside embeddings for quality-aware clustering operations. The grouping logic properly maintains separate quality lists per cluster.

backend/app/utils/clustering_advanced.py (4)

1-21: LGTM on module structure and imports.

The module docstring clearly explains the purpose and the critical min_samples >= 2 enforcement. Imports are appropriate for the clustering operations.

67-114: LGTM—min_samples >= 2 enforcement is correctly implemented.

The function properly normalizes embeddings, enforces the critical min_samples >= 2 constraint to prevent bridge-point chaining, and delegates epsilon selection appropriately.

117-161: LGTM—hierarchical clustering with complete linkage is a good conservative choice.

The complete linkage ensures all pairs within a cluster meet the distance threshold, aligning with the PR objective of preventing incorrect merges.

313-331: LGTM—backward compatibility alias and exports are well-defined.

The advanced_face_clustering alias maintains API compatibility, and __all__ correctly exports the public interface.

backend/app/utils/clustering_conservative.py (8)

1-23: LGTM—module docstring clearly articulates the conservative clustering philosophy.

The key principles are well-documented: strict thresholds, cluster validation, and cautious merging.

25-77: LGTM—epsilon selection logic is well-designed.

The adaptive percentile selection based on data spread (tight, moderate, wide) is a thoughtful approach that balances clustering quality across different datasets.

80-95: LGTM—cluster validation is straightforward and correct.

171-207: LGTM—fit_predict implements a well-structured pipeline.

The sequence of normalize → select epsilon → DBSCAN → validate/split → safe merge → relabel is logical and aligns with the two-phase clustering approach described in the PR objectives.

209-246: LGTM—cluster validation and splitting logic is correct.

The 0.8 multiplier on max_diameter for sub-cluster splitting (line 230) is a good heuristic to ensure tighter sub-clusters.

248-364: LGTM—safe merge implementation with Union-Find is well-designed.

Key strengths:

Union-Find with path compression handles transitive merges correctly

Multiple merge strategies (centroid distance, min pairwise + match ratio, high match ratio) address angle variation

Pre-merge validation (line 355) prevents creating oversized clusters

This directly addresses the PR objective of reuniting over-split clusters while maintaining safety.

366-371: LGTM—relabeling is straightforward and correct.

374-406: LGTM—convenience function provides a clean API.

Parameters are well-documented and correctly forwarded to ConservativeFaceClustering.

backend/app/utils/clustering_conservative.py

backend/test.py

Copilot

Pull request overview

This PR improves the face clustering algorithm to fix issue #793 where faces were incorrectly merged across different people (bridge-point chaining) and split for the same person across different angles.

Key Changes:

Implements two-phase clustering: DBSCAN with min_samples=2 to prevent bridge-point chaining, followed by post-cluster mean merge to reunite same-person splits
Adds face quality assessment module to filter low-quality face detections before clustering
Introduces conservative and hierarchical clustering algorithms as alternatives to standard DBSCAN

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
backend/app/utils/face_clusters.py	Core clustering logic with post-merge functionality and quality filtering integration
backend/app/utils/clustering_advanced.py	Main clustering entry point supporting multiple algorithms (DBSCAN, hierarchical, conservative)
backend/app/utils/clustering_conservative.py	Conservative clustering implementation with cluster validation and safe merging
backend/app/utils/face_quality.py	New module for assessing face quality based on sharpness, brightness, and size
backend/app/models/FaceDetector.py	Integration of quality assessment during face detection
backend/app/database/faces.py	Database schema and query updates to store and retrieve quality scores
backend/tests/test_clustering_algorithm.py	Comprehensive unit tests for clustering algorithms and post-merge functionality
backend/tests/test_face_quality.py	Unit tests for face quality assessment functions
backend/test.py	Test script parameter updates (contains bug with min_samples=1)
backend/requirements.txt	Added kneed dependency (currently unused)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-23T17:57:10Z

backend/requirements.txt

 psutil>=5.9.5
 pytest-asyncio>=1.0.0
 setuptools==66.1.1
+kneed>=0.8.5


The 'kneed' package is added to requirements.txt but doesn't appear to be used anywhere in the codebase. If this dependency is not needed for the current implementation, it should be removed. If it's intended for future use, consider adding a comment explaining its purpose.

Suggested change

kneed>=0.8.5

Copilot · 2025-12-23T17:57:11Z

backend/app/utils/clustering_conservative.py

+            validate_clusters: Whether to validate and split loose clusters
+            auto_eps: Auto-select conservative epsilon
+            merge_close_clusters: Whether to merge clusters that are clearly same person
+            merge_threshold: Max centroid distance for merging (default: 0.30)


The docstring states "merge_threshold: Max centroid distance for merging (default: 0.30)" but the actual default value in the function signature is 0.40. The documentation should match the implementation.

Suggested change

merge_threshold: Max centroid distance for merging (default: 0.30)

merge_threshold: Max centroid distance for merging (default: 0.40)

Copilot · 2025-12-23T17:57:11Z

backend/app/utils/face_clusters.py

+            mean_i = np.mean([f["embedding"] for f in faces_i], axis=0)
+            merged[label_i] = list(faces_i)  # Copy the list
+
+            for j in range(i + 1, len(cluster_items)):
+                label_j, faces_j = cluster_items[j]
+                if label_j in used:
+                    continue
+
+                mean_j = np.mean([f["embedding"] for f in faces_j], axis=0)


The mean embeddings are calculated without normalization. Since cosine distance is being used (which is scale-invariant), the embeddings should be normalized before computing the mean. Otherwise, the mean of unnormalized embeddings may not be representative. Consider using the calculate_cluster_mean function from clustering_advanced.py which properly normalizes embeddings before and after computing the mean.

Copilot · 2025-12-23T17:57:11Z

backend/app/utils/face_clusters.py

+        fixed_eps=eps,
+        distance_threshold=HIERARCHICAL_DISTANCE_THRESHOLD,
+        linkage=HIERARCHICAL_LINKAGE,
+        density_refinement=VALIDATE_CLUSTERS,


The parameter 'density_refinement' is passed to cluster_faces but the conservative clustering algorithm expects 'validate_clusters'. The parameter name mismatch means this setting won't be applied when using the conservative algorithm. Update line 265 to pass 'validate_clusters=VALIDATE_CLUSTERS' or add 'density_refinement' as an alias in the cluster_faces function that maps to 'validate_clusters'.

Suggested change

density_refinement=VALIDATE_CLUSTERS,

validate_clusters=VALIDATE_CLUSTERS,

Copilot · 2025-12-23T17:57:11Z

backend/app/utils/clustering_advanced.py

+            linkage="complete",  # Conservative: all pairs must be similar
+        )
+    else:
+        clustering = AgglomerativeClustering(
+            n_clusters=None,
+            distance_threshold=distance_threshold,
+            metric="cosine",
+            linkage="complete",


The 'linkage' parameter is passed from face_clusters.py but is ignored in cluster_faces_hierarchical, which always uses "complete" linkage. If complete linkage is mandatory for correctness, remove the linkage parameter from the caller. Otherwise, update cluster_faces_hierarchical to accept and use a linkage parameter.

Suggested change

linkage="complete", # Conservative: all pairs must be similar

)

else:

clustering = AgglomerativeClustering(

n_clusters=None,

distance_threshold=distance_threshold,

metric="cosine",

linkage="complete",

linkage=linkage, # Conservative: all pairs must be similar by default

)

else:

clustering = AgglomerativeClustering(

n_clusters=None,

distance_threshold=distance_threshold,

metric="cosine",

linkage=linkage,

Copilot · 2025-12-23T17:57:12Z

backend/test.py

+    # Updated parameters to match production improvements
+    # eps=0.2 for tighter clustering, min_samples=1 to include all faces
+    dbscan = DBSCAN(eps=0.2, min_samples=1, metric="cosine")


The comment states "min_samples=1 to include all faces" but the PR description and configuration explicitly set min_samples=2 to prevent bridge-point chaining, which is the core fix for issue #793. This test file uses min_samples=1 which reintroduces the exact bug being fixed. The parameters should match the production configuration: eps=0.35 and min_samples=2.

Suggested change

# Updated parameters to match production improvements

# eps=0.2 for tighter clustering, min_samples=1 to include all faces

dbscan = DBSCAN(eps=0.2, min_samples=1, metric="cosine")

# Parameters aligned with production configuration to prevent bridge-point chaining

# eps=0.35, min_samples=2

dbscan = DBSCAN(eps=0.35, min_samples=2, metric="cosine")

Copilot · 2025-12-23T17:57:12Z

backend/app/utils/clustering_conservative.py

+            max_cluster_diameter: Maximum allowed cluster diameter (default: 0.55)
+            validate_clusters: Whether to validate and split loose clusters
+            auto_eps: Auto-select conservative epsilon
+            merge_close_clusters: Whether to merge clusters that are clearly same person
+            merge_threshold: Max centroid distance for merging (default: 0.30)


The docstring states "max_cluster_diameter: Maximum allowed cluster diameter (default: 0.55)" but the actual default value in the function signature is 0.60. The documentation should match the implementation.

Suggested change

max_cluster_diameter: Maximum allowed cluster diameter (default: 0.55)

validate_clusters: Whether to validate and split loose clusters

auto_eps: Auto-select conservative epsilon

merge_close_clusters: Whether to merge clusters that are clearly same person

merge_threshold: Max centroid distance for merging (default: 0.30)

max_cluster_diameter: Maximum allowed cluster diameter (default: 0.60)

validate_clusters: Whether to validate and split loose clusters

auto_eps: Auto-select conservative epsilon

merge_close_clusters: Whether to merge clusters that are clearly same person

merge_threshold: Max centroid distance for merging (default: 0.40)

Copilot · 2025-12-23T17:57:12Z

backend/app/utils/clustering_conservative.py

+            sub_clusters[label].append(face_indices[i])
+
+        return [np.array(indices) for indices in sub_clusters.values()]
+    except Exception:


Using a bare except clause catches all exceptions including SystemExit and KeyboardInterrupt, which can make debugging difficult. Consider catching specific exceptions (e.g., ValueError, RuntimeError) that AgglomerativeClustering might raise, or at minimum use "except Exception:" to avoid catching system exceptions.

Suggested change

except Exception:

except (ValueError, RuntimeError):

Copilot · 2025-12-23T17:57:13Z

backend/app/utils/face_clusters.py

-        metric="cosine",
-        n_jobs=-1,  # Use all available CPU cores
+        auto_eps=CLUSTERING_AUTO_EPSILON,
+        fixed_eps=eps,


The parameter 'fixed_eps' passed from face_clusters.py (line 262) is not used when algorithm="dbscan". The cluster_faces_dbscan function has an 'eps' parameter, but 'fixed_eps' is passed via **kwargs and ignored. The 'eps' parameter should be used directly instead of 'fixed_eps' in the call from face_clusters.py, or the cluster_faces function should explicitly map fixed_eps to eps before calling cluster_faces_dbscan.

Suggested change

fixed_eps=eps,

eps=eps,

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

backend/tests/test_clustering_algorithm.py (2)

65-68: Consider testing the exact threshold value.

The test only verifies the threshold is conservative (< 0.35), but per the PR description, the intended value is 0.28. Consider either:

Testing the exact value: assert POST_MERGE_MEAN_DISTANCE_THRESHOLD == 0.28
Adding a comment explaining why a range check is preferred over an exact match

This helps catch unintended configuration drift.

184-232: Strengthen post-merge test with deterministic embeddings.

The test uses random embeddings and doesn't verify that clusters are actually merged (lines 228-229 acknowledge this). This means the test could pass even if post-merge logic is broken.

Consider using deterministic embeddings to reliably verify merge behavior:

🔎 Proposed fix

    def test_post_merge_combines_close_clusters(
        self, mock_filter, mock_cluster, mock_get_faces
    ):
        """Test that post-merge combines clusters with similar mean embeddings."""
-        # Create embeddings that will form 2 clusters with very similar means
-        base_embedding = np.random.rand(512)
+        # Create embeddings with known means that are close (distance < 0.28)
+        # Cluster 0 mean will be ~[1.0, 0, 0, ...]
+        # Cluster 1 mean will be ~[0.95, 0, 0, ...] (cosine distance ~0.05)
+        embedding_1 = np.array([1.0] + [0.0] * 511)
+        embedding_2 = np.array([1.0] + [0.0] * 511)
+        embedding_3 = np.array([0.95] + [0.0] * 511)
+        embedding_4 = np.array([0.95] + [0.0] * 511)
+        
        mock_get_faces.return_value = [
            {
                "face_id": 1,
-                "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(),
+                "embeddings": embedding_1.tolist(),
                "cluster_name": None,
                "quality": 0.8,
            },
            {
                "face_id": 2,
-                "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(),
+                "embeddings": embedding_2.tolist(),
                "cluster_name": None,
                "quality": 0.8,
            },
            {
                "face_id": 3,
-                "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(),
+                "embeddings": embedding_3.tolist(),
                "cluster_name": None,
                "quality": 0.8,
            },
            {
                "face_id": 4,
-                "embeddings": (base_embedding + np.random.rand(512) * 0.01).tolist(),
+                "embeddings": embedding_4.tolist(),
                "cluster_name": None,
                "quality": 0.8,
            },
        ]
        mock_filter.return_value = mock_get_faces.return_value
        # DBSCAN creates 2 separate clusters
        mock_cluster.return_value = np.array([0, 0, 1, 1])

        results = cluster_util_cluster_all_face_embeddings()

-        # Post-merge should combine the 2 clusters into 1
-        # All 4 faces should have the same cluster_uuid
-        # Note: Due to random embeddings, clusters might or might not merge
-        # At minimum, we verify the function ran without errors and returned all faces
+        # Post-merge should combine the 2 clusters into 1
+        # All 4 faces should have the same cluster_uuid
        assert len(results) == 4
-        assert all(r.cluster_uuid is not None for r in results)
+        cluster_uuids = [r.cluster_uuid for r in results]
+        # Verify all faces merged into a single cluster
+        assert len(set(cluster_uuids)) == 1, "Expected all faces to be merged into one cluster"

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51043fe and f17c161.

📒 Files selected for processing (1)

backend/tests/test_clustering_algorithm.py

🧰 Additional context used

🧬 Code graph analysis (1)

backend/tests/test_clustering_algorithm.py (1)

backend/app/utils/face_clusters.py (2)

cluster_util_cluster_all_face_embeddings (194-342)

_calculate_cosine_distance (437-451)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Backend Tests

🔇 Additional comments (8)

backend/tests/test_clustering_algorithm.py (8)

1-16: LGTM - Test module setup is clean.

The imports and module structure are well-organized. Testing the private _calculate_cosine_distance function directly is appropriate for unit testing.

25-52: LGTM - Comprehensive helper function tests.

The cosine distance tests cover all critical cases (identical, orthogonal, opposite, unnormalized) with appropriate assertions and numerical tolerances.

74-85: LGTM - Good edge case coverage.

Properly tests the empty database scenario.

86-109: LGTM - Correctly validates min_samples=2 behavior.

The test properly verifies that a single face becomes noise when min_samples=2, preventing bridge-point chaining as per PR objectives.

152-179: LGTM - Proper noise point handling.

Correctly validates that faces labeled as noise (-1) are excluded from results.

233-260: LGTM - Correctly validates post-merge disabled behavior.

Properly verifies that clusters remain separate when post-merge is disabled.

265-303: LGTM - Good integration test for quality filtering.

The test appropriately allows the real filter_quality_faces function to run while patching configuration, verifying that low-quality faces (0.3) are filtered before clustering while high-quality faces (0.8, 0.6) pass through.

305-368: LGTM - Cluster name determination tests are solid.

Both tests properly validate:

Majority voting correctly selects "John" when it appears twice

Cluster name remains None when no existing names are present

coderabbitai · 2025-12-23T18:11:52Z

backend/tests/test_clustering_algorithm.py

+        embeddings = [
+            np.array([1.0] + [0.0] * 511),  # Distinct embedding 1
+            np.array([0.9] + [0.1] * 511),  # Similar to 1
+            np.array([0.0, 1.0] + [0.0] * 510),  # Distinct embedding 2
+            np.array([0.1, 0.9] + [0.0] * 510),  # Similar to 2
+        ]


⚠️ Potential issue | 🟡 Minor

Fix misleading comments about embedding similarity.

The comments state embeddings are "Similar to 1" and "Similar to 2", but the actual embeddings are not particularly similar:

[0.9] + [0.1] * 511 creates [0.9, 0.1, 0.1, ...] which differs significantly from [1.0, 0.0, 0.0, ...]

Similarly for embeddings 3 and 4

Since the clustering result is mocked (line 136), the actual similarity doesn't affect the test, but the comments should be updated to avoid confusion.

🔎 Suggested fix

embeddings = [ - np.array([1.0] + [0.0] * 511), # Distinct embedding 1 - np.array([0.9] + [0.1] * 511), # Similar to 1 - np.array([0.0, 1.0] + [0.0] * 510), # Distinct embedding 2 - np.array([0.1, 0.9] + [0.0] * 510), # Similar to 2 + np.array([1.0] + [0.0] * 511), # Embedding for cluster 0 + np.array([0.9] + [0.1] * 511), # Embedding for cluster 0 + np.array([0.0, 1.0] + [0.0] * 510), # Embedding for cluster 1 + np.array([0.1, 0.9] + [0.0] * 510), # Embedding for cluster 1 ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

embeddings = [

np.array([1.0] + [0.0] * 511), # Distinct embedding 1

np.array([0.9] + [0.1] * 511), # Similar to 1

np.array([0.0, 1.0] + [0.0] * 510), # Distinct embedding 2

np.array([0.1, 0.9] + [0.0] * 510), # Similar to 2

]

embeddings = [

np.array([1.0] + [0.0] * 511), # Embedding for cluster 0

np.array([0.9] + [0.1] * 511), # Embedding for cluster 0

np.array([0.0, 1.0] + [0.0] * 510), # Embedding for cluster 1

np.array([0.1, 0.9] + [0.0] * 510), # Embedding for cluster 1

]

🤖 Prompt for AI Agents

In backend/tests/test_clustering_algorithm.py around lines 119 to 124, the inline comments claim embeddings are "Similar to 1" and "Similar to 2" but the arrays shown are not close to the referenced vectors and thus are misleading; update the comments to accurately describe the embeddings (e.g., "Different but biased towards first dimension" or "Biased towards first/second dimension" or simply "Variant of embedding 1/2") so they reflect the actual numeric relationships, keeping the mocked clustering note intact.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

backend/app/utils/clustering_conservative.py (2)
141-169: Parameters and documentation look good; consider making the min_samples override more explicit.

The docstring defaults now correctly match the actual parameter defaults (addressing past review comments). The initialization logic is sound.

Line 164 silently overrides min_samples to enforce a minimum of 2, which aligns with the PR's goal of preventing bridge-point chaining. While the comment documents this behavior, consider logging a warning when the override occurs to make it more explicit to callers.
🔎 Optional improvement: Add warning for min_samples override
+import logging
+
+logger = logging.getLogger(__name__)
+
 class ConservativeFaceClustering:
     def __init__(
         self,
         eps: float = 0.25,
         min_samples: int = 2,
         max_cluster_diameter: float = 0.60,
         validate_clusters: bool = True,
         auto_eps: bool = True,
         merge_close_clusters: bool = True,
         merge_threshold: float = 0.40,
     ):
         ...
         self.eps = eps
+        if min_samples < 2:
+            logger.warning(
+                f"min_samples={min_samples} is too low; enforcing min_samples=2 "
+                "to prevent bridge-point chaining"
+            )
         self.min_samples = max(min_samples, 2)  # Never allow 1
248-364: Well-designed merge logic; consider documenting the threshold multipliers.

The safe merge implementation is sophisticated and addresses the PR objective of reuniting same-person splits:

Union-Find enables transitive merges

Dual strategies (centroid + pairwise distance) handle angle variations

Match ratio check prevents spurious merges from a few outlier faces

Pre-merge validation ensures cluster quality

The hardcoded threshold multipliers (0.8, 0.6) and match ratio thresholds (0.3, 0.5) work well but could benefit from inline comments explaining their rationale or extraction as named constants for clarity.
🔎 Optional: Extract magic numbers as named constants
 class ConservativeFaceClustering:
+    # Merge strategy thresholds
+    _CENTROID_MERGE_FACTOR = 0.8  # Merge if centroids within 80% of threshold
+    _MIN_DIST_MERGE_FACTOR = 0.6  # Merge if any pair within 60% of threshold
+    _MIN_MATCH_RATIO = 0.3        # At least 30% cross-cluster matches required
+    _HIGH_MATCH_RATIO = 0.5       # High confidence merge threshold
+
     def _safe_merge_clusters(self, embeddings: NDArray, labels: NDArray) -> NDArray:
         ...
-                if centroid_dist < self.merge_threshold * 0.8:
+                if centroid_dist < self.merge_threshold * self._CENTROID_MERGE_FACTOR:
                     should_merge = True
-                elif min_dist < self.merge_threshold * 0.6 and match_ratio >= 0.3:
+                elif min_dist < self.merge_threshold * self._MIN_DIST_MERGE_FACTOR and match_ratio >= self._MIN_MATCH_RATIO:
                     should_merge = True
-                elif match_ratio >= 0.5 and min_dist < self.merge_threshold:
+                elif match_ratio >= self._HIGH_MATCH_RATIO and min_dist < self.merge_threshold:
                     should_merge = True

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f17c161 and 34130c1.

📒 Files selected for processing (2)

backend/app/utils/clustering_conservative.py
backend/test.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Backend Tests

🔇 Additional comments (5)

backend/test.py (1)

43-45: LGTM! Parameters now align with production configuration.

The updated DBSCAN parameters (eps=0.35, min_samples=2) correctly match the production configuration and address the past review comments. Using min_samples=2 prevents bridge-point chaining between distinct people, which is the core fix for issue #793.

backend/app/utils/clustering_conservative.py (4)

25-127: LGTM! Helper functions are well-implemented.

The utility functions are solid:

compute_pairwise_distances properly handles normalization with division-by-zero protection

select_conservative_epsilon uses adaptive selection with sensible fallbacks and clamping

validate_cluster provides straightforward diameter validation

split_loose_cluster appropriately uses except Exception: (addressing the past review concern about bare except)

171-207: LGTM! Well-structured clustering workflow.

The fit_predict method implements a clear, logical flow:

Edge case handling for small datasets

Normalization and epsilon selection

Initial DBSCAN clustering

Optional validation/splitting

Optional safe merging

Consecutive relabeling

The implementation aligns well with the PR's two-phase clustering approach.

209-246: LGTM! Validation and splitting logic is sound.

The cluster validation and splitting implementation correctly:

Validates cluster diameter against threshold

Splits loose clusters into tighter sub-clusters

Relabels sub-clusters appropriately

Marks undersized sub-clusters as noise

This addresses the PR objective of preventing distinct people from being merged.

366-406: LGTM! Clean utility functions.

The relabeling and convenience wrapper functions are straightforward and correctly implemented:

_relabel_consecutive properly maps clusters to consecutive integers while preserving noise label (-1)

cluster_conservative provides a clean functional interface to the clustering class

fix: improve face clustering algorithm

51043fe

Copilot AI review requested due to automatic review settings December 23, 2025 17:49

Copilot started reviewing on behalf of SiddharthJiyani December 23, 2025 17:50 View session

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

backend/app/utils/clustering_conservative.py Show resolved Hide resolved

backend/test.py Outdated Show resolved Hide resolved

Copilot AI reviewed Dec 23, 2025

View reviewed changes

resolve linting errors

f17c161

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

refactor code based on review

34130c1

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

github-actions bot added backend bug Something isn't working medium labels Dec 24, 2025

rahulharpal1603 closed this Dec 24, 2025

	merge_threshold: Max centroid distance for merging (default: 0.30)
	merge_threshold: Max centroid distance for merging (default: 0.40)

	density_refinement=VALIDATE_CLUSTERS,
	validate_clusters=VALIDATE_CLUSTERS,

fix: improve face clustering algorithm #830

fix: improve face clustering algorithm #830

Conversation

SiddharthJiyani commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Configuration

Benefits

Demo

Before

After

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SiddharthJiyani commented Dec 23, 2025 •

edited

Loading

coderabbitai bot commented Dec 23, 2025 •

edited

Loading