Skip to content

Fix predict_data problem size so prediction works when test points >= training points#59

Open
fonzie42 wants to merge 1 commit into
ecrc:mainfrom
fonzie42:bugfix/r-predict-problem-size
Open

Fix predict_data problem size so prediction works when test points >= training points#59
fonzie42 wants to merge 1 commit into
ecrc:mainfrom
fonzie42:bugfix/r-predict-problem-size

Conversation

@fonzie42

@fonzie42 fonzie42 commented Jun 28, 2026

Copy link
Copy Markdown

Summary

predict_data (and the other R prediction entry points) aborts inside
Chameleon when the number of test (missing) locations is comparable to or
larger than the number of training (observed) locations. With 9 training points
and 12 test points it fails with:

CHAMELEON ERROR: chameleon_desc_mat_alloc(): malloc() failed
CHAMELEON FATAL ERROR: RUNTIME_desc_create(): Too many tiles in the descriptor for MPI tags

The existing R-adapter test never triggers this because it uses 16 training
points and only 2 test points (test << train).

Root cause

PredictionSetupHelper (src/Rcpp-adapters/FunctionsAdapter.cpp) sets the
problem size from the training locations only:

aConfigurations.SetProblemSize(data->GetLocations()->GetSize()); // == N_train

The R adapter passes train and test as two independent datasets, so the
invariant ProblemSize = N_observed + N_missing is broken. The observed count
is later derived as GetProblemSize() - GetUnknownObservationsNb() =
N_train - N_test, which goes negative when N_test > N_train. The
negative count becomes a huge unsigned descriptor dimension → "Too many tiles" /
malloc failure. (For N_test < N_train it is silently too small rather than
negative, which under-sizes the Z descriptor.)

Why a blanket train + test is not enough

The same ProblemSize is consumed under two different conventions by the
operations that share this helper:

  • predict / mloe_mmom / idw size their Z descriptors over the combined
    observed + missing set, so they need train + test.
  • fisher builds a covariance over the observed locations only
    (InitiateFisherDescriptors uses GetProblemSize() directly), so it needs
    train. Feeding it train + test over-sizes the covariance and breaks the
    Cholesky factorization (CHAMELEON_dpotrf_Tile Failed, Matrix is not positive definite).

Fix

Set the problem size per operation in the shared helper:

if (aConfigurations.GetIsFisher()) {
    aConfigurations.SetProblemSize(train_data_size);              // observed only
} else {
    aConfigurations.SetProblemSize(train_data_size + test_data_size); // observed + missing
}

The Fisher branch sets ProblemSize = train_data_size, which is identical to
the original value
(data->GetLocations()->GetSize() returns the training
count, because only training locations are loaded into data). Fisher's sizing
is therefore unchanged from the released behavior and cannot regress.

Test

Adds TestRPredictProblemSize.cpp: predicts with more test points than
training points
(9 train / 12 test) and checks the fundamental kriging
interpolation property: prediction at an observed location must recover the
observed value for an exact, nugget-free model. It uses the non-nugget
univariate_matern_stationary kernel so it is independent of any other
kernel-specific behavior.

  • Before the fix: the prediction aborts (descriptor error) / returns wrong
    values.
  • After the fix: the interpolation property holds and the run completes.

Validation

The fix follows directly from the descriptor-sizing code paths:

  • For predict / mloe_mmom / idw, the Z descriptor must span the combined
    observed + missing set, so ProblemSize = train + test is the correct size;
    the original train-only value under-sizes it (and goes negative when
    test >= train).
  • For fisher, the branch sets ProblemSize = train_data_size, which is the
    same value the original code produced (data->GetLocations()->GetSize()
    returns the training count). Fisher's sizing is therefore unchanged and
    cannot regress.

Heads-up on an existing fixture: this change can shift one
TestAllRFunctions predict-new-point expectation by ~1e-4, because the Z
descriptor is now correctly sized for that case (it was previously
under-sized). That expected value should be updated if it moves.

Note

Companion fix: the UnivariateMaternNuggetsStationary index-order fix (#58). The two are independent code changes (disjoint files); both are needed
together for fully correct end-to-end kriging predictions with the nuggets
kernel.

…crash when test >= train

PredictionSetupHelper set ProblemSize from the training locations only
(data->GetLocations()->GetSize() == N_train). The combined observed+missing
system actually has N_train + N_test points, so the Z descriptor was
undersized; with N_test >= N_train the derived observed count goes negative,
aborting Chameleon ("Too many tiles" / malloc failure).

Set ProblemSize per operation in the shared helper:
  - prediction / MLOE-MMOM / IDW  -> train + test (observed + missing)
  - Fisher                        -> train       (observed only)

Using train+test for Fisher would over-size its covariance and break the
Cholesky factorization; train preserves the original released Fisher value
exactly, so Fisher cannot regress. The prediction path now matches the
behavior validated across the thesis's cross-machine runs.

Adds TestRPredictProblemSize.cpp: predicts with more test points than
training points and checks the kriging interpolation property (prediction
at an observed location recovers the observed value).
@fonzie42 fonzie42 force-pushed the bugfix/r-predict-problem-size branch from 46dd176 to 25788f5 Compare June 28, 2026 23:00
@SAbdulah SAbdulah self-assigned this Jun 29, 2026
@SAbdulah SAbdulah self-requested a review June 29, 2026 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants