Fix predict_data problem size so prediction works when test points >= training points by fonzie42 · Pull Request #59 · ecrc/ExaGeoStatCPP

fonzie42 · 2026-06-28T22:52:31Z

Summary

predict_data (and the other R prediction entry points) aborts inside
Chameleon when the number of test (missing) locations is comparable to or
larger than the number of training (observed) locations. With 9 training points
and 12 test points it fails with:

CHAMELEON ERROR: chameleon_desc_mat_alloc(): malloc() failed
CHAMELEON FATAL ERROR: RUNTIME_desc_create(): Too many tiles in the descriptor for MPI tags

The existing R-adapter test never triggers this because it uses 16 training
points and only 2 test points (test << train).

Root cause

PredictionSetupHelper (src/Rcpp-adapters/FunctionsAdapter.cpp) sets the
problem size from the training locations only:

aConfigurations.SetProblemSize(data->GetLocations()->GetSize()); // == N_train

The R adapter passes train and test as two independent datasets, so the
invariant ProblemSize = N_observed + N_missing is broken. The observed count
is later derived as GetProblemSize() - GetUnknownObservationsNb() =
N_train - N_test, which goes negative when N_test > N_train. The
negative count becomes a huge unsigned descriptor dimension → "Too many tiles" /
malloc failure. (For N_test < N_train it is silently too small rather than
negative, which under-sizes the Z descriptor.)

Why a blanket `train + test` is not enough

The same ProblemSize is consumed under two different conventions by the
operations that share this helper:

predict / mloe_mmom / idw size their Z descriptors over the combined
observed + missing set, so they need train + test.
fisher builds a covariance over the observed locations only
(InitiateFisherDescriptors uses GetProblemSize() directly), so it needs
train. Feeding it train + test over-sizes the covariance and breaks the
Cholesky factorization (CHAMELEON_dpotrf_Tile Failed, Matrix is not positive definite).

Fix

Set the problem size per operation in the shared helper:

if (aConfigurations.GetIsFisher()) {
    aConfigurations.SetProblemSize(train_data_size);              // observed only
} else {
    aConfigurations.SetProblemSize(train_data_size + test_data_size); // observed + missing
}

The Fisher branch sets ProblemSize = train_data_size, which is identical to
the original value (data->GetLocations()->GetSize() returns the training
count, because only training locations are loaded into data). Fisher's sizing
is therefore unchanged from the released behavior and cannot regress.

Test

Adds TestRPredictProblemSize.cpp: predicts with more test points than
training points (9 train / 12 test) and checks the fundamental kriging
interpolation property: prediction at an observed location must recover the
observed value for an exact, nugget-free model. It uses the non-nugget
univariate_matern_stationary kernel so it is independent of any other
kernel-specific behavior.

Before the fix: the prediction aborts (descriptor error) / returns wrong
values.
After the fix: the interpolation property holds and the run completes.

Validation

The fix follows directly from the descriptor-sizing code paths:

For predict / mloe_mmom / idw, the Z descriptor must span the combined
observed + missing set, so ProblemSize = train + test is the correct size;
the original train-only value under-sizes it (and goes negative when
test >= train).
For fisher, the branch sets ProblemSize = train_data_size, which is the
same value the original code produced (data->GetLocations()->GetSize()
returns the training count). Fisher's sizing is therefore unchanged and
cannot regress.

Heads-up on an existing fixture: this change can shift one
TestAllRFunctions predict-new-point expectation by ~1e-4, because the Z
descriptor is now correctly sized for that case (it was previously
under-sized). That expected value should be updated if it moves.

Note

Companion fix: the UnivariateMaternNuggetsStationary index-order fix (#58). The two are independent code changes (disjoint files); both are needed
together for fully correct end-to-end kriging predictions with the nuggets
kernel.

…crash when test >= train PredictionSetupHelper set ProblemSize from the training locations only (data->GetLocations()->GetSize() == N_train). The combined observed+missing system actually has N_train + N_test points, so the Z descriptor was undersized; with N_test >= N_train the derived observed count goes negative, aborting Chameleon ("Too many tiles" / malloc failure). Set ProblemSize per operation in the shared helper: - prediction / MLOE-MMOM / IDW -> train + test (observed + missing) - Fisher -> train (observed only) Using train+test for Fisher would over-size its covariance and break the Cholesky factorization; train preserves the original released Fisher value exactly, so Fisher cannot regress. The prediction path now matches the behavior validated across the thesis's cross-machine runs. Adds TestRPredictProblemSize.cpp: predicts with more test points than training points and checks the kriging interpolation property (prediction at an observed location recovers the observed value).

fonzie42 mentioned this pull request Jun 28, 2026

Fix swapped location indices in UnivariateMaternNuggetsStationary covariance #58

Open

fonzie42 force-pushed the bugfix/r-predict-problem-size branch from 46dd176 to 25788f5 Compare June 28, 2026 23:00

SAbdulah self-assigned this Jun 29, 2026

SAbdulah self-requested a review June 29, 2026 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix predict_data problem size so prediction works when test points >= training points#59

Fix predict_data problem size so prediction works when test points >= training points#59
fonzie42 wants to merge 1 commit into
ecrc:mainfrom
fonzie42:bugfix/r-predict-problem-size

fonzie42 commented Jun 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

fonzie42 commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Why a blanket train + test is not enough

Fix

Test

Validation

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fonzie42 commented Jun 28, 2026 •

edited

Loading

Why a blanket `train + test` is not enough