Skip to content

Comments

SOLR-17955: OTEL metrics: SplitShardCmd.checkDiskSpace needs conversion#3859

Merged
mlbiscoc merged 9 commits intoapache:mainfrom
mlbiscoc:split-shard-disk
Jan 26, 2026
Merged

SOLR-17955: OTEL metrics: SplitShardCmd.checkDiskSpace needs conversion#3859
mlbiscoc merged 9 commits intoapache:mainfrom
mlbiscoc:split-shard-disk

Conversation

@mlbiscoc
Copy link
Contributor

https://issues.apache.org/jira/browse/SOLR-17955

Splitshard disk space check was broken when switching the /admin/metrics api to prometheus. This fixes that by reading the parsed prometheus output.

Comment on lines 901 to 909
if (indexSize == -1.0) {
log.warn("cannot verify information for parent shard leader");
return;
}
double indexSize = size.doubleValue();

Number freeSize =
(Number) rsp.getResponse()._get(List.of("metrics", freeDiskSpaceMetricName), null);
if (freeSize == null) {
if (freeSize == -1.0) {
log.warn("missing node disk space information for parent shard leader");
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking for here, this should throw instead of just return and continuing with the shard split cmd. Tests were passing because of this. Then we could at least know there was a regression. Was this just return for some specific purpose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least assert but I'm +1 to just throw if you wish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general rule, never log AND throw. Pick one. Here, you choose to throw, so ditch the log. Logging is typically redundant with throwing, since throwing basically always gets logged as well.


SolrResponse resp = req.process(cloudManager.getSolrClient());

double[] sizes = new double[] {-1.0, -1.0}; // [indexSize, freeSize]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why an array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was kind of a work around solution for the lambda filter. I can move off of it if you think it's too messy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is too messy. In this case, use two AtomicDoubles

Comment on lines 901 to 909
if (indexSize == -1.0) {
log.warn("cannot verify information for parent shard leader");
return;
}
double indexSize = size.doubleValue();

Number freeSize =
(Number) rsp.getResponse()._get(List.of("metrics", freeDiskSpaceMetricName), null);
if (freeSize == null) {
if (freeSize == -1.0) {
log.warn("missing node disk space information for parent shard leader");
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least assert but I'm +1 to just throw if you wish.

SolrRequest.METHOD.GET, "/admin/metrics", SolrRequest.SolrRequestType.ADMIN, params);
req.setResponseParser(new InputStreamResponseParser("prometheus"));

SolrResponse resp = req.process(cloudManager.getSolrClient());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to talk to some arbitrary node, not necessarily the node that has the specific core we are going to split. To do that, you can get the URL from the Replica, truncate to the node, and use Http2SolrClient.requestWithBaseUrl

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exceptions make sense actually since Solr provides an option to disable the size check. So a user can disable them if they like.

Basically approving but a couple minor improvements. And needs a changelog.


SolrResponse resp = req.process(cloudManager.getSolrClient());
var cloudClient = (CloudHttp2SolrClient) cloudManager.getSolrClient();
var http2Client = (Http2SolrClient) cloudClient.getHttpClient();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after you sync from main, please rename this var without the 2. IMO we shouldn't have this number 2 in our client var names

@mlbiscoc
Copy link
Contributor Author

Made last few changes on 46149ca that I forgot to do.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor improvement request but LGTM otherwise.
Thanks for fixing two issues in one here :-)


var cloudClient = (CloudHttp2SolrClient) cloudManager.getSolrClient();
var http2Client = (Http2SolrClient) cloudClient.getHttpClient();
var httpClient = (HttpJettySolrClient) cloudClient.getHttpClient();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you no longer even need to cast... but whatever.


SolrResponse resp =
http2Client.requestWithBaseUrl(parentShardLeader.getBaseUrl(), req::process);
SolrResponse resp = httpClient.requestWithBaseUrl(parentShardLeader.getBaseUrl(), req::process);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be surprised if we deprecate the method you are calling. It's overloaded; the other one is defined on the base client, is used somewhat widely, and returns a NamedList. Looking at the code here, you only want a NamedList, not SolrResponse specifically.

- name: Matthew Biscocho
- name: David Smiley
links:
- name: SOLR-17955
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe reference both JIRA issues that this addresses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created another changelog entry instead

@dsmiley
Copy link
Contributor

dsmiley commented Jan 24, 2026

@mlbiscoc this is waiting for > a month; missed 10.0

Copy link
Contributor Author

@mlbiscoc mlbiscoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is waiting for > a month; missed 10.0

Oof sorry! completely forgot about this...

Merged in main to get this back up to date. Will merge by EOD unless you still have concerns.

- name: Matthew Biscocho
- name: David Smiley
links:
- name: SOLR-17955
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created another changelog entry instead

@mlbiscoc mlbiscoc merged commit 70d8caa into apache:main Jan 26, 2026
5 checks passed
mlbiscoc added a commit that referenced this pull request Jan 26, 2026
…eeds conversion (#3859)

Fixes split shard which was disabled after moving metrics to OTEL. Split shard now parses and reads prometheus metrics for disk space on the correct node for the split shard command. Also a minor change to throw when failing to get the index size and free size metrics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants