SOLR-17955: OTEL metrics: SplitShardCmd.checkDiskSpace needs conversion#3859
SOLR-17955: OTEL metrics: SplitShardCmd.checkDiskSpace needs conversion#3859mlbiscoc merged 9 commits intoapache:mainfrom
Conversation
| if (indexSize == -1.0) { | ||
| log.warn("cannot verify information for parent shard leader"); | ||
| return; | ||
| } | ||
| double indexSize = size.doubleValue(); | ||
|
|
||
| Number freeSize = | ||
| (Number) rsp.getResponse()._get(List.of("metrics", freeDiskSpaceMetricName), null); | ||
| if (freeSize == null) { | ||
| if (freeSize == -1.0) { | ||
| log.warn("missing node disk space information for parent shard leader"); | ||
| return; | ||
| } |
There was a problem hiding this comment.
I'm thinking for here, this should throw instead of just return and continuing with the shard split cmd. Tests were passing because of this. Then we could at least know there was a regression. Was this just return for some specific purpose?
There was a problem hiding this comment.
At least assert but I'm +1 to just throw if you wish.
There was a problem hiding this comment.
As a general rule, never log AND throw. Pick one. Here, you choose to throw, so ditch the log. Logging is typically redundant with throwing, since throwing basically always gets logged as well.
|
|
||
| SolrResponse resp = req.process(cloudManager.getSolrClient()); | ||
|
|
||
| double[] sizes = new double[] {-1.0, -1.0}; // [indexSize, freeSize] |
There was a problem hiding this comment.
It was kind of a work around solution for the lambda filter. I can move off of it if you think it's too messy.
There was a problem hiding this comment.
it is too messy. In this case, use two AtomicDoubles
| if (indexSize == -1.0) { | ||
| log.warn("cannot verify information for parent shard leader"); | ||
| return; | ||
| } | ||
| double indexSize = size.doubleValue(); | ||
|
|
||
| Number freeSize = | ||
| (Number) rsp.getResponse()._get(List.of("metrics", freeDiskSpaceMetricName), null); | ||
| if (freeSize == null) { | ||
| if (freeSize == -1.0) { | ||
| log.warn("missing node disk space information for parent shard leader"); | ||
| return; | ||
| } |
There was a problem hiding this comment.
At least assert but I'm +1 to just throw if you wish.
| SolrRequest.METHOD.GET, "/admin/metrics", SolrRequest.SolrRequestType.ADMIN, params); | ||
| req.setResponseParser(new InputStreamResponseParser("prometheus")); | ||
|
|
||
| SolrResponse resp = req.process(cloudManager.getSolrClient()); |
There was a problem hiding this comment.
this is going to talk to some arbitrary node, not necessarily the node that has the specific core we are going to split. To do that, you can get the URL from the Replica, truncate to the node, and use Http2SolrClient.requestWithBaseUrl
dsmiley
left a comment
There was a problem hiding this comment.
The exceptions make sense actually since Solr provides an option to disable the size check. So a user can disable them if they like.
Basically approving but a couple minor improvements. And needs a changelog.
|
|
||
| SolrResponse resp = req.process(cloudManager.getSolrClient()); | ||
| var cloudClient = (CloudHttp2SolrClient) cloudManager.getSolrClient(); | ||
| var http2Client = (Http2SolrClient) cloudClient.getHttpClient(); |
There was a problem hiding this comment.
after you sync from main, please rename this var without the 2. IMO we shouldn't have this number 2 in our client var names
|
Made last few changes on 46149ca that I forgot to do. |
dsmiley
left a comment
There was a problem hiding this comment.
Just a minor improvement request but LGTM otherwise.
Thanks for fixing two issues in one here :-)
|
|
||
| var cloudClient = (CloudHttp2SolrClient) cloudManager.getSolrClient(); | ||
| var http2Client = (Http2SolrClient) cloudClient.getHttpClient(); | ||
| var httpClient = (HttpJettySolrClient) cloudClient.getHttpClient(); |
There was a problem hiding this comment.
I suspect you no longer even need to cast... but whatever.
|
|
||
| SolrResponse resp = | ||
| http2Client.requestWithBaseUrl(parentShardLeader.getBaseUrl(), req::process); | ||
| SolrResponse resp = httpClient.requestWithBaseUrl(parentShardLeader.getBaseUrl(), req::process); |
There was a problem hiding this comment.
I wouldn't be surprised if we deprecate the method you are calling. It's overloaded; the other one is defined on the base client, is used somewhat widely, and returns a NamedList. Looking at the code here, you only want a NamedList, not SolrResponse specifically.
| - name: Matthew Biscocho | ||
| - name: David Smiley | ||
| links: | ||
| - name: SOLR-17955 |
There was a problem hiding this comment.
maybe reference both JIRA issues that this addresses
There was a problem hiding this comment.
I created another changelog entry instead
|
@mlbiscoc this is waiting for > a month; missed 10.0 |
mlbiscoc
left a comment
There was a problem hiding this comment.
this is waiting for > a month; missed 10.0
Oof sorry! completely forgot about this...
Merged in main to get this back up to date. Will merge by EOD unless you still have concerns.
| - name: Matthew Biscocho | ||
| - name: David Smiley | ||
| links: | ||
| - name: SOLR-17955 |
There was a problem hiding this comment.
I created another changelog entry instead
…eeds conversion (#3859) Fixes split shard which was disabled after moving metrics to OTEL. Split shard now parses and reads prometheus metrics for disk space on the correct node for the split shard command. Also a minor change to throw when failing to get the index size and free size metrics.
https://issues.apache.org/jira/browse/SOLR-17955
Splitshard disk space check was broken when switching the /admin/metrics api to prometheus. This fixes that by reading the parsed prometheus output.