Skip to content

Conversation

@yeya24
Copy link
Contributor

@yeya24 yeya24 commented Oct 1, 2025

We are testing xfunctions in our set up and here is one issue we hit when testing with super sparse metrics.

Here is the problem we got and I created a simple test to reproduce the problem and added to this PR.

  • For a sparse series, assume it has 2 samples. 1 ingested at 2025-09-07T06:47:37Z (let's call it t1) and another ingested at 2025-09-07T09:24:37Z (let's call it t2)
  • When xincrease only selects only 1 sample, I expect xincrease(http_requests_total[1m]) > 0 to have value at the steps when sparse metrics were ingested.
  • xincrease value is 0 for the sample at t2 if my query start time is before 2025-09-07T07:48:38Z. If my query start time is equal to or after 2025-09-07T07:48:38Z, xincrease shows the correct value.

2025-09-07T07:48:38Z is kind of the magic timestamp because it is t1 + 60m + 1m + 1s, where 60m is the ext lookback delta and 1m is the select range in http_requests_total[1m].

The issue in extendedRate is in this line. If the query start time is within [t1, t1 + ext lookback delta + select range], metricAppearedTs for the series is set to t1 according to this. Then if stepTime-offset <= until condition will not meet when evaluating another sample at a later timestamp.

	// This effectively injects a "zero" series for xincrease if we only have one sample.
	// Only do it for some time when the metric appears the first time.
	until := selectRange + metricAppearedTs
	if isCounter && !isRate && sameVals {
		// Make sure we are not at the end of the range.
		if stepTime-offset <= until {
			return samples[0].V.F, nil, nil
		}
	}

Proposed fix for this issue:

For each step, when there is only 1 sample selected within the select range, it is expected forxincrease to always return the sample value. It shouldn't be impacted by the start time of the query and if there is any sample close enough.

Current logic. Depends on first sample timestamp.

	until := selectRange + metricAppearedTs
	if isCounter && !isRate && sameVals {
		// Make sure we are not at the end of the range.
		if stepTime-offset <= until {
			return samples[0].V.F, nil, nil
		}
	}

Proposed logic, this doesn't rely on first sample timestamp. Also when there is only one sample within the select range, return its value. When there are more than 1 sample but they have the same value, return 0.

	if isCounter && !isRate && sameVals {
		var sampleInRange *Sample
		for _, sample := range samples {
			if sample.T >= rangeStart && sample.T <= rangeEnd {
				// More than one sample in range and same value, return 0.
				if sampleInRange != nil {
					return 0, nil, nil
				}
				sampleInRange = &sample
			}
		}
		if sampleInRange != nil {
			return sampleInRange.V.F, nil, nil
		}
	}

@alvinlin123
Copy link

@fpetkovski @GiedriusS would appreciate if you can take a look on this one. IMO I agree with Ben that for sparse samples, if we have a single sample within the user-given-time-range sample, then we should just return the sample value without considering the metricAppearedTs. It is pretty hard to understand from user perspective on why I see different result for xincrease() for sparse sample if I change my query start timestamp by a mere 1s.

@GiedriusS
Copy link
Member

I didn't work on this for a very long time and I don't remember all the details.

This function is "our own" so we can kind of do what we want. In the original proposal there were at least 3 different implementations IIRC. In my opinion, this either way is a heuristic and it is not guaranteed to be "correct" in all cases. As far as I remember, the extended Select() time range and this whole logic was exactly because we want to guess whether some metric existed in the past or not.

The key part to me is this:

to have value at the steps when sparse metrics were ingested.

How do you define "ingested"? The Prometheus team is also trying to solve this with the created timestamp functionality.

Your proposed change looks fine, I can't see anything wrong right now but I will think about cases where it might not work correctly.

Could you also elaborate a bit on how/why the way it works right now blocks some of your work?

Also, should we invest any more time into this when prometheus/prometheus#16457 modifiers are gaining traction in the Prometheus world? Do you see a future for this function & code?

@yeya24
Copy link
Contributor Author

yeya24 commented Oct 8, 2025

Thanks @GiedriusS. This behavior looks like a bug from our users' perspective. It is not really blocking us but we'd like to understand the right behavior.

How do you define "ingested"? The Prometheus team is also trying to solve this with the created timestamp functionality.

Yeah I think created timestamp seems the right way to go and at least it solves our usecase.

Also agree that we should try to support the new modifiers in the thanos engine to support this and align with the community with the right semantics. For xrate it is quite hard to actually define its behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants