receive: CPU Saturation #4270
Replies: 31 comments 1 reply
-
But you can see this without gaps in Prometheus yes? |
Beta Was this translation helpful? Give feedback.
-
There's no gaps in Prometheus. Can you know the below reason ?
|
Beta Was this translation helpful? Give feedback.
-
I do not know. That error causes Prometheus to retry sending data though, so it's unlikely to be that. Could you try running a |
Beta Was this translation helpful? Give feedback.
-
I tried to modify receive.replication-factor to 1, and it never appeared again. |
Beta Was this translation helpful? Give feedback.
-
That's pretty dangerous though, that means if any one instance is not working your entire cluster is down. |
Beta Was this translation helpful? Give feedback.
-
I am experimenting with the master branch to see if there will be problems. Is there a way to make a node offline and the cluster can still work normally when receive.replication-factor is still 1? |
Beta Was this translation helpful? Give feedback.
-
Replication factor is exactly what allows us to have partial downtime in a cluster and still function, obviously we need to ensure that works reliably, I'll mark this as a bug tentatively as we continue to investigate the issue and hopefully resolve it :) |
Beta Was this translation helpful? Give feedback.
-
I changed the receive_refactor back to 3 and the problem will arise. The first figure is in thanos. He lost his data. The second figure is in promethues. His data is complete. receive log
receive config
Maybe my last log interception was incomplete. |
Beta Was this translation helpful? Give feedback.
-
I'll use the master branch to verify the problem first. Is the current receive not suitable for the production environment of too much data ? |
Beta Was this translation helpful? Give feedback.
-
Master branch still has the problem. receive config:
receive log:
|
Beta Was this translation helpful? Give feedback.
-
We are actively ingesting millions of active series and are not experiencing this, so I believe the culprit is somewhere else. I currently don't have time to look into this further, but maybe @kakkoyun @squat or @bwplotka do. |
Beta Was this translation helpful? Give feedback.
-
Experiencing exactly the same erro with replication factor 3. @chaimch - any success on this? |
Beta Was this translation helpful? Give feedback.
-
Did someone find a fix for this |
Beta Was this translation helpful? Give feedback.
-
I'm actually facing this issues with thanos v0.16.0. |
Beta Was this translation helpful? Give feedback.
-
this error happens why you set the wrong |
Beta Was this translation helpful? Give feedback.
-
same issue |
Beta Was this translation helpful? Give feedback.
-
same issue with thanos v0.17.2
|
Beta Was this translation helpful? Give feedback.
-
For me the error stopped happening after upgrading to build of the master (image: |
Beta Was this translation helpful? Give feedback.
-
I have spent the whole afternoon today trying to figure this out, and this behavior seems to be an incompatibility with newer versions of Prometheus. The error below only happens with Prometheus 2.23 or higher.
I just downgraded my Prometheus pod to version v2.22.2 and everything looks fine now. I am running Thanos v0.17.2 |
Beta Was this translation helpful? Give feedback.
-
I also got this error on Prometheus 1.19.3 and Thanos 0.17.0. |
Beta Was this translation helpful? Give feedback.
-
I've been hit by this issue too. I've not fully gotten to the bottom of it and there could easily be multiple reasons for it. I've been testing out the Receiver in a local
Initially I thought the problem was fixed after I downgraded Prom from v2.24 to v2.15.2 however this may not have been the case since the problem described above has since appeared again. Where I have noticed a definite pattern is with resources allocats. Since I'm running the cluster locally I'm keeping things pretty lean. Nothing is OOMing though. If I give Receivers a limit of 256Mi memory and 100m CPU then the steps I've outlined above reliably reproduce the issue. I suspect that because CPU is so low it's probably the main culprit but I've not confirmed. When the receivers are too constrained they seem to be taking too long to join the ring and then the entire ring becomes unstable. Sometimes killing all receivers at the same time so that they all restart around the same time sorts things. I thought for a while that the live/ready probes were potentially playing a part too but I've removed them from the picture and can replicate simply by constraining resources. |
Beta Was this translation helpful? Give feedback.
-
Thanks for those items. Looks like it's painful enough to give it more priority. 🤗
Yes, I doubt it have anything to do with Prometheus version, except, maybe it is something with remote write configuration (sharding, queues, bigger remote write batches etc). I would look on that.
That is waaaaaaay too low. Receiver is like Prometheus, just even more CPU consuming because of replication and forwarding. CPU saturation is a common problem, but it's expected if you want to cope with load. There are couple of things it would be amazing to clarify.
IF saturation is the problem make sure you:
Does that help? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the feedback 👍 Yep, I fully intend to allocate a lot more resources. At the moment I'm really just trying to understand potential failure modes before attempting to move towards production. Increasing the replication factor is also something I'm intending on doing. One thing that is bothering me a bit though is that we've got HA Prom pairs writing to Thanos clusters and this will end up meaning each series is replicated 4x. I understand that there is experimental support for deduplicating series with identical data points but I'm worried it's maybe a little too bleeding edge at the moment. If this isn't done though there'll obviously be no guarantees that a single Receiver doesn't host all replicas for a particular series. I've not spent time yet explicitly trying to cause data loss. I don't believe I'm seeing any though when a Receiver goes offline - haven't confirmed 100% yet but so far it's been looking like (with a replication factor == 1) points are just queued up in the Proms and delivered whenever the Receiver ring becomes stable again. Something I have noticed though is that perhaps 50% of the time after a Receiver gets a SIGTERM and tries to flush its block it fails. It's not always the case but much of the time the logs state it's because the block has been closed (or something like that) and it can't read it. The data is still on disk though as far as the Receiver is concerned so not data loss. Sorry, I'm probably getting too far off the point of this issue now :) *Edit
|
Beta Was this translation helpful? Give feedback.
-
Just curious if anyone seen this issue with 0.18.0? Going to give it a shot early next week. |
Beta Was this translation helpful? Give feedback.
-
HI,I am using v0.18.0 .I don't know if if there is any mistake in my config file.The yaml file like this: apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/component: database-write-hashring
app.kubernetes.io/instance: thanos-receive
app.kubernetes.io/name: thanos-receive
app.kubernetes.io/version: v0.18.0
name: thanos-receive
namespace: thanos
spec:
replicas: 3
selector:
matchLabels:
app.kubernetes.io/component: database-write-hashring
app.kubernetes.io/instance: thanos-receive
app.kubernetes.io/name: thanos-receive
serviceName: thanos-receive
template:
metadata:
labels:
app.kubernetes.io/component: database-write-hashring
app.kubernetes.io/instance: thanos-receive
app.kubernetes.io/name: thanos-receive
app.kubernetes.io/version: v0.18.0
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- thanos-receive
- key: app.kubernetes.io/instance
operator: In
values:
- thanos-receive
namespaces:
- thanos
topologyKey: kubernetes.io/hostname
weight: 100
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- thanos-receive
- key: app.kubernetes.io/instance
operator: In
values:
- thanos-receive
namespaces:
- thanos
topologyKey: topology.kubernetes.io/zone
weight: 100
containers:
- args:
- receive
- --log.level=info
- --log.format=logfmt
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --remote-write.address=0.0.0.0:19291
- --receive.replication-factor=2
- --objstore.config=$(OBJSTORE_CONFIG)
- --tsdb.path=/var/thanos/receive
- --label=replica="$(NAME)"
- --label=receive="true"
- --tsdb.retention=15d
- --receive.local-endpoint=$(NAME).thanos-receive.$(NAMESPACE).svc.cluster.local:10901
- --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
- |-
--tracing.config="config":
"sampler_param": 2
"sampler_type": "ratelimiting"
"service_name": "thanos-receive"
"type": "JAEGER"
env:
- name: NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: OBJSTORE_CONFIG
valueFrom:
secretKeyRef:
key: thanos.yaml
name: thanos-objectstorage
image: core.harbor.panda.com/thanos/thanos:v0.18.0
livenessProbe:
failureThreshold: 8
httpGet:
path: /-/healthy
port: 10902
scheme: HTTP
periodSeconds: 30
name: thanos-receive
ports:
- containerPort: 10901
name: grpc
- containerPort: 10902
name: http
- containerPort: 19291
name: remote-write
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: 10902
scheme: HTTP
periodSeconds: 5
resources:
limits:
cpu: 2
memory: 1024Mi
requests:
cpu: 0.5
memory: 512Mi
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /var/thanos/receive
name: data
readOnly: false
- mountPath: /var/lib/thanos-receive
name: hashring-config
terminationGracePeriodSeconds: 900
volumes:
- configMap:
name: hashring
name: hashring-config
volumeClaimTemplates:
- metadata:
labels:
app.kubernetes.io/component: database-write-hashring
app.kubernetes.io/instance: thanos-receive
app.kubernetes.io/name: thanos-receive
name: data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi The hashing file like this:{
"hashrings.json": "[
{
\"hashring\": \"soft-tenants\",
\"endpoints\":
[
\"thanos-receive-0.thanos-receive.thanos.svc.cluster.local:10901\",
\"thanos-receive-1.thanos-receive.thanos.svc.cluster.local:10901\",
\"thanos-receive-2.thanos-receive.thanos.svc.cluster.local:10901\"
]
}
]
"
} And the receive log like this:level=error ts=2021-02-01T02:46:59.386889118Z caller=handler.go:330 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error" |
Beta Was this translation helpful? Give feedback.
-
Hey @yuzijiang718, as suggested above have you tried to allocate more resources? |
Beta Was this translation helpful? Give feedback.
-
Hello,I'm using same configuration and I have enough memory and CPU power(always less than limits). here is my config:
and here is logs:
Readiness and Liveness probes periodically fails on both env(minikube and real cluster)
|
Beta Was this translation helpful? Give feedback.
-
I have an update on my experience with this issue that might help someone. When running Receivers within a vanilla k8s cluster I saw only the issues I'd mentioned in my previous comment and generally speaking things are fine when Receivers have enough resources. I recently tried Receivers in a cluster which had Istio installed and the Receiver pods had Istio sidecars injected into them. The TL;DR is that I saw the same behaviour that was mentioned in the initial post in this thread, i.e. periods of data loss. It's still too early to say with 100% certainty but today I have prevented Istio sidecars from being injected into Receiver pods and things so far are working as expected. I don't know if the OP was using a service mesh or not? At this point I can only speculate as to what was going on when Istio was in the picture. During this test almost all Prom remote-write settings were at their defaults. What I observed however was:
There may well be some things that can be done to keep Receivers inside a service mesh, maybe disabling retries and setting short connection timeouts. I've not tried these things. I suppose my point though is that there are issues that can happen if you're using one. |
Beta Was this translation helpful? Give feedback.
-
I am facing this issue with Thanos v0.20.1 and Prometheus v2.11.0 version. In my case Thanos receive PODs end up using all the CPU available in the node and throw these errors. Anyone managed to tackle the issue or find any workarounds ? @kakkoyun |
Beta Was this translation helpful? Give feedback.
-
image: docker.io/bitnami/thanos:0.21.1-scratch-r2 normal:
abnormal:
how was it finally resolved? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Thanos, Prometheus and Golang version used:
Object Storage Provider:
What happened:

Lost data, please see below
What you expected to happen:
No data loss
How to reproduce it (as minimally and precisely as possible):
Full logs to relevant components:
Anything else we need to know:
Environment:
Beta Was this translation helpful? Give feedback.
All reactions