receive: CPU Saturation #4270

enjoychaim · 2020-08-05T01:34:09Z

enjoychaim
Aug 5, 2020

Thanos, Prometheus and Golang version used:

Object Storage Provider:

apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-objectstorage
  namespace: monitoring
data:
  objectstorage.yaml: |
    type: S3
    config:
      endpoint: "s3.us-east-1.amazonaws.com"
      bucket: "sre-thanos"

What happened:
Lost data, please see below

What you expected to happen:
No data loss

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Receiver Logs

level=error ts=2020-08-05T01:32:26.233522015Z caller=handler.go:299 component=receive component=receive-handler err="replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:26.248889974Z caller=handler.go:299 component=receive component=receive-handler err="replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:26.414569932Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:26.418503985Z caller=handler.go:299 component=receive component=receive-handler err="replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:27.114444429Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:27.114662597Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:27.114900564Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:27.115106052Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"

Anything else we need to know:

Environment:

thanos-receive-hashrings.json:

    [
      {
        "hashring": "soft-tenants",
        "endpoints":
        [
          "thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901",
          "thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901",
          "thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901"
        ]
      }
    ]

receive args:

    spec:
      nodeSelector:
        eks.amazonaws.com/nodegroup: thanos
      containers:
      - args:
        - receive
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --remote-write.address=0.0.0.0:19291
        - --objstore.config-file=/etc/thanos/objectstorage.yaml
        - --tsdb.path=/var/thanos/receive
        - --tsdb.retention=15d
        - --tsdb.wal-compression
        - --receive.replication-factor=3
        - --label=receive_replica="$(NAME)"
        - --label=receive="true"
        - --receive.hashrings-file=/etc/thanos/thanos-receive-hashrings.json
        - --receive.local-endpoint=$(NAME).thanos-receive.monitoring.svc.cluster.local:10901
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        image: thanosio/thanos:v0.14.0

connectivity test in thanos-receive-2

[ec2-user@thanos-receive-2 ~]$ k exec -ti thanos-receive-2 -n monitoring -- sh
/ # nc -vz  thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901
thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901 (172.30.213.95:10901) open
/ # nc -vz  thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901
thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901 (172.30.252.6:10901) open
/ # nc -vz  thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901
thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901 (172.30.161.204:10901) open

brancz · 2020-08-05T07:55:16Z

brancz
Aug 5, 2020
Maintainer

But you can see this without gaps in Prometheus yes?

0 replies

enjoychaim · 2020-08-06T07:12:43Z

enjoychaim
Aug 6, 2020
Author

But you can see this without gaps in Prometheus yes

There's no gaps in Prometheus.

Can you know the below reason ?

level=error ts=2020-08-05T01:32:26.233522015Z caller=handler.go:299 component=receive component=receive-handler err="replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:26.248889974Z caller=handler.go:299 component=receive component=receive-handler err="replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:26.414569932Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:26.418503985Z caller=handler.go:299 component=receive component=receive-handler err="replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:27.114444429Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:27.114662597Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:27.114900564Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:27.115106052Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"

0 replies

brancz · 2020-08-06T11:49:03Z

brancz
Aug 6, 2020
Maintainer

I do not know. That error causes Prometheus to retry sending data though, so it's unlikely to be that. Could you try running a master image of Thanos? A couple of fixes for receive did land recently and haven't been released yet.

0 replies

enjoychaim · 2020-08-06T12:15:44Z

enjoychaim
Aug 6, 2020
Author

I tried to modify receive.replication-factor to 1, and it never appeared again.

0 replies

brancz · 2020-08-06T12:25:26Z

brancz
Aug 6, 2020
Maintainer

That's pretty dangerous though, that means if any one instance is not working your entire cluster is down.

0 replies

enjoychaim · 2020-08-07T09:40:04Z

enjoychaim
Aug 7, 2020
Author

That's pretty dangerous though, that means if any one instance is not working your entire cluster is down.

I am experimenting with the master branch to see if there will be problems.

Is there a way to make a node offline and the cluster can still work normally when receive.replication-factor is still 1?
By that way, The worst result will only lose part of the data.

0 replies

brancz · 2020-08-07T09:49:40Z

brancz
Aug 7, 2020
Maintainer

Replication factor is exactly what allows us to have partial downtime in a cluster and still function, obviously we need to ensure that works reliably, I'll mark this as a bug tentatively as we continue to investigate the issue and hopefully resolve it :)

0 replies

enjoychaim · 2020-08-08T02:27:29Z

enjoychaim
Aug 8, 2020
Author

But you can see this without gaps in Prometheus yes?

I changed the receive_refactor back to 3 and the problem will arise.

The first figure is in thanos. He lost his data.

The second figure is in promethues. His data is complete.

receive log

level=warn ts=2020-08-08T01:57:48.668899741Z caller=writer.go:91 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=11
level=warn ts=2020-08-08T01:57:48.669558959Z caller=writer.go:91 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=10
level=warn ts=2020-08-08T01:57:48.669979205Z caller=writer.go:91 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=4

receive config

    spec:
      nodeSelector:
        eks.amazonaws.com/nodegroup: thanos
      containers:
      - args:
        - receive
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --remote-write.address=0.0.0.0:19291
        - --objstore.config-file=/etc/thanos/objectstorage.yaml
        - --tsdb.path=/var/thanos/receive
        - --tsdb.retention=15d
        - --tsdb.wal-compression
        - --receive.replication-factor=3
        - --label=receive_replica="$(NAME)"
        - --label=receive="true"
        - --receive.hashrings-file=/etc/thanos/thanos-receive-hashrings.json
        - --receive.local-endpoint=$(NAME).thanos-receive.monitoring.svc.cluster.local:10901
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        image: thanosio/thanos:v0.14.0
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: 10902
            scheme: HTTP
          periodSeconds: 30

Maybe my last log interception was incomplete.
I guess the loss of data is caused by Error on ingesting out-of-order samples.
In this case, use the master branch of receive or sidecar ？

0 replies

enjoychaim · 2020-08-08T02:35:44Z

enjoychaim
Aug 8, 2020
Author

I'll use the master branch to verify the problem first.

Is the current receive not suitable for the production environment of too much data ？

0 replies

enjoychaim · 2020-08-09T01:55:02Z

enjoychaim
Aug 9, 2020
Author

Master branch still has the problem.

receive config：

apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-objectstorage
  namespace: monitoring
data:
  objectstorage.yaml: |
    type: S3
    config:
      endpoint: "s3.us-east-1.amazonaws.com"
      bucket: "sre-thanos"
---

apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-receive-hashrings
  namespace: monitoring
data:
  thanos-receive-hashrings.json: |
    [
      {
        "hashring": "soft-tenants",
        "endpoints":
        [
          "thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901",
          "thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901",
          "thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901"
        ]
      }
    ]
---

apiVersion: v1
kind: Service
metadata:
  name: thanos-receive
  namespace: monitoring
  labels:
    kubernetes.io/name: thanos-receive
spec:
  ports:
  - name: http
    port: 10902
    protocol: TCP
    targetPort: 10902
  - name: remote-write
    port: 19291
    protocol: TCP
    targetPort: 19291
  - name: grpc
    port: 10901
    protocol: TCP
    targetPort: 10901
  selector:
    kubernetes.io/name: thanos-receive
  clusterIP: None
---

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  labels:
    kubernetes.io/name: thanos-pdb
  name: thanos-pdb
  namespace: monitoring
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      kubernetes.io/name: thanos-receive
---

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    kubernetes.io/name: thanos-receive
  name: thanos-receive
  namespace: monitoring
spec:
  replicas: 3
  selector:
    matchLabels:
      kubernetes.io/name: thanos-receive
  serviceName: thanos-receive
  template:
    metadata:
      labels:
        kubernetes.io/name: thanos-receive
    spec:
      nodeSelector:
        eks.amazonaws.com/nodegroup: thanos
      containers:
      - args:
        - receive
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --remote-write.address=0.0.0.0:19291
        - --objstore.config-file=/etc/thanos/objectstorage.yaml
        - --tsdb.path=/var/thanos/receive
        - --tsdb.retention=15d
        - --tsdb.wal-compression
        - --receive.replication-factor=3
        - --label=receive_replica="$(NAME)"
        - --label=receive="true"
        - --receive.hashrings-file=/etc/thanos/thanos-receive-hashrings.json
        - --receive.local-endpoint=$(NAME).thanos-receive.monitoring.svc.cluster.local:10901
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        # image: thanosio/thanos:v0.14.0
        image: thanosio/thanos:master-2020-08-07-9b578afb
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: 10902
            scheme: HTTP
          periodSeconds: 30
        name: thanos-receive
        ports:
        - containerPort: 10901
          name: grpc
        - containerPort: 10902
          name: http
        - containerPort: 19291
          name: remote-write
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 10902
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 60
        resources:
          limits:
            cpu: "1500m"
            memory: "14.5Gi"
          requests:
            cpu: "1500m"
            memory: "14.5Gi"
        volumeMounts:
        - mountPath: /var/thanos/receive
          name: data
          readOnly: false
        - mountPath: /etc/thanos/thanos-receive-hashrings.json
          name: thanos-receive-hashrings
          subPath: thanos-receive-hashrings.json
        - mountPath: /etc/thanos/objectstorage.yaml
          name: thanos-objectstorage
          subPath: objectstorage.yaml
        - mountPath: "/var/run/secrets/eks.amazonaws.com/serviceaccount/"
          name: aws-token
      terminationGracePeriodSeconds: 120
      volumes:
      - configMap:
          defaultMode: 420
          name: thanos-receive-hashrings
        name: thanos-receive-hashrings
      - configMap:
          name: thanos-objectstorage
        name: thanos-objectstorage
      - name: aws-token
        projected:
          sources:
          - serviceAccountToken:
              audience: "sts.amazonaws.com"
              expirationSeconds: 86400
              path: token
  volumeClaimTemplates:
  - metadata:
      labels:
        app.kubernetes.io/name: thanos-receive
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 100Gi

receive log:

level=warn ts=2020-08-09T01:52:39.184802971Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=3
level=warn ts=2020-08-09T01:52:39.185293108Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=6
level=warn ts=2020-08-09T01:52:40.430026634Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=5

0 replies

brancz · 2020-08-10T12:26:00Z

brancz
Aug 10, 2020
Maintainer

We are actively ingesting millions of active series and are not experiencing this, so I believe the culprit is somewhere else. I currently don't have time to look into this further, but maybe @kakkoyun @squat or @bwplotka do.

0 replies

Antiarchitect · 2020-09-15T15:40:46Z

Antiarchitect
Sep 15, 2020

Experiencing exactly the same erro with replication factor 3. @chaimch - any success on this?

0 replies

h-abinaya28 · 2020-11-12T05:25:24Z

h-abinaya28
Nov 12, 2020

Did someone find a fix for this level=error ts=2020-11-12T05:22:34.30823915Z caller=handler.go:299 component=receive component=receive-handler err= msg="internal server error"

0 replies

JoseRIvera07 · 2020-11-18T12:52:10Z

JoseRIvera07
Nov 18, 2020

I'm actually facing this issues with thanos v0.16.0.
level=error ts=2020-11-18T12:03:49.983380782Z caller=handler.go:331 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"

0 replies

yuriydzobak · 2020-12-02T18:17:17Z

yuriydzobak
Dec 2, 2020

I'm actually facing this issues with thanos v0.16.0.
level=error ts=2020-11-18T12:03:49.983380782Z caller=handler.go:331 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"

this error happens why you set the wrong --receive.local-endpoint= and hashiring has another endpoint

0 replies

Wander1024 · 2020-12-23T11:11:09Z

Wander1024
Dec 23, 2020

same issue
level=error ts=2020-12-23T11:08:22.799370065Z caller=handler.go:331 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
thanos 0.17.0

0 replies

jmichalek132 · 2021-01-04T13:13:33Z

jmichalek132
Jan 4, 2021

same issue with thanos v0.17.2

level=error ts=2021-01-04T13:12:57.13438741Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"
level=error ts=2021-01-04T13:12:57.235778067Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"
level=error ts=2021-01-04T13:12:57.337735689Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"
level=error ts=2021-01-04T13:12:57.439512509Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"

0 replies

jmichalek132 · 2021-01-05T13:32:40Z

jmichalek132
Jan 5, 2021

For me the error stopped happening after upgrading to build of the master (image: thanosio/thanos:master-2021-01-05-0aa07118). There are some changes to error handling in thanos receive since the last release. Notably this commit. However I am unsure whether this fixed the issue.

0 replies

luizrojo · 2021-01-20T20:17:59Z

luizrojo
Jan 20, 2021

I have spent the whole afternoon today trying to figure this out, and this behavior seems to be an incompatibility with newer versions of Prometheus.

The error below only happens with Prometheus 2.23 or higher.

2021-01-20T17:12:24.577630-03:00 ip-10-184-125-9 thanos-receive[26164]: level=error ts=2021-01-20T20:12:24.577379617Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"

I just downgraded my Prometheus pod to version v2.22.2 and everything looks fine now.

I am running Thanos v0.17.2

0 replies

Wander1024 · 2021-01-21T02:01:31Z

Wander1024
Jan 21, 2021

I have spent the whole afternoon today trying to figure this out, and this behavior seems to be an incompatibility with newer versions of Prometheus.

The error below only happens with Prometheus 2.23 or higher.

2021-01-20T17:12:24.577630-03:00 ip-10-184-125-9 thanos-receive[26164]: level=error ts=2021-01-20T20:12:24.577379617Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"

I just downgraded my Prometheus pod to version v2.22.2 and everything looks fine now.

I am running Thanos v0.17.2

I also got this error on Prometheus 1.19.3 and Thanos 0.17.0.

0 replies

tdinucci · 2021-01-26T17:35:51Z

tdinucci
Jan 26, 2021

I've been hit by this issue too. I've not fully gotten to the bottom of it and there could easily be multiple reasons for it.

I've been testing out the Receiver in a local kind cluster and typically what I've been seeing is:

Receiver hashring forms and all is OK - currently I'm using a replication factor of 1
I kill one of the receivers
Other receivers spit out the log message which I'd expect to see (for a while at least) about not being able to connect to the dead pod
A replacement pod is provisioned however it either doesn't join the ring or it does very briefly and then leaves again.
Typically the entire ring becomes unstable, with all pods entering and then leaving quickly afterwards

Initially I thought the problem was fixed after I downgraded Prom from v2.24 to v2.15.2 however this may not have been the case since the problem described above has since appeared again.

Where I have noticed a definite pattern is with resources allocats. Since I'm running the cluster locally I'm keeping things pretty lean. Nothing is OOMing though.

If I give Receivers a limit of 256Mi memory and 100m CPU then the steps I've outlined above reliably reproduce the issue. I suspect that because CPU is so low it's probably the main culprit but I've not confirmed.

When the receivers are too constrained they seem to be taking too long to join the ring and then the entire ring becomes unstable. Sometimes killing all receivers at the same time so that they all restart around the same time sorts things.

I thought for a while that the live/ready probes were potentially playing a part too but I've removed them from the picture and can replicate simply by constraining resources.

0 replies

bwplotka · 2021-01-26T18:48:20Z

bwplotka
Jan 26, 2021
Maintainer

Thanks for those items.

Looks like it's painful enough to give it more priority. 🤗

Initially I thought the problem was fixed after I downgraded Prom from v2.24 to v2.15.2 however this may not have been the case since the problem described above has since appeared again.

Yes, I doubt it have anything to do with Prometheus version, except, maybe it is something with remote write configuration (sharding, queues, bigger remote write batches etc). I would look on that.

If I give Receivers a limit of 256Mi memory and 100m CPU then the steps I've outlined above reliably reproduce the issue. I suspect that because CPU is so low it's probably the main culprit but I've not confirmed.

That is waaaaaaay too low. Receiver is like Prometheus, just even more CPU consuming because of replication and forwarding. CPU saturation is a common problem, but it's expected if you want to cope with load.

There are couple of things it would be amazing to clarify.

We recommend to run the receiver with replication. You can then survive load spikes and you can offload load across instances.
Log line is one thing, but data loss is another, especially if running multiple receivers and especially when Prometheus is configured correctly (it should retry until data is shipped). Can we double check what failed that retry did not work? This is important
Please ensure saturation of CPU. It's easy to spot when you will look on CPU time metric. You will see the line fluctuating over limit with weird spikes. Sometimes you could see even gaps in Prometheus scraping receiver if it's overloaded. It's recommended to give at least 1-2 CPU cores for typical setups.

IF saturation is the problem make sure you:

add more replicas
add more CPU
File issue with CPU pprof profiles - we can take a look on how to optimize CPU, maybe we can improve something. Make sure to also mention the write traffic you got at the moment of CPU profile 🤗

Does that help?

0 replies

tdinucci · 2021-01-26T20:14:36Z

tdinucci
Jan 26, 2021

Thanks for the feedback 👍

Yep, I fully intend to allocate a lot more resources. At the moment I'm really just trying to understand potential failure modes before attempting to move towards production.

Increasing the replication factor is also something I'm intending on doing. One thing that is bothering me a bit though is that we've got HA Prom pairs writing to Thanos clusters and this will end up meaning each series is replicated 4x. I understand that there is experimental support for deduplicating series with identical data points but I'm worried it's maybe a little too bleeding edge at the moment. If this isn't done though there'll obviously be no guarantees that a single Receiver doesn't host all replicas for a particular series.

I've not spent time yet explicitly trying to cause data loss. I don't believe I'm seeing any though when a Receiver goes offline - haven't confirmed 100% yet but so far it's been looking like (with a replication factor == 1) points are just queued up in the Proms and delivered whenever the Receiver ring becomes stable again.

Something I have noticed though is that perhaps 50% of the time after a Receiver gets a SIGTERM and tries to flush its block it fails. It's not always the case but much of the time the logs state it's because the block has been closed (or something like that) and it can't read it. The data is still on disk though as far as the Receiver is concerned so not data loss.

Sorry, I'm probably getting too far off the point of this issue now :)

*Edit
Have just seen the issue I mentioned about the block not being flushed to the object store. Reading the logs now I think they're maybe just saying that the process isn't going to accept new writes.

level=info ts=2021-01-26T20:39:38.7154373Z caller=main.go:168 msg="caught signal. Exiting." signal=terminated
level=warn ts=2021-01-26T20:39:38.722088Z caller=intrumentation.go:54 component=receive msg="changing probe status" status=not-ready reason=null
level=info ts=2021-01-26T20:39:38.723041Z caller=http.go:65 component=receive service=http/server component=receive msg="internal server is shutting down" err=null
level=info ts=2021-01-26T20:39:38.7238685Z caller=receive.go:319 component=receive msg="shutting down storage"
level=info ts=2021-01-26T20:39:38.7252897Z caller=multitsdb.go:152 component=receive component=multi-tsdb msg="flushing TSDB" tenant=default-tenant
level=info ts=2021-01-26T20:39:39.2325798Z caller=http.go:84 component=receive service=http/server component=receive msg="internal server is shutdown gracefully" err=null
level=info ts=2021-01-26T20:39:39.2327525Z caller=intrumentation.go:66 component=receive msg="changing probe status" status=not-healthy reason=null
level=info ts=2021-01-26T20:39:40.2242286Z caller=compact.go:500 component=receive component=multi-tsdb tenant=default-tenant msg="write block" mint=1611693148894 maxt=1611693577791 ulid=01EX06RPF56H854R67VK6Z08AY duration=1.4985802s
level=info ts=2021-01-26T20:39:40.5811752Z caller=head.go:824 component=receive component=multi-tsdb tenant=default-tenant msg="Head GC completed" duration=79.5139ms
level=info ts=2021-01-26T20:39:40.7326703Z caller=receive.go:323 component=receive msg="storage is flushed successfully"
level=info ts=2021-01-26T20:39:40.7329166Z caller=multitsdb.go:180 component=receive component=multi-tsdb msg="closing TSDB" tenant=default-tenant
level=info ts=2021-01-26T20:39:40.7375334Z caller=receive.go:329 component=receive msg="storage is closed"
level=info ts=2021-01-26T20:39:40.7378887Z caller=receive.go:535 component=receive component=uploader msg="uploading the final cut block before exiting"
level=info ts=2021-01-26T20:39:40.7396618Z caller=grpc.go:123 component=receive service=gRPC/server component=receive msg="internal server is shutting down" err=null
level=info ts=2021-01-26T20:39:40.7400401Z caller=grpc.go:136 component=receive service=gRPC/server component=receive msg="gracefully stopping internal server"
level=info ts=2021-01-26T20:39:40.7432215Z caller=grpc.go:149 component=receive service=gRPC/server component=receive msg="internal server is shutdown gracefully" err=null
level=warn ts=2021-01-26T20:39:40.9814783Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=1
level=warn ts=2021-01-26T20:39:40.9819714Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=1
level=warn ts=2021-01-26T20:39:40.985965Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=error ts=2021-01-26T20:39:43.564304Z caller=handler.go:331 component=receive component=receive-handler err="storing locally, endpoint thanos-prometheus-thanos-receiver-1.thanos-prometheus-thanos-receiver.thanos.svc.cluster.local:10901: commit samples: write to WAL: log series: write data/default-tenant/wal/00000001: file already closed" msg="internal server error"
level=error ts=2021-01-26T20:39:43.5654236Z caller=handler.go:331 component=receive component=receive-handler err="storing locally, endpoint thanos-prometheus-thanos-receiver-1.thanos-prometheus-thanos-receiver.thanos.svc.cluster.local:10901: commit samples: write to WAL: log series: write data/default-tenant/wal/00000001: file already closed" msg="internal server error"
level=error ts=2021-01-26T20:39:43.5680151Z caller=handler.go:331 component=receive component=receive-handler err="storing locally, endpoint thanos-prometheus-thanos-receiver-1.thanos-prometheus-thanos-receiver.thanos.svc.cluster.local:10901: commit samples: write to WAL: log series: write data/default-tenant/wal/00000001: file already closed" msg="internal server error"
level=error ts=2021-01-26T20:39:43.5748706Z caller=handler.go:331 component=receive component=receive-handler err="storing locally, endpoint thanos-prometheus-thanos-receiver-1.thanos-prometheus-thanos-receiver.thanos.svc.cluster.local:10901: commit samples: write to WAL: log series: write data/default-tenant/wal/00000001: file already closed" msg="internal server error"
level=error ts=2021-01-26T20:39:43.5823829Z caller=handler.go:331 component=receive component=receive-handler err="storing locally, endpoint thanos-prometheus-thanos-receiver-1.thanos-prometheus-thanos-receiver.thanos.svc.cluster.local:10901: commit samples: write to WAL: log series: write data/default-tenant/wal/00000001: file already closed" msg="internal server error"

0 replies

Drewster727 · 2021-01-29T19:40:07Z

Drewster727
Jan 29, 2021

Just curious if anyone seen this issue with 0.18.0? Going to give it a shot early next week.

0 replies

yuzijiang718 · 2021-02-01T03:37:41Z

yuzijiang718
Feb 1, 2021

Just curious if anyone seen this issue with 0.18.0? Going to give it a shot early next week.

HI,I am using v0.18.0 .I don't know if if there is any mistake in my config file.

The yaml file like this:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/component: database-write-hashring
    app.kubernetes.io/instance: thanos-receive
    app.kubernetes.io/name: thanos-receive
    app.kubernetes.io/version: v0.18.0
  name: thanos-receive
  namespace: thanos
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/component: database-write-hashring
      app.kubernetes.io/instance: thanos-receive
      app.kubernetes.io/name: thanos-receive
  serviceName: thanos-receive
  template:
    metadata:
      labels:
        app.kubernetes.io/component: database-write-hashring
        app.kubernetes.io/instance: thanos-receive
        app.kubernetes.io/name: thanos-receive
        app.kubernetes.io/version: v0.18.0
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - thanos-receive
                - key: app.kubernetes.io/instance
                  operator: In
                  values:
                  - thanos-receive
              namespaces:
              - thanos
              topologyKey: kubernetes.io/hostname
            weight: 100
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - thanos-receive
                - key: app.kubernetes.io/instance
                  operator: In
                  values:
                  - thanos-receive
              namespaces:
              - thanos
              topologyKey: topology.kubernetes.io/zone
            weight: 100
      containers:
      - args:
        - receive
        - --log.level=info
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --remote-write.address=0.0.0.0:19291
        - --receive.replication-factor=2
        - --objstore.config=$(OBJSTORE_CONFIG)
        - --tsdb.path=/var/thanos/receive
        - --label=replica="$(NAME)"
        - --label=receive="true"
        - --tsdb.retention=15d
        - --receive.local-endpoint=$(NAME).thanos-receive.$(NAMESPACE).svc.cluster.local:10901
        - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
        - |-
          --tracing.config="config":
            "sampler_param": 2
            "sampler_type": "ratelimiting"
            "service_name": "thanos-receive"
          "type": "JAEGER"
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: OBJSTORE_CONFIG
          valueFrom:
            secretKeyRef:
              key: thanos.yaml
              name: thanos-objectstorage
        image: core.harbor.panda.com/thanos/thanos:v0.18.0
        livenessProbe:
          failureThreshold: 8
          httpGet:
            path: /-/healthy
            port: 10902
            scheme: HTTP
          periodSeconds: 30
        name: thanos-receive
        ports:
        - containerPort: 10901
          name: grpc
        - containerPort: 10902
          name: http
        - containerPort: 19291
          name: remote-write
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: 10902
            scheme: HTTP
          periodSeconds: 5
        resources:
          limits:
            cpu: 2
            memory: 1024Mi
          requests:
            cpu: 0.5
            memory: 512Mi
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /var/thanos/receive
          name: data
          readOnly: false
        - mountPath: /var/lib/thanos-receive
          name: hashring-config
      terminationGracePeriodSeconds: 900
      volumes:
      - configMap:
          name: hashring
        name: hashring-config
  volumeClaimTemplates:
  - metadata:
      labels:
        app.kubernetes.io/component: database-write-hashring
        app.kubernetes.io/instance: thanos-receive
        app.kubernetes.io/name: thanos-receive
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi

The hashing file like this:

{
	"hashrings.json": "[
		    {
		        \"hashring\": \"soft-tenants\",
		        \"endpoints\": 
		        [
		          \"thanos-receive-0.thanos-receive.thanos.svc.cluster.local:10901\",
		          \"thanos-receive-1.thanos-receive.thanos.svc.cluster.local:10901\",
		          \"thanos-receive-2.thanos-receive.thanos.svc.cluster.local:10901\"
		        ]
		    }
		]
		"
}

And the receive log like this:

level=error ts=2021-02-01T02:46:59.386889118Z caller=handler.go:330 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=error ts=2021-02-01T03:09:26.704816498Z caller=handler.go:330 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=error ts=2021-02-01T03:26:15.085478835Z caller=handler.go:330 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"

0 replies

kakkoyun · 2021-02-12T10:17:07Z

kakkoyun
Feb 12, 2021
Maintainer

Hey @yuzijiang718, as suggested above have you tried to allocate more resources?

0 replies

kimabd · 2021-02-18T07:52:42Z

kimabd
Feb 18, 2021

Hey @yuzijiang718, as suggested above have you tried to allocate more resources?

Hello,I'm using same configuration and I have enough memory and CPU power(always less than limits).
on minikube everything works well, but on real cluster there is always problems with liveness-probe checking every 5-20min, and a lot of error messages. At graph I have 98% of time empty data

here is my config:

apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/component: database-write-hashring
app.kubernetes.io/instance: thanos-receive
app.kubernetes.io/name: thanos-receive
app.kubernetes.io/version: v0.18.0
name: thanos-receive
namespace: monitoring
spec:
replicas: 3
selector:
matchLabels:
app.kubernetes.io/component: database-write-hashring
app.kubernetes.io/instance: thanos-receive
app.kubernetes.io/name: thanos-receive
serviceName: thanos-receive
template:
metadata:
annotations: {}
labels:
app.kubernetes.io/component: database-write-hashring
app.kubernetes.io/instance: thanos-receive
app.kubernetes.io/name: thanos-receive
app.kubernetes.io/version: v0.18.0
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- thanos-receive
- key: app.kubernetes.io/instance
operator: In
values:
- thanos-receive
namespaces:
- monitoring
topologyKey: kubernetes.io/hostname
weight: 100
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- thanos-receive
- key: app.kubernetes.io/instance
operator: In
values:
- thanos-receive
namespaces:
- monitoring
topologyKey: topology.kubernetes.io/zone
weight: 100
containers:
- args:
- receive
- --log.level=info
- --log.format=logfmt
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --remote-write.address=0.0.0.0:19291
- --receive.replication-factor=3
- --objstore.config=$(OBJSTORE_CONFIG)
- --tsdb.path=/var/thanos/receive
- --label=replica="$(NAME)"
- --label=receive="true"
- --tsdb.retention=14d
- --receive.local-endpoint=$(NAME).thanos-receive.$(NAMESPACE).svc.cluster.local:10901
- --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
env:
- name: NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: OBJSTORE_CONFIG
valueFrom:
secretKeyRef:
key: thanos.yaml
name: thanos-objstore-config
image: quay.io/thanos/thanos:v0.18.0
livenessProbe:
failureThreshold: 8
httpGet:
path: /-/healthy
port: 10902
scheme: HTTP
periodSeconds: 30
name: thanos-receive
ports:
- containerPort: 10901
name: grpc
- containerPort: 10902
name: http
- containerPort: 19291
name: remote-write
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: 10902
scheme: HTTP
periodSeconds: 5
resources:
limits:
cpu: 0.7
memory: 8000Mi
requests:
cpu: 0.123
memory: 100Mi
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /var/thanos/receive
name: data
readOnly: false
- mountPath: /var/lib/thanos-receive
name: hashring-config
terminationGracePeriodSeconds: 900
volumes:
- configMap:
name: thanos-receive-base
name: hashring-config
- emptyDir: {}
name: data
volumeClaimTemplates: []

and here is logs:

level=error ts=2021-02-19T08:32:38.742509105Z caller=handler.go:330 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=warn ts=2021-02-19T08:32:38.74330643Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=1

Readiness and Liveness probes periodically fails on both env(minikube and real cluster)

60m Warning Unhealthy pod/thanos-receive-2 Readiness probe failed: Get http://172.17.0.15:10902/-/ready: dial tcp 172.17.0.15:10902: connect: connection refused
60m Warning Unhealthy pod/thanos-receive-2 Liveness probe failed: Get http://172.17.0.15:10902/-/healthy: dial tcp 172.17.0.15:10902: connect: connection refused

0 replies

tdinucci · 2021-03-03T16:58:47Z

tdinucci
Mar 3, 2021

I have an update on my experience with this issue that might help someone.

When running Receivers within a vanilla k8s cluster I saw only the issues I'd mentioned in my previous comment and generally speaking things are fine when Receivers have enough resources. I recently tried Receivers in a cluster which had Istio installed and the Receiver pods had Istio sidecars injected into them.

The TL;DR is that I saw the same behaviour that was mentioned in the initial post in this thread, i.e. periods of data loss. It's still too early to say with 100% certainty but today I have prevented Istio sidecars from being injected into Receiver pods and things so far are working as expected. I don't know if the OP was using a service mesh or not?

At this point I can only speculate as to what was going on when Istio was in the picture. During this test almost all Prom remote-write settings were at their defaults. What I observed however was:

A Prom would write to a Receiver (let's call this Receiver A)
Some of the data points that hit A would need routed to another Receiver (let's call this B)
Routing from A to B may fail and then A keeps retrying every 5 seconds. Under the hood Istio was also doing it's own retries.
Things were getting themselves into a situation where none of the retries the Receiver was aware of were succeeding, mostly due to timeouts but also some DNS issues. This would continue for an extended period, maybe 10s of minutes.
While this is happening, Prom did not appear to think any of the write requests were failing - no queue backlog was forming. I'm guessing this could be due to requests taking a while to timeout - not sure at the minute though.
Suddenly a Prom remote-write queue backlog would form (potentially due to requests timing out) and it would contain 10s to 100s of thousands of data points.
The queue would be immediately split from 5 to 1000 shards
Memory would spike and soon after the Prom would OOM
The Prom would restart and start replaying its WAL to the Receivers
By this point the Receivers head chunk had moved on and the mint was greater than the the points in the Prom queue
The Receivers starts dropping points
Once the old points are all dropped the new ones are ingested by the Receivers
At some later time everything above would repeat

There may well be some things that can be done to keep Receivers inside a service mesh, maybe disabling retries and setting short connection timeouts. I've not tried these things. I suppose my point though is that there are issues that can happen if you're using one.

0 replies

milonjames · 2021-05-21T13:03:15Z

milonjames
May 21, 2021

I am facing this issue with Thanos v0.20.1 and Prometheus v2.11.0 version. In my case Thanos receive PODs end up using all the CPU available in the node and throw these errors. Anyone managed to tackle the issue or find any workarounds ? @kakkoyun

1 reply

milonjames May 25, 2021

I managed to figure out what the issue was with my setup, a misconfigured quote was causing the issue, thanks to #4093 (comment)

GounGG · 2022-06-04T22:05:27Z

GounGG
Jun 4, 2022

image: docker.io/bitnami/thanos:0.21.1-scratch-r2

normal:

--receive.replication-factor=1

abnormal:

--receive.replication-factor=2

when replication-factor is set to non-1, an error occurs

how was it finally resolved?

0 replies

receive: CPU Saturation #4270

Uh oh!

Uh oh!

Replies: 31 comments · 1 reply

Uh oh!

brancz Aug 5, 2020 Maintainer

Uh oh!

enjoychaim Aug 6, 2020 Author

Uh oh!

brancz Aug 6, 2020 Maintainer

Uh oh!

enjoychaim Aug 6, 2020 Author

Uh oh!

brancz Aug 6, 2020 Maintainer

Uh oh!

Uh oh!

enjoychaim Aug 7, 2020 Author

Uh oh!

brancz Aug 7, 2020 Maintainer

Uh oh!

enjoychaim Aug 8, 2020 Author

Uh oh!

Uh oh!

enjoychaim Aug 8, 2020 Author

Uh oh!

enjoychaim Aug 9, 2020 Author

Uh oh!

brancz Aug 10, 2020 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bwplotka Jan 26, 2021 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HI,I am using v0.18.0 .I don't know if if there is any mistake in my config file.

The hashing file like this:

And the receive log like this:

Uh oh!

Replies: 31 comments 1 reply

brancz
Aug 5, 2020
Maintainer

enjoychaim
Aug 6, 2020
Author

brancz
Aug 6, 2020
Maintainer

enjoychaim
Aug 6, 2020
Author

brancz
Aug 6, 2020
Maintainer

enjoychaim
Aug 7, 2020
Author

brancz
Aug 7, 2020
Maintainer

enjoychaim
Aug 8, 2020
Author

enjoychaim
Aug 8, 2020
Author

enjoychaim
Aug 9, 2020
Author

brancz
Aug 10, 2020
Maintainer

bwplotka
Jan 26, 2021
Maintainer