Saturday, May 11, 2024

[Solved] prometheus-kube-prometheus-prometheus-rulefiles group=kubernetes-resources msg="Failed to get Querier" err="TSDB not ready"


We recently faced an Error with the Prometheus which is deployed using the prometheus operator on the EFS Volume shared across multiple pods.

caller=group.go:104 level=error component="rule manager" file=/etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0/monitoring-kube-prometheus-kubernetes-resources.yaml group=kubernetes-resources msg="Failed to get Querier" err="TSDB not ready"

caller=head.go:176 level=error component=tsdb msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 30172821


As the information relayed in the log was sufficient enough which shows that the data with the chunks got corrupted and when the prometheus restarts it replays the WAL and when it reaches to the particular chunk sequence it gets failed. This is the primary cause of the failure.

Solution :-

Now for the above corruption of chunk data there is no simple way of recovering. Generally only 2 hours of data. So in this case since chunks and data is corrupted that means you will have some data loss and more you delay greater is the loss of data that will happen. The only solution in this case is to delete the WAL and than agar restart the prometheus which will again create the WAL won't replay old chunks so some data loss would be there in our case 2 hours and will start nornally that way. If you dont do this it will continue to restart and not able to replay wall completely. So we went ahead deleted wall and went we started the prometheus it worked perfectly fine

Alternatively you want to move all data to some alternative backup location if you want but just make sure this directory is completely empty

rm -rf /prometheus/wal/


Post a Comment