Saturday, May 11, 2024

[Solved] Error 503 Service Unavailable on the Rolling Deployment of Service in EKS Cluster


Kubernetes including EKS uses the Rolling deployment by default for deploying the applications with Zero Downtime. Generally this works perfectly fine but during a recent major change we tested our application deployment in staging environment before deploying it into production to determine if any downtime will be there and than arrange the deployment accordingly. 

what we did was a curl request which gets executed every 1 second interval and what we figured out was quite unusal that while application written in Kotlin java gave 503 Service unavailable during the deployment which was not expected since we expected zero downtime because we were using Rolling deployment.


To deploy an application which will really update with zero downtime the application should meet some requirements. To mention few of them:

1. application should handle graceful shutdown
2. application should implement readiness and liveness probes correctly

Solution :-

In our case on further examination we found that the application was missing the graceful shutdown for this particular service in the staging environment. In more general the configmap for the particular application was missing which cause the 503 Service unavailable issue while testing after creating the configmap the issue got resolved.

This gives us some hint what could be wrong considering these things is important for zero downtime.

[Solved] prometheus-kube-prometheus-prometheus-rulefiles group=kubernetes-resources msg="Failed to get Querier" err="TSDB not ready"


We recently faced an Error with the Prometheus which is deployed using the prometheus operator on the EFS Volume shared across multiple pods.

caller=group.go:104 level=error component="rule manager" file=/etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0/monitoring-kube-prometheus-kubernetes-resources.yaml group=kubernetes-resources msg="Failed to get Querier" err="TSDB not ready"

caller=head.go:176 level=error component=tsdb msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 30172821


As the information relayed in the log was sufficient enough which shows that the data with the chunks got corrupted and when the prometheus restarts it replays the WAL and when it reaches to the particular chunk sequence it gets failed. This is the primary cause of the failure.

Solution :-

Now for the above corruption of chunk data there is no simple way of recovering. Generally only 2 hours of data. So in this case since chunks and data is corrupted that means you will have some data loss and more you delay greater is the loss of data that will happen. The only solution in this case is to delete the WAL and than agar restart the prometheus which will again create the WAL won't replay old chunks so some data loss would be there in our case 2 hours and will start nornally that way. If you dont do this it will continue to restart and not able to replay wall completely. So we went ahead deleted wall and went we started the prometheus it worked perfectly fine

Alternatively you want to move all data to some alternative backup location if you want but just make sure this directory is completely empty

rm -rf /prometheus/wal/

Saturday, May 4, 2024

[Solved] ERROR: Rancher must be ran with the --privileged flag when running outside of Kubernetes


While running rancher docker container on the ubuntu server, i saw the container crashing very frequently. After checking the logs saw the following error happening very frequently

docker run -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher:latest
ERROR: Rancher must be ran with the --privileged flag when running outside of Kubernetes


When you are installing the rancher in test environment where you dont need the identity verification using ssl than it becomes essential you pass the --privileged flag. 

Solution :-

Run the following command to overcome the issue

docker run -d --restart=unless-stopped -p 80:80 -p 443:443 --privileged rancher/rancher:latest