-->

Saturday, June 8, 2024

Mastering Python: A Deep Dive into Sets and Dictionaries Part5

[Solved] Deleting Argocd Config makes the resource invisible but not deleting it completely and hangs in Argocd

 Error:-

Recently while depreciating the resources from the Argocd specifically the ingress , removed the application set from the Argocd after which Argocd tried to remove the resources. But while setting up Ingress we have configured delete_protection.enabled=true which was not removed. So  At this point the Argocd was not able to remove the resource due to the delete protection enabled rather start showing resource as Missing in Argocd Dashboard while multiple events showed resource not able to delete due to protection enabled. 

Now because we were using the Keda, the Keda validation Error also go high while the ingress getting struck to delete and keda trying to create the resource but was not able to either.

Cause:-

The obvious cause of the error was that the delete_protection.enabled should be set to the false first before removing the Appset so it can delete the resource successfully. However since it was not done, Argocd hanged in deleting state but cannot delete due to the flag.

At this point tried to disable the flag from the AWS Dasahboard on ALB Ingress which was successful however it did cause any change since Argocd already had a config and would try to sync it back and since multiple micro services calling same ingress it cant be changed manually at same time across all services.

Even if you somehow try to sync it back than also Argocd will revert back the config as it will sync back any changes to maintain the state. So only option for you to rollback the changes and set the delete_protection.enabled to false and than delete again. But this was also not possible because when argocd is trying to delete and struck it cannot sync in new changes. So you need to complete the previous sync than only new sync will happen.

So the last viable option is to delete the ingress using the kubectl, however in this case if you try to delete it will hang up and delete will not complete. So i tried to delete it forcefully and grace period 0 like 

--force --grace-period=0

However even this would not get finish and sort out the problem

Monday, June 3, 2024

[Solved] Gitlab remote: ERROR: Your SSH key has expired.

 Error:-

We have been using Gitlab and recently I got the following error while pushing my change to the Gitlab Repository


remote:
remote: ========================================================================
remote:
remote: ERROR: Your SSH key has expired.

remote:
remote: ========================================================================
remote:
fatal: Could not read from remote repository.


Cause:-

The reason for this failure was that i created a ssh key about a year ago and gitlab by default puts a security policy that your ssh key will get expire after 1 year. This is something built in to improve security in Gitlab now because that timeperiod has elapsed of 1 year thats why it gives that error of ssh key has expired.

Saturday, May 11, 2024

[Solved] Error 503 Service Unavailable on the Rolling Deployment of Service in EKS Cluster

 Error:-

Kubernetes including EKS uses the Rolling deployment by default for deploying the applications with Zero Downtime. Generally this works perfectly fine but during a recent major change we tested our application deployment in staging environment before deploying it into production to determine if any downtime will be there and than arrange the deployment accordingly. 

what we did was a curl request which gets executed every 1 second interval and what we figured out was quite unusal that while application written in Kotlin java gave 503 Service unavailable during the deployment which was not expected since we expected zero downtime because we were using Rolling deployment.


Cause:-

To deploy an application which will really update with zero downtime the application should meet some requirements. To mention few of them:

1. application should handle graceful shutdown
2. application should implement readiness and liveness probes correctly

Solution :-

In our case on further examination we found that the application was missing the graceful shutdown for this particular service in the staging environment. In more general the configmap for the particular application was missing which cause the 503 Service unavailable issue while testing after creating the configmap the issue got resolved.

This gives us some hint what could be wrong considering these things is important for zero downtime.

[Solved] prometheus-kube-prometheus-prometheus-rulefiles group=kubernetes-resources msg="Failed to get Querier" err="TSDB not ready"

 Error:-

We recently faced an Error with the Prometheus which is deployed using the prometheus operator on the EFS Volume shared across multiple pods.


caller=group.go:104 level=error component="rule manager" file=/etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0/monitoring-kube-prometheus-kubernetes-resources.yaml group=kubernetes-resources msg="Failed to get Querier" err="TSDB not ready"

caller=head.go:176 level=error component=tsdb msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 30172821


Cause:-

As the information relayed in the log was sufficient enough which shows that the data with the chunks got corrupted and when the prometheus restarts it replays the WAL and when it reaches to the particular chunk sequence it gets failed. This is the primary cause of the failure.


Solution :-

Now for the above corruption of chunk data there is no simple way of recovering. Generally only 2 hours of data. So in this case since chunks and data is corrupted that means you will have some data loss and more you delay greater is the loss of data that will happen. The only solution in this case is to delete the WAL and than agar restart the prometheus which will again create the WAL won't replay old chunks so some data loss would be there in our case 2 hours and will start nornally that way. If you dont do this it will continue to restart and not able to replay wall completely. So we went ahead deleted wall and went we started the prometheus it worked perfectly fine

Alternatively you want to move all data to some alternative backup location if you want but just make sure this directory is completely empty

rm -rf /prometheus/wal/