Saturday, June 8, 2024
[Solved] Deleting Argocd Config makes the resource invisible but not deleting it completely and hangs in Argocd
Error:-
Now because we were using the Keda, the Keda validation Error also go high while the ingress getting struck to delete and keda trying to create the resource but was not able to either.
Cause:-
The obvious cause of the error was that the delete_protection.enabled should be set to the false first before removing the Appset so it can delete the resource successfully. However since it was not done, Argocd hanged in deleting state but cannot delete due to the flag.
At this point tried to disable the flag from the AWS Dasahboard on ALB Ingress which was successful however it did cause any change since Argocd already had a config and would try to sync it back and since multiple micro services calling same ingress it cant be changed manually at same time across all services.
Even if you somehow try to sync it back than also Argocd will revert back the config as it will sync back any changes to maintain the state. So only option for you to rollback the changes and set the delete_protection.enabled to false and than delete again. But this was also not possible because when argocd is trying to delete and struck it cannot sync in new changes. So you need to complete the previous sync than only new sync will happen.
So the last viable option is to delete the ingress using the kubectl, however in this case if you try to delete it will hang up and delete will not complete. So i tried to delete it forcefully and grace period 0 like
--force --grace-period=0
However even this would not get finish and sort out the problem
Monday, June 3, 2024
[Solved] Gitlab remote: ERROR: Your SSH key has expired.
Error:-
remote:
remote: ========================================================================
remote:
remote: ERROR: Your SSH key has expired.
remote:
remote: ========================================================================
remote:
fatal: Could not read from remote repository.
Cause:-
The reason for this failure was that i created a ssh key about a year ago and gitlab by default puts a security policy that your ssh key will get expire after 1 year. This is something built in to improve security in Gitlab now because that timeperiod has elapsed of 1 year thats why it gives that error of ssh key has expired.
Saturday, May 11, 2024
[Solved] Error 503 Service Unavailable on the Rolling Deployment of Service in EKS Cluster
Error:-
Cause:-
To deploy an application which will really update with zero downtime the application should meet some requirements. To mention few of them:
1. application should handle graceful shutdown 2. application should implement readiness and liveness probes correctly
Solution :-
In our case on further examination we found that the application was missing the graceful shutdown for this particular service in the staging environment. In more general the configmap for the particular application was missing which cause the 503 Service unavailable issue while testing after creating the configmap the issue got resolved.
This gives us some hint what could be wrong considering these things is important for zero downtime.
[Solved] prometheus-kube-prometheus-prometheus-rulefiles group=kubernetes-resources msg="Failed to get Querier" err="TSDB not ready"
Error:-
caller=group.go:104 level=error component="rule manager" file=/etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0/monitoring-kube-prometheus-kubernetes-resources.yaml group=kubernetes-resources msg="Failed to get Querier" err="TSDB not ready" caller=head.go:176 level=error component=tsdb msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 30172821
Cause:-
As the information relayed in the log was sufficient enough which shows that the data with the chunks got corrupted and when the prometheus restarts it replays the WAL and when it reaches to the particular chunk sequence it gets failed. This is the primary cause of the failure.
Solution :-
Now for the above corruption of chunk data there is no simple way of recovering. Generally only 2 hours of data. So in this case since chunks and data is corrupted that means you will have some data loss and more you delay greater is the loss of data that will happen. The only solution in this case is to delete the WAL and than agar restart the prometheus which will again create the WAL won't replay old chunks so some data loss would be there in our case 2 hours and will start nornally that way. If you dont do this it will continue to restart and not able to replay wall completely. So we went ahead deleted wall and went we started the prometheus it worked perfectly fine
Alternatively you want to move all data to some alternative backup location if you want but just make sure this directory is completely empty
rm -rf /prometheus/wal/