Cloud Devops Automation

Tuesday, November 29, 2022

[Solved] sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '127.0.0.1'

Issue:-

When launching a container from the application image, application needs to connect to the mysql database running on the host machine. But when you try to connect using the localhost or 127.0.0.1 you get the following error.

Error:-

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 111] Connection refused)")

Effect:-

Application container went down because application was not able to connect to the mysql database.

Resolution:-

Since the mysql database is running on the host machine so it makes sense that we use the --network=host which will disable the docker networking and use the host based networking and than your docker container will be able to connect to the host database since both are in the same network.

docker run -d --network=host project_app1:latest

Saturday, October 22, 2022

[Solved] Failed to pull image rpc error: code = Unknown desc = context deadline exceeded

Issue:-

When creating a tomcat:9 image pod in the minikube got the following error

Warning Failed 58s kubelet Failed to pull image "tomcat:9": rpc error: code = Unknown desc = context deadline exceeded

Error:-

   Warning  Failed     58s                  kubelet            Failed to pull image "tomcat:9": rpc error: code = Unknown desc = context deadline exceeded

Detail overview about ISTIO Service Mesh

Thursday, October 13, 2022

[Solved] warning: containerd.io.rpm: error: Failed dependencies:container-selinux >= 2:2.74 is needed by containerd.io

Issue:-

When installing the containerd rpm on the centos7 get a dependency issue related to container-selinux preventing the containerd from getting installed

Error:-

 [root@kubemaster ~]# rpm -ivh containerd.io-1.6.8-3.1.el7.x86_64.rpm
warning: containerd.io-1.6.8-3.1.el7.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID 621e9f35: NOKEY
error: Failed dependencies:
	container-selinux >= 2:2.74 is needed by containerd.io-1.6.8-3.1.el7.x86_64

Effect:-

Was not able to install the containerd on the Centos7.

Issue:-

When installing the containerd on the centos7 using the yum package manager it gives the error mentioned below

Error:-

 No package containerd available.

Effect:-

Was not able to install the containerd on the Centos7.

Resolution:-

Download the rpm for the containerd from the following link

https://download.docker.com/linux/centos/7/x86_64/stable/Packages/

 wget https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.6.8-3.1.el7.x86_64.rpm

Wednesday, September 28, 2022

Generating token in kubernetes using kubeadm command for adding the worker nodes

Issue:- Kubeadm provides you a join token command when you first create a kubernetes cluster. But if you dont have that token handy for the future requirement for addition of the worker nodes to increase the cluster capacity ?

Solution:- you can run the following command which will allow you to generate the full token command which can be used to add the worker nodes to master in the future.

 [centos@kubemaster ~]$ kubeadm token create --print-join-command

kubeadm join 172.31.98.106:6443 --token ix1ien.29glfz1p04d7ymtd --discovery-token-ca-cert-hash sha256:1f202db500d698032d075433176dd62f5d0074453daa12ccdfffd637a966a771

Once the token has been generated than you can run the command on the worker node to add it in the kubernetes cluster.

[Solved] Persistentvolume claim pending while installing the Elasticsearch using Helm

Issue:-

When installing the elasticsearch using the helm , the elasticsearch continaer fails as the multimaster nodes go in pending state for the persistentvolumeclaim and continaer remains in the pending state.

Error:-

Persistent volume claim remains in the pending state

Effect:-

Was not able to install the elasticsearch as persistent volume claim was not ready for the Elasticsearch.

Issue:-

When installing the elasticsearch using the helm , the elasticsearch continaer fails with an exception AccessDeniedException[/usr/share/elasticsearch/data/nodes];

Error:-

"cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "uncaught exception in thread [main]",

"stacktrace": ["org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes];"

Effect:-

Was not able to install the elasticsearch and elasticsearch pod keeps crashing again and again as the healthcheck is not passed and the liveness probe fails restarting the pod again and again.

Resolution:-

Follow the following steps to resolve the issue

1. The issue comes because the elasticsearch user is not having the permission on the /usr/share/elasticsearch/data/nodes directory.

2. But you cannot directly use kubectl exec command as elasticsearch does not support sh or bash and also if the pod gets replaced the issue will arise again.

3. So in order to resolve this issue you need to use the init continaers and runasuser so that the proper permission is available with the elasticsearch user, I was able to create the following workaround for this issue

replicas: 1
minimumMasterNodes: 1

volumeClaimTemplate:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 1Gi
extraInitContainers: |
   - name: create
     image: busybox:1.35.0
     command: ['mkdir', '-p', '/usr/share/elasticsearch/data/nodes/']
     securityContext:
       runAsUser: 0
     volumeMounts:
      - mountPath: /usr/share/elasticsearch/data
        name: elasticsearch-master
   - name: file-permissions
     image: busybox:1.35.0
     command: ['chown', '-R', '1000:1000', '/usr/share/elasticsearch/']
     securityContext:
        runAsUser: 0
     volumeMounts:
      - mountPath: /usr/share/elasticsearch/data
        name: elasticsearch-master

Explanation:-

Here we have added extraInitContainers and use the busybox image, so create a directory /usr/share/elasticsearch/data/nodes/ inside the busybox container and use the securitycontext with runAsUser and changing the fil permission 1000 and mounting this volume inside the elasticsearch container at the path /usr/share/elasticsearch/data after which the correct permissions are there. So you should not get the accessdenied permission again. And pod should run fine this time.

Wednesday, July 20, 2022

Terraform variables input output local variable theory Part2

Monday, July 18, 2022

[Solved] too early for operation, device not yet seeded or device model not acknowledged

Issue:-

WWhen installing the terragrunt using the snap got the following error

Error:-

error: too early for operation, device not yet seeded or device model not acknowledged

Effect:-

Was not able to install the terragrunt as installation failed at that very moment.

 [root@aafe920be71c ~]# snap install terragrunt
error: too early for operation, device not yet seeded or device model not acknowledged

Resolution:-

Follow the following steps to resolve the issue

1. Try checking the status of the snapd service which was inactive in my case

 [root@aafe920be71c ~]# systemctl status snapd.seeded.service
● snapd.seeded.service - Wait until snapd is fully seeded
   Loaded: loaded (/usr/lib/systemd/system/snapd.seeded.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

2. Now start the snapd service as

[root@aafe920be71c ~]# systemctl status snapd.seeded.service
● snapd.seeded.service - Wait until snapd is fully seeded
   Loaded: loaded (/usr/lib/systemd/system/snapd.seeded.service; disabled; vendor preset: disabled)
   Active: active (exited) since Mon 2022-07-18 16:12:34 UTC; 2s ago
  Process: 6425 ExecStart=/usr/bin/snap wait system seed.loaded (code=exited, status=0/SUCCESS)
 Main PID: 6425 (code=exited, status=0/SUCCESS)

Jul 18 16:12:33 aafe920be71c.mylabserver.com systemd[1]: Starting Wait until snapd is fully seeded...
Jul 18 16:12:34 aafe920be71c.mylabserver.com systemd[1]: Started Wait until snapd is fully seeded.

[root@aafe920be71c ~]# snap install terragrunt
2022-07-18T16:13:10Z INFO Waiting for automatic snapd restart...
terragrunt 0+git.ae675d6 from dt9394 (terraform-snap) installed

Explanation:-

snapd service was not running due to which the snapd has failed to install the package and given the above error.

Terraform theory part1

Wednesday, June 29, 2022

[Resolved] ERROR Uncaught exception in thread 'kafka-admin-client-thread | adminclient-1': (org.apache.kafka.common.utils.KafkaThread) java.lang.OutOfMemoryError: Java heap space

Issue:-

When trying to delete the topic in the Amazon MSK kafka clusterr got the following error

Error:-

ERROR Uncaught exception in thread 'kafka-admin-client-thread | adminclient-1': (org.apache.kafka.common.utils.KafkaThread)

java.lang.OutOfMemoryError: Java heap space

Effect:-

Was not able to delete the Topic in the MSK kafka cluster due to the above error message.

 ERROR Uncaught exception in thread 'kafka-admin-client-thread | adminclient-1': (org.apache.kafka.common.utils.KafkaThread)java.lang.OutOfMemoryError: Java heap space
	at java.base/java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:61)
	at java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:348)
	at org.apache.kafka.common.memory.MemoryPool$1.tryAllocate(MemoryPool.java:30)
	at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:112)
	at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424)
	at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385)
	at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651)
	at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572)
	at org.apache.kafka.common.network.Selector.poll(Selector.java:483)
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:535)
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1131)

Resolution:-

Follow the following steps

1. By including the truststore in the command with --command-config and the client properties was able to resolve the above issue.

 kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml

2. Now run the nslookup command on the DNS name and verify if its getting resolved or not

./kafka-topics.sh --bootstrap-server b-1.test-kafka.q15lx0.c10.kafka.us-west-2.amazonaws.com:9094,b-2.test-kafka.q15lx0.c10.kafka.us-west-2.amazonaws.com:9094,b-3.test-kafka.q15lx0.c10.kafka.us-west-2.amazonaws.com:9094 --delete --topic <topic-name>  --command-config  /Users/amittal/kafka/kafka_2.12-2.2.1/bin/client.properties

Explanation:-

You might be having the truststore in your home directory but you need to include it in your command with the --command-config otherwise it will fail to connect to the kafka cluster and eventually you wont be able to delete the topic from the Amazon Kafka cluster.

Saturday, June 4, 2022

[Resolved] default.svc.cluster.local: Name or service not known

Issue:-

After creating a service when I tried to verify if the DNS name for the service is getting resolved or I got the following error.

Error:-

my-service.default.svc: Name or service not known

Effect:-

I was unable to confirm if the service DNS was actually resolving or not and if there was some issue as the service itself was not accessible via curl or the browser

 [centos@kubemaster service]$ nslookup my-service.default.svc  
 -bash: nslookup: command not found  
 [centos@kubemaster service]$ dig nslookup my-service.default.svc  
 -bash: dig: command not found  
 [centos@kubemaster service]$ ping nslookup my-service.default.svc  
 ping: my-service.default.svc: Name or service not known  
 [centos@kubemaster service]$ ping my-service.default.svc  
 ping: my-service.default.svc: Name or service not known

Resolution:-

Follow the following steps

1. Create a pod with the DNS utils installed on it for making the nslookup command work inside the pod

 kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml

2. Now run the nslookup command on the DNS name and verify if its getting resolved or not

[centos@kubemaster service]$ kubectl exec -it dnsutils -- nslookup my-service.default.svc  
 Server:          10.96.0.10  
 Address:     10.96.0.10#53  
 Name:     my-service.default.svc.cluster.local  
 Address: 10.111.144.147

Explanation:-

Previously i was trying to resolve the DNS on the host network but the coredns works inside the kubernetes cluster only or pod network not the host network that is why you cannot use the traditional way of resolving the DNS using the nslookup or dig command. So we installed a pod with the dnsutils installed inside it and than we provide the nslookup command from inside the pod and directly print the result on the stdout. So you can use this way to resolve the DNS and verify if its working fine or not. Also you should put till svc as the kubernetes can take the cluster.local itself.

[Resolved] groupVersion shouldn't be empty

Issue:-

When creating the simple resource like pod, replicaset, deployments etc got a groupVersion error specified below.

Error:-

groupVersion shouldn't be empty

Effect:-

Not able to create the resource because of the above error

 apiversion: v1  
 kind: Pod  
 metadata:  
  name: pod2  
 spec:  
  containers:  
  - name: c1  
   image: nginx

Resolution:-

If you look at the above configuration precisely you will find the apiversion has been specified incorrectly. It should have been apiVersion k.So just a difference of block letter can make that error. The same error will occur even if you forgot to mention the apiVersion in the configuration or it is misspelled. Below configuration will work fine.

 apiVersion: v1  
 kind: Pod  
 metadata:  
  name: pod2  
 spec:  
  containers:  
  - name: c1  
   image: nginx

Explanation:-

apiVersion is hardcoded in the kubernetes. So if you misspell it, not use it or make a error in the Capital and small letter as well it will give the above error.

Monday, May 30, 2022

Understanding Docker TOCTOU Vulnerability

Sunday, May 29, 2022

[Resolved] Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.

Issue:-

Issue is with the dashboard service. When deploying the Dashboard service using the yaml in the kubernetes it gives the following error.

Error:-

Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.

Effect:-

Because the dashboard service is not able to connect to the dashboard-metrics-scraper service the UI for the dashboard service is not loading up due to which the Dashboard is not working in the UI and timeout after some time.

Issue:-

When installing the metricserver in the kubernetes getting the following error.

Error:-

Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)

Effect:-

Due to the above error the metricserver will not work

[centos@kubemaster dashboard]$ kubectl top nodes

W0529 10:18:25.234815 13218 top_node.go:119] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag

Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)

Understanding the Levels of assurances in SLSA - Part2

Sunday, May 22, 2022

Understanding Supply chain Levels for Software Artifacts SLSA Part1

Friday, May 20, 2022

Understanding Dockershim removal by kubernetes with version v1.24

Thursday, May 19, 2022

Understanding CI/CD devops pipelines with GITLAB with realworld analogy ...

Wednesday, May 18, 2022

[Resolved] An error occurred (Throttling) when calling the DescribeLoadBalancers operation (reached max retries: 4): Rate exceeded

Issue:-

If you are having big infrastructure and you have put lot of automation in place than your awscli limits might reach the thresholds which might result in error like.

Error:-

An error occurred (Throttling) when calling the DescribeLoadBalancers operation (reached max retries: 4): Rate exceeded

Effect:-

The command or the script which you have run might failed due to the limit being reached for the calls to the aws resource and retries also exhausted. You might run the command again if you doing it manually and facing error but in case its some script that will cause bigger issue to make the script failed without the retry logic written within the script as well which you have created.

Weaknesses/Limitations of gRPC Part5

Strengths of gRPC

Thursday, May 12, 2022

Understanding kubernetes components kubectl, daemon, apiserver, apiversi...

Monday, May 9, 2022

[Resolved] from setuptools_rust import RustExtension ModuleNotFoundError: No module named 'setuptools_rust'

Issue:-

Issue with the cryptography package and Rust during the Ansible Installation on the Centos.

Error:-

Downloading https://files.pythonhosted.org/packages/3d/5f/addb8b91fd356792d28e59a8275fec833323cb28604fb3a497c35d7cf0a3/cryptography-37.0.1.tar.gz (585kB) 100% |████████████████████████████████| 593kB 2.0MB/s Complete output from command python setup.py egg_info: =============================DEBUG ASSISTANCE========================== If you are seeing an error here please try the following to successfully install cryptography: Upgrade to the latest pip and try again. This will fix errors for most users. See: https://pip.pypa.io/en/stable/installing/#upgrading-pip =============================DEBUG ASSISTANCE========================== Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-build-nfv80r3s/cryptography/setup.py", line 14, in <module> from setuptools_rust import RustExtension ModuleNotFoundError: No module named 'setuptools_rust'

Effect:-

Ansible Installation failed while using pip with the above error.

Resolution:-

#pip install --upgrade pip

Explanation:-

Basically the issue is coming due to the outdated version of the pip being used for the Ansible installation due to which the above error occurs.

Try upgrading the pip first and than try to install the Ansible again using the pip and it should succeed.

[Resolved] Error response from daemon: invalid MountType: "=bind"

Issue:-

Unable to deploy the visualizer service in the Docker Swarm

Error:-

Error response from daemon: invalid MountType: "=bind"

Effect:-

# docker service create --name=viz --publish=8080:8080/tcp --constraint=node.role==manager --mount=type==bind,src=/var/run/docker.sock,dst=/var/run/docker.sock dockersamples/visualizer

Resolution:-

# docker service create --name=viz --publish=8080:8080/tcp --constraint=node.role==manager --mount=type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock dockersamples/visualizer

Explanation:-

You need to use the =bind and not ==bind to solve the problem

Thursday, May 5, 2022

All About IPAM and use in cloud, devops, vpc troubleshooting

Preventing DDOS Attacks Using AWS WAF Rule based WEBACL PART 3

Wednesday, May 4, 2022

Preventing DDOS Attacks Using AWS WAF Rule based WEBACL PART 2

Preventing DDOS Attacks Using AWS WAF Rule based WEBACL

Thursday, April 7, 2022

Understanding Kubernetes Canary Deployment With Architecture Diagram Part-II

Friday, March 25, 2022

Understanding the Concept of Canary Deployment in Kubernetes Part-1

Thursday, March 17, 2022

Troubleshooting and Logging in Distroless Images

Wednesday, March 16, 2022

Signing the Docker Images using Cosign - Part 4

Saturday, March 12, 2022

Understanding Distroless Container Images

Wednesday, March 9, 2022

[Solved] Intermittent / burst logs in the Newrelic / ELK

Issue:-

Although the application was writing the logs continuously and shipper shipping but the logs were missing for a particular period and burst of logs with spikes being observed in the Newrelic ELK.

Error:-

Following graph shows the actual issue of intermittent or burst of the logs in ELK

Effect:-

Due to the non availability of the logs it was becoming difficult to troubleshoot the issue as the logs were getting delayed and sometimes might be missed out as well.

Resolution:-

Printing the error logs or logs required for troubleshooting helps to overcome this issue.

Explanation:-

More than 1million event logs are getting posted in an hour due to which the Disk would be becoming a bottleneck and burst of events are being pushed into the Newrelic ELK.

Lowering down and printing the error logs or logs required for troubleshooting should help to overcome this issue of intermittent logs in the Newrelic/ELK.

Pages

Book our Service

Join Slack Channel

Subscribe Our Youtube Channel

Tuesday, November 29, 2022

Issue:-

Error:-

Effect:-

Saturday, October 22, 2022

Issue:-

Error:-

Monday, October 17, 2022

Thursday, October 13, 2022

Issue:-

Error:-

Effect:-

Issue:-

Error:-

Effect:-

Wednesday, September 28, 2022

Issue:-

Error:-

Persistent volume claim remains in the pending state

Effect:-

Issue:-

Error:-

Effect:-

Explanation:-

Wednesday, July 20, 2022

Monday, July 18, 2022

Issue:-

Error:-

Effect:-

Resolution:-

Explanation:-

Wednesday, June 29, 2022

Issue:-

Error:-

Effect:-

Resolution:-

Explanation:-

Saturday, June 4, 2022

Issue:-

Error:-

Effect:-

Resolution:-

Explanation:-

Issue:-

Error:-

Effect:-

Resolution:-

Explanation:-

Monday, May 30, 2022

Sunday, May 29, 2022

Issue:-

Error:-

Effect:-

Issue:-

Error:-

Effect:-

Saturday, May 28, 2022

Monday, May 23, 2022

Sunday, May 22, 2022

Friday, May 20, 2022

Thursday, May 19, 2022

Wednesday, May 18, 2022

Issue:-

Error:-

Effect:-

Tuesday, May 17, 2022

Monday, May 16, 2022

Thursday, May 12, 2022

Wednesday, May 11, 2022

Monday, May 9, 2022

Issue:-

Error:-

Effect:-

Resolution:-

Explanation:-

Issue:-