Cloud Devops Automation

Wednesday, October 31, 2018

[Solved] Stderr: VBoxManage: error: The virtual machine 'master_default_1540967069723_95784' has terminated unexpectedly during startup with exit code 1 (0x1)

Error:-
There was an error while executing `VBoxManage`, a CLI used by Vagrant
for controlling VirtualBox. The command and stderr is shown below.

Command: ["startvm", "cddac55c-debe-470d-bb0a-d5badf0c19af", "--type", "gui"]

Stderr: VBoxManage: error: The virtual machine 'master_default_1540967069723_95784' has terminated unexpectedly during startup with exit code 1 (0x1)
VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component MachineWrap, interface IMachine

Solution:-

1. So here is a brief of what I was doing, i installed the virtualbox on my macos using the brew and installed the vagrant than and tried to bring up the vm using vagrant which resulted in above error.

2. The problem is because the MACOS doesn't allow the changes in the kernel modules by some external application due to which the installation of the Virtualbox fails on MacOS

3. To resolve this issue download the virtualbox installer from the virtualbox site and instead of using brew to install the virtualbox use the installer.
http://download.virtualbox.org/virtualbox/5.2.20

4. Now once it fails open the Security & Privacy setting and click on allow option as below

5. Once you have clicked on allow option , try reinstalling again and you should be able to install it successfully this time.

6. Now the vagrant up and it will also work.

Installing the virtualbox on mac using brew

Use the following command to install the virtualbox on the mac using brew

 brew cask install virtualbox

Thursday, October 25, 2018

Managing Multiple VPC in Organization

If you are managing a very large infrastructure which is spawned across multiple Public clouds, private datacenters and have large number of external integration with the multiple merchants over the tunnel , its good to maintain the Network details for all the Public clouds (AWS-VPC), private datacenters etc so that there is no overlapping between your account and some other team account with whom you might have to peer or create a tunnel in a later point in time. Its good to maintain a wiki page for the same and everytime there is a requirement for the New infrastructure creation always update the wiki for the same.

For the AWS you can prepare a excel sheet with the following fields to relay the information correctly to other teams:-
1. Network details
2. CIDR
3. Broadcast IP
4. Netmask
5. Location
6. Comments

For private datacenters enter the following details
1. Subnet
2. Mask
3. Subnet Details
4. VLAN ID
5. Zone/VLAN
6. Gateway

Enable or Disable passphrase on id_rsa key file

It's always good to have a passphrase entered whenever you are generating any ssh-key for the server access as it helps to prevent unauthorised access in case you key is compromised from the security point of view and are mostly the requirement of the audits as it act as an two factor authentication which requires the passphrase and secure key entered to access the server.

You can also enable the google authentication in which case it will generate a passcode on applications such as google authenticator and apart from the passphrase and key a person accessing the server would need to enter the google authenticator code as well in order to access the server thus increasing the security even further. Covered this in my previous post below

Google Authenticator MFA for Linux systems

In case you forget to enable the passphrase and want to enable it now use the following command to enable the passphrase without effecting your existing key file

ssh-keygen -p -f ~/.ssh/id_rsa

Simply and enter your passphrase 2 times and now every time you ssh to server with your key you will need to enter this passphrase.

In case you want to remove the passphrase use the following command

ssh-keygen -p

enter your old passphrase and leave it blank afterwards. This basically overrides your previous passphrase with blank passphrase.

Wednesday, October 3, 2018

Elasticsearch monitoring

What is Elastic Search?

Elasticsearch is an open source distributed document store and search engine that stores and retrieves data structures in near real-time.
Elasticsearch represents data in the form of structured JSON documents, and makes full-text search accessible via RESTful API and web clients for languages like PHP, Python, and Ruby.

Few Key Areas to monitor Elastic Search in DataDog:

Search and indexing performance
Memory and garbage collection
Host-level system and network metrics
Cluster health and node availability
Resource saturation and errors

Search and indexing performance:

Search Performance Metrics:

Query load : Monitoring the number of queries currently in progress can give you a rough idea of how many requests your cluster is dealing with at any particular moment in time.
Query latency: Though Elasticsearch does not explicitly provide this metric, monitoring tools can help you use the available metrics to calculate the average query latency by sampling the total number of queries and the total elapsed time at regular intervals
Fetch latency: The second part of the search process, the fetch phase, should typically take much less time than the query phase. If you notice this metric consistently increasing, this could indicate a problem with slow disks, enriching of documents (highlighting relevant text in search results, etc.), or requesting too many results.

Indexing Performance Metrics:

Indexing latency:

If you notice the latency increasing, you may be trying to index too many documents at one time (Elasticsearch’s documentation recommends starting with a bulk indexing size of 5 to 15 megabytes and increasing slowly from there).

solution:

If you are planning to index a lot of documents and you don’t need the new information to be immediately available for search, you can optimize for indexing performance over search performance by decreasing refresh frequency until you are done indexing

2. Flush latency:

If you see this metric increasing steadily, it could indicate a problem with slow disks; this problem may escalate and eventually prevent you from being able to add new information to your index

Important points for Elasticsearch Optimizations

Points to be taken care before creating cluster:

Volume of data
Nodes and capacity planning.
Balancing, High availability, Shards Allocation.
Understanding the queries that clusters will serve.

Config walk-through:

cluster.name:
Represents the name of the cluster and it should be same across the nodes in the cluster.

node.name:
Represent the name of the particular node in the cluster. It must be unique for every node and it is good to represent the hostname.

path.data:
Location where the elasticsearch need to store the index data in disk. If you are planning to handle huge amount of data in the cluster, it is good to point to another EBS volume instead of root volume.

path.logs:
Location where the elasticsearch needs to store the server startup, indexing and other logs. It is also good to store at other than EBS volume.

bootstrap.memory_lock:
This is an important config in ES config file. This needs to set as "TRUE". This config locks the amount of heap memory that is configured in the JAVA_ARGS to elasticsearch. If it is not configured, the OS may swap out the data of ES into disk and in-turn garbage collections may take more than a minute instead of milliseconds. This directly affects the node status and chances are high that the nodes may come out of the cluster.

network.host:
This config will set both network.bind.host and network.publish.host. Since we are trying to configuring the ES as cluster, bind and publish shouldn't be localhost or loopback address.

discovery.zen.ping.unicast.hosts:
This config need to hold all the node resolvable host-name in the ES cluster.

Never .. Ever.. Enable multicast ping discovery. That will create unwanted ping checks for the node discovery across the infrastructure(say 5 nodes, that pings all the 100 servers in the infra. Its bad). Also it is deprecated in Elasticsearch 5.x

discovery.zen.minimum_master_nodes:
Number of master eligible nodes need to be live for deciding the leader. The quorum can be calculated by (N/2)+1. Where N is the count of master eligible nodes. For three master node cluster, the quorum is 2. This option is mandatory to avoid split brains.

index.number_of_shards:
Set the number of shards (splits) of an index (5 by default).

index.number_of_replicas:
Set the number of replicas (additional copies) of an index (1 by default).

Cluster topology:
Cluster topology can be defined mainly with these two config. node.data and node.master.

node.data	node.master	State
false	true	only serves as master eligible node and no data will be saved there
false	false	Works as loadbalancer for queries and aggregations.
true	true	master eligible node and that will save data in the location "path.data"
true	false	only serves as data node

There is a difference between master and master eligible nodes. Setting node.master will make the node as master eligible node alone. When the cluster is started, the ES itself elects one of the node from the master eligible node to make it master node. We can get the current master node from the ES API "/_cat/nodes?v" . Any cluster related anomolies will be logged at this master node log only.

Cluster optimization for stability and performance:

Enable the memory locking(bootstrap.memory_lock) in elasticsearch.yml
Set MAX_LOCKED_MEMORY= unlimited and ES_HEAP_SIZE(xmx and xms) with half of the memory on the server at /etc/default/elasticsearch.
Also configure MAX_OPEN_FILES with 16K since it wont hit its limit in the long run.

Issue sending Email from the Ec2 instances

I configured the postfix recently on the ec2 instance and tried sending the mail with all the security group rules and NACL rules in place. However after i was initially able to telnet to the google email servers on port 25 soon i start getting logs with no connection error messages and ultimately i was not able to do telnet and even the mails were not going or received by the receiver.

This problem was only coming on the ec2 instances. This is because the Amazon throttles the traffic on the port 25 for all the Ec2 instance by default. But its possible to remove this throttling over the ec2 instance over the port 25.

For removing this limitation you need to create a DNS A record in the route53 to your instance used in the mail server such as postfix.

With the root account open the following link

https://aws-portal.amazon.com/gp/aws/html-forms-controller/contactus/ec2-email-limit-rdns-request

And provide your use case for sending the mail. Than you need to provide any reverse DNS record which might be required by the AWS to create as the reverse dns queries are used by the mail servers to verify the authenticity of the mail servers which are sending the mail and in turn lookout for these servers as mail is received to verify they are not sending mail so a proper resolution increases the chances of the mail being delivered to the inbox rather than spam.

Once the request is approved by the AWS Support you will receive a notification from the support team that the throttle limitations has been removed on your ec2 instance and you can send the mail to any recipient basis of your use cases.

SSH upgradation on the ubuntu for PCI Compliance

In case your security team raises a concern regarding the upgrading of the openssh server version on the ubuntu servers kindly refer to the openssh version based on the distribution before making any changes as this can effect the overall reachability to the server

Following are the latest openssh version based on the distribution

OpenSSH 6.6 is the most recent version on Ubuntu 14.04.

OpenSSH 7.2 is the most recent version on Ubuntu 16.04.

OpenSSH 7.6 is the most recent version on Ubuntu 18.04.

Openssh 7.6 is supported on the Ubuntu 18.04 only and Ubuntu 14.04 is not compliant with it. Thats why its not upgraded during the patching activity.

Like all the other distribution ubuntu also backports the vulnerabilities so that the application compatibility doesn't break by changing versions between different distributions.

Dont make any changes to your server which are not compatible with your distribution version.

Go on providing the version of the ubuntu you are running.

This can be verified from the below links as well

Ubuntu 14.04:
https://launchpad.net/ubuntu/trusty/+source/openssh/+changelog

Ubuntu 16.04:
https://launchpad.net/ubuntu/xenial/+source/openssh/+changelog

Ubuntu 18.04:
https://launchpad.net/ubuntu/bionic/+source/openssh/+changelog