Wednesday, October 9, 2019

[Solved] CannotPullContainerError: no space left on device

By default ECS service of AWS doesn't take care of free disk space on ECS instances while putting new tasks. It uses only CPU and Memory resources for a task placement. In case of disk overfilling, ECS is trying to start new task anyways, but it fails because of error “CannotPullContainerError: no space left on device”. Overfilled instances stay active in cluster until regular cluster roll replaces all instances.

The correct way for handling task placement is by letting ECS know about free disk space (set custom attribute) and set placement constrant for a task definition. (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-placement-constraints.html). Once we have custom attribute that indicates disk usage, we can configure task definition to not place task if used disk space greater than configured threshold.

This can be achieved by included the shell script for monitoring free space and deregistering an instance. The script needs to be run through a system cron every 5minutes. This script gets disk usage from 'df' command output and set ECS instance attribute 'SpaceUsedPercent', if used disk space greater treshold (85%), script sets ECS instance status to draining, and when running container count drops to 3 or less, script deregisters container instance from ECS cluster and update CloudWatch metric 'deregisteredLowSpaceInstances'. When an instance inactive in cluster more than 10 mins Spotinst terminates it.

If you are not using the Spotinst in that case you will need to further put a logic to terminate and create a new instance using the AWS CLI and add that to the running cluster. If you are using Spotinst to run your containers than spotinst can take care of this for you.

Create new revision of task definition, scroll down to 'Constraint', add new.

Now ECS will consider this additional constraint as well while placing the tasks.



Post a Comment