Chaos Engineering With Docker EE

Table of Contents

Why Chaos Engineering?

Even before we get into the definition of Chaos Engineering or why it has become important, let’s take a look at traditional approach. Most of the applications and configuration would be put under stress testing to find out the breakage point. This primarily helped to assure the operations team that the provisioned capacity is enough for the anticipated workload. The tests was relatively (if not fairly) simple to do. But with time there are couple of things that has changed:

System have become more and more complex now
Workloads can change abruptly and scaling up and down is a necessity now

Also, there is a philosophical shift happenning the way IT operations used to think –

Servers are disposable – Earlier the basic deployment units (in most cases physical or virtual servers) were treated like “Pets” and the configuration changes would lead to a snowflake. Now with configuration management tools servers are disposable like “cattles” and can be resurrected from scratch if there is a configuration change aka Pheonix Servers.
Failure have been accepted as business as usual, outages are not. I am not trying to force you to accept system failures, but most of the IT operations today acknowledges that things would go wrong. Simply put, one needs to be prepared for it.
Because of the explosion of internet, services are not limited by geographies anymore. Workloads are not predictible anymore and they are bound to go beyond the breakage point of one servers, it is just a matter of time and chance.
Complexity of applications has increased multi-fold. Today applications are not just three-tier deployments. A web page rendered might be working with 10s or in some cases 100s of micro-services in the backend. Only way test the resiliency of the system is by injecting random issues on purpose.

This all lead the IT Operation leads to be convinced that the best way to be prepared for an outage is to simulate one. If you are not convinced yet, perhaps you want to read a bit about the study of how much loss the business can suffer because of infrastructure outage.

How do you go about it?

So what should be your strategy? I believe the easiest way is to introduce unit testing and integration testing for infrastructure and architecture components too, just like application code. so for any kind of High Availability or Disaster Recovery approach you have implemented, you should have a test case. e.g. if you are having a cluster with 2 nodes, your test case could be shoot down one of the node. Yes, you read it right. I am suggesting that you should take down a node. There is no other way for you to test high availability but to simulate failure. Similarly, you can test scalability but injecting slowness and network congetion.
There are many popular examples and inspirations for Chaos Injection. Most popular one are:

Generic guidelines are available on Principles of Chaos Engineering
Netflix’s Chaos Monkey to do various kind of chaos injection e.g. introduce slowness in the network, kill EC2 instances, detach the network or disks from EC2 instances
Netflix’s Chaos Kong though is not open sourced yet but a nice inspiration and aspiration for anyone embarking on chaos engineering within their enterprise.
Facebook’s Project Storm

Those who practice chaos engineering by trying to break themselves, have been rewarded well in times of outages. Best example is how Netflix weathered the storm by preparing for the worst.

How does that translate in the container’s world?

In today’s date a lot of new applications and services are being deployed as containers. If you are starting up with Chaos Engineering in Docker, there are many different mechanisms and tools available at your disposal.
Before we get into tools, let’s look at some of the basic features of Docker which should be helpful to you.

1. Docker Service

It is often better to deploy your application as a Swarm Service instead of deploying them as native container. In case you are using Kubernetes, it is better to deploy your request as a sevice. Both the definitions are declarative and define the desired state of service. This is really helpful in maintaining the uptime of your application as the service would always try to maintain the availability of service.

Example

In this example, I am going to use a Dockerfile to build a new image and then I will be using it to deploy a new service. The example is executed against a Docker UCP cluster from a client node (with docker cli and UCP Client Bundle).
Setup a docker build file Dockerfile-nohc:

FROM nginx:latest
RUN apt-get -qq update
COPY index.html /usr/share/nginx/html
EXPOSE 80 443
CMD [“nginx”, “-g”, “daemon off;”]

Build your image

sh-4.2$ docker image build -t $dtr_url/development/tweet_to_us:demoMay -f Dockerfile-nohc .
Sending build context to Docker daemon 4.096kB
ip-10-100-2-106: Step 1/4 : FROM nginx:latest
ip-10-100-2-106: —> ae513a47849c
ip-10-100-2-106: Step 2/4 : COPY index.html /usr/share/nginx/html
ip-10-100-2-106: —> Using cache
ip-10-100-2-106: —> b97207424f3a
ip-10-100-2-106: Step 3/4 : EXPOSE 80 443
ip-10-100-2-106: —> Using cache
ip-10-100-2-106: —> bfe4f59a2094
ip-10-100-2-106: Step 4/4 : CMD nginx -g daemon off;
ip-10-100-2-106: —> Using cache
ip-10-100-2-106: —> cb79c6283bb5
ip-10-100-2-106: Successfully built cb79c6283bb5
ip-10-100-2-106: Successfully tagged dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay

Now we need to push you image to a repository (DTR or Dockerhub), so that it is available to all nodes:

sh-4.2$ docker image push $dtr_url/development/tweet_to_us:demoMay
The push refers to a repository [dtr.ashnikdemo.com:12443/development/tweet_to_us]
c75bed55c5fa: Pushed
7ab428981537: Mounted from development/tweet-to-us
82b81d779f83: Mounted from development/tweet-to-us
d626a8ad97a1: Mounted from development/tweet-to-us
demoMay: digest: sha256:08090c853df56ceee495fb95537ac9f2c81cf8718e5fc76c513ba1d8e7d145f0 size: 1155

Now we will start a service using this image:

sh-4.2$ docker service create -d –name=twet-app –mode=replicated –replicas=2 –publish 8080:80 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay
pq6eojqprru4ctw0ib0lwfmj6

This request asks the Swarm cluster to setup the service with --mode=replicated and --replicas=2 i.e. Swarm would try to maintain two tasks for this service at any point of time, unless requested otherwise by the user. You can inspect the tasks running for the service with docker service ps command:

ID
zzq1jgolcc2o
zlkf4ejuxus8

NAME
twet-app.1
twet-app.2

IMAGE
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay

NODE
ip-10-100-2-67
ip-10-100-2-93

DESIRED STATE
Running
Running

CURRENT STATE
Running 3 minutes ago
Running 3 minutes ago

ERROR

PORTS

As you can see there are two tasks running and these tasks would be setup with VIP which will do load-balancing among the two containers/tasks.

sh-4.2$ docker service inspect –format='{{.Endpoint}}’ twet-app
{{vip [{ tcp 80 8080 ingress}]} [{ tcp 80 8080 ingress}] [{f80zlxoy56y20ql48o3v9aiwo 10.255.0.225/16}]}

Let’s try to kill one of the underlying containers and see if Swarm is able to maintain the declarative state we had requested:

603c7f8940fe
54aa164ea509

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay

“nginx -g ‘daemon …”
“nginx -g ‘daemon …”

7 minutes ago
7 minutes ago

Up 7 minutes
Up 7 minutes

80/tcp, 443/tcp
80/tcp, 443/tcp

ip-10-100-2-67/twet-app.1.zzq1jgolcc2oyucexn4j9u9pq
ip-10-100-2-93/twet-app.2.zlkf4ejuxus851onp4i2t143p

sh-4.2$
sh-4.2$ docker container kill 603c7f8940fe
603c7f8940fe
sh-4.2$
sh-4.2$
sh-4.2$ docker service ps twet-app

ID
sp4hz64oytu0
zzq1jgolcc2o
zlkf4ejuxus8

NAME
twet-app.1
_ twet-app.1
twet-app.2

IMAGE
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay

NODE
ip-10-100-2-67
ip-10-100-2-67
ip-10-100-2-93

DESIRED STATE
Running
Shutdown
Running

CURRENT STATE
Running 2 seconds ago
Failed 7 seconds ago
Running 8 minutes ago

ERROR
.
“task: non-zero exit (137)”
.

PORTS

As you can see the container 603c7f8940fe was used by one of the tasks of our service twet-app and once we kill the container, Swarm tries to maintain the state by starting another task.
Note: Pushing image to repository is needed when you are running with distributed setup. As you can see above in the build was done on one of the nodes from the Swarm clusterip-10-100-2-106 and image would be only available on only one node. Hence if we were to run service without pushing the image to a repository, there is good chance that the tasks would get started on the same node (ip-10-100-2-106) i.e. the only node that has access to the image or different nodes would get different images (left by different image builds). Swarm does a good job of reminding us about this. Here is an example if I tried to run the servie without pushing the image:

sh-4.2$ docker service create -d –name=twet-app –mode=replicated –replicas=2 –publish 8080:80 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay
image dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay could not be accessed on a registry to record
its digest. Each node will access dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay independently,
possibly leading to different nodes running different
versions of the image.
t46gb1wi3tc7xs2j08egzcut1

2. Health Checks

Docker allows you to use healthcheck to keep a tab on the health of running containers. The healthcheck can be either baked into you image during the build process using HEALTHCHECK direction in Dockerfile or during runtime using –healthcheck option with docker service create or docker container run
To quote the docker documentation

The HEALTHCHECK instruction tells Docker how to test a container to check that it is still working. This can detect cases such as a web server that is stuck in an infinite loop and unable to handle new connections, even though the server process is still running.

Note: The HEALTHCHECK feature was added in Docker 1.12.

Build time example of HEALTHCHECK

To make use of this feature we will add a new command to our Dockerfile now

HEALTHCHECK –interval=30s –timeout=3s –retries=2
CMD python /usr/share/nginx/html/healthcheck.py || exit 1

This means that the healthcheck command python /usr/share/nginx/html/healthcheck.py will be run for the first time after 30s i.e. 30 seconds after starting up the tasks. The healthcheck will be run with an interval of every 30s after that. The healthcheck would timeout in 3s and upon failure of 2 retries the container will be declared unhealthy.
We will have to add a few new files to support HEALTHCHECK

healthcheck.py – our own little piece of code to check the health of container.
healthcheck.html

Now we will build and push the image

sh-4.2# docker image build –no-cache -t $dtr_url/development/tweet_to_us:demoMay_Healthcheck -f Dockerfile .
Sending build context to Docker daemon 7.68kB
Step 1/7 : FROM nginx:latest
—> b175e7467d66
Step 2/7 : RUN apt-get -qq update
—> Running in 152a3156632c
—> 2a6be94d9a04
Removing intermediate container 152a3156632c
Step 3/7 : RUN apt-get -qq –allow-downgrades –allow-remove-essential –allow-change-held-packages install python > /dev/null
—> Running in 56a5b9141aaf
debconf: delaying package configuration, since apt-utils is not installed
—> 99605506e79f
Removing intermediate container 56a5b9141aaf
Step 4/7 : COPY healthcheck.html healthcheck.py index.html /usr/share/nginx/html/
—> b1b93b73d0fa
Removing intermediate container dab1d03a75e4
Step 5/7 : EXPOSE 80 443
—> Running in 50b63022a6c3
—> 4297f32f769b
Removing intermediate container 50b63022a6c3
Step 6/7 : HEALTHCHECK –interval=30s –timeout=3s –retries=2 CMD python /usr/share/nginx/html/healthcheck.py || exit 1
—> Running in 1a1a9cd1f139
—> 042010177008
Removing intermediate container 1a1a9cd1f139
Step 7/7 : CMD nginx -g daemon off;
—> Running in 767a8098f177
—> 9153fcd78222
Removing intermediate container 767a8098f177
Successfully built 9153fcd78222
Successfully tagged dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
sh-4.2# docker image push dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
The push refers to a repository [dtr.ashnikdemo.com:12443/development/tweet_to_us]
a60e00b623bb: Pushed
2e83bcd5bc8d: Pushed
5134599f00a1: Pushed
77e23640b533: Pushed
757d7bb101da: Pushed
3358360aedad: Pushed
demoMay_Healthcheck: digest: sha256:a4fb4fd2733e37ae7282148ccb497aac4c2fc18a74aa8e950271ffd648b07da8 size: 1579

Now once we deploy the service, initially the health status would be starting until the first healthcheck is initiated

sh-4.2$ docker service rm twet-app
twet-app
sh-4.2$ docker service create -d –name=twet-app –mode=replicated –replicas=2 –publish 8080:80 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
lbfmu7vxa1i6arfmpstzq3rer
sh-4.2$ docker container ls | grep -i twet

1feb5ed8e0b6
6ccb4d691fe9

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”
“nginx -g ‘daemon …”

23 seconds ago
24 seconds ago

Up 23 seconds (health: starting)
Up 23 seconds (health: starting)

80/tcp, 443/tcp
80/tcp, 443/tcp

ip-10-100-2-93/twet-app.2.rkh6gofzfru83wjqcyzq2mdcl
ip-10-100-2-67/twet-app.1.urb4v2vttrlsvcz11wfnj6yh2

After the first healthechk, the healthcheck status would be healthy

sh-4.2$ docker container ls | grep -i twet

1feb5ed8e0b6
6ccb4d691fe9

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”
“nginx -g ‘daemon …”

About a minute ago
About a minute ago

Up About a minute (healthy)
Up About a minute (healthy)

80/tcp, 443/tcp
80/tcp, 443/tcp

ip-10-100-2-93/twet-app.2.rkh6gofzfru83wjqcyzq2mdcl
ip-10-100-2-67/twet-app.1.urb4v2vttrlsvcz11wfnj6yh2

Testing Healthcheck and self-healing

Now let’s try to force a distruption by connecting to one of the containers and changing the content of healthcheck.html

sh-4.2$ docker container exec -it 1feb5ed8e0b6 bash
root@1feb5ed8e0b6:/# echo test > /usr/share/nginx/html/healthcheck.html
root@1feb5ed8e0b6:/# exit

Soon (in about 1 minute given our interval, timeout and retries configuration in the Dockerfile), the container will be reported unhealthy and replaced with a new container to run the task

sh-4.2$ docker container ls | grep -i twet

1feb5ed8e0b6
6ccb4d691fe9

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”
“nginx -g ‘daemon …”

3 minutes ago
13 minutes ago

Up 13 minutes (unhealthy)
Up 13 minutes (healthy)

80/tcp, 443/tcp
80/tcp, 443/tcp

ip-10-100-2-93/twet-app.2.xskxfd9n6e39wlghpm0k7tphr
ip-10-100-2-67/twet-app.1.urb4v2vttrlsvcz11wfnj6yh2

sh-4.2$ docker container ls | grep -i twet

efc5b969ef8c
1feb5ed8e0b6
6ccb4d691fe9

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”
“nginx -g ‘daemon …”
“nginx -g ‘daemon …”

42 seconds ago
11 minuted ago
11 minuted ago

Up 36 seconds (healthy)
Exited (0) 41 seconds ago
Up 11 minutes (healthy)

80/tcp, 443/tcp
.
80/tcp, 443/tcp

ip-10-100-2-93/twet-app.2.xskxfd9n6e39wlghpm0k7tphr
ip-10-100-2-93/twet-app.2.rkh6gofzfru83wjqcyzq2mdcl
ip-10-100-2-67/twet-app.1.urb4v2vttrlsvcz11wfnj6yh2

Runtime definition of Healthcheck

You can also override the command to check health, its frequency and retries while creating the service

sh-4.2$ docker service rm twet-app
twet-app

sh-4.2$ docker service create -d –name=twet-app
- –mode=replicated –replicas=2 –publish 8080:80
- –health-cmd “python /usr/share/nginx/html/healthcheck.py || exit 1”
- –health-interval 10s
- –health-retries 2
- –health-timeout 30ms
- dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

mj4lel34whrvupscq8sjt7g5m

Disable healthcheck

In runtime while creating a service, you can disable the healtcheck with --no-healthcheck option. That will supress any healthcheck which has been defined in the base image

sh-4.2$ docker service create -d –name=twet-app
- –mode=replicated –replicas=2 –publish 8080:80
- –no-healthcheck
- dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

nxkv39smzq5k0o9tgwmc74t2c

If the base container you are going to use has a HEALTHCHECK defined, it can also disable the healthchek during build time using HEALTHCHECK NONE

Checking the status

You can use docker container inspect command to further review the state of your containers and details healthchekc command output:
e.g. in case of timeout error:

sh-4.2$ docker container inspect –format='{{json .State.Health}}’ a9486dc964af
{“Status”:”starting”,”FailingStreak”:1,”Log”:[{“Start”:”2018-05-12T17:18:01.698087531Z”,”End”:”2018-05-12T17:18:01.728282187Z”,”ExitCode”:-1,”Output”:”Health check exceeded timeout (30ms)”}]}

in case of failures:

sh-4.2$ docker container inspect –format='{{json .State.Health}}’ 62ad34709fb8
{“Status”:”healthy”,”FailingStreak”:1,”Log”:[{“Start”:”2018-05-12T17:28:33.714393794Z”,”End”:”2018-05-12T17:28:33.793534206Z”,”ExitCode”:0,”Output”:””},{“Start”:”2018-05-12T17:28:53.793900452Z”,”End”:”2018-05-12T17:28:53.871217425Z”,”ExitCode”:1,”Output”:”The content of the healthcheck did not match. Expected Content-“healthy”, we got: testn”}]}

sh-4.2$ docker container inspect –format='{{json .State.Health}}’ 62ad34709fb8
{“Status”:”unhealthy”,”FailingStreak”:2,”Log”:[{“Start”:”2018-05-12T17:28:33.714393794Z”,”End”:”2018-05-12T17:28:33.793534206Z”,”ExitCode”:0,”Output”:””},{“Start”:”2018-05-12T17:28:53.793900452Z”,”End”:”2018-05-12T17:28:53.871217425Z”,”ExitCode”:1,”Output”:”The content of the healthcheck did not match. Expected Content-“healthy”, we got: testn”},{“Start”:”2018-05-12T17:29:13.871399894Z”,”End”:”2018-05-12T17:29:13.948097443Z”,”ExitCode”:1,”Output”:”The content of the healthcheck did not match. Expected Content-“healthy”, we got: testn”}]}

in case of no failures

sh-4.2$ docker container inspect –format='{{json .State.Health}}’ 181d566f6aa9
{“Status”:”healthy”,”FailingStreak”:0,”Log”:[{“Start”:”2018-05-12T17:25:06.184822447Z”,”End”:”2018-05-12T17:25:06.262241844Z”,”ExitCode”:0,”Output”:””},{“Start”:”2018-05-12T17:25:26.262408086Z”,”End”:”2018-05-12T17:25:26.338883823Z”,”ExitCode”:0,”Output”:””},{“Start”:”2018-05-12T17:25:46.339143953Z”,”End”:”2018-05-12T17:25:46.416973058Z”,”ExitCode”:0,”Output”:””},{“Start”:”2018-05-12T17:26:06.417170336Z”,”End”:”2018-05-12T17:26:06.495295881Z”,”ExitCode”:0,”Output”:””},{“Start”:”2018-05-12T17:26:26.495482044Z”,”End”:”2018-05-12T17:26:26.572278146Z”,”ExitCode”:0,”Output”:””}]}

Note: The output will contain a friendly message if one is printed by your healthcheck command.

3. Tooling and Automation

Now that we have covered the basic building blocks of chaos engineering with Docker, let’s try to take a look at some tools. Pumba is a fairly new but quite promising tool for chaos orchestration. Best thing is it works well with a Swarm cluster, you just need to point it to the manager node. We can easily get it to work with Docker UCP Client Bundle.
Example
First, we need to setup an isolated network where we will setup our application and test it out docker network create -d overlay tweet-app-net
Now let’s setup a service using healthcheck from the previous examples

docker service create -d –name=twet-app –network tweet-app-net
- –mode=replicated –replicas=2 –publish 8080:80
- –health-cmd “python /usr/share/nginx/html/healthcheck.py || exit 1”
- –health-interval 20s
- –health-retries 2
- –health-timeout 200ms
- dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

Let’s ensure that the service has been started properly with requested number of replicas which are healthy

sh-4.2$ docker container ls | grep -i twet

75b2bf6f219d
393355d083fb

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”
“nginx -g ‘daemon …”

27 seconds ago
About a minute ago

Up 21 seconds (healthy)
Up 59 seconds (healthy)

80/tcp, 443/tcp
80/tcp, 443/tcp

ip-10-100-2-93/twet-app.1.im7f7qm2xh6fk6uqla462qzia
ip-10-100-2-67/twet-app.2.6uueh28nxj7btpfzffeq40f6b

Now let’s use Pumba to randomly kill some containers under the service

export SVC_NAME=twet-app
pumba –random kill $(docker service ps –no-trunc
- –filter “desired-state=Running”
- ${SVC_NAME} | awk ‘ {if (NR!=1) {print $2″.”$1} } ‘)

You will an output confirming that the container has been killed

sh-4.2$ pumba –random kill $(docker service ps –no-trunc
- –filter “desired-state=Running”
- ${SVC_NAME} | awk ‘ {if (NR!=1) {print $2″.”$1} } ‘)
INFO[0000] Kill containers
INFO[0003] Killing /twet-app.2.6uueh28nxj7btpfzffeq40f6b (393355d083fbd33d8247e6cf9dcdb36046000764547db776b405bb4c37ef7438) with signal SIGKILL

You will notice that as soon as the container is killed, the swarm manager would try to restore the state back to desired state i.e. with 2 healthy replica

sh-4.2$ docker container ls | grep -i twet

75b2bf6f219d

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”

42 seconds ago

Up 36 seconds (healthy)

80/tcp, 443/tcp

ip-10-100-2-93/twet-app.1.im7f7qm2xh6fk6uqla462qzia

sh-4.2$ docker container ls | grep -i twet

75b2bf6f219d

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”

44 seconds ago

Up 38 seconds (healthy)

80/tcp, 443/tcp

ip-10-100-2-93/twet-app.1.im7f7qm2xh6fk6uqla462qzia

sh-4.2$
sh-4.2$ docker container ls | grep -i twet

dfb8b7ebc559
75b2bf6f219d

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”
“nginx -g ‘daemon …”

7 seconds ago
47 seconds ago

Up 1 seconds (health: starting)
Up 41 seconds (healthy)

80/tcp, 443/tcp

ip-10-100-2-67/twet-app.2.7505c070tcyk14wudbdbufy3t
ip-10-100-2-93/twet-app.1.im7f7qm2xh6fk6uqla462qzia

You can also try to stop or remove a container various commands provided by pumba.
You can also use --interval option to run the command at a regular interval to perform stress testing. e.g. to run the same kill command every 10minutes

export SVC_NAME=twet-app
pumba –random –interval 10m kill $(docker service ps –no-trunc
- –filter “desired-state=Running”
- ${SVC_NAME} | awk ‘ {if (NR!=1) {print $2″.”$1} } ‘)

Network delay

Let’s first take example of a simple setup with a single node.

docker swarm init

Setup the service by running this command against the single manager node of your newly initiated Swarm Cluster

sh-4.2# docker service create -d –name=twet-app –network tweet-app-net

–mode=replicated –replicas=2 –publish 8080:80

–health-cmd “python /usr/share/nginx/html/healthcheck.py || exit 1”

–health-interval 10s

–health-retries 2

–health-timeout 100ms

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

image dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck could not be accessed on a registry to record
its digest. Each node will access dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck independently,
possibly leading to different nodes running different
versions of the image.
pb7teb13m2oczlp8rub0wjkdo

Fire a pumba command to introduce delays

sh-4.2# pumba –random netem –interface lo –duration 60s
- –filter “desired-state=Running”
- ${SVC_NAME} | awk ‘ {if (NR!=1) {print $2″.”$1} } ‘)

–tc-image gaiadocker/iproute2 delay

–time 10 jitter 100

–distribution normal $(docker service ps –no-trunc

INFO[0000] netem: delay for containers
INFO[0000] Running netem command ‘[delay 10ms 10ms 20.00]’ on container 2eb65f467edc585e586feee01d3eba36c301bb3830a9e163e81aa4edf8d5f36c for 1m0s
INFO[0000] Start netem for container 2eb65f467edc585e586feee01d3eba36c301bb3830a9e163e81aa4edf8d5f36c on ‘lo’ with command ‘[delay 10ms 10ms 20.00]’

Monitor the status for containers running the of for the service:

sh-4.2# docker container ls | grep twet-app

2eb65f467edc
031b9ec08b50

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”
“nginx -g ‘daemon …”

About a minute ago
11 minutes ago

Up About a minute
Up 11 minutes (healthy)

80/tcp, 443/tcp
80/tcp, 443/tcp

twet-app.2.eyxe2yo9fs928b6x2oa3q26m0
twet-app.1.k6wfqvzyn5p7ka60witz92msv

You will notice that becuase of the network delays introduced by pumba, the containers are failing the healthcheck:

sh-4.2# docker container ls | grep twet-app

2eb65f467edc
031b9ec08b50

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”
“nginx -g ‘daemon …”

About a minute ago
11 minutes ago

Up About a minute
Up 11 minutes (healthy)

80/tcp, 443/tcp
80/tcp, 443/tcp

twet-app.2.eyxe2yo9fs928b6x2oa3q26m0
twet-app.1.k6wfqvzyn5p7ka60witz92msv

Soon the unhealthy container would be removed:

sh-4.2# docker container ls | grep twet-app

031b9ec08b50

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”

11 minutes ago

Up 11 minutes (healthy)

80/tcp, 443/tcp

twet-app.1.k6wfqvzyn5p7ka60witz92msv

And it will be replaced with a new container:

sh-4.2# docker container ls | grep twet-app

2c1453b4d680
031b9ec08b50

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”
“nginx -g ‘daemon …”

20 seconds ago
11 minutes ago

Up Less than a second (health: starting)
Up 11 minutes (healthy)

80/tcp, 443/tcp
80/tcp, 443/tcp

twet-app.2.oc900fzo7a7v1kgz2kt45h1pr
twet-app.1.k6wfqvzyn5p7ka60witz92msv

As soon as the healthcheck is executed, it will turn into a healthy one:

sh-4.2# docker container ls | grep twet-app

2c1453b4d680
031b9ec08b50
sh-4.2#

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”
“nginx -g ‘daemon …”

20 seconds ago
11 minutes ago

Up 15 seconds (healthy)
Up 11 minutes (healthy)

80/tcp, 443/tcp
80/tcp, 443/tcp

twet-app.2.oc900fzo7a7v1kgz2kt45h1pr
twet-app.1.k6wfqvzyn5p7ka60witz92msv

While the container is being replaced, you will notice that pumba command would fail (as the container it attached to has been lost)

INFO[0060] Stopping netem on container 2eb65f467edc585e586feee01d3eba36c301bb3830a9e163e81aa4edf8d5f36c
INFO[0060] Stop netem for container 2eb65f467edc585e586feee01d3eba36c301bb3830a9e163e81aa4edf8d5f36c on ‘lo’
ERRO[0060] Error response from daemon: cannot join network of a non running container: 2eb65f467edc585e586feee01d3eba36c301bb3830a9e163e81aa4edf8d5f36c
ERRO[0060] Error response from daemon: cannot join network of a non running container: 2eb65f467edc585e586feee01d3eba36c301bb3830a9e163e81aa4edf8d5f36c

As you can see, pumba was able to introduce network delay and HEALTHCHECK in the image or --health-cmd at service level helped us to restart the images which were slowing. Well, at this time this is the most that Pumba and Swarm can do. I am hoping in times to come, Swarm service healthcheck would allow us to define auto-scale policies too.
Now, if we are running against a UCP setup or any “true” swarm cluster which has worker and manager nodes, pumba netem command would not work when you fire it from a client. This is unlike the kill command (or most of the other pumba commands), which do work against a Swarm cluster. I came up with a simple solution to work around it.

Pumba in a container

Well you can run pubma in a container as the example says on it’s github page.

# once in a 10 seconds, try to kill (with `SIGTERM` signal) all containers named **hp(something)**
# on same Docker host, where Pumba container is running
$ docker run -d -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba pumba --interval 10s kill --signal SIGTERM ^hp

This means that we can create, a service that runs on each node in your Swarm cluster and executes pumba netem command. We need to change the entrypoint of the service and mount /var/run/docker.sock of the local node to container so that pumba can have access to docker deamon on each node.
The pumba command should essentially look for containers that belong to your service only so you need to pass a list of containers to entrypoint pumba command.

export container_list=$(docker service ps –no-trunc –filter “desired-state=Running” ${SVC_NAME} | awk ‘ {if (NR!=1) {print $2″.”$1} } ‘)

The command should try to inject delay only in specific interface i.e. the one used by HEALTHCHECK.

export netem_interface=lo

Now let’s run our pumba netem service

docker service create -d –restart-condition none –mode global –name pumba-netem-delay
–mount type=bind,source=/var/run/docker.sock,destination=/var/run/docker.sock
–entrypoint “pumba –random netem –interface ${netem_interface} –duration 60s
–tc-image gaiadocker/iproute2 delay
–time 10 jitter 100
–distribution normal ${container_list}”
gaiaadm/pumba

The effect will be same as the previous example we run on one node Swarm Cluster.
If you are scripting this, then introduce a delay and then cleanup the swarm service:

sleep 60
docker service rm pumba-netem-delay

Simulate Packet loss

To be added

The bold test – Node failure

One of the reasont to run your containers in a Swarm cluster is to ensure fault tolerance to node failures. Let’s try to simulate node failure and see how docker UCP manager handles it.
Let’s first list various tasks of our application:

docker service ps twet-app

Output would something like below, giving you details of the number of tasks, their id and node on which they are running:

ID
s8iib1ue7nrd
oiafp1o6klxx

NAME
twet-app.1
twet-app.2

IMAGE
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

NODE
ip-10-100-2-67
ip-10-100-2-93

DESIRED STATE
Running
Running

CURRENT STATE
Running 42 minutes ago
Running 42 minutes ago

ERROR

PORTS

For the purpose of our testing let’s try to fail one of the nodes, let’s say ip-10-100-2-67.
Since I am running in AWS, I will find out the instance id of the server and restart. We can use docker node ls before and after restart, to note the node status

sh-4.2$ docker node ls

ID
3du1xn000h3jz3t2fcx9lcvdl
ag12n6ejw7ztf0yqpsao4208u
awql5xr67h0jmxjllzfohqqy2 *
e2soqi2u67nfnoxop8mgfvm7a
i2fjh10bx31ij6i3q2jvzwjco
lpi6z3np5vp83vatmh51d3i59
m4j4g27conj199uciw98k5h1b
mzwnamamkze7yagqa602tmd71
o3z2xxqo90mm4dlnpq632zorj
yczj5bg55l37xfkugkwwyc5ji

HOSTNAME
ip-10-100-2-38
ip-10-100-2-115
ip-10-100-2-15
ip-10-100-2-169
ip-10-100-2-40
ip-10-100-2-106
ip-10-100-2-67
ip-10-100-2-93
ip-10-100-2-66
ip-10-100-2-70.ap-southeast-1.compute.internal

STATUS
Ready
Ready
Ready
Ready
Ready
Ready
Ready
Ready
Ready
Ready

AVAILABILITY
Active
Active
Active
Active
Active
Active
Active
Active
Active
Active

MANAGER STATUS
.
.
Reachable
Reachable
Leader
.
.
.
.
.

sh-4.2$ aws ec2 describe-instances –filters “Name=network-interface.private-dns-name,Values=ip-10-100-2-67.ap-southeast-1.compute.internal” | grep -i InstanceId
“InstanceId”: “i-0db2edf9253157f97”,
sh-4.2$ aws ec2 reboot-instances –instance-ids i-0db2edf9253157f97

sh-4.2$ docker node ls

ID
3du1xn000h3jz3t2fcx9lcvdl
ag12n6ejw7ztf0yqpsao4208u
awql5xr67h0jmxjllzfohqqy2
e2soqi2u67nfnoxop8mgfvm7a *
i2fjh10bx31ij6i3q2jvzwjco
lpi6z3np5vp83vatmh51d3i59
m4j4g27conj199uciw98k5h1b
mzwnamamkze7yagqa602tmd71
o3z2xxqo90mm4dlnpq632zorj
yczj5bg55l37xfkugkwwyc5ji

HOSTNAME
ip-10-100-2-38
ip-10-100-2-115
ip-10-100-2-15
ip-10-100-2-169
ip-10-100-2-40
ip-10-100-2-106
ip-10-100-2-67
ip-10-100-2-93
ip-10-100-2-66
ip-10-100-2-70.ap-southeast-1.compute.internal

STATUS
Ready
Ready
Ready
Ready
Ready
Ready
Down
Ready
Ready
Ready

AVAILABILITY
Active
Active
Active
Active
Active
Active
Active
Active
Active
Active

MANAGER STATUS
.
.
Reachable
Reachable
Leader
.
.
.
.
.

As you can see the node became unavailable once the reboot was executed
In order to maintain the desired state of service with 2 replica, Swarm manager would start a new container on one of the surviving nodes

ID
7e2icqhk0n54
s8iib1ue7nrd
oiafp1o6klxx

NAME
twet-app.1
_ twet-app.1
twet-app.2

IMAGE
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

NODE
ip-10-100-2-93
m4j4g27conj199uciw98k5h1b
ip-10-100-2-93

DESIRED STATE
Running
Shutdown
Running

CURRENT STATE
Running 12 seconds ago
Running 50 seconds ago

ERROR

Sameer Kumar – Senior Solution Architect

Sameer Kumar is Database Solution Architect working with Ashnik. He has worked on many complex setups and migration assignments for some of the key customers from Retail, BFSI and Telecom Sector. Sameer is a certified PostgreSQL and EDB Postgres Plus Advanced Server Professional. He is also a certified Postgres Trainer and has delivered many trainings for public and corporate batches. He is well versed with other RDBMS e.g. DB2, Oracle, and SQL Server and is also trained on NoSQL technologies viz MongoDB. He has worked closely with customer and helped them build analytics platform on NoSQL databases and migrate from RDBMS to MongoDB. And while he’s in the free mode, he loves to take his cycle around Singapore for a spin.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Bolt.new, Bolt.DIY & DeepSeek-V3: AI Transforming DevOps from Development to Deployment - Watch Now!

Revolutionize Your CX with
Unified Observability

CloudOps Automation tool for Infrastructure monitoring and deployment.

Indonesia’s top digital credit service provider leverages Ashnik’s PostgreSQL expertise and services

Revolutionize Your CX with Unified Observability

Automate and monitor your PostgreSQL with ease.

The CloudOps Automation Tool for easy Infrastructure deployment and monitoring

Maximize Potential of Your Data with Streaming Data Pipeline Architecture

End-to-End Traceability and Unified Observability for the Modern Infrastructure

Watch: How to auto-scale in deployments using Kubernetes(K8s): A Technical Demo