Design of mite
Mite is designed as a distributed system made up of several components.
Some are required, and others are optional. The components communicate
over TCP, so you can run them anywhere that they can talk to each other
in this fashion: as ordinary processes on a single machine, in
containers (on a single machine or in a cloud container orchestration
environment like Kubernetes) or on different machines entirely.
The components that make up mite are:
controller*: manages a test scenario and feeds tasks to the runners.
runner: responsible for injecting the test load into the target system
duplicator*: a message router between the controller/runner and their
downstream components
collector: logs messages appearing on the mite message bus to a file
receiver: dispatches incoming messages to connected processors
recorder*: listens for special messages on the bus and records them
to a file.
stats: aggregates raw messages from the controller and runner into
statistical summaries
exporter*: (aka prometheus exporter) listens for aggregations from the
stats component, and exposes these via HTTP to a
The components marked with an asterisk are singletons; the rest can be
scaled up to meet the demand of the test. The communication pathways
between the components are represented in the following diagram:
Distributed components
In order to achieve high throughput, it’s necessary to split mite into
discrete components and distribute them across cores/machines. Mite as
an entity consists of a number of these components, each with their own
responsibilities for test control, execution and/or data collection.
Duplicator
The mite duplicator is responsible for ingesting messages from multiple
sources and then duplicating them to other components. It ensures that
messages that arrive at an incoming socket are distributed exactly once
to each of N outgoing sockets. It is usually the first component to
start. The controller and runners feed their logs into the duplicator
and cannot run without it. Only one duplicator can run at a time.
To start the duplicator with one inbound socket (--message-socket
)
and two outbound sockets (the remaining arguments) for the stats and
collector components, run the following.
mite duplicator --message-socket=tcp://0.0.0.0:14302 tcp://127.0.0.1:14303 tcp://127.0.0.1:14304
Stats
The mite stats component aggregates statistics out of the messages
passed from the collector. As many stats daemons can be run as is
necessary to achieve desired throughput. It can be passed an optional
in and out socket, but in our case we’ll just run it with default
sockets.
The default in socket is tcp://127.0.0.1:14304`
which we previously
told our duplicator to output on. The default output socket is on
port 14305.
Collector
The mite collector is an optional component designed to write the raw
logs to files. It defaults to writing 10000 log lines to a file before
rolling and timestamping the file. The most recent file is named
current
. Again, as many of these can be run as is necessary to
achieve throughput. However, each collector must be passed a different
--collector-dir
, to avoid them overwriting each otherʼs logs. The
log lines will be round-robined across all the collectorsʼ output
directories.
mite collector --collector-dir mite-logs
Recorder
The mite recorder is an optional component designed to record data to
disk. It can be used for storing values created during
handling
messages of the type data_created
to write data and purge_data
to delete old recorded values. There is an optional --recorder-dir
argument to specify a folder where files are created.
Logs are compressed with msgpack; the repo includes a cat_msgpack.py
script for quickly dumping their contents (but the most common usage will
be to read them in programmatically in test journey files). We could also
have specified an optional inbound socket but here we are relying on the
default of tcp://127.0.0.1:14303
, previously specified as an output
socket for the duplicator.
Receiver
The mite receiver component is a generic mechanism that can perform tasks
not limited to fulfilling a single role (eg. stats, collector). This allows
changing a mite pipeline more easily, without introducing new processes and
network components. A mite.cli.receiver
instance can have multiple custom
processors
connected to it as either listeners or raw listeners and can
dispatch incoming messages to them.
mite receiver tcp://127.0.0.1:14303 \
--processor=my.custom.processors:StatsProcessor \
--processor=my.custom.processors:CollectorProcessor
mite receiver tcp://127.0.0.1:14310 \
--processor=my.custom.processors:PrintProcessor
Prometheus Exporter
The prometheus exporter provides a http metrics endpoint for the time series database to scrape,
pulling metrics from the stats components. In our case, the stats
components will output on its default socket and the exporter is
configured to read from there by default.
There is more about data visualization in the next page (Data visualization).
Runners
The mite runners are the component responsible for injecting the load
into the system under test. As many of these can be created as is
necessary for the volume of load you are injecting, but for optimum
performance you should make sure that each has a whole CPU core on
which to run. The runner needs two arguments, a socket it can use to
talk to the controller and a message socket it can use to send messages
to the duplicator. In the below instance we’ll let it use the defaults
of 14301 for communicating with the controller and 14302 for messages
out to the duplicator.
Controller
The last component to run is the mite controller. It dictates the
scenario to run, loads and distrbutes the config to the runners and is
responsible for managing the work that the runners are doing. As all
our components are set up to use default sockets, we just have to
specify the scenario to run, in the format of a python importable module
and a name in that module (separated by a colon).
mite controller mite.example:scenario
Useful topologies
While itʼs true that increasing the heft of a test scenario increases
the load on all miteʼs components, nowhere is the increase more apparent
than on the runners, the component directly responsible for injecting
the load on a 1:1 basis into the target system. Therefore, it makes the
most sense to separate the runners from the other components of mite.
This both avoids interference between the components and makes it easy
to scale the resources devoted to the runners.
Resource usage
In the following sections, we describe our “real-world” usage of mite.
This infrastructure is somewhat overprovisioned for our needs, but our
focus has been on the performance of the applications which we test.
While not wishing to be profligate, we believe it would be a mistake to
shrink the injection infrastructure close to its performance limits.
When resource constraints appear in an NFT exercise, they should stem
from the system under test and not the test apparatus. Anything other
than the most occasional exception to this rule is an indication that
the NFT is not efficiently organized.
In fact, we have reason to believe that the peak performance of mite, in
terms of maximum throughput per CPU and memory devoted to running mite,
is significantly above what is implied by these numbers. Furthermore,
our resource usage already compares favorably with other performance
engineering teams in Sky using different tools, even without having
performed a dedicated performance tuning of our injection
infrastructure.
Single machine
In the doc/example/ directory, you will find a docker-compose.yml file
which will deploy a full mite stack, along with supporting programs, on
your local machine. While impractical for running a significant load
injection exercise (due to the performance limitations of a single
machine), this serves to illustrate the mechanics of getting the
components to talk to each other and the outside world. For a concrete,
configuration-level view of setting up a mite pipeline, the reader is
referred to the files in that directory. In the sections that follow,
we will discuss a few more abstract considerations for deploying mite in
different kinds of environments.
Virtual machines
At Sky, we have run mite in a configuration with 4 virtual machines. One,
with 16GB of memory and 16 cores, hosts the controller, duplicator, stats,
exporter, and (if warranted) collector and recorder. The other three have
8GB of memory and 16 cores each, and each host 16 runner processes. Our
system under test consists of a 1:1 replica of the production environment,
deployed with each weekly release candidate. The underlying hardware is
in a corporate datacentre (though it could just as easily correspond to
servers rented from a colocation facility or VM provider). We regard
this as fairly typical of a traditional NFT setup in a medium to large
tech company.
We have used this infrastructure to inject load of up to 12k tps into
our system under test across a variety of journeys, including some which
simulate full user interaction with the platform, i.e. signin → modify
data → signout.
(Note that the provisioning of our test injection infrastructure is also
undercharacterized above: far more important than memory for the runners
is the bandwidth from them to the system under test – which is also less
straightforward to quantify in the than VM size. Our injectors and
system under test are colocated in the same datacentres, both on the
inside of the corporate firewall. This provides ample bandwidth for our
use case.)
We hope that this description of our usage will provide you with an idea
of the scale of infrastructure which mite requires to run, and will help
you to architect your deployment as well.
The Cloud
In addition to the traditional VM-based deployment described above, we
have also used mite in a “cloud” environment – specifically in a
kubernetes cluster. As above this is provisioned by the company, but
could just as easily be part of a hosted kubernetes offering such as
GKE.
In addition to the difference in the space into which the applications
are deployed, this environment also comes with a different release
cadence: continuous delivery is used with nightly NFT runs (recycling
the resources that are used to run CI testing during the day as
developers work on the code). Finally, the environment also has NFRs
that are roughly an order of magnitude larger than the traditional
VM-based one.
Mite as a distributed system made of discrete units is in many ways
well-adapted to such an environment. We have deployed it into the cluster
with the following resource allocations:
Component |
Replicas |
CPU |
Memory |
Controller |
1 |
2 |
500MB |
Duplicator |
1 |
2 |
100MB |
Exporter |
1 |
0.25 |
100MB |
Runner |
50 |
1 |
500MB |
Stats |
20 |
1 |
50MB |
For injecting loads of up to 22k tps, we have found 50 runners and 20
stats to be more than sufficient. (We have noted that the abstract
“CPU” is more performant in this environment than in the VMs in the
previous section.) As before, the network bandwidth used by mite in
this environment is not characterized; we have not run into problems
with our assumption that all the relevant pipes are fat enough for
within-cluster communication of the scale that we require.