Extreme Testing Demo
Introduction
There has been some development on the extreme testing project. There is a project being created from existing open source components and there has been the thought or modifying the QA spec based on the demo. Sampath Priyankara, Oladeji Olawoye and Gautam Divgi have been working the code here. Below are some very manual instructions on how to run a small demo.
CAUTION!!!
- This is the pre-pre-pre-...-alpha version of Eris. Please do not take it to be something that will look like the final version or even the alpha version. There are many manual steps that need to be automated and several other stories/blueprints yet to come.
- This is a very simple example, but it should provide the idea of how this can be expanded on.
- The example below demonstrates an haproxy failure. If you are not using haproxy, this can be very easily converted into an example for other failures as well
- The code is python 2.7 only at the moment. The authors are definitely looking at >= 3.5 compatibility.
- We're just doing control plane failures at the moment - not data plane.
Environment
Hardware Stack
type of hardware used, configuration. also if you want to add "tested on" hardware list? Network config if any specific hardware/connectvity used. maybe list ip address schemes in its own section?
Software Stack
- Ubuntu 16.04
- Python >= 2.7.12 <= 2.7.14 (including virtualenv or virtualenv has come bundled with your python installation.)
Project Setup
Assumptions
The demo assumes you have a working Ubuntu 16.04 environment with Python >= 2.7.12 <= 2.7.14. It also assumes you have installed virtualenv or virtualenv has come bundled with your python installation.
The demo assumes you have a new user called eris_test created for this purpose.
Download
The source code to this is available on GitHub at
https://github.com/gautamdivgi/eris
Installing eris code
Create eris virtualenv
Eris will not run as root and should definitely not try to install packages into the default installation/ To run eris, we want to create a virtualenv in the user's login environment. At the login create the virtualenv
$virtualenv eris |
Then activate the virtualenv and install ansible & rally.
$source eris/bin/activate |
This should install ansible 2.4 and rally 0.9.1
Get the source code
The current source code is at https://github.com/gautamdivgi/eris which is a fork of the eris repo. This will change shortly, however, for now clone this repo.
(eris) $ cd eris |
This step will provide all the latest source code.
Create additional directories
Some additional directories are needed. They basically store private keys for ssh, inventory information, etc.
(eris) $ cd eris |
The description of the directories is as follows
config: Any configuration files like the target zone setup, rally deployment configs, ansible configs, etc.
etc: Data files such as id's, password's, private keys, etc. This should be a protected directory with only rwx access to only the user and no one else.
tests: The place where all the tests happen.
tests/inventory: This one is a bit of a hack (well, even more so than the rest of the process here ). Sometimes when your inventory is derived out of either an http(s) or an ssh connection it can get time consuming to do inventory every time. So, this is where the inventory json is pre-generated and stored for use. You really shouldn't need this if you are using a file based inventory (described later).
tests/playbooks: All the test cases go here. Yes - they're ansible playbooks (for now at least).
tests/scenarios: The directory for the rally scenarios.
tests/sched_jobs: For running jobs at specified intervals or time triggers eris uses a scheduler. This scheduler is necessary because the linux/unix "at" may or may not be installed with certain distributions as it is considered a security risk. The scheduler lives and dies with the test case and polls for jobs in this directory. This is where the jobs for failure injection and monitoring are scheduled.
tests/task_out: The output of the scheduled tasks.
tests/templates: The various jinja2 templates used by the playbooks. Primarily these are jobs to be scheduled and failure injection.
tests/tmp: Log files, stdout, stderr of background tasks and everything else goes here.
Copy source files
Some source files need to be copied into the correct directories. This is again a manual pain point that needs to obviously go away.
(eris) $ cd eris |
Now the eris project is setup. We can go into test case & scenario setup. This step is essentially doing what an automated setup script or setup.py would do.
Configuring Eris
Assumptions
- You have an already installed OpenStack cloud (not devstack)
- You have root private key access to all the machines - compute & control.
- You have a map of the deployment from which you can create a JSON file (or you have installed via Fuel).
Creating the inventory
The eris configuration file
The eris inventory is created by the erisinv script. However, since we would like the script to be run directly on the ansible or ansible-playbook command as a dynamic inventory the program cannot take configuration files or variables on its command line. As per ansible dynamic inventory rules, the only option it takes is "–list". The rest of the information needs to come through environment variables. In the case of eris, the inventory sources all information out of an environment variable called ERIS_CONFIG_FILE. This variable points to a configuration JSON file that gives erisinv all the information needed to create an ansible dynamic inventory. The document here will not describe what an ansible dynamic inventory behaves or looks like. For that refer to Developing Ansible Dynamic Inventories. Here we describe how to create the configuration file for a file based inventory. The inventories work off a plugin method, so creating more varieties is possible. However, that documentation will come later.
File based inventories
The eris config file for a file based inventory is the simplest. It looks like
{ |
The field descriptions are given below. For now, always provide fully qualified paths in the file.
deployment_map: This is a simple JSON array (described below) on how to describe the target deployment.
deployment_ssh: All the ansible ssh variables needed to access every aspect of the deployment.
The deployment map for file based inventories
The deployment map for file based inventories is simple. It contains an array of nodes (servers, VMs, etc.) along with its ip addresses and any special properties ansible should know of. An example looks like
[ |
The fields descriptions are
groups: An array of groups that the node/host belongs to.
ip: The ip address of the host. Prefer ip addresses to dns FQDNS. This is because if a failure injection impacts the path to the DNS then we should be able to control the host via a direct ip address path.
mac: The MAC address of the host
name: The ansible alias for the host.
type: The type of host - currently only "vm" or "bare-metal" is supported. This could be expanded to have racks, switches, etc.
ansible_ssh_variables: Any ssh variables for the host that will override the default variables provided in the configuration file.
Creating the configuration files
Create the deployment and the configuration file in /home/eris_test/eris/config directory. Modify the ERIS_CONFIG_FILE variable to point to the configuration. Add that export in either your .bashrc or your virtualenv activate. The deployment JSON should be saved with the name specified in the "deployment_map" in the eris_test_config.json.
export ERIS_CONFIG_DIR=${HOME}/eris/config export ERIS_CONFIG_FILE=${ERIS_CONFIG_DIR}/eris_test_config.json |
Save your private keys in /home/eris_test/eris/etc. The private keys specified in all configuration files above should point to this directory. Also, always provide the fully qualified path name in all configuration files.
Rally deployment JSON
Eris will use rally to run load on the system. It will create a rally deployment as a part of the test. In the /home/eris_test/eris/config directory create a rally deployment JSON. An example is
{ |
Use a deployment that works for you. OpenStack/rally is very well documented and we do not intend to recreate that here. In this case we are not supplying a rally config. This will pick a default sqlite3 db in the /tmp directory. While this works for a demo it will change in the future.
Setting up the scenario
Rally load generation
The load generation will be a simple nova_boot_and_delete. The eris playbook currently injects failures based on time. Hence, the scenario is run for a duration as opposed to an iteration count. The sample used here is pasted below.
{% set flavor_name = flavor_name or "m1.tiny" %} |
Keep in mind that the duration is the load run time and does not represent the elapsed time of the entire test. This is important when considering time to inject failures. Create this file as boot-and-delete-for-duration.json in the eris/tests/scenarios directory.
Scheduler commands
The sched_daemon takes commands to schedule either for one-time (useful for failure injection) or on a recurring bases (useful for gathering monitoring data). Currently, we have 1 failure injection (an haproxy failure) and 3 monitors (top, netstat and free). However, as will be seen this can be easily expanded to more. The life of the scheduler starts with the test when the ansible playbook is run and ends with the playbook. The way to have the scheduler act on a specific command or a job is to copy the file into the tests/sched_jobs directory. All the commands that the scheduler runs need to be setup in tests/templates directory.
Scheduler command JSON
There are two types of commands. One are control commands and the other is job commands. The control commands offer rudimentary control of the scheduler - i.e. stop the scheduler or kill all the jobs in the scheduler.
The control command has only 2 fields
type: "CMD"
command: "STOP" or "KILL"
The job commands for one-time jobs have the following fields
type: "JOB"
outdir: The output directory for the job. Typically templatized (and for the purposes of this test tests/task_out)
run_at: The command will be run at "run_at" seconds from now.
cmd: The command to run (typically an ansible command with specific targets)
tag: A user friendly tag for the job
The job commands for recurring jobs have the following fields
type: "JOB"
outdir: The output directory for the job. Typically templatized (and for the purposes of this test tests/task_out)
run_at: The command will be run at "run_at" seconds from now.
repeat: Repeat every "repeat" seconds
until: Stop running after "until" seconds from now
cmd: The command to run (typically an ansible command with specific targets)
tag: A user friendly tag for the job
Control commands
tests/templates/stop_command.j2 |
---|
{ "type": "CMD", "command": "STOP" } |
tests/templates/kill_command.j2 |
---|
{ "type": "CMD", "command": "KILL" } |
Copying either of these into the tests/sched_jobs directory will cause the scheduler to act on them. This is normally done via the template command inside an ansible playbook.
Monitors
tests/templates/free_monitor.j2 |
---|
{ "type": "JOB", "outdir": "{{ outdir }}", "run_at": {{ run_at }}, "repeat": {{ repeat_every }}, "until": {{ until_time }}, "cmd": ["ansible", "aic-haproxy", "-i", "{{ inv_bin }}", "-m", "command", "-a", "free"], "tag": "free-monitor" } |
tests/templates/top_monitor.j2 |
---|
{ "type": "JOB", "outdir": "{{ outdir }}", "run_at": {{ run_at }}, "repeat": {{ repeat_every }}, "until": {{ until_time }}, "cmd": ["ansible", "aic-haproxy", "-i", "{{ inv_bin }}", "-m", "command", "-a", "top -b -n1"], "tag": "top-monitor" } |
tests/templates/netstat_monitor.j2 |
---|
{ "type": "JOB", "outdir": "{{ outdir }}", "run_at": {{ run_at }}, "repeat": {{ repeat_every }}, "until": {{ until_time }}, "cmd": ["ansible", "aic-haproxy", "-i", "{{ inv_bin }}", "-m", "command", "-a", "netstat -a"], "tag": "netstat-monitor" } |
The free, top and netstat monitors looks at the target haproxy nodes to monitor free memory usage. As can be seen it uses ansible command module under the covers. The template parameters need to be specified in the test playbook. The "aic-haproxy" in this is a group in the inventory. Specify a group or templatize via ansible which groups need to be monitored. Beware of having too many ssh connections especially via a jump host.
Failure injection
The failure injection scenario terminate all haproxies in front of the control services. This is done to check if the HA mechanism can recover from this extreme scenario.
tests/templates/haproxy_failure.j2 |
---|
{ "type": "JOB", "outdir": "{{ outdir }}", "run_at": {{ hapxy_fail_run_at }}, "cmd": ["ansible", "{{ targets }}", "-i", "{{ inv_bin }}", "-m", "command", "-a", "pkill -{{ kill_signal }} -f haproxy"], "tag": "haproxy-kill-all" } ~ |
Running the Failure Scenario
Creating the YML for the failure scenario
The YAML for the failure scenario is the pasted below. We generally create two versions of this. One a dry-run version and another the real failure injection. The difference between the dry run version and the real version is that the dry run version only creates the various files from the templates and tests if all variables are present. The actual version runs the failure injection. The two YAMLs are provided side by side to enable a comparison. In general, the standard procedure for all tests will follow the algorithm:
- Setup the environment variables (including failure injection times, etc.)
- Start the monitors
- Start the rally load
- Schedule the failure injection
- Wait for completion.
- Stop the scheduler.
- <Not Implemented>: From the data collected by the monitors and rally compute KPI and signal a pass/fail for the test.
tests/playbooks/rally_playbook_dry_run.yml | tests/playbooks/rally_playbook_hapxy_fail.yml |
---|---|
--- #- name: 'start_scheduler' - name: 'start_free_monitor' - name: 'start_top_monitor' - name: 'start_netstat_monitor' - name: 'create_rally_db'
- name: 'check_deployment' #- name: 'run_rally_task' - name: 'create_hapxy_fail'
- name: 'stop_the_scheduler' | --- - name: 'start_scheduler' - name: 'start_free_monitor' - name: 'start_top_monitor' - name: 'start_netstat_monitor' - name: 'create_rally_db' - name: 'create_deployment' - name: 'check_deployment' - name: 'run_rally_task' - name: 'create_hapxy_fail'
- name: 'stop_the_scheduler' |
Once everything is setup in the playbooks directory, you are now ready to run the test.
Things to check before running
- Check if the ERIS_CONFIG_FILE and ERIS_CONFIG_DIR have been set to the correct configuration you need to use.
- Ensure you are in the eris virtual env - i.e. you've run source eris/bin/activate
Running the failure injection
Running the failure injection is simple.
To run a dry run use:
(eris) $ cd eris/tests (eris) $ ansible-playbook --extra-vars="test_dir=/home/eris_test/eris/tests" -i ../bin/erisinv playbooks/rally_playbook_dry_run.yml |
To run an actual failure injection scenario use
(eris) $ cd eris/tests (eris) $ ansible-playbook --extra-vars="test_dir=/home/eris_test/eris/tests" -i ../bin/erisinv playbooks/rally_playbook_hapxy_fail.yml |
What you should expect
The rally transaction log should show mostly success transactions - until the haproxies fail. However, if you have a correct HA setup the haproxies will recover and the rally transaction log should start showing successes again. Computing the KPI to signal success/failure of a scenario is a critical element which will be the next point of our development.
What actually happened (Some Results)
The failure injection was run and the results were somewhat surprising. Although the control plane recovered pretty well based on the pacemaker/corosync clustering of the haproxies, the rally SLA showed only an 0.5% success rate. What happens is because of the constant concurrency requriement of the task rally tries to keep up the concurrency even when the connection can't be established. The connection failure is a fast fail event. Hence, rally generates a huge number of requests that all fail very fast and skews the numbers very badly.
The log file will have a slew of entries that look like
2017-10-30 11:52:34.087 30513 INFO rally.task.runner [-] Task 74e50633-b50c-4b05-b086-3ecc769cf4c5 | ITER: 7835 END: Error ConnectFailure: Unable to establish connection to http://keystone.fuel.local:5000/v3/: HTTPConnectionPool(host='keystone.fuel.local', port=5000): Max retries exceeded with url: /v3/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80a29b6210>: Failed to establish a new connection: [Errno 111] Connection refused',)) |
The results from rally show the 0.5% SLA
+-----------------------------------------------------------------------------------------------------------------------+ | Response Times (sec) | +--------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+ | Action | Min (sec) | Median (sec) | 90%ile (sec) | 95%ile (sec) | Max (sec) | Avg (sec) | Success | Count | +--------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+ | nova.boot_server | 0.002 | 0.002 | 0.005 | 0.006 | 20.435 | 0.085 | 0.5% | 7845 | | nova.delete_server | 2.733 | 3.515 | 5.261 | 5.585 | 6.36 | 3.797 | 97.3% | 37 | | total | 0.002 | 0.002 | 0.005 | 0.006 | 23.95 | 0.103 | 0.5% | 7845 | | -> duration | 0.002 | 0.002 | 0.005 | 0.006 | 22.95 | 0.098 | 0.5% | 7845 | | -> idle_duration | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.005 | 0.5% | 7845 | +--------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+ |
This case actually highlights the dangers of not using good characteristic load when running destructive (or performance scenarios) and the importance of modeling the input well.
Source Files
File name | Location | Source |
---|---|---|
test_config.json | /home/eris_test/eris/config | Create one based on configuration - sample in section "Creating the Inventory" |
test_deployment.json | /home/eris_test/eris/config | Create one based on configuration - sample in section "Creating the inventory" |
rally_test_deployment.json (as sre1.json - which is the deployment name used in the authors lab) | /home/eris_test/eris/config | |
boot-and-delete-for-duration.json | /home/eris_test/eris/tests/scenarios | |
free_monitor.j2 | /home/eris_test/eris/tests/templates | |
top_monitor.j2 | /home/eris_test/tests/templates | |
netstat_monitor.j2 | /home/eris_test/tests/templates | |
haproxy_failure.j2 | /home/eris_test/tests/templates | |
stop_command.j2 | /home/eris_test/tests/templates | |
kill_command.j2 | /home/eris_test/tests/templates | |
rally_playbook_dry_run.yml | /home/eris_test/tests/playbooks | |
rally_playbook_hapxy_fail.yml | /home/eris_test/tests/playbooks |