Introduction

Heading into the PTG at Denver we may want to have a consistent presentation as to why we are making certain technology choices and why a new project is needed. This space is to collaboratively come up with that presentation.

Current Discussion Points

Sampath's Openstack-operators email thread

Gautam's Analysis: Rally dev's definitely seem to be asking us to look at rally (which we've already done - and that's a good thing ).

Current Rally Roadmap

The current rally roadmap is here - https://docs.google.com/a/mirantis.com/spreadsheets/d/16DXpfbqvlzMFaqaXAcJsBzzpowb_XpymaK2aFY2gA2g/edit#gid=0

The road map as of 08/16/2017 (for those who can't access google docs from their company intranet)

The current readthedocs - http://docs.xrally.xyz/projects/openstack/en/latest/

The current feature requests - http://docs.xrally.xyz/projects/openstack/en/latest/feature_requests.html

A quick analysis shows that the main blocker is this - task-data-struct-refactor (https://blueprints.launchpad.net/rally/+spec/refactor-db-schema). The blueprint was opened in 04/2016 and is currently pending approval.

Also not sure how long a lot of the "in-progress" tasks are.

Some Options

Develop ansible orchestration with single Rally
- Easiest, no refactoring of existing rally code.
- Scenarios are still rally, vm workloads are still in rally
- Can take scenario and runner functions as ansible plugins/modules
- Can there be an efficient rally to ansible-plugin interface… probably not.
- The above point can cause inefficiency - especially for calculating metrics
- There may be community push-back

Pros
Cons

Need to agree on a clear presentation of why a separate project and why rally isn't a good fit for all the functionality.
- We could refactor into ansible
- Will take time - but this could be an eventual goal
- Go our own route, our own project
- Re-use rally code initially - or limit rally use to specific situations/scenarios

Question: is rally really not a good fit - or is it just a matter of refactoring it/changing its architecture.
Need not be limited by rally

Project Name

Project: Eris

Inspiration: Greek goddess of strife, discord and destruction

Talking points for Sydney Summit

Guiding Principles (as always)

Community: If there is a project that seems like a good fit and is open source we just use it and participate actively in its development. However, it needs to align with the OpenStack community capabilities - i.e. has the ability to absorb Python plugins and has a Python API.

Technical:

Agentless execution (attempt not to have a bunch of extra agents installed. Also data from agents can be suspect during a failure injection.
Extensible - should be able to plugin both proprietary and open source tools
Feedback - possible feedback between ever component of the test (.e.g. feedback from a VM evacuate operation to inject a fault, feedback from a metrics component to change load and inject a fault, etc.)
Non-deterministic - non-deterministic testing. Success or failure is decided on the basis of KPI

Eris Scope (from forum posting)

The entire scope of "extreme testing" is fairly large and splits into 3 major parts:

1. References: Extreme testing is non-deterministic. Such testing generally is valid with the following reference.

Architecture: Software and hardware architecture of the deployed cloud.

Workload: Proposed workload injected into the control and data plane.

KPI/SLO: The KPI/SLO that are measured for the reference architecture(s) under the reference workload(s).

2. Test Suite: What test scenarios are we trying to get done?

Control Plane Performance: Benchmarks for control plane performance should be derived from an Eris test suite.

Data Plane Performance: Benchmarks for data plane performance should be derived from an Eris test suite.

Resiliency to Failure: Failure injection scenarios coupled with control and data plane load should provide KPI on the cloud installation’s resiliency to failure.

Resource scale limits: Identify limits of how much we can scale resources. Examples include: what is the max memory for VMs? How many computes can we support? How many subnets can be created? What is the max size of a cinder volume? How many cinder volumes, etc.?

Resource concurrency limits: Identify limits of how many concurrent operations can be handled per resource. Examples include: Reconfiguring a network on a large tenant of 300+ VMs – how many concurrent operations can the single subnet handle?

Operational readiness: This has different meanings for open source Eris vs. an operators version. For open source Eris this will include a smoke test of a specific number of tests to run at the OpenStack QA gate. For an operator it will include an expanded set of tests for running in production (including destructive tests).

3. Frameworks & Tools: How do we enable the test suite?

Repeatable Experiments: Eris should have the capability to create repeatable experiments and reliably reproduce results of non-deterministic test scenarios.

Test creation: Eris should have the capability to create test cases using an open specification like YAML or JSON and encourage test case reuse.

Test orchestration: Eris should have the capability to orchestrate test scenarios of various types (distributed load, underlay faults, etc.)

Extensibility: The framework should be extensible for various open source and proprietary tools (e.g. plugin to use HP Perf. center instead of Openstack/rally for load injection, plugin by Juniper for router fault injection, cLCP support, vLCP support, etc.)

Automation: The entire test orchestration and validation should be automated by the orchestration mechanism (no eyeballing graphs to check for success/failure, it should be determined by automated KPI/SLO comparison)

Simulators & Emulators: Competent simulators and emulators to provide scale testing (e.g. 10,000 compute nodes, 1 million VMs, etc.).

Note:- Matt Reidermann (Nova PTL) was looking at resource scale testing and developer tools to enable that (and maybe even resource concurrency testing). His forum topic (http://forumtopics.openstack.org/cfp/details/55) was refused in favor of ours. So we will have the Nova crowd in our room as well

Technology Review & Choices

Tools Surveyed

Rally (choice to generate control plane load)

Shaker (choice to generate data plane load)

Browbeat (benchmarking - combines rally + shaker with various scenario YAMLs - comes closest to "failure injection" but it doesn't do any failure injection)

Cloud99 (Not maintained at the moment - very rudimentary functionality and uses rally to generate load)

os-faults (Need to evaluate if we need this - why not just use ansible directly?)

Jepsen (not Python/no Python API)

Summary

No tools currently for emulation and simulation that can provide the level of scale and concurrency testing today. However, with OpenStack looking to scale to massive clouds this is critical. A parallel can be drawn with the Internet - one does not need the Internet to simulate a new routing algorithm for 1000's of nodes - we just need Mininet and a laptop or ns2/ns3 and a laptop.
Current KPI calculation mechanisms rely on models of API success rate & response times for the control plane (Rally) and various iperf metrics (Shaker). Needs to be more flexible - e.g. RabbitMQ memory consumption should not show an upward trend at the end of a scenario as a success criteria.
No feedback: Current mechanisms for failure injections are at the software level (well, os-faults has IPMI faults but haven't seen examples of use) and are generally single failure injection/recoveries. Need more capacity for feedback & complex orchestration of fault, load and metrics (e.g. kill MySQL/MariaDB when memory usage approaches a certain limit and change load characteristics, or during a vm evacuate inject packet loss on the target host, etc.)
Disparate load generation - rally does control plane single scenario load, shaker does data plane iperf/iperf3 load. Need more complex multi-scenario and traffic profile to generate realistic load on control and data plane. Discussion point - where do we add various features for multi-scenario & distributed load generation. Outside rally or within rally? Easier at the moment outside rally - i.e. orchestrator runs several rally scenarios in parallel but control is limited and statistical in nature.
Monitoring: Monitoring limited to a tool view (i.e. rally knows about its transactions, Shaker knows about its network stats). No tool providing cloud-wide stats. Yes - we can install StackLight or Fluentd/Prometheus but that's too heavy weight for just testing. Looking at something lightweight and over ssh like HP SiteScope.

LCOO

Eris Talking Points