Extreme Testing: Vision/Architecture/Tasks Whiteboard

Gautam's comments

Sampath's post on the openstack-operators list and responses
1. http://lists.openstack.org/pipermail/openstack-operators/2017-August/014072.html
2. Most of the responses suggest using Rally with os-faults - which is one of the main directions we are headed. So that's good.
Rally has a bunch of features on the roadmap that we need but those are blocked due to one reason or the other. https://docs.google.com/a/mirantis.com/spreadsheets/d/16DXpfbqvlzMFaqaXAcJsBzzpowb_XpymaK2aFY2gA2g/edit#gid=0
1. Multi scenarios is blocked
2. Distributed workload is blocked
3. There's a lot of discussion around some refactoring - which is allegedly super-hard because its breaking just about everything...
4. They plan to add various features taken care of by shaker today - like vm workloads.
5. Question: We may need to engage w/ the rally folks at the ptg on a proposed resolution.
  1. Concern - rally is building a fairly extensive orchestration mechanism
  2. But there are other orchestration mechanisms that may be better suited - e.g. Ansible (or anything else - not picky!!!)

Sample Workload used for performance testing

Admin: Create tenant & users

Net: Create network

Subnet: Create subnet in "net"

ListNet: list networks

BootVM: Create an instance

ListVM: List instances/VMs

Volume: create a cinder volume

ListVol: List cinder volumes

Attach: Attach cinder volumes to VM

Reboot: Reboot VM

Detach: Detach Cinder volume from VM

DelVM: delete VM

DelVol: Delete cinder volume

DelSubn: Delete subnet

DelNet: Delete network

There are also some heat templates used. But this is pretty much what happens for most things.

Stack to be used: IMHO - lets use either Mitaka or Newton. I'm partial towards Newton, but most may be on Mitaka today.

^^ In community, we have to focus on Master or RC1-3. Best case, n-2 release is possible. Because, n-3 is EOL. Mitaka became EOL since this release. (sampath)

Sundar's Comments

Define reference architecture
Define reference workload
Define stack
1. Define core components to test

Edited on 8/1/17

Phase 1 (Focus on Openstack Summit Presentation - A small deliverable, Demo and PPT for high level Vision)
1. Define reference architecture
  1. stick to one reference architecture and release
  2. Make assumptions when necessary and justify why it was made
  3. Pick the components you want to include as a part of the ref architecture - Minimum required to sustain a meaningful test for phase 1
  4. Pick one component that we are going to test - Ceph/Neturon etc for phase 1
2. Define reference workload
  1. Simple is good. Pick load for the ONE component being tested
  2. Need not be real life representational for phase 1
3. Define Test Framework (for phase 1)
  1. Define test framework for the ONE component being tested (highliten the relevant components alone in the big-picture diagram)
  2. Define discovery
  3. Define load injection parameters
  4. Define failure injection
  5. Define what metrics will be gathered
  6. Test Manager (just define/setup for extensibility for phase 1)
4. (LAB) Perform test
  1. Setup Env
  2. Setup discovery (setup of environment - the map)
  3. Introduce load
  4. Introduce failure
  5. Gather metrics/Documents and findings
  6. Introduce test manager and automate the above testing, perform a testing to validate test manager
5. Deliverables
  1. Draft reference architecture
  2. Draft workload definition
  3. Draft MOP of env setup and parameters
  4. Draft Test plan
  5. Draft results, metrics gathered and follow ups
  6. Draft Vision Document/PPT
Phase 2 (focus on Extending Phase 1 - Goal is to make a Specific Phase 1 to a Generic model)
1. Define architecture
  1. Same as phase 1
  2. Add a component to test
2. Define ref workload
  1. simple is good. Extend existing workload or add one for CURRENT component being tested
  2. Not representational for Phase 2
3. Define Framework (As above in Phase 1)
  1. Accommodate the new component being tested
  2. Enable it under Test Manager
  3. **Now enable automated testing through test Manager for BOTH the components
4. (LAB) Perform test
  1. Should be able to test for both Phase 1 and 2
  2. Document results
Phase 3

Problem Statement (Why needed?)

Robustness/Resiliency of OpenStack at the CI/CD gates
Developers can share re-creatable performance and resiliency experiments
Easy to define scenarios and execute
Should be verifiable via KPI and automated on reference architecture. There shouldn't be any eyeballing of graphs
Better monitoring, SLA and failure models on reference deployment architectures

Vision Statement (How accomplished? – 5 main points already outlined in Wiki)

Define test workflow
1. Test frequency – not the entire suite can be run at each check in
2. Find the minimal set that are critical for testing at each CI/CD gate

Define test framework

1. Test Tools (preferable agentless: not ideal to have "failure injection" agents running on production/test sites)
  1. Deployment
    1. Create repeatable experiments
  2. Discovery
    1. Discovery a site – everything including h/w, underlay, etc.
  3. Load injection (Control & Data Plane)
    1. Rally: For control plane load injection
    2. Shaker: For data plane load injection
  4. Failure injection
    1. Os-faults: For failure injection
  5. Metrics gathering
    1. Need new tools – should be agentless/ssh/ansible based.
    2. Pipe metrics to test manager

1. 1. Orchestration
    1. Ansible – create rally, shaker, os-faults and metrics gathering plugins
  2. Test Manager
    1. Resiliency Studio/Jenkins (AT&T Proposed – time to open source is Aug – Sept 2018)
    2. Start orchestrator runs
    3. Collect metrics
    4. Incorporate capability for SLA plugins
      1. SLA plugins will decide whether test is a success/failure
    5. Interact with GitHub/CI-CD environments
      1. Provide detailed logs and metrics for data
      2. Create bugs
2. Developer Tools
  1. Goal:
    1. Push what can be possible as far left into the development cycle
    2. Minimize resource utilization (financial & computational)
  2. Data Center Emulator
    1. Emulate reference architectures
    2. Run performance and failure injection scenarios
    3. Mathematically extrapolate for acceptable limits
Define test scope & scenarios
1. KPI/SLA
  1. What metrics are part of the KPI matrix
    1. Examples: Control Plane – API response time, Success Rate, RabbitMQ connection distribution, CPU/Memory utilization, I/O rate, etc.
    2. Examples: Data Plane – throughput, vrouter failure characteristics, storage failure characteristics, memory congestion, scheduling latency, etc.
  2. What are the various bounds?
    1. Examples: Control Plane - RabbitMQ connection distribution should be uniform within a certain std. deviation, API response times are lognormal distributed and not acceptable past 90 percentile, etc.
  3. Realistic Workload Models for control & data plane
  4. Realistic KPI models from operators
  5. Realistic outage scenarios
2. Automated Test Case Generation
  1. Are there design & deployment templates that can be supplied so that an initial suite of scenarios is automatically generated?
  2. Top-Down assessment methodology to generate the scenarios – shouldn't burden developers with "paperwork".
3. Performance
  1. Control Plane
  2. Data Plane
4. Destructive
  1. Failure injection
5. Scale
  1. Scale resources (cinder volumes, subnets, VMs, etc.)
6. Concurrency
  1. Multiple requests on the same resource
7. Operational Readiness
  1. What are we looking for here – just a shakeout to ensure a site is ready for operation? May be a subset of performance & resiliency tests?
Define reference architectures
1. What are the reference architectures?
2. H/W variety – where is it located?
3. Deployment toolset for creating repeatable experiments – there is ENoS for container based deployments, what about other types?
4. Deployment, Monitoring & Alerting templates
Implementation Priorities
1. Tackle Control Plane & Software Faults (rally + os-faults)
  1. Most code already there – need more plugins
  2. os-faults: More fault injection scenarios (degradations, core dumps, etc.)
  3. Rally: Randomized triggers, SLA extensions (e.g. t-test with p-values)
  4. Metrics gathering plugin
2. Shaker enhancements (rally + shaker + os-faults)
  1. os-faults hooks mechanisms
  2. Storage, CPU/Memory (also cases with sr-iov, dpdk, etc.)
  3. os-faults for data plane software failures (cinder driver, vrouter, kernel stalls, etc.)
  4. Develop SLA measurement hooks
3. os-faults underlay enhancements & data center emulator
  1. os-faults: Underlay crash & degradation code
  2. Data center emulator with ENoS to model underlay & software
4. Traffic models & KPI measurement
  1. Realistic traffic models (CP + DP) and software to emulate the models
  2. Real KPI and scaled KPI to measure in virtual environments

LCOO

Extreme Testing-Vision+Arch

Gautam's comments

Sample Workload used for performance testing

Sundar's Comments

Problem Statement (Why needed?)

Vision Statement (How accomplished? – 5 main points already outlined in Wiki)