2018.02.01 Specialty ERIS

Web meeting: https://zoom.us/j/579402487

 (add yourself to be notified of any changes)

Goals

  • plan PTG attendance
  • coordinate on next steps

Discussion items

TimeItemWhoNotes
1Started the meeting with a conversation about how most people will think that this effort is about picking from various software but it is instead and should be about the tracking/measurement/metrics/what to watch and baseline from the run.

2PTG Approach (Feb26-March 2, Dublin IR) - From previous meeting and plans we hoped to have ERIS project to the point where we can bring back to the QA team and other interested parties and increase contributions.

https://www.openstack.org/ptg/

We can not travel to Dublin and so this will not be place we will come bcak to engage with the QA team or Self Healing SIG.  Sampath says we can drop back to meet with the QA team in IRC.

3Technical Discussion PointsGautam

This is a rough collection (not comprehensive) on some knowledge gained today (obviously all points are pretty open for fairly rigorous debate (smile) ).

  1. Getting the inventory structure right is quite critical. What we've observed when creating the YAMLs has been that our test cases are very specific to the installation. This means someone has to rework a bunch of variables in the test cases every time. At what level should we be targeting test cases - should it be "run memory failure on server-1.abc.com" because that is a rabbitmq server or should I just say "run memory failure on any 1 rabbitmq server". So - tests can be "service specific" or "server specific". Both are needed - so both need to be accounted for. The problem is openstack uses so many components a manual entry of all those is not optimal.
  2. Tests take a long time to run. Giving a 1 minute full cpu failure doesn't have much effect. Need to have sustained high cpu consumption to observe any sort of degradation.
  3. There is a tool called "stress-ng" which provides a lot of the degradations (not failures) - like full cpu, full mem, disk i/o burn, etc. The only problem is that its a c program - not easily replicated in python (my earlier comment on the GIL and its impacts on what we want to do applies here). It's easy enough to create an ansible module around it. But we may need to distribute compatible versions based on distro used. This is because a lot of companies / operations folks don't want to have this program on production (or test) servers.
  4. Metrics collection - all ansible modules running failure need to have a standardized output to provide metrics on when failure injection started and when it ended.
  5. The mathematics on reliably automating can get complex (if we want to do it with any level of statistical consistency). I'm currently working through some data collected. Will try and present it in 2 weeks time.
  6. openstack/rally makes process control and progress control hard. It's a good tool and flexible for load generation but sometimes process aborts don't work well. There is also no easy method to wrap rally in an ansible module and controlling the load generation process externally can become difficult. Need rally to harden its state machine and control mechanism (or we have to do it for them (smile) ).
  7. Point #6 applies to the sched_daemon as well.
  8. We should explore an alternate ssh path. Sometimes there are connection limits set on ssh daemons and if we use one as a proxy on a large site we could trip those limits. May be something as simple as a paramiko based server with dynamically generated keys. That way it could be more secure as well. Steps would be:
    1. Generate keypairs
    2. Deploy paramiko based python code via ansible - comes up on random port with pubkey
    3. Eris inventory uses that for any ssh communication.
    4. Once tests are done - delete keys and servers.
    5. Don't leave a trace behind.
4Action Items from previous meeting - not covered in today's meeting but carrying forward to next