Project status: We had determined in December meeting that we needed to get the initial project in place before inviting more collaborators. Let's check on progress and enumerate remaining tasks.
2
PTG Approach (Feb26-March 2, Dublin IR) - From previous meeting and plans we hoped to have ERIS project to the point where we can bring back to the QA team and other interested parties and increase contributions.
Seems likely we can get some hours during the first 2 days during QA time if we ask. Sampath Priyankara agreed to follow up with QA team to ask.
Main action item is for us to check with our companies to see if they will send us.
2
Overlap to Self Healing SIG. From previous call Sampath said he would check with Adam Spears to see if PTG was a good place to overlap.
A good description of this related SIG, showing where the need for Destructive Testing is complimentary but would share and enhance efforts to bring out use cases. Basically each operator would bring their use case and be the testers of it.
This is a rough collection (not comprehensive) on some knowledge gained today (obviously all points are pretty open for fairly rigorous debate ).
Getting the inventory structure right is quite critical. What we've observed when creating the YAMLs has been that our test cases are very specific to the installation. This means someone has to rework a bunch of variables in the test cases every time. At what level should we be targeting test cases - should it be "run memory failure on server-1.abc.com" because that is a rabbitmq server or should I just say "run memory failure on any 1 rabbitmq server". So - tests can be "service specific" or "server specific". Both are needed - so both need to be accounted for. The problem is openstack uses so many components a manual entry of all those is not optimal.
Tests take a long time to run. Giving a 1 minute full cpu failure doesn't have much effect. Need to have sustained high cpu consumption to observe any sort of degradation.
There is a tool called "stress-ng" which provides a lot of the degradations (not failures) - like full cpu, full mem, disk i/o burn, etc. The only problem is that its a c program - not easily replicated in python (my earlier comment on the GIL and its impacts on what we want to do applies here). It's easy enough to create an ansible module around it. But we may need to distribute compatible versions based on distro used. This is because a lot of companies / operations folks don't want to have this program on production (or test) servers.
Metrics collection - all ansible modules running failure need to have a standardized output to provide metrics on when failure injection started and when it ended.
The mathematics on reliably automating can get complex (if we want to do it with any level of statistical consistency). I'm currently working through some data collected. Will try and present it in 2 weeks time.
openstack/rally makes process control and progress control hard. It's a good tool and flexible for load generation but sometimes process aborts don't work well. There is also no easy method to wrap rally in an ansible module and controlling the load generation process externally can become difficult. Need rally to harden its state machine and control mechanism (or we have to do it for them ).
Point #6 applies to the sched_daemon as well.
We should explore an alternate ssh path. Sometimes there are connection limits set on ssh daemons and if we use one as a proxy on a large site we could trip those limits. May be something as simple as a paramiko based server with dynamically generated keys. That way it could be more secure as well. Steps would be:
Generate keypairs
Deploy paramiko based python code via ansible - comes up on random port with pubkey
Eris inventory uses that for any ssh communication.