[clue] Performing RCAs

foo7775 at comcast.net foo7775 at comcast.net
Wed Apr 15 11:36:55 MDT 2015


Hi all, 

I'm hoping to get some good suggestions on how I might be able to improve my ability to perform root cause analysis when problems occur. At the moment, my primary method is to go through logs (/var/log/messages, etc.) in the hope that something might be logged that will let me say "OK, _this_ is what caused the service to stop/the problem to occur/etc." - but as many of you know, all too often, there simply isn't anything logged. I am aware of the historical data provided by the 'sar' utility, & that's definitely helpful up to a point, and I've tried to start an effort to ensure that 'sysstat' & 'collectl' are installed on all of our production servers, but I'm fairly sure that many of you know a number of other things that would be helpful to me. 

One thing that's really frustrating to me is that the management team will often insist upon knowing the cause for an event, when (from everything I can tell) there's simply *nothing* there to say why it occurred. I'm hoping that a number of you might be able to help me drastically reduce the number of times I have to say "I don't know why <foo> occurred." 

Thanks all, 

T. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://cluedenver.org/pipermail/clue/attachments/20150415/1ea73b4a/attachment.html 


More information about the clue mailing list