Kynetx Operational Best Practices
From KynetxDocs
Contents |
Kynetx Operational Best Practice Mantras
- Design systems which are free of Single Points of Failure (SPOF) from the start.
- Disks are cheap, buy lots of them.
- RAID is not a four letter word. All production volumes should be RAID1 or RAID5 volumes.
- If you can't measure it, you can manage it!
- Who's watching the watcher? All monitoring systems should themselves be monitored.
- A backup tape kept on site does not a Disaster Recovery Plan make.
- If it can fail, it will fail, it is just a matter of time
- Plan to survive random failure events!
- If it happens once, it will never happen again. If it happens twice, it is guaranteed to happen a third time!
- There is no such thing as a coincidence in IT. There is ALWAYS a root cause to every issue/incident, and if you don't fix it, it will come back to haunt you
Application of the Kynetx Operational Best Practice Mantras
Design systems which are free of Single Points of Failure (SPOF) from the start.
- All production systems have redundant components (dual power supplies, chipkill memory, dual disk controllers) to minimize the potential of a single component failure from disabling the systems.
Disks are cheap, buy lots of them.
- Production systems have been maxed out on disk space to provide the highest level of storage density in the locally attached storage possible
RAID is not a four letter word. All production volumes should be RAID1 or RAID5 volumes.
- All production systems utilize either RAID1 or RAID5 (DB Servers) for their storage to optimize disk I/O and resiliency
If you can't measure it, you can manage it!
- System performance data is gathered by SAR and stored in a central logging server for analysis
Who's watching the watcher? All monitoring systems should themselves be monitored.
- The Nagios monitoring process is monitored by monit, which sends an alert to the IT Operations staff if the Noagios process dies
A backup tape kept on site does not a Disaster Recovery Plan make.
- All business critical servers are backed up on a weekly full / daily incremental schedule. Weekly, a full backup set is removed from the data center and stored securely off site
If it can fail, it will fail, it is just a matter of time
- Kynetx keeps a supply of spare parts on site in the event of a component failure. In addition, all production servers carry 24x7 w/four (4) hour replacement manufacturer service contracts
Beware the "Fulling/Windley power failure tests." Plan to survive them!
- If there is a big red button, it will be pushed just to see how everyone reacts. The Kynetx IT Operations team performs dry runs of each incident response plans every six (6) months. This way, we are confident in our abilities to plan and execute them, so our clients can be confident in out ability to plan and execute them as well.
If it happens once, it will never happen again. If it happens twice, it is guaranteed to happen a third time!
- Guiding principal when dealing with failures or issues. At Kynetx we believe that we must get it right the first time, so we don't allow times two an three to have a chance of happening.
There is no such thing as a coincidence in IT. There is ALWAYS a root cause to every issue/incident, and if you don't fix it, it will come back to haunt you.
- At Kynetx, we do not believe in coincidences. Every issue and incident has a root cause that can be identified and prevented. When an incident occurs, a full Root Cause analysis (RCA) is performed, published and sent to all impacted parties. Our goal is to ensure that we learn from history and not repeat it.
