Wednesday, May 12, 2021

Different Fault Tolerance Options At System/Patch Level In Linux

 


As we all know, patching is an essential & integral part of any IT infrastructure. It could be cloud based systems (virtual) or an on-premise virtual systems or physical servers running in a dedicated data center. Patch management has now become an important buzzword in corporate IT organizations and business offices. Patch management is basically the process of acquiring, testing and installing multiple code changes (patches) to systems software and applications. 


As we go with such changes, there are at times that we need to have a strong fallback method in case any failures after patching. This method should be resilient and should bring back the infrastructure to a steady state as it was before patching. So, let's talk about some of the industry best practices in this space in brief. We are excluding any third party tools or practice outside of the native Linux infrastructure. 


The following are the possible fault-tolerance options at system/patch level.


[1] System Snapshot from hypervisor or cloud platform for virtual systems.


This is one of the most commonly used and recommended approaches in case of virtual/cloud based estate. This gets initiated and completed on the Virtual Infrastructure end or at cloud end as desired. This can be categorized as one of the best practices in this sector.


 Pros:

  - Industry recommended practice

  - Easy to execute and restore from either hypervisor or cloud level.

  - System level Fault tolerance.

  

 Cons:

  - Does require additional storage space on the backend infrastructure. 

  - Any changes after the snapshot are not valid when restored. 


  

[2] System level fault-tolerance using native LVM snapshot with Boom utility in RHEL7.5 onwards.


There are many times that we need a simple yet native solution in Linux which could save the system state(snapshot) and restore later quickly. Yes, a simple and yet native solution is to use LVM Snapshot. This feature facilities in capturing the root file system (/) snapshot and revert the changes later using snapshot. The only prerequisite for this is that the root file system (/) should be on a LVM and there should be free space available within the Volume Group. There are many use cases of this and one of them is to restore system state after making some changes which are not desired or expected. The other main use case would be to restore system state after unsuccessful patching of systems. This is another best practice method.


 Pros:

  - Natively supported on RHEL7.5 systems on-wards.

  - Could provide a second level of system fault-tolerance on top of system snapshot.

 

 Cons:

  - Requires additional disk space to get it implemented in each system.

  - Requires a Linux enterprise.


References: https://www.redhat.com/en/blog/boom-booting-rhel-lvm-snapshots

https://access.redhat.com/solutions/3772101

https://access.redhat.com/solutions/3750001  


[3] Package level roll back using "dnf history" or "yum history" commands.


One of the most easiest and yet native methods of restoring a package state. As most of the system management team is aware of this command and should not be complicated in implementing. 


 Pros:

  - Native Linux feature.

  - Easy to implement and execute. 

 - No additional disk space since this is a native feature of the rpm database which keeps track of package transaction history. 

 

 Cons:

  - Doesn't provide system level fault-tolerance.

 - Some packages from SELinux, selinux-policy-*, kernel, glibc can't be rolled back using this feature. 

  

  Reference: https://access.redhat.com/solutions/64069



[4] Most trusted but not the best method using ReaR (best for disaster recovery) .

“Rear” (Relax-and-Recover) fits perfect in implementing a bare metal disaster recovery solution & image migration as well. Rear is the leading Open Source disaster recovery solution. It is a modular framework with many ready-to-go workflows for common situations.

There are other open-source tools available for implementing DR solution in Linux such as Mondo Rescue, Baculla, Disaster Recovery Linux Manager, Clonezilla etc,.

 Pros:

  - open-source Linux feature.

- No need of any subscription or license.

  - Provides a complete backup solution well suited in case of disaster.  

 

 Cons:

  - Needs additional infrastructure or external servers for implementing ReaR solution.

3 comments:

subbu said...

Nicely explained, thank you

SimplyLinuxFAQ said...

Thanks Subbu

#HTK said...

Very nice document to understand :)