Using ROSA and AWS EFS for disaster recovery – An unpopular solution
Here in this blog, we are going to learn about an unpopular solution for disaster recovery using ROSA & AWS.
A disaster can strike at any time and take many different forms, such as accidentally erasing data or a major cloud service being down over an entire region. Numerous “freemium” and commercial products are available to lessen this risk.
In the event of a disaster, clients usually ask, “How do I make sure business can resume as quickly as possible with the least amount of downtime and data loss?”
This article describes how to use Red Hat OpenShift Service on AWS (ROSA) with AWS EFS to provide static volume provisioning at a reasonable cost.
Replication of persistent volumes by itself is insufficient in the big picture of disaster recovery when you consider all the other services that applications need to communicate with in order to operate properly. One application might have to interact with an API from a third party vendor, a virtual machine application, an AWS RDS data store, and so forth. A regional disaster recovery (DR) strategy has to take into consideration some, if not all, essential linkages.
Workloads with storage performance needs that NFS can meet are best suited for this solution.
Any of the following situations can use this solution:
- Data security
- Migration of applications
- Changing a cluster because of an incorrect setup (If you are considering replacing your cluster, please get in touch with Red Hat support first!)
- Data proximity, allowing workloads deployed in different areas to access read-only data
- Regional failover brought caused by a region’s or that region’s essential service going down
- safeguard against unintentional data loss. For instance, the entire persistent volume or storage service is inadvertently erased.
- Wargaming business continuity
Read this article from Amazon if you need a refresher on cloud disaster recovery.
Best practices for security
AWS EFS is an NFS storage service that is shared. To guarantee that application teams may access designated directories, it is essential to apply best security standards, including file permissions and access controls, to the EFS instance. The whole EFS directory tree is only accessible to team members with higher credentials. You may, for instance, restrict particular access kinds to particular IP ranges, AWS Principals, Roles, and so forth.
I utilize rather loose policies only as an example in this article’s implementation portion. You should implement even more stringent controls in a production environment while making sure application teams may still access data without difficulty.
Summary of the solution
I’ll talk about the remedy in two stages: prior to the catastrophe and following it.
Phase I: preparation for a disaster
In the preparedness stage, we create the backup plan that will be implemented in case of an emergency, working with RPO and RTO goals. Remember that a plan like this needs to be tested on a regular basis to find any weaknesses that might have been developed over time as a result of changes in the technology’s maturity and lifetime.
The principal area will be the target of the most, if not all, application installations and network traffic during this phase.
At a more advanced level, the procedure might resemble this:
- Set up the main OpenShift cluster in Area A.
- Set up the backup OpenShift cluster in Area B. If RPO and RTO account for the time required to provision a new ROSA cluster and apply day-2 configurations, then this process can be postponed until a disaster happens.
- 45 to 60 minutes are needed for both OpenShift self-managed and ROSA Classic.
- It takes 15 to 30 minutes to use Hosted Control Planes (HCP) with ROSA.
- Configuring Day 2 using GitOps shouldn’t take longer than fifteen minutes.
- In Region A, provide EFS Primary. To enable NFS traffic between the bastion or CI/CD host and the OpenShift-Primary cluster, apply the relevant SecurityGroup rules.
- Make Region B’s EFS Secondary accessible.
- Apply the appropriate SecurityGroup rules to allow NFS traffic from the bastion/CICD host and the OpenShift-Secondary cluster.
- Replication from EFS-Primary to EFS-Secondary should be enabled.
- Set up the AWS EFS CSI Driver Operator if you plan to enable dynamic volume provisioning as well. I advise avoiding using this unless it is meant for tasks that are not urgent. StorageClass-created dynamic volumes are not supported by this solution.
- Provide an automated method (such as Ansible) for app teams, or tenants, to request static persistent volumes on OpenShift-Primary.
The playbook volume-create.yaml functions as follows:
- Enter the necessary information from the user, such as the namespace, ocp_primary_login_command, efs_primary_hostname, business_unit, ocp_primary_cluster_name, application_name, pvc_name, pvc_size, and AWS credentials.
- Verify user input for things like character lengths, permissions to administer the OpenShift cluster, and more.
- Make the volume directory tree on EFS-Primary as follows: /<prefix>/<business_unit>/<application_name>/<namespace>/<pvc_name>
- Replace parameters like <volume_name>, <volume_namespace>, <volume_nfs_server>, and <volume_nfs_path> using the prepared PV/PVC template. Then, save the created manifest to a directory local to the repository.
- Apply OpenShift-Primary with the PV/PVC manifest. If the namespace doesn’t already exist, it is created.
- Push the PV/PVC manifest to a Git repository after committing it. <playbook-dir>/PV-PVCs/primary/<business_unit>/<ocp_primary_cluster_name>/<application_name>/<namespace>/pv-pvc_<pvc_name>.The location to the PV/PVC manifest file is yaml.Yaml
- With user inputs supplied as job parameters, integrate the volume-create.sh procedure into an appropriate continuous integration pipeline.
- Put in place the automated procedure (using Ansible, for instance) to restore the OpenShift-Secondary static volumes.
The playbook volume-restore.yaml functions as follows:
- Enter the necessary user inputs, such as the ocp_secondary_login_command, efs_primary_hostname, efs_secondary_hostname, and Git credentials.
- Wait for EFS-Secondary to become write-enabled before stopping the replication process.
- List all volume manifests used for OpenShift-Primary by recursively scanning the PV-PVCs/primary/* directory. Replace the EFS-Primary hostname with the EFS-Secondary hostname for each persistent-volume manifest.
- Utilize OpenShit-Secondary to apply the secondary PV/PVC manifests.
- Upload the backup PV/PVC files to Git. The PV/PVC manifest file path is <playbook-dir>/PVPVCs/secondary/<business_unit>/<cluster_name>/<application_name>/<namespace>/pv-pvc_<pvc_name>.yaml
- Use user inputs as job parameters and integrate the volume-restore.sh procedure into an appropriate continuous integration pipeline.
- Run the volume-create.sh pipeline on OpenShift-Primary to provision a few persistent volumes for testing.
- Install a few stateful apps on OpenShift-Primary that have static volumes.
- Check that pods can mount and write data to their corresponding persistent volumes at the given directory.
Phase II: recovery following the disaster
During the recovery phase, network traffic is redirected, applications are redeployed (if they haven’t previously), and the Secondary region assumes control and becomes the Primary region.
Higher up, the procedure appears like this:
- Provide OpenShift-Secondary in Region B as needed.
- Whenever necessary, integrate OpenShift-Secondary with the EFS-Secondary instance.
- Verification of network connectivity
- Furthermore, it is possible to enable dynamic volume provisioning. They ought to be reserved, therefore, for non-critical workloads devoid of the need for regional catastrophe recovery.
- To restore static volumes onto OpenShift-Secondary, use the volume-restore.sh pipeline.
Sifting through the /PV-PVCs/primary/ <cluster_name>/* directory, this method generates a matching PV/PVC for every manifest it finds. The volume manifests that result should be saved in /PV-PVCs/secondary/<cluster_name>/*. The primary cluster name is cluster_name.
- Reinstall the programs you use on OpenShift-Secondary. It is advised that applications container images be stored in an external image registry and that GitOps solutions like ArgoCD manage the Continuous Deployment and Delivery (CD) portion of the CI/CD flow in order for the deployment time to meet DR RPO/RTO targets.
- Send network data to the Secondary Region by rerouting it.
Execution
If you think this approach would be helpful, go over the precise processes involved in implementation. The following are the requirements:
- A fundamental understanding of NFS, AWS, and OpenShift
- AWS credentials
- clearance to launch and replicate EFS services
- ROSA’s Primary and Secondary clusters
- The software packages python3.11, python3.11-pip, ansible, aws-cli-v2, nfs-utils, nmap, unzip, and openshift-cli are installed on the bastion host.
In summary
We’ve shown how to use ROSA, AWS EFS, static volume provisioning, and Ansible automation to accomplish regional failover of an application state in the context of Kubernetes. By incorporating Event-Driven Ansible into the mix, this strategy can be improved even further and reduce or eliminate the need for human interaction in the volume-create and volume-restore cycle. Azure Red Hat OpenShift (ARO) and Azure File in a Microsoft Azure system might be used in a similar manner.