A guide to understanding how node-level tuning works on OpenShift
This blog post’s objective is to provide an overview of the node-level tuning on OpenShift performed by the default cluster as well as the additional options cluster managers have to use node-level tuning to customize the platform’s performance to their needs.
Because the Red Hat OpenShift Container Platform (RHOCP) comes loaded with suitable defaults to handle general-purpose workloads, administrators typically do not need to bother about node-level tuning on OpenShift. There are circumstances, though, in which assistance is required to boost workload performance. While cluster administrators will typically do this intervention as a post-installation (day 2) configuration modification, it can even be required at cluster installation (day 1).
Option 0: Do nothing
Similar to Red Hat Enterprise Linux, the OpenShift platform is preconfigured by default for general-purpose workloads. The Node Tuning Operator (NTO), one of the basic OpenShift operators, is largely responsible for system tuning.
The parent OpenShift profile’s many tunables increase specific kernel restrictions. This enhances the system’s performance under conditions of increased system load and cluster size. The changes, however, primarily result in the cost of greater kernel memory usage.
[main] summary=Optimize systems running OpenShift (parent profile) include=${f:virt_check:virtual-guest:throughput-performance} [selinux] avc_cache_threshold=8192 # rhbz#1548428, PR10027 [net] nf_conntrack_hashsize=1048576 # PR413 (the default limit is too low for OpenShift) [sysctl] net.ipv4.ip_forward=1 # Forward packets between interfaces kernel.pid_max=>4194304 # PR79, for large-scale workloads; systemd sets kernel.pid_max to 4M since v243 fs.aio-max-nr=>1048576 # PSAP-900 net.netfilter.nf_conntrack_max=1048576 net.ipv4.conf.all.arp_announce=2 # rhbz#1758552 pod communication due to ARP failures net.ipv4.neigh.default.gc_thresh1=8192 net.ipv4.neigh.default.gc_thresh2=32768 net.ipv4.neigh.default.gc_thresh3=65536 # rhbz#1384746 gc_thresh3 limits no. of nodes/routes net.ipv6.neigh.default.gc_thresh1=8192 net.ipv6.neigh.default.gc_thresh2=32768 net.ipv6.neigh.default.gc_thresh3=65536 vm.max_map_count=262144 # rhbz#1793714 ElasticSearch (logging) [sysfs] /sys/module/nvme_core/parameters/io_timeout=4294967295 /sys/module/nvme_core/parameters/max_retries=10 [scheduler] # see rhbz#1979352; exclude containers from aligning to house keeping CPUs cgroup_ps_blacklist=/kubepods\.slice/ # workaround for rhbz#1921738 runtime=0
The throughput-performance profile, which is the default profile advised for servers, serves as the foundation for the majority of the OpenShift profile. We also incorporate additional performance and functional tunables. The functional ones:
- ARP failures between a node and its pods have caused OpenShift pod communication problems.
- To, enable the clean start of Elasticsearch pods, increase vm.max map count.
- For heavy workloads, adjust kernel.pid max (also helps with pod density)
The performance tunables:
- Permit huge clusters and more than a thousand routes.
- Change the maximum entries and the size of the Netfilter connection tracking hash database.
- By modifying the AVC cache threshold, node performance (CPU utilization) will be improved (1, 2)
- Add extra VMs to the RHOCP nodes.
- Stop the TuneD [scheduler] plug-in by lining up containers with the housekeeping CPUs.
- Disable the TuneD [scheduler] plug-dynamic in’s behavior
The openshift-control-plane in the latest iterations of RHOCP merely inherits from the OpenShift profile. Both of the fs.inotify settings for the openshift-node profile are operational parameters that, as of right now, just replicate the values already given by MCO to enable their application prior to kubelet start. By enabling data exchange during the sender’s initial TCP SYN on client and server connections, the net.ipv4.tcp fastopen=3 configuration parameter decreases network latency.
Option 1: I need some custom tuning
As indicated by the Linux kernel documentation for the ixgb driver, we will provide an example Tuned CR here for optimizing a system with a 10 Gigabit Intel(R) network interface card for throughput.
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-network-tuning namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Increase throughput for NICs using ixgb driver include=openshift-node [sysctl] ### CORE settings (mostly for socket and UDP effect) # Set maximum receive socket buffer size. net.core.rmem_max = 524287 # Set maximum send socket buffer size net.core.wmem_max = 524287 # Set default receive socket buffer size. net.core.rmem_default = 524287 # Set default send socket buffer size. net.core.wmem_default = 524287 # Set maximum amount of option memory buffers. net.core.optmem_max = 524287 # Set number of unprocessed input packets before kernel starts dropping them. net.core.netdev_max_backlog = 300000 name: openshift-network-tuning recommend: - match: - label: node-role.kubernetes.io/worker priority: 20 profile: openshift-network-tuning
Note that the core network settings only are included for brevity.
Option 2: I need low-latency/real-time tuning
The Telco 5G Core User Plane Function (UPF), the Financial Services Industry (FSI), and some High-Performance Computing (HPC) workloads are examples of specialized workloads that call for low-latency/real-time tuning. Such tuning, though, involves sacrifices. Employing the real-time kernel reduces overall throughput, as does use additional power or statically dividing your system into housekeeping and workload partitions. Static partitioning interferes with the effective utilization of computing resources by the OpenShift Kubernetes platform and may oversubscribe the housekeeping partitions. Partitioning must take place at many different levels and through the coordination of many distinct elements:
- Separating the pods for administration and work
- Workloads using Guaranteed pods
- Separating system processes from CPUs that are under load
- Moving kernel threads to CPUs used for cleaning
- Transferring the NIC’s IRQs to the housekeeping CPUs
In addition to partitioning, there are various methods for lowering software latency:
- Real-time kernel usage
- Huge pages (per NUMA node) are used to reduce the expense of TLB misses.
- CPU load balancing for DPDK is disabled.
- Removing the CPU CFS quota
- To lessen latency variations, hyperthreading may be disabled.
- BIOS tuning
All of the above tasks can be completed manually, but great care must be made to complete them coherently. The Performance Profile controller from NTO enters the picture here. It serves as an orchestrator, relieving users of the burden of manual configuration, and ensures that all components—including the kernel, TuneD, Kubelet [CPU, Topology and Memory manager], and CRI-O—necessary to carry out the aforementioned tasks are correctly configured in accordance with a specified PerformanceProfile.
A single-node PerformanceProfile deployment using OpenShift looks like this:
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: performance spec: cpu: isolated: "2-31,34-63" reserved: "0-1,32-33" globallyDisableIrqLoadBalancing: false hugepages: defaultHugepagesSize: "1G" pages: - size: "1G" count: 16 node: 0 net: userLevelNetworking: false devices: [] nodeSelector: node-role.kubernetes.io/master: "" numa: topologyPolicy: "best-effort" realTimeKernel: enabled: true
The CPUs 2-31 and 34-63 assigned by the PerformanceProfile above will be used for low-latency workloads, while the remaining 4 CPUs will be set aside for system maintenance. The reserved CPUs may not always be enough to handle device interruptions. This is why the aforementioned example sets globallyDisableIrqLoadBalancing to false in order to permit interrupt processing on isolated (tuned and prepared for sensitive workload) CPUs. The IRQs load balancing, however, can be turned off for individual pod CPUs.
When utilizing the annotations with irq-load-balancing.crio.io/cpu-quota.crio.io=disable. Additionally, this example profile offers 16 1 GiB large pages on NUMA node 0, Topology Manager’s best-effort NUMA alignment policy, and real-time kernel. It also does not limit the number of NIC queues to the number of reserved CPUs. The PerformanceProfile controller implicitly configures any additional system changes it makes. For instance: when utilizing the annotations with irq-load-balancing.crio.io/cpu-quota.crio.io=disable. Additionally, this example profile offers 16 1 GiB large pages on NUMA node 0, Topology Manager’s best-effort NUMA alignment policy, and real-time kernel. It also does not limit the number of NIC queues to the number of reserved CPUs. The PerformanceProfile controller implicitly configures any additional system changes it makes. For instance:
- Making CPU Manager policy static will allow for CPU allocation that is exclusive.
- Need entire physical core allocation when topology policy is single-numa-node or restricted
- A CPU Manager reconcile period setting (the shorter the reconcile period, the faster the CPU Manager prevents non-Guaranteed pods to run on isolated CPUs, at cost of more system resources)
- Setting the Memory Manager policy to pin memory and large pages closer to the designated CPUs (where the topology policy is restricted or single-numa-node).
- Making a fast handler and a runtime class for CRI-O (note the runtimeClassName in the pod specification below)
- Using TuneD to activate and launch stalld
The reader could think that PerformanceProfile contains all the configurations necessary to achieve low latency at this point. This is not the case, though. Additionally, we must make the necessary modifications to the low latency workload pod specification. Specifically, specifying the preconfigured runtime class, adding user-requested CRI-O annotations, and ensuring that the pod is inserted into the Guaranteed resource class.
apiVersion: v1 kind: Pod metadata: name: example # Disable CFS cpu quota accounting cpu-quota.crio.io: "disable" # Disable CPU balance with CRIO cpu-load-balancing.crio.io: "disable" # Opt-out from interrupt handling irq-load-balancing.crio.io: "disable" spec: # Map to the correct performance class runtimeClassName: get-from-performance-profile ... containers: - name: container-name image: image-registry/image resources: limits: memory: "2Gi" cpu: "16" ...
It is significant to remember that any custom Kubelet adjustments will be overwritten by NTO’s Performance Profile controller. However, it is feasible to annotate the PerformanceProfile with the customized Kubelet adjustments. Similar to this, it is also feasible to add an additional TuneD configuration to replace or build on top of the one provided by the Performance Profile controller.
More partitioning has been added to the RHOCP system through configuration by PerformanceProfiles. For DPDK programmes that handles network packet processing in user space and cannot tolerate hardware disruptions, this makes sense. It will also apply to other latency-sensitive applications in a similar way. But the additional partitioning comes at a price. The OS and/or RHOCP management pods may not function well on reserved cores or they may not be enough. Therefore, when partitioning RHOCP in this manner, careful planning and testing are always required.
Summary
The RHOCP administrators have a variety of choices for optimising the performance of their nodes. There are a few important things to remember. First, should we consider tuning during cluster installation, or can it be done after the cluster has been installed? The great bulk of tuning tasks can be completed after installation. This category includes customised tuning performed by NTO and Tuned Profiles. Second, can we afford to strictly divide our RHOCP cluster to eliminate noisy neighbours in exchange for some CPUs being over or underutilised? If so, think about utilising PerformanceProfiles from NTO. Finally, when determining our Node-level tuning on OpenShift strategy, we must consider how frequently our nodes will reboot.