A guide to understanding how node-level tuning works on OpenShift

This blog post’s objective is to provide an overview of the node-level tuning on OpenShift performed by the default cluster as well as the additional options cluster managers have to use node-level tuning to customize the platform’s performance to their needs.

Because the Red Hat OpenShift Container Platform (RHOCP) comes loaded with suitable defaults to handle general-purpose workloads, administrators typically do not need to bother about node-level tuning on OpenShift. There are circumstances, though, in which assistance is required to boost workload performance. While cluster administrators will typically do this intervention as a post-installation (day 2) configuration modification, it can even be required at cluster installation (day 1).

Option 0: Do nothing

Similar to Red Hat Enterprise Linux, the OpenShift platform is preconfigured by default for general-purpose workloads. The Node Tuning Operator (NTO), one of the basic OpenShift operators, is largely responsible for system tuning.

The parent OpenShift profile’s many tunables increase specific kernel restrictions. This enhances the system’s performance under conditions of increased system load and cluster size. The changes, however, primarily result in the cost of greater kernel memory usage.

[main]
summary=Optimize systems running OpenShift (parent profile)
include=${f:virt_check:virtual-guest:throughput-performance}

[selinux]
avc_cache_threshold=8192      # rhbz#1548428, PR10027

[net]
nf_conntrack_hashsize=1048576 # PR413 (the default limit is too low for OpenShift)

[sysctl]
net.ipv4.ip_forward=1         # Forward packets between interfaces
kernel.pid_max=>4194304       # PR79, for large-scale workloads; systemd sets kernel.pid_max to 4M since v243
fs.aio-max-nr=>1048576        # PSAP-900
net.netfilter.nf_conntrack_max=1048576
net.ipv4.conf.all.arp_announce=2           # rhbz#1758552 pod communication due to ARP failures
net.ipv4.neigh.default.gc_thresh1=8192
net.ipv4.neigh.default.gc_thresh2=32768
net.ipv4.neigh.default.gc_thresh3=65536    # rhbz#1384746 gc_thresh3 limits no. of nodes/routes
net.ipv6.neigh.default.gc_thresh1=8192
net.ipv6.neigh.default.gc_thresh2=32768
net.ipv6.neigh.default.gc_thresh3=65536
vm.max_map_count=262144                    # rhbz#1793714 ElasticSearch (logging)

[sysfs]
/sys/module/nvme_core/parameters/io_timeout=4294967295
/sys/module/nvme_core/parameters/max_retries=10

[scheduler]
# see rhbz#1979352; exclude containers from aligning to house keeping CPUs
cgroup_ps_blacklist=/kubepods\.slice/
# workaround for rhbz#1921738
runtime=0

The throughput-performance profile, which is the default profile advised for servers, serves as the foundation for the majority of the OpenShift profile. We also incorporate additional performance and functional tunables. The functional ones:

ARP failures between a node and its pods have caused OpenShift pod communication problems.
To, enable the clean start of Elasticsearch pods, increase vm.max map count.
For heavy workloads, adjust kernel.pid max (also helps with pod density)

The performance tunables:

Permit huge clusters and more than a thousand routes.
Change the maximum entries and the size of the Netfilter connection tracking hash database.
By modifying the AVC cache threshold, node performance (CPU utilization) will be improved (1, 2)
Add extra VMs to the RHOCP nodes.
Stop the TuneD [scheduler] plug-in by lining up containers with the housekeeping CPUs.
Disable the TuneD [scheduler] plug-dynamic in’s behavior

The openshift-control-plane in the latest iterations of RHOCP merely inherits from the OpenShift profile. Both of the fs.inotify settings for the openshift-node profile are operational parameters that, as of right now, just replicate the values already given by MCO to enable their application prior to kubelet start. By enabling data exchange during the sender’s initial TCP SYN on client and server connections, the net.ipv4.tcp fastopen=3 configuration parameter decreases network latency.

Option 1: I need some custom tuning

As indicated by the Linux kernel documentation for the ixgb driver, we will provide an example Tuned CR here for optimizing a system with a 10 Gigabit Intel(R) network interface card for throughput.

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-network-tuning
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Increase throughput for NICs using ixgb driver
      include=openshift-node
      [sysctl]
      ### CORE settings (mostly for socket and UDP effect)
      # Set maximum receive socket buffer size.
      net.core.rmem_max = 524287
      # Set maximum send socket buffer size
      net.core.wmem_max = 524287
      # Set default receive socket buffer size.
      net.core.rmem_default = 524287
      # Set default send socket buffer size.
      net.core.wmem_default = 524287
      # Set maximum amount of option memory buffers.
      net.core.optmem_max = 524287
      # Set number of unprocessed input packets before kernel starts dropping them.
      net.core.netdev_max_backlog = 300000
    name: openshift-network-tuning
  recommend:
  - match:
    - label: node-role.kubernetes.io/worker
    priority: 20
    profile: openshift-network-tuning

Note that the core network settings only are included for brevity.

Option 2: I need low-latency/real-time tuning

The Telco 5G Core User Plane Function (UPF), the Financial Services Industry (FSI), and some High-Performance Computing (HPC) workloads are examples of specialized workloads that call for low-latency/real-time tuning. Such tuning, though, involves sacrifices. Employing the real-time kernel reduces overall throughput, as does use additional power or statically dividing your system into housekeeping and workload partitions. Static partitioning interferes with the effective utilization of computing resources by the OpenShift Kubernetes platform and may oversubscribe the housekeeping partitions. Partitioning must take place at many different levels and through the coordination of many distinct elements:

Separating the pods for administration and work
Workloads using Guaranteed pods
Separating system processes from CPUs that are under load
Moving kernel threads to CPUs used for cleaning
Transferring the NIC’s IRQs to the housekeeping CPUs

In addition to partitioning, there are various methods for lowering software latency:

Real-time kernel usage
Huge pages (per NUMA node) are used to reduce the expense of TLB misses.
CPU load balancing for DPDK is disabled.
Removing the CPU CFS quota
To lessen latency variations, hyperthreading may be disabled.
BIOS tuning

All of the above tasks can be completed manually, but great care must be made to complete them coherently. The Performance Profile controller from NTO enters the picture here. It serves as an orchestrator, relieving users of the burden of manual configuration, and ensures that all components—including the kernel, TuneD, Kubelet [CPU, Topology and Memory manager], and CRI-O—necessary to carry out the aforementioned tasks are correctly configured in accordance with a specified PerformanceProfile.

A single-node PerformanceProfile deployment using OpenShift looks like this:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  cpu:
    isolated: "2-31,34-63"
    reserved: "0-1,32-33"
  globallyDisableIrqLoadBalancing: false
  hugepages:
    defaultHugepagesSize: "1G"
    pages:
    - size: "1G"
      count: 16
      node: 0
  net:
    userLevelNetworking: false
    devices: []
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numa:
    topologyPolicy: "best-effort"
  realTimeKernel:
    enabled: true

The CPUs 2-31 and 34-63 assigned by the PerformanceProfile above will be used for low-latency workloads, while the remaining 4 CPUs will be set aside for system maintenance. The reserved CPUs may not always be enough to handle device interruptions. This is why the aforementioned example sets globallyDisableIrqLoadBalancing to false in order to permit interrupt processing on isolated (tuned and prepared for sensitive workload) CPUs. The IRQs load balancing, however, can be turned off for individual pod CPUs.

When utilizing the annotations with irq-load-balancing.crio.io/cpu-quota.crio.io=disable. Additionally, this example profile offers 16 1 GiB large pages on NUMA node 0, Topology Manager’s best-effort NUMA alignment policy, and real-time kernel. It also does not limit the number of NIC queues to the number of reserved CPUs. The PerformanceProfile controller implicitly configures any additional system changes it makes. For instance: when utilizing the annotations with irq-load-balancing.crio.io/cpu-quota.crio.io=disable. Additionally, this example profile offers 16 1 GiB large pages on NUMA node 0, Topology Manager’s best-effort NUMA alignment policy, and real-time kernel. It also does not limit the number of NIC queues to the number of reserved CPUs. The PerformanceProfile controller implicitly configures any additional system changes it makes. For instance:

Making CPU Manager policy static will allow for CPU allocation that is exclusive.
Need entire physical core allocation when topology policy is single-numa-node or restricted
A CPU Manager reconcile period setting (the shorter the reconcile period, the faster the CPU Manager prevents non-Guaranteed pods to run on isolated CPUs, at cost of more system resources)
Setting the Memory Manager policy to pin memory and large pages closer to the designated CPUs (where the topology policy is restricted or single-numa-node).
Making a fast handler and a runtime class for CRI-O (note the runtimeClassName in the pod specification below)
Using TuneD to activate and launch stalld

The reader could think that PerformanceProfile contains all the configurations necessary to achieve low latency at this point. This is not the case, though. Additionally, we must make the necessary modifications to the low latency workload pod specification. Specifically, specifying the preconfigured runtime class, adding user-requested CRI-O annotations, and ensuring that the pod is inserted into the Guaranteed resource class.

apiVersion: v1
kind: Pod
metadata:
  name: example
  # Disable CFS cpu quota accounting
  cpu-quota.crio.io: "disable"
  # Disable CPU balance with CRIO
  cpu-load-balancing.crio.io: "disable"
  # Opt-out from interrupt handling
  irq-load-balancing.crio.io: "disable"
spec:
  # Map to the correct performance class
  runtimeClassName: get-from-performance-profile
  ...
  containers:
  - name: container-name
    image: image-registry/image
    resources:
      limits:
        memory: "2Gi"
        cpu: "16"
  ...

It is significant to remember that any custom Kubelet adjustments will be overwritten by NTO’s Performance Profile controller. However, it is feasible to annotate the PerformanceProfile with the customized Kubelet adjustments. Similar to this, it is also feasible to add an additional TuneD configuration to replace or build on top of the one provided by the Performance Profile controller.

More partitioning has been added to the RHOCP system through configuration by PerformanceProfiles. For DPDK programmes that handles network packet processing in user space and cannot tolerate hardware disruptions, this makes sense. It will also apply to other latency-sensitive applications in a similar way. But the additional partitioning comes at a price. The OS and/or RHOCP management pods may not function well on reserved cores or they may not be enough. Therefore, when partitioning RHOCP in this manner, careful planning and testing are always required.

Summary

The RHOCP administrators have a variety of choices for optimising the performance of their nodes. There are a few important things to remember. First, should we consider tuning during cluster installation, or can it be done after the cluster has been installed? The great bulk of tuning tasks can be completed after installation. This category includes customised tuning performed by NTO and Tuned Profiles. Second, can we afford to strictly divide our RHOCP cluster to eliminate noisy neighbours in exchange for some CPUs being over or underutilised? If so, think about utilising PerformanceProfiles from NTO. Finally, when determining our Node-level tuning on OpenShift strategy, we must consider how frequently our nodes will reboot.