Improved Horizontal Scalability

March 25, 2026 · 8 min read

Philipp Stehle

Summary

In this blog post, I will give you an overview of the new Horizontal Scalability feature of OpenBao, its (current) limitations and planned future developments. In the second part, I will show some benchmarks to see in which cases the new feature helps (spoiler: it works best in read-heavy workloads, but doesn't improve write-heavy workloads).

Recap on Scalability Terminology

Let's quickly recap what "horizontal" means in the context of scalability: To scale a service up (i.e., allow it to handle more load) there are two options: Give each instance more resources (CPU, memory, network speed, disk capacity, ...) or increase the number of instances. Increasing instance size is referred to as "vertical" and increasing the number of instances is referred to as "horizontal" scaling.

Whether this is successful, depends on the problem at hand and the implementation. There are some problems that scale very well - both horizontally and vertically - like serving static files. Given that the server is implemented reasonably well, either doubling the resources per instance or doubling the number of instances should both double the number of static files you can serve. Once you start to introduce mutability scalability starts to suffer, especially if you want to provide strong consistency guarantees.

Recently released Horizontal Scalability Features

With OpenBao v2.5.0, we introduced what we call "read scalability". To ensure High Availability, OpenBao has supported "standby nodes" since we forked from Vault. One of these nodes would automatically take over active duty, if the current leader nodes goes down (planned or unplanned). We extended this feature to allow the "standby nodes" to become "read-only nodes", meaning they can now serve read requests on their own, while write requests are still forwarded to the leader.

This has some limits:

The obvious being: If you have mostly write requests with only a few read-only requests, then read scalability won't help.

info
Keep in mind that a read request on the API level (e.g., HTTP GET) might still require a write to the storage, e.g. reading a dynamic secret from the database plugin will require a write to storage, because after its TTL has expired, it needs to be deleted and for this OpenBao has to do "bookkeeping".
Currently, only the Raft storage backend is supported. Support for PostgreSQL is in the works.

There are also some caveats:

It might require changes to your load-balancer setup: If no traffic is routed to the "read-only nodes", then they won't remove any load from the primary.
Data read from a standby node might be stale, if it has recently been updated on the primary. If this is unacceptable to you in general, you can set disable_standby_reads = true in your configuration to disable reads from standby. Or, if this is acceptable most of the time, you can ensure to read from the primary in cases where you need the latest data.

Horizontal Scalability Outlook

Consistent writing in a distributed fashion is an inherently hard problem. If we were set out to solve this in a general purpose way, we'd be doomed to fail. We need to look out for properties of our specific case, that we can exploit: Currently, there is one leader node per cluster, which will handle all write request. We are planning to change this and have leader nodes per namespace, which will handle all write requests for their namespace. As write requests between sibling namespaces are independent, there is no risk of violating data integrity (a property we can exploit 🎉).

This again has an obvious limitation: if you do not use namespaces or if most of your traffic is within a single namespace (e.g. you have a test and prod namespace and prod accounts for 99% of the load) this won't help. But let's take one step at a time and maybe it is even good enough, because a use cases is either small enough to avoid scalability problems or big enough to require namespaces anyway.

Now it is time to look at some benchmarks.

Benchmarks

Setup

I have set up two 3-node OpenBao clusters on Azure, one running OpenBao 2.4.4 and one running OpenBao 2.5.1. Each cluster has a dedicated load balancer, but the load balancer for 2.4.4 is configured to direct traffic only to the primary node and the load balancer for the 2.5.1 cluster will distribute the traffic among all nodes.

I ran two benchmarks, one for the KV engine and one for the PKI engine. I ran both of them for 5 minutes per cluster using the benchmark-openbao tool.

KV Engine Benchmark

The benchmark aims for 500 requests per second, with 90% reads and 10% writes.

Full benchmark-openbao Config

benchmark_kv.hcl
duration = "5m"
cleanup  = true
workers = 2000
rps = 500

disable_keep_alive = true

test "kvv1_write" "writes" {
  weight = 10
  config {
    numkvs = 10
    kvsize = 100
  }
}

test "kvv1_read" "reads" {
  weight = 90
  config {
    numkvs = 10
    kvsize = 100
  }
}

If we take a look at the per-node CPU usage, we can clearly see the effects. The first plot shows the CPU usage of the 2.4.4 cluster over a 15 minute timespan (average per minute) with the 5 minute benchmark roughly centered. Before and after the benchmark, we can see the active node hovering around 4% CPU usage, while the standbys use around 2%. During the benchmark it raises to around 60% for the primary and 8% for the standbys.

CPU usage of 2.4.4 during the benchmark run

The next plot shows the same data for the 2.5.1 cluster. In the first plot we could clearly see which line is the primary, but here all lines are very similar. Before and after the benchmark the CPU usage hovers around 4% jumping up to 30 to 35% for all nodes during the benchmark.

CPU usage of 2.5.1 during the benchmark run

This results in better latencies:

operation	version	mean	95th%	99th%	count
reads	2.5.1	5.75 ms	19.55 ms	36.94 ms	1349402
	2.4.4	7.92 ms	28.69 ms	51.64 ms	1350523
writes	2.5.1	58.08 ms	105.90 ms	239.21 ms	150598
	2.4.4	87.21 ms	201.23 ms	355.96 ms	149477

Averaged Results

While the Graphs above show a single benchmark run, the tables are the combined numbers of several benchmark runs.

PKI Engine Benchmark

The benchmarks aims for only 5 requests per second, generating RSA 2048 certificates with no_store = true, which makes this a read-only operation. Generating a (RSA) private key is quite heavy on the CPU, therefore the low number of requests.

Full benchmark-openbao Config

benchmark_pki_rsa2048_no_store.hcl
duration = "5m"
cleanup  = true
workers = 15
rps = 5

disable_keep_alive = true

test "pki_issue" "pki_issue" {
    weight = 100
    config {
        setup_delay="2s"
        root_ca {
            common_name = "benchmark.test Root Authority"
            key_type = "rsa"
            key_bits = "2048"
        }
        intermediate_csr {
            common_name = "benchmark.test Intermediate Authority"
            key_type = "rsa"
            key_bits = "2048"
        }
        role {
            ttl = "10s"
            no_store = true
            generate_lease = false
            key_type = "rsa"
            key_bits = "2048"
        }
    }
}

Performance Tip

If you are using the PKI engine and issue a fair amount of certificates, you should consider using Certificate Signing Requests (CSRs) instead of generating the private key on the OpenBao cluster. Also, using ECDSA or Ed25519 instead of RSA where possible will improve the performance, see "Key types matter"

Here again we have two graphs, first from the 2.4.4 cluster followed by the 2.5.1 cluster. During idle the results are the same as before.

On 2.4.4 cluster the primary spikes to 100% CPU usage, while the standbys don't see any change (because in contrast to the KV benchmark there are no writes happening, so they have no raft updates to apply).

CPU usage of 2.4.4 during the benchmark run

On the 2.5.1 cluster the load is spread pretty well at roughly 40%.

CPU usage of 2.5.1 during the benchmark run

This time we also see better latencies, but additionally the 2.4.4 cluster failed to even fulfill the 5 requests per second target and some requests (0.73%) even failed.

version	mean	95th%	99th%	count	rate	success
2.5.1	566.77 ms	1976.84 ms	3575.30 ms	15000	5.00	100.00 %
2.4.4	11668.62 ms	27152.59 ms	38203.77 ms	3990	1.33	99.27 %

Missing Benchmarks

I decided against showing a write heavy benchmark, like PKI with no_store = false or KV with 100% write. Not because I want to hide the fact that our read scalability feature does not help here, but because it would be boring to look at. The results for both versions would be similar, the only change you could see is that for 2.5.1 the standby nodes use a little more CPU while the cluster is idle (as we have seen before).

Summary​

Recap on Scalability Terminology​

Recently released Horizontal Scalability Features​

Horizontal Scalability Outlook​

Benchmarks​

Setup​

KV Engine Benchmark​

PKI Engine Benchmark​

Missing Benchmarks​