Version: Development

Implementation of Invalidation with Raft

info

Note that the PostgreSQL-specific portions were not implemented. Support for Raft landed in v2.5.0.

Summary

As part of the changes to allow read-enabled standby nodes, we need to trigger the invalidation hooks which allow caches and plugins to detect changes to cached entries. This design is currently absent from the physical.Backend interface which this solves.

Problem Statement

Various layers in OpenBao perform caching:

The sdk/physical/cache.go engine, placed above the actual physical storage implementation.
Various core components like the policy store, namespace store, identity store, and others.
The plugins themselves, such as PKI or Transit which cache parsed representations of keys to avoid the overhead of re-parsing on frequently used key material.

The current design supports an invalidate function Invalidate(ctx, key); notably, storage isn't present in this call as it is simply meant to invalidate the given (storage) key in the relevant caches and is not used to reload the new value.

However, physical.Backend has no hook allow vault.Core to build and call its invalidation router.

This proposes such a mechanism that HA backends must implement, allowing replicated HA storage engines to send storage change events to OpenBao. With Raft, the committed log entry approach lends itself naturally to invalidations. Additionally, we propose mechanisms for allowing PostgreSQL to send these events itself.

User-facing Description

There will be no user-facing changes to this MR.

Technical Description

Physical

We extend the physical.HABackend interface as follows:

type InvalidateFunc func(key string)

type HABackend interface {
    ... existing methods ...
    HookInvalidate(hook InvalidateFunc)
}

HookInvalidate(...) accepts a single function, which vault.Core will implement, taking only a key of a written (updated or deleted) entry. Note that unlike the sdk.Backend implementation, InvalidateFunc does not take a context as that is assumed to be supplied by vault.Core (one derived from the present active context).

This function is assumed to be fast and called when all updates in a transaction are finished and assumed to be visible. That is, it should not be called when a pending write transaction is not yet called.

This function is not called on the node on which the write occurred. If a GRPC-backed cross-cluster write capability is created in the future, the higher level vault.Core function would need to be called manually by the GRPC interface.

Vault Core

Within the Core function, we'll implement the physical.InvalidateFunc with the direct approach for the time being:

func (c *Core) Invalidate(key string) {
    // Route invalidation based on key:
    ctx := activeContext

    if strings.HasPrefix(key, namespaceBarrierPrefix) {
        // lookup namespace and adjust ctx appropriately
    }

    // Dispatch to appropriate subsystem:
    //
    // 1. A plugin backend.
    // 2. Namespace store
    // 3. Token store
    // 4. Quota manager
    // 5. Audit broker
    // 6. Expiration manager
    // 7. Policy store
    // 8. Identity store
    // 9. Login MFA store
}

This function will be set as the invalidation hook for the physical backend when it is HA-enabled.

Raft Support

This will be hooked from raft.FSM.ApplyBatch(...) after the batch has been applied and committed.

PostgreSQL Support

Only a single viable mechanism exists to support invalidation in PostgreSQL: building an additional table, openbao_wal_store, which writes entries to be invalidated, maintaining explicit state via GRPC-based invalidation notifications from standby nodes back to the active leader. Because standby nodes can wait until their replica is up-to-date before joining the cluster, we only need to maintain WAL entries for actively connected nodes and can fully prune the WAL when leadership changes. Likewise, for standby nodes which exist on an out-of-date replica, we can add a maximum lifetime for WAL entries and remove any entry for which the standby node does not confirm fast enough; this missed standby can then be forced to wait for their replica to catch up again before rejoining the WAL mechanism.

This table has the schema:

CREATE TABLE openbao_wal-store (
  idx BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  path  TEXT COLLATE "C" NOT NULL
);

On periodic heartbeats, standby nodes will send their PostgreSQL WAL state to the leaders, who can verify if they've cleared the OpenBao WAL table since that WAL, prompting the standby to restart and wait for their node to catch up. Subsequently, when an invalidation is processed, the standby node can use the same RPC mechanism to inform the leader that the invalidation has been processed.

Rationale and Alternatives

There are other alternatives to the PostgreSQL changes:

Using LISTEN/NOTIFY. This does not work across PostgreSQL replicas, which we otherwise currently support, and would not work with PGBouncer or similar, as they are best-effort.
Using a GRPC call from active to the standby with generation value sent over the wire. This requires significant changes to the storage layer, to allow GRPC calls invoked by the physical.Backend implementation and is not durable: a standby node at time X may receive a GRPC with a future generation ID, crash at time X+1, restart at time X+2 and load the previous value into cache, get the replicated entry at time X+3 and never be invalidated. This would require the active node store a log of all past GRPC events sent and the standby to confirm they've seen them. This two-way ACK approach does not help with net-new servers, which again may be loaded with out-of-date storage and thus would require past invalidations be sent as well.
Hooking the native replication log: this allows OpenBao to know about all changes in PostgreSQL and decode storage entries. For background on this see pg_replication_slot, logical decoding, and standby servers. Note that the default level, replica, is not sufficient to enable this feature. There is a go library we can use handle this.

However, this configuration change--and the lack of support for notifications on PostgreSQL replicas--means that we cannot easily support a horizontally-scalable setup across disconnected (from an OpenBao credential PoV) PostgreSQL replicas like we can today. This makes this option not a good fit.
Modifying our schema to include an updated_at column. On startup, we can query the highest value and then periodically check for invalidations which have occurred. This approach does not work for two major reasons: it fails to track deletes and it does not guarantee a sort order, making it hard to explicitly determine which events have been seen without keeping track on both the active and standby nodes.

Each of these has issues.

Downsides

This complexity is necessary to enable read-enabled standby nodes. Memory usage in OpenBao and storage pressure on PostgreSQL may temporarily spike as a result of write-heavy workloads.

Security Implications

None; this is at the physical layer and does not change various security attributes. The invalidation should be reliable and not cause invalidation issues.

User/Developer Experience

n/a

Unresolved Questions

n/a

https://github.com/openbao/openbao/issues/1528

Proof of Concept

n/a

Summary​

Problem Statement​

User-facing Description​

Technical Description​

Physical​

Vault Core​

Raft Support​

PostgreSQL Support​

Rationale and Alternatives​

Downsides​

Security Implications​

User/Developer Experience​

Unresolved Questions​

Related Issues​

Proof of Concept​