Autopilot
Autopilot enables automated workflows for managing Raft clusters. The current feature set includes 3 main features: Server Stabilization, Dead Server Cleanup and State API.
Server stabilization
Server stabilization helps to retain the stability of the Raft cluster by safely
joining new voting nodes to the cluster. When a new voter node is joined to an
existing cluster, autopilot adds it as a non-voter instead, and waits for a
pre-configured amount of time to monitor it's health. If the node remains to be
healthy for the entire duration of stabilization, then that node will be
promoted as a voter. The server stabilization period can be tuned using
server_stabilization_time
(see below).
Dead server cleanup
Dead server cleanup automatically removes nodes deemed unhealthy from the
Raft cluster, avoiding the manual operator intervention. This feature can be
tuned using the cleanup_dead_servers
, dead_server_last_contact_threshold
,
and min_quorum
(see below).
State API
State API provides detailed information about all the nodes in the Raft cluster in a single call. This API can be used for monitoring for cluster health.
Follower health
Follower node health is determined by 2 factors.
- Its ability to heartbeat to leader node at regular intervals. Tuned using
last_contact_threshold
(see below). - Its ability to keep up with data replication from the leader node. Tuned using
max_trailing_logs
(see below).
Default configuration
By default, Autopilot is enabled in OpenBao, although dead server cleanup is not enabled by default.
Autopilot exposes a configuration API to manage its behavior. Autopilot gets initialized with the following default values. If these default values do not meet your expected autopilot behavior, don't forget to set them to your desired values.
-
cleanup_dead_servers
-false
- This controls whether to remove dead servers from
the Raft peer list periodically or when a new server joins. This requires that
min-quorum
is also set.
- This controls whether to remove dead servers from
the Raft peer list periodically or when a new server joins. This requires that
-
dead_server_last_contact_threshold
-24h
- Limit on the amount of time
a server can go without leader contact before being considered failed. This
takes effect only when
cleanup_dead_servers
is set.
- Limit on the amount of time
a server can go without leader contact before being considered failed. This
takes effect only when
-
min_quorum
- This doesn't default to anything and should be set to the expected number of voters in your cluster whencleanup_dead_servers
is set astrue
.- Minimum number of servers that should always be present in a cluster. Autopilot will not prune servers below this number.
-
max_trailing_logs
-1000
- Amount of entries in the Raft Log that a server can be behind before being considered unhealthy.
-
last_contact_threshold
-10s
- Limit on the amount of time a server can go without leader contact before being considered unhealthy.
-
server_stabilization_time
-10s
- Minimum amount of time a server must be in a healthy state before it can become a voter. Until that happens, it will be visible as a peer in the cluster, but as a non-voter, meaning it won't contribute to quorum.
Note: Autopilot in OpenBao does similar things to what autopilot does in Consul. However, the configuration in these 2 systems differ in terms of default values and thresholds; some additional parameters might also show up in OpenBao in comparison to Consul as well. Autopilot in OpenBao and Consul use different technical underpinnings requiring these differences, to provide the autopilot functionality.