网上有许多关于patroni的文章许多,绝大多数文章是通过手动搭建的方式,仅做出了一个patroni的情况搭建,包罗各种微信群等,对于patroni参数的利用,故障转移的原理以及实操都只字未提,本文通过Ubuntu 20 情况下 patroni 自动化安装,一分钟快速搭建 patroni 集群 来快速搭建一个集群,以及实操的方式实现故障转移的测试和验证,通过模仿真实的故障以及故障转移的日记,来分析故障转移的实现和效果。
0,patroni 集群状态
ubuntu11 注,ubuntu12,ubuntu13 为从,以下测试始终保持Ubuntu11 为主,Ubuntu 12 Ubuntu 13为从的架构- root@ubuntu11:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
- + Cluster: pg_cluster_wy_prod (7641831362696373502) ---------+----+-------------+-----+------------+-----+
- | Member | Host | Role | State | TL | Receive LSN | Lag | Replay LSN | Lag |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
- | ubuntu11 | 192.168.152.121:9000 | Leader | running | 6 | | | | |
- | ubuntu12 | 192.168.152.122:9000 | Sync Standby | streaming | 6 | 0/F000348 | 0 | 0/F000348 | 0 |
- | ubuntu13 | 192.168.152.123:9000 | Replica | streaming | 6 | 0/F000348 | 0 | 0/F000348 | 0 |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
复制代码 鉴于测试目标,设置patroni的systemctl service服务的自动启动为no- postgres@ubuntu11:~$ cat /etc/systemd/system/patroni.service
- [Unit]
- Description=Patroni
- After=network.target etcd.service
- Wants=etcd.service
- [Service]
- Type=simple
- User=postgres
- Group=postgres
- Environment="TZ=Asia/Shanghai"
- Environment="PYTHONUNBUFFERED=1"
- ExecStart=/usr/local/bin/patroni /usr/local/pgsql17/patroni/patroni.yml
- ExecReload=/bin/kill -HUP $MAINPID
- ExecStop=/bin/kill -TERM $MAINPID
- #Restart=on-failure
- Restart=no
- RestartSec=10
- TimeoutStartSec=120
- TimeoutStopSec=60
- LimitNOFILE=65536
- StandardOutput=null
- StandardError=journal
- SyslogIdentifier=patroni
- [Install]
- WantedBy=multi-user.target
- postgres@ubuntu11:~$
复制代码 Ubuntu 11主节点日记, 每隔 10 秒轮询一次集群状态,轮训隔断由参数loop_wait决定- 2026-05-21 08:46:37,142 INFO: no action. I am (ubuntu11), the leader with the lock
- 2026-05-21 08:46:47,145 INFO: no action. I am (ubuntu11), the leader with the lock
- 2026-05-21 08:46:57,190 INFO: no action. I am (ubuntu11), the leader with the lock
- 2026-05-21 08:47:07,148 INFO: no action. I am (ubuntu11), the leader with the lock
- 2026-05-21 08:47:17,145 INFO: no action. I am (ubuntu11), the leader with the lock
- 2026-05-21 08:47:27,189 INFO: no action. I am (ubuntu11), the leader with the lock
- 2026-05-21 08:47:37,153 INFO: no action. I am (ubuntu11), the leader with the lock
复制代码 Ubuntu 12 从节点日记- 2026-05-21 08:46:47,628 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:46:57,628 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:47:07,671 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:47:17,632 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:47:27,675 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:47:37,139 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:47:47,233 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:47:57,681 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
复制代码 Ubuntu 13 从节点日记- 2026-05-21 08:46:57,643 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:47:07,696 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:47:17,647 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:47:27,688 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:47:37,155 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:47:47,255 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:47:57,696 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
复制代码
1,自动故障转移场景1:主节点OS正常,patroni服务非常故障
主节点状态正常,关闭主节点patroni服务模仿主节点故障- root@ubuntu11:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
- + Cluster: pg_cluster_wy_prod (7641831362696373502) ---------+----+-------------+-----+------------+-----+
- | Member | Host | Role | State | TL | Receive LSN | Lag | Replay LSN | Lag |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
- | ubuntu11 | 192.168.152.121:9000 | Leader | running | 6 | | | | |
- | ubuntu12 | 192.168.152.122:9000 | Sync Standby | streaming | 6 | 0/F000348 | 0 | 0/F000348 | 0 |
- | ubuntu13 | 192.168.152.123:9000 | Replica | streaming | 6 | 0/F000348 | 0 | 0/F000348 | 0 |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+root@ubuntu11:/usr/local/patroni_install# systemctl stop patroniroot@ubuntu11:/usr/local/patroni_install#
复制代码 从节点Ubuntu12上观察到的集群状态,此时原始主节点已处于制止状态- root@ubuntu12:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
- + Cluster: pg_cluster_wy_prod (7641831362696373502) ---------+----+-------------+-----+------------+-----+
- | Member | Host | Role | State | TL | Receive LSN | Lag | Replay LSN | Lag |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
- | ubuntu11 | 192.168.152.121:9000 | Replica | stopped | | unknown | | unknown | |
- | ubuntu12 | 192.168.152.122:9000 | Leader | running | 7 | | | | |
- | ubuntu13 | 192.168.152.123:9000 | Sync Standby | streaming | 7 | 0/100001A8 | 0 | 0/100001A8 | 0 |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
复制代码 原始从节点Ubuntu12,成为新的主节点,日记如下- ......
- 2026-05-21 08:55:27,680 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:55:37,723 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:55:47,683 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:55:57,681 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-21 08:56:06,109 WARNING: Request failed to ubuntu11: GET http://192.168.152.121:8008/patroni (HTTPConnectionPool(host='192.168.152.121', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))))
- 2026-05-21 08:56:06,169 INFO: promoted self to leader by acquiring session lock
- 2026-05-21 08:56:06,169 INFO: Lock owner: ubuntu12; I am ubuntu12
- 2026-05-21 08:56:06,172 INFO: updated leader lock during promote
- server promoting
- 2026-05-21 08:56:07,185 INFO: Lock owner: ubuntu12; I am ubuntu12
- 2026-05-21 08:56:07,195 INFO: Assigning synchronous standby status to ['ubuntu13']
- server signaled
- 2026-05-21 08:56:09,324 INFO: Synchronous standby status assigned to ['ubuntu13']
- 2026-05-21 08:56:09,369 INFO: no action. I am (ubuntu12), the leader with the lock
- 2026-05-21 08:56:17,196 INFO: no action. I am (ubuntu12), the leader with the lock
- 2026-05-21 08:56:27,187 INFO: no action. I am (ubuntu12), the leader with the lock
- 2026-05-21 08:56:37,242 INFO: no action. I am (ubuntu12), the leader with the lock
- ......
复制代码 这种场景下的故障转移的流程:
手动关闭Ubuntu11 Patroni 主节点模仿故障 ———>Ubuntu 11上的patroni自动删除 DCS 中的 leader key———> Ubuntu12 从节点颠末loop_wait轮训后检测到DSC无主 ———> 获取锁提拔为 Leader———> promote 本地PostgreSQL为主库
2,自动故障转移场景2:主节点服务器断电
Ubuntu11 通过“关机”(而非关闭客户机)来模仿服务器忽然断电,这种场景须要深刻明确租约寿命,也就是ttl(默认 30 秒)参数的概念

新的主节点Ubuntu 12上看到的集群状态- root@ubuntu13:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
- + Cluster: pg_cluster_wy_prod (7642212398676862997) ----+----+-------------+-----+------------+-----+------------------------+
- | Member | Host | Role | State | TL | Receive LSN | Lag | Replay LSN | Lag | Tags |
- +----------+----------------------+---------+-----------+----+-------------+-----+------------+-----+------------------------+
- | ubuntu11 | 192.168.152.121:9000 | Leader | running | 10 | | | | | failover_priority: 100 |
- | ubuntu12 | 192.168.152.122:9000 | Replica | streaming | 10 | 0/C000000 | 0 | 0/C000358 | 0 | failover_priority: 80 |
- | ubuntu13 | 192.168.152.123:9000 | Replica | streaming | 10 | 0/C000380 | 0 | 0/C000380 | 0 | failover_priority: 60 |
- +----------+----------------------+---------+-----------+----+-------------+-----+------------+-----+------------------------+
- root@ubuntu13:/usr/local/patroni_install#
- root@ubuntu13:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
- + Cluster: pg_cluster_wy_prod (7642212398676862997) ---------+----+-------------+-----+------------+-----+-----------------------+
- | Member | Host | Role | State | TL | Receive LSN | Lag | Replay LSN | Lag | Tags |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+-----------------------+
- | ubuntu12 | 192.168.152.122:9000 | Leader | running | 11 | | | | | failover_priority: 80 |
- | ubuntu13 | 192.168.152.123:9000 | Sync Standby | streaming | 11 | 0/C000688 | 0 | 0/C000688 | 0 | failover_priority: 60 |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+-----------------------+
- root@ubuntu13:/usr/local/patroni_install#
复制代码 新的主节点Ubuntu12上patroni的日记- 2026-05-22 13:53:59,956 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-22 13:54:10,451 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-22 13:54:20,026 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-22 13:54:30,458 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-22 13:54:40,461 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-22 13:54:50,499 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-22 13:55:00,456 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-22 13:55:10,457 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)######差不多这个点开始对Ubuntu11掉电
- 2026-05-22 13:55:20,498 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)######为什么这个点,检测到的Ubuntu11还是正常状态?因为Ubuntu11的lease也就是租约还没有过期
- 2026-05-22 13:55:32,106 WARNING: Request failed to ubuntu11: GET http://192.168.152.121:8008/patroni (HTTPConnectionPool(host='192.168.152.121', port=8008): Max retries exceeded with url: /patroni (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f893566d880>, 'Connection to 192.168.152.121 timed out. (connect timeout=2)')))
- 2026-05-22 13:55:32,114 INFO: promoted self to leader by acquiring session lock
- 2026-05-22 13:55:32,114 INFO: Lock owner: ubuntu12; I am ubuntu12
- 2026-05-22 13:55:32,115 INFO: updated leader lock during promote
- 2026-05-22 13:55:33,137 INFO: Lock owner: ubuntu12; I am ubuntu12
- 2026-05-22 13:55:33,193 INFO: Assigning synchronous standby status to ['ubuntu13']
- 2026-05-22 13:55:35,316 INFO: Synchronous standby status assigned to ['ubuntu13']
- 2026-05-22 13:55:35,322 INFO: no action. I am (ubuntu12), the leader with the lock
- 2026-05-22 13:55:35,377 INFO: no action. I am (ubuntu12), the leader with the lock
- 2026-05-22 13:55:45,324 INFO: no action. I am (ubuntu12), the leader with the lock
- 2026-05-22 13:55:55,367 INFO: no action. I am (ubuntu12), the leader with the lock
- 2026-05-22 13:56:05,329 INFO: no action. I am (ubuntu12), the leader with the lock
复制代码 新的主节点通过psql检察身份状态- postgres=#
- postgres=#
- postgres=# select now(),pg_is_in_recovery(); ###########################这里开始对原始主节点Ubuntu11 掉电,然后连续查询
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:10.849473+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:11.665724+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- ------------------------------+-------------------
- 2026-05-22 13:55:12.32947+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:13.017149+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:13.799962+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:14.902866+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:15.672331+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:16.435662+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:17.070935+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:17.816528+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:18.546785+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:19.393943+08 | t
- (1 row)
- #......中间省略掉......
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:29.759037+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:30.417626+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:31.089604+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:31.775459+08 | t
- (1 row)
- postgres=# select now(),pg_is_in_recovery(); ###########################22秒之后,新的主节点才真正promote起来
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:32.400935+08 | f
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- ------------------------------+-------------------
- 2026-05-22 13:55:33.27183+08 | f
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:33.950342+08 | f
- (1 row)
- postgres=# select now(),pg_is_in_recovery();
- now | pg_is_in_recovery
- -------------------------------+-------------------
- 2026-05-22 13:55:34.758651+08 | f
- (1 row)
- postgres=# postgres=#
- postgres-# postgres=#
复制代码 连合上述日记,来明确ttl的概念,从时间的维度来观察:
1,2026-05-22 13:55:10,457,上面提到差不多在这个是时间点开始对原主节点Ubuntu11断电,
2,2026-05-22 13:55:20,498 ,patroni日记中检测到的Ubuntu11还是正常状态?
3, 2026-05-22 13:55:32.400935,通过查询新的主节点的pg_is_in_recovery,发现pg_is_in_recovery才变为f,也即故障转移乐成
日记是否与现实利用的不符合,明显Ubuntu11在13:55:10就断电了,为什么13:55:20还在检测的时间还是正常的,但是直到13:55:32,新的主节点才真正开始工作,这是不是抵牾的?
这是由于,在13:55:10断电,在13:55:10前几秒(减去一个loop_wait的时间点,loop_wait默认10秒), Ubuntu11上的patroni对etcd中的leader key续约,续约一次收效时间为向后推30秒,lease也就是租约还没有逾期,其租约大概在13:55:30之后才逾期,因此在13:55:20这个时间点,接替它的从节点上的patroni服务,检测到leader key 并没有逾期。
直到下一个检测周期,也即13:55:30的时间,这一轮查抄的时间才发现“2026-05-22 13:55:32,106 WARNING: Request failed to ubuntu11: GET http://192.168.152.121:8008/patroni (HTTPConnectionPool(host='192.168.152.121', port=8008): Max retries exceeded with url: /patroni (Caused by ConnectTimeoutError(, 'Connection to 192.168.152.121 timed out. (connect timeout=2)')))”原始主节点非常,为什么日记是13:55:32,在13:55:30的底子上加了2秒呢?由于connect timeout=2
以上才是patroni参数中ttl的真正寄义。
这种场景下的故障转移的流程:
关闭Ubuntu11 电源 模仿主节点故障 ———>10秒后 Ubuntu 11上的leader 租约扔有用(现实上此时Ubuntu已宕机) ———>10秒后 Ubuntu 11上的leader 租约扔有用 (现实上此时Ubuntu已宕机) ———>10秒后 Ubuntu 12检测到leader 失效———> 抢占leader key,promote 本地PostgreSQL为主库
因此如果想提到patroni的故障转移的灵敏性,须要减小ttl的值,也即镌汰leader key的租约时间,同时也要减小loop_wait,增长判断leader key的频率,来提拔故障检测以及转移的灵敏性,但也要意识到,调小这两个参数,大概在网络抖动是会带来的预期之外的故障转移。
3,自动故障转移场景3:主节点网络分区
用iptables -A OUTPUT -d 192.168.152.121 -j DROP
从节点1- root@ubuntu12:/usr/local/patroni_install# sudo iptables -A OUTPUT -d 192.168.152.121 -j DROP
- root@ubuntu12:/usr/local/patroni_install# sudo iptables -A INPUT -s 192.168.152.121 -j DROP
- root@ubuntu12:/usr/local/patroni_install#
复制代码 从节点2- root@ubuntu13:/usr/local/patroni_install# sudo iptables -A OUTPUT -d 192.168.152.121 -j DROP
- root@ubuntu13:/usr/local/patroni_install# sudo iptables -A INPUT -s 192.168.152.121 -j DROP
- root@ubuntu13:/usr/local/patroni_install#
复制代码 网络分区已形成
1,对于新的主节点:Ubuntu12已经乐成担当主节点- 2026-05-22 14:47:38,980 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-22 14:47:49,402 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-22 14:47:58,941 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-22 14:48:09,491 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-22 14:48:19,441 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
- 2026-05-22 14:48:31,104 WARNING: Request failed to ubuntu11: GET http://192.168.152.121:8008/patroni (HTTPConnectionPool(host='192.168.152.121', port=8008): Max retries exceeded with url: /patroni (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f8988fdb1f0>, 'Connection to 192.168.152.121 timed out. (connect timeout=2)')))
- 2026-05-22 14:48:31,183 INFO: promoted self to leader by acquiring session lock
- 2026-05-22 14:48:31,187 INFO: Lock owner: ubuntu12; I am ubuntu12
- 2026-05-22 14:48:31,239 INFO: updated leader lock during promote
- 2026-05-22 14:48:32,206 INFO: Lock owner: ubuntu12; I am ubuntu12
- 2026-05-22 14:48:32,214 INFO: Assigning synchronous standby status to ['ubuntu13']
- 2026-05-22 14:48:34,337 INFO: Synchronous standby status assigned to ['ubuntu13']
- 2026-05-22 14:48:34,385 INFO: no action. I am (ubuntu12), the leader with the lock
- 2026-05-22 14:48:42,245 INFO: no action. I am (ubuntu12), the leader with the lock
- 2026-05-22 14:48:52,256 INFO: no action. I am (ubuntu12), the leader with the lock
复制代码 须要分析的是,对网络分区的故障转移,与上面主节点断电一样,固然在新主节点的日记中,从发现到故障转移只用了10秒多,但现实上,在网络分区之后,由于原主节点对于leader key的末了一次续约加上了30秒(ttl),导致网络分区发生后,新的主节点在探测ttl的时间,前2次探测的时间现实上网络分区已经形成,但此时新的主节点尚未担当,直至原主节点的leader key 租约超期,这一点与上面一种情况一样,详细测试过不在赘述。
2,对于原主节点:
此时原主节点日记已无法毗连至Ubuntu12 和Ubuntu 13,注意日记
2026-05-22 14:48:17,257 ERROR: Error communicating with DCS
2026-05-22 14:48:17,258 INFO: demoting self because DCS is not accessible and I was a leader
2026-05-22 14:48:17,258 INFO: Demoting self (offline)
原始主节点网络分区之后,自动降级为只读状态,因此不会出现双主大概脑裂的征象。同时会一连不绝地实行毗连到Ubuntu12和ubuntu13上的etcd集群(日记在连续天生,没有贴全),以确保网络规复后自动参加集群- 2026-05-22 14:47:38,896 INFO: no action. I am (ubuntu11), the leader with the lock
- 2026-05-22 14:47:38,963 INFO: no action. I am (ubuntu11), the leader with the lock
- 2026-05-22 14:47:48,912 INFO: no action. I am (ubuntu11), the leader with the lock
- 2026-05-22 14:47:58,948 INFO: no action. I am (ubuntu11), the leader with the lock
- 2026-05-22 14:48:08,903 INFO: Lock owner: ubuntu11; I am ubuntu11
- 2026-05-22 14:48:12,244 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.3332171243333355)")
- 2026-05-22 14:48:12,244 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:48:12,244 INFO: Retrying on http://192.168.152.123:2379
- 2026-05-22 14:48:13,913 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b2b0>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:48:13,913 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:48:13,913 INFO: Retrying on http://192.168.152.122:2379
- 2026-05-22 14:48:15,583 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b2e0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:48:15,583 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:48:17,253 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b520>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:48:17,256 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
- 2026-05-22 14:48:17,257 ERROR: Error communicating with DCS
- 2026-05-22 14:48:17,258 INFO: demoting self because DCS is not accessible and I was a leader
- 2026-05-22 14:48:17,258 INFO: Demoting self (offline)
- 2026-05-22 14:48:18,932 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b970>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:00,355 INFO: postmaster pid=3525
- 2026-05-22 14:49:01,400 INFO: demoted self because DCS is not accessible and I was a leader
- 2026-05-22 14:49:01,403 WARNING: Loop time exceeded, rescheduling immediately.
- 2026-05-22 14:49:01,405 INFO: Lock owner: ubuntu11; I am ubuntu11
- 2026-05-22 14:49:01,405 INFO: establishing a new patroni heartbeat connection to postgres
- 2026-05-22 14:49:04,749 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.33254870033331)")
- 2026-05-22 14:49:04,749 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:04,749 INFO: Retrying on http://192.168.152.123:2379
- 2026-05-22 14:49:06,419 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04bfa0>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:06,419 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:06,419 INFO: Retrying on http://192.168.152.122:2379
- 2026-05-22 14:49:08,089 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e451c0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:08,089 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:09,758 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c53fa90>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:11,417 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=1.00350682653891)")
- 2026-05-22 14:49:11,417 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:13,086 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04ba60>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:13,088 ERROR: Error communicating with DCS
- 2026-05-22 14:49:13,088 INFO: DCS is not accessible
- 2026-05-22 14:49:13,088 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
- 2026-05-22 14:49:13,090 WARNING: Loop time exceeded, rescheduling immediately.
- 2026-05-22 14:49:14,757 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b820>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:14,763 INFO: Lock owner: ubuntu11; I am ubuntu11
- 2026-05-22 14:49:18,103 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.3331819403333234)")
- 2026-05-22 14:49:18,103 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:18,103 INFO: Retrying on http://192.168.152.122:2379
- 2026-05-22 14:49:19,773 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e450d0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:19,773 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:19,773 INFO: Retrying on http://192.168.152.123:2379
- 2026-05-22 14:49:21,441 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45370>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:21,442 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:23,112 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45670>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:23,114 ERROR: Error communicating with DCS
- 2026-05-22 14:49:23,114 INFO: DCS is not accessible
- 2026-05-22 14:49:23,114 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
- 2026-05-22 14:49:23,115 WARNING: Loop time exceeded, rescheduling immediately.
- 2026-05-22 14:49:24,784 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45b50>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:24,790 INFO: Lock owner: ubuntu11; I am ubuntu11
- 2026-05-22 14:49:28,128 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.332673626333379)")
- 2026-05-22 14:49:28,128 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:28,128 INFO: Retrying on http://192.168.152.123:2379
- 2026-05-22 14:49:29,799 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e5c220>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:29,799 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:29,799 INFO: Retrying on http://192.168.152.122:2379
- 2026-05-22 14:49:31,469 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c41f460>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:31,469 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:33,138 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c53f9a0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:34,794 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=1.1904628130222932)")
- 2026-05-22 14:49:34,794 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:36,464 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b0a0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:36,468 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
- 2026-05-22 14:49:36,468 ERROR: Error communicating with DCS
- 2026-05-22 14:49:36,468 INFO: DCS is not accessible
- 2026-05-22 14:49:36,470 WARNING: Loop time exceeded, rescheduling immediately.
- 2026-05-22 14:49:38,140 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b8b0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:38,145 INFO: Lock owner: ubuntu11; I am ubuntu11
- 2026-05-22 14:49:41,485 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.33319548833335)")
- 2026-05-22 14:49:41,485 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:41,485 INFO: Retrying on http://192.168.152.122:2379
- 2026-05-22 14:49:43,153 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45fa0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:43,153 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:43,153 INFO: Retrying on http://192.168.152.123:2379
- 2026-05-22 14:49:44,823 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e459d0>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:44,823 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:46,493 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45ac0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:48,150 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=1.4432267216546961)")
- 2026-05-22 14:49:48,150 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 14:49:49,819 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45100>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 14:49:49,821 ERROR: Error communicating with DCS
- 2026-05-22 14:49:49,821 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
- 2026-05-22 14:49:49,821 INFO: DCS is not accessible
复制代码 Ubuntu 12清除网络分区,- root@ubuntu12:/usr/local/patroni_install# sudo iptables -D OUTPUT -d 192.168.152.121 -j DROP
- root@ubuntu12:/usr/local/patroni_install# sudo iptables -D INPUT -s 192.168.152.121 -j DROP
复制代码 Ubuntu 13上也清除网络分区- root@ubuntu13:/usr/local/patroni_install# sudo iptables -D OUTPUT -d 192.168.152.121 -j DROP
- root@ubuntu13:/usr/local/patroni_install# sudo iptables -D INPUT -s 192.168.152.121 -j DROP
- root@ubuntu13:/usr/local/patroni_install#
- root@ubuntu13:/usr/local/patroni_install#
复制代码 可以发现被隔离的Ubuntu11自动以从节点身份参加集群。- root@ubuntu12:/usr/local/patroni_install#
- root@ubuntu12:/usr/local/patroni_install#
- root@ubuntu12:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
- + Cluster: pg_cluster_wy_prod (7642589780522937440) ---------+----+-------------+-----+------------+-----+
- | Member | Host | Role | State | TL | Receive LSN | Lag | Replay LSN | Lag |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
- | ubuntu11 | 192.168.152.121:9000 | Replica | streaming | 5 | 0/60043F0 | 0 | 0/60043F0 | 0 |
- | ubuntu12 | 192.168.152.122:9000 | Leader | running | 5 | | | | |
- | ubuntu13 | 192.168.152.123:9000 | Sync Standby | streaming | 5 | 0/60043F0 | 0 | 0/60043F0 | 0 |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
- root@ubuntu12:/usr/local/patroni_install#
- root@ubuntu12:/usr/local/patroni_install#
复制代码 Ubuntu11上的日记,自动实行pg_rewind,然后以从节点的身份参加集群- 2026-05-22 15:02:40,971 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e5c4c0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 15:02:42,626 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=1.3500901345867078)")
- 2026-05-22 15:02:42,626 INFO: Reconnection allowed, looking for another server.
- 2026-05-22 15:02:44,295 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e5cb80>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 15:02:44,296 ERROR: Error communicating with DCS
- 2026-05-22 15:02:44,297 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
- 2026-05-22 15:02:44,297 INFO: DCS is not accessible
- 2026-05-22 15:02:44,298 WARNING: Loop time exceeded, rescheduling immediately.
- 2026-05-22 15:02:45,967 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b3a0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 15:02:45,975 INFO: Lock owner: ubuntu11; I am ubuntu11
- 2026-05-22 15:02:47,985 ERROR: failed to update leader lock
- 2026-05-22 15:02:47,994 INFO: not promoting because failed to update leader lock in DCS
- 2026-05-22 15:02:47,994 INFO: Lock owner: ubuntu12; I am ubuntu11
- 2026-05-22 15:02:48,001 INFO: Local timeline=4 lsn=0/70000A0
- 2026-05-22 15:02:48,027 INFO: primary_timeline=5
- 2026-05-22 15:02:48,030 INFO: primary: history=1 0/504F580 no recovery target specified
- 2 0/6003D20 no recovery target specified
- 3 0/6003EC0 no recovery target specified
- 4 0/6004148 no recovery target specified
- 2026-05-22 15:02:48,049 INFO: running pg_rewind from ubuntu12
- 2026-05-22 15:02:49,312 INFO: running pg_rewind from dbname=postgres user=rewind_user host=192.168.152.122 port=9000 target_session_attrs=read-write
- 2026-05-22 15:02:50,305 INFO: pg_rewind exit code=0
- 2026-05-22 15:02:50,305 INFO: stdout=
- 2026-05-22 15:02:50,305 INFO: stderr=pg_rewind: servers diverged at WAL location 0/6004148 on timeline 4
- pg_rewind: rewinding from last common checkpoint at 0/6004038 on timeline 4
- pg_rewind: Done!
- 2026-05-22 15:02:50,307 WARNING: Postgresql is not running.
- 2026-05-22 15:02:50,308 INFO: Lock owner: ubuntu12; I am ubuntu11
- 2026-05-22 15:02:50,319 INFO: pg_controldata:
- pg_control version number: 1700
- Catalog version number: 202406281
- Database system identifier: 7642589780522937440
- Database cluster state: in archive recovery
- pg_control last modified: Fri May 22 15:02:50 2026
- Latest checkpoint location: 0/6004340
- Latest checkpoint's REDO location: 0/60042E8
- Latest checkpoint's REDO WAL file: 000000050000000000000006
- Latest checkpoint's TimeLineID: 5
- Latest checkpoint's PrevTimeLineID: 5
- Latest checkpoint's full_page_writes: on
- Latest checkpoint's NextXID: 0:762
- Latest checkpoint's NextOID: 24576
- Latest checkpoint's NextMultiXactId: 1
- Latest checkpoint's NextMultiOffset: 0
- Latest checkpoint's oldestXID: 731
- Latest checkpoint's oldestXID's DB: 1
- Latest checkpoint's oldestActiveXID: 762
- Latest checkpoint's oldestMultiXid: 1
- Latest checkpoint's oldestMulti's DB: 1
- Latest checkpoint's oldestCommitTsXid: 0
- Latest checkpoint's newestCommitTsXid: 0
- Time of latest checkpoint: Fri May 22 14:53:31 2026
- Fake LSN counter for unlogged rels: 0/3E8
- Minimum recovery ending location: 0/60043F0
- Min recovery ending loc's timeline: 5
- Backup start location: 0/0
- Backup end location: 0/0
- End-of-backup record required: no
- wal_level setting: replica
- wal_log_hints setting: on
- max_connections setting: 100
- max_worker_processes setting: 8
- max_wal_senders setting: 10
- max_prepared_xacts setting: 0
- max_locks_per_xact setting: 64
- track_commit_timestamp setting: off
- Maximum data alignment: 8
- Database block size: 8192
- Blocks per segment of large relation: 131072
- WAL block size: 8192
- Bytes per WAL segment: 16777216
- Maximum length of identifiers: 64
- Maximum columns in an index: 32
- Maximum size of a TOAST chunk: 1996
- Size of a large-object chunk: 2048
- Date/time type storage: 64-bit integers
- Float8 argument passing: by value
- Data page checksum version: 1
- Mock authentication nonce: 3587dd0ff212f7ed05a16aa24aa1d6a6f187f55d5d6a2e158ce45327a7e55005
- 2026-05-22 15:02:50,320 INFO: Lock owner: ubuntu12; I am ubuntu11
- 2026-05-22 15:02:50,367 INFO: starting as a secondary
- 2026-05-22 15:02:50,368 INFO: closed patroni connections to postgres
- 2026-05-22 15:02:50,738 INFO: postmaster pid=3952
- 2026-05-22 15:02:51,774 INFO: Lock owner: ubuntu12; I am ubuntu11
- 2026-05-22 15:02:51,774 INFO: establishing a new patroni heartbeat connection to postgres
- 2026-05-22 15:02:51,795 INFO: Local timeline=5 lsn=0/60043F0
- 2026-05-22 15:02:51,803 INFO: primary_timeline=5
- 2026-05-22 15:02:51,812 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
- 2026-05-22 15:02:52,281 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
- 2026-05-22 15:03:02,819 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
- 2026-05-22 15:03:12,778 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
- 2026-05-22 15:03:22,777 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
复制代码 4,自动故障转移场景4:暴力删除主节点PostgreSQL数据文件
这种情况现真相况中险些不会发生,除非刻意为之,那么直接暴力删除运行中的主节点的PostgreSQL的数据文件会发生什么?
以下是实测,可以发现,暴力删除数据文件之后:1,patroni集群会自动故障转移(由于主节点无法对外提供服务了),2,原主节点会自动从集群中克隆一份数据作为从节点运行- ######当前集群正常状态
- root@ubuntu11:~# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
- + Cluster: pg_cluster_wy_prod (7642589780522937440) ---------+----+-------------+-----+------------+-----+
- | Member | Host | Role | State | TL | Receive LSN | Lag | Replay LSN | Lag |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
- | ubuntu11 | 192.168.152.121:9000 | Sync Standby | streaming | 6 | 0/70002D0 | 0 | 0/70002D0 | 0 |
- | ubuntu12 | 192.168.152.122:9000 | Replica | streaming | 6 | 0/70002D0 | 0 | 0/70002D0 | 0 |
- | ubuntu13 | 192.168.152.123:9000 | Leader | running | 6 | | | | |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
- root@ubuntu11:~#
- root@ubuntu11:~#
- root@ubuntu11:~#
- root@ubuntu11:~#
- ######暴力删除原主节点数据文件
- root@ubuntu11:~# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
- + Cluster: pg_cluster_wy_prod (7642589780522937440) ----------------+----+-------------+-----+------------+-----+
- | Member | Host | Role | State | TL | Receive LSN | Lag | Replay LSN | Lag |
- +----------+----------------------+--------------+------------------+----+-------------+-----+------------+-----+
- | ubuntu11 | 192.168.152.121:9000 | Leader | running | 7 | | | | |
- | ubuntu12 | 192.168.152.122:9000 | Sync Standby | streaming | 7 | 0/7000410 | 0 | 0/7000410 | 0 |
- | ubuntu13 | 192.168.152.123:9000 | Replica | creating replica | | unknown | | unknown | |
- +----------+----------------------+--------------+------------------+----+-------------+-----+------------+-----+
- root@ubuntu11:~#
- ######集群恢复正常
- root@ubuntu11:~# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
- + Cluster: pg_cluster_wy_prod (7642589780522937440) ---------+----+-------------+-----+------------+-----+
- | Member | Host | Role | State | TL | Receive LSN | Lag | Replay LSN | Lag |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
- | ubuntu11 | 192.168.152.121:9000 | Leader | running | 7 | | | | |
- | ubuntu12 | 192.168.152.122:9000 | Sync Standby | streaming | 7 | 0/7000410 | 0 | 0/7000410 | 0 |
- | ubuntu13 | 192.168.152.123:9000 | Replica | streaming | 7 | 0/9000000 | 0 | 0/9000000 | 0 |
- +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
复制代码 原主节点日记,可以发现:replica has been created using basebackup,也就是说当前粉碎的节点,会自动基于basebackup备份一份数据,自动规复并参加集群,就像打不死的小强一样刚强。固然如果库比力大的情况就另说。- 026-05-22 16:39:25,903 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:39:35,943 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:39:45,901 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:39:55,902 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:40:05,944 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:40:15,902 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:40:25,901 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:40:35,947 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:40:45,904 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:40:55,899 INFO: Lock owner: ubuntu13; I am ubuntu13
- 2026-05-22 16:40:57,570 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fd2642bc7c0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
- 2026-05-22 16:40:57,573 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:41:05,902 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:41:15,901 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:41:25,905 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:41:35,944 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:41:45,903 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:41:55,908 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:42:05,945 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:42:15,905 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:42:25,902 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:42:35,943 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:42:45,901 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:42:55,901 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:43:05,944 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:43:15,901 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:43:25,904 INFO: no action. I am (ubuntu13), the leader with the lock
- 2026-05-22 16:43:35,906 INFO: Lock owner: ubuntu13; I am ubuntu13
- 2026-05-22 16:43:35,958 INFO: Leader key released
- 2026-05-22 16:43:35,961 INFO: released leader key voluntarily as data dir empty and currently leader
- 2026-05-22 16:43:35,961 INFO: Lock owner: None; I am ubuntu13
- 2026-05-22 16:43:36,003 INFO: waiting for leader to bootstrap
- 2026-05-22 16:43:36,016 INFO: Lock owner: ubuntu11; I am ubuntu13
- 2026-05-22 16:43:36,018 INFO: trying to bootstrap from leader 'ubuntu11'
- 2026-05-22 16:43:36,134 ERROR: Error when fetching backup: pg_basebackup exited with code=1
- 2026-05-22 16:43:36,135 WARNING: Trying again in 5 seconds
- 2026-05-22 16:43:37,069 INFO: Lock owner: ubuntu11; I am ubuntu13
- 2026-05-22 16:43:37,116 INFO: bootstrap from leader 'ubuntu11' in progress
- 2026-05-22 16:43:42,134 INFO: replica has been created using basebackup
- 2026-05-22 16:43:42,135 INFO: bootstrapped from leader 'ubuntu11'
- 2026-05-22 16:43:42,531 INFO: postmaster pid=25110
- 2026-05-22 16:43:43,580 INFO: Lock owner: ubuntu11; I am ubuntu13
- 2026-05-22 16:43:43,580 INFO: establishing a new patroni heartbeat connection to postgres
- 2026-05-22 16:43:43,602 INFO: Local timeline=7 lsn=0/9000000
- 2026-05-22 16:43:43,628 INFO: primary_timeline=7
- 2026-05-22 16:43:43,665 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-22 16:43:47,112 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-22 16:43:57,127 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-22 16:44:07,584 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-22 16:44:17,625 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-22 16:44:27,579 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-22 16:44:37,584 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-22 16:44:47,627 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-22 16:44:57,579 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-22 16:45:07,587 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
- 2026-05-22 16:45:17,619 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
复制代码
5,总结
本文通过三种现实的故障,严苛测试了patroni故障转移集群的高可用性,可以发现patroni可以完善处置惩罚各种故障,实现集群的高可用性,同时对于故障转移集群的ttl参数,以及loop_wait参数,在故障转移中的作用,做了实操性的验证,笔者自己也对这两个参数有了更加深刻的熟悉。
免责声明:如果侵犯了您的权益,请联系站长及时删除侵权内容,谢谢合作!qidao123.com:ToB企服之家,中国第一个企服评测及软件市场,开放入驻,技术点评得现金. |