用户云卷云舒 发表于 2026-5-22 15:59:35

PostgreSQL 高可用集群 patroni 自动故障转移测试

网上有许多关于patroni的文章许多,绝大多数文章是通过手动搭建的方式,仅做出了一个patroni的情况搭建,包罗各种微信群等,对于patroni参数的利用,故障转移的原理以及实操都只字未提,本文通过Ubuntu 20 情况下 patroni 自动化安装,一分钟快速搭建 patroni 集群 来快速搭建一个集群,以及实操的方式实现故障转移的测试和验证,通过模仿真实的故障以及故障转移的日记,来分析故障转移的实现和效果。
 

0,patroni 集群状态

ubuntu11 注,ubuntu12,ubuntu13 为从,以下测试始终保持Ubuntu11 为主,Ubuntu 12 Ubuntu 13为从的架构root@ubuntu11:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
+ Cluster: pg_cluster_wy_prod (7641831362696373502) ---------+----+-------------+-----+------------+-----+
| Member   | Host               | Role         | State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
| ubuntu11 | 192.168.152.121:9000 | Leader       | running   |6 |             |   |            |   |
| ubuntu12 | 192.168.152.122:9000 | Sync Standby | streaming |6 |   0/F000348 |   0 |0/F000348 |   0 |
| ubuntu13 | 192.168.152.123:9000 | Replica      | streaming |6 |   0/F000348 |   0 |0/F000348 |   0 |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+鉴于测试目标,设置patroni的systemctl service服务的自动启动为no
postgres@ubuntu11:~$ cat /etc/systemd/system/patroni.service

Description=Patroni
After=network.target etcd.service
Wants=etcd.service


Type=simple
User=postgres
Group=postgres
Environment="TZ=Asia/Shanghai"
Environment="PYTHONUNBUFFERED=1"

ExecStart=/usr/local/bin/patroni /usr/local/pgsql17/patroni/patroni.yml
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -TERM $MAINPID

#Restart=on-failure
Restart=no
RestartSec=10
TimeoutStartSec=120
TimeoutStopSec=60

LimitNOFILE=65536

StandardOutput=null
StandardError=journal
SyslogIdentifier=patroni


WantedBy=multi-user.target
postgres@ubuntu11:~$
Ubuntu 11主节点日记, 每隔 10 秒轮询一次集群状态,轮训隔断由参数loop_wait决定
2026-05-21 08:46:37,142 INFO: no action. I am (ubuntu11), the leader with the lock
2026-05-21 08:46:47,145 INFO: no action. I am (ubuntu11), the leader with the lock
2026-05-21 08:46:57,190 INFO: no action. I am (ubuntu11), the leader with the lock
2026-05-21 08:47:07,148 INFO: no action. I am (ubuntu11), the leader with the lock
2026-05-21 08:47:17,145 INFO: no action. I am (ubuntu11), the leader with the lock
2026-05-21 08:47:27,189 INFO: no action. I am (ubuntu11), the leader with the lock
2026-05-21 08:47:37,153 INFO: no action. I am (ubuntu11), the leader with the lockUbuntu 12 从节点日记
2026-05-21 08:46:47,628 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-21 08:46:57,628 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-21 08:47:07,671 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-21 08:47:17,632 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-21 08:47:27,675 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-21 08:47:37,139 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-21 08:47:47,233 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-21 08:47:57,681 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)Ubuntu 13 从节点日记
2026-05-21 08:46:57,643 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-21 08:47:07,696 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-21 08:47:17,647 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-21 08:47:27,688 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-21 08:47:37,155 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-21 08:47:47,255 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-21 08:47:57,696 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11) 
1,自动故障转移场景1:主节点OS正常,patroni服务非常故障

主节点状态正常,关闭主节点patroni服务模仿主节点故障
root@ubuntu11:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
+ Cluster: pg_cluster_wy_prod (7641831362696373502) ---------+----+-------------+-----+------------+-----+
| Member   | Host               | Role         | State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
| ubuntu11 | 192.168.152.121:9000 | Leader       | running   |6 |             |   |            |   |
| ubuntu12 | 192.168.152.122:9000 | Sync Standby | streaming |6 |   0/F000348 |   0 |0/F000348 |   0 |
| ubuntu13 | 192.168.152.123:9000 | Replica      | streaming |6 |   0/F000348 |   0 |0/F000348 |   0 |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+root@ubuntu11:/usr/local/patroni_install# systemctl stop patroniroot@ubuntu11:/usr/local/patroni_install#从节点Ubuntu12上观察到的集群状态,此时原始主节点已处于制止状态
root@ubuntu12:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
+ Cluster: pg_cluster_wy_prod (7641831362696373502) ---------+----+-------------+-----+------------+-----+
| Member   | Host               | Role         | State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
| ubuntu11 | 192.168.152.121:9000 | Replica      | stopped   |    |   unknown |   |    unknown |   |
| ubuntu12 | 192.168.152.122:9000 | Leader       | running   |7 |             |   |            |   |
| ubuntu13 | 192.168.152.123:9000 | Sync Standby | streaming |7 |0/100001A8 |   0 | 0/100001A8 |   0 |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+原始从节点Ubuntu12,成为新的主节点,日记如下
......
2026-05-21 08:55:27,680 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-21 08:55:37,723 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-21 08:55:47,683 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-21 08:55:57,681 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-21 08:56:06,109 WARNING: Request failed to ubuntu11: GET http://192.168.152.121:8008/patroni (HTTPConnectionPool(host='192.168.152.121', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))))
2026-05-21 08:56:06,169 INFO: promoted self to leader by acquiring session lock
2026-05-21 08:56:06,169 INFO: Lock owner: ubuntu12; I am ubuntu12
2026-05-21 08:56:06,172 INFO: updated leader lock during promote
server promoting
2026-05-21 08:56:07,185 INFO: Lock owner: ubuntu12; I am ubuntu12
2026-05-21 08:56:07,195 INFO: Assigning synchronous standby status to ['ubuntu13']
server signaled
2026-05-21 08:56:09,324 INFO: Synchronous standby status assigned to ['ubuntu13']
2026-05-21 08:56:09,369 INFO: no action. I am (ubuntu12), the leader with the lock
2026-05-21 08:56:17,196 INFO: no action. I am (ubuntu12), the leader with the lock
2026-05-21 08:56:27,187 INFO: no action. I am (ubuntu12), the leader with the lock
2026-05-21 08:56:37,242 INFO: no action. I am (ubuntu12), the leader with the lock
......这种场景下的故障转移的流程:
手动关闭Ubuntu11 Patroni 主节点模仿故障 ———>Ubuntu 11上的patroni自动删除 DCS 中的 leader key———> Ubuntu12 从节点颠末loop_wait轮训后检测到DSC无主 ———> 获取锁提拔为 Leader———> promote 本地PostgreSQL为主库
 2,自动故障转移场景2:主节点服务器断电

Ubuntu11 通过“关机”(而非关闭客户机)来模仿服务器忽然断电,这种场景须要深刻明确租约寿命,也就是ttl(默认 30 秒)参数的概念
https://img2024.cnblogs.com/blog/380271/202605/380271-20260522125733266-193642037.png
新的主节点Ubuntu 12上看到的集群状态
root@ubuntu13:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
+ Cluster: pg_cluster_wy_prod (7642212398676862997) ----+----+-------------+-----+------------+-----+------------------------+
| Member   | Host               | Role    | State   | TL | Receive LSN | Lag | Replay LSN | Lag | Tags                   |
+----------+----------------------+---------+-----------+----+-------------+-----+------------+-----+------------------------+
| ubuntu11 | 192.168.152.121:9000 | Leader| running   | 10 |             |   |            |   | failover_priority: 100 |
| ubuntu12 | 192.168.152.122:9000 | Replica | streaming | 10 |   0/C000000 |   0 |0/C000358 |   0 | failover_priority: 80|
| ubuntu13 | 192.168.152.123:9000 | Replica | streaming | 10 |   0/C000380 |   0 |0/C000380 |   0 | failover_priority: 60|
+----------+----------------------+---------+-----------+----+-------------+-----+------------+-----+------------------------+
root@ubuntu13:/usr/local/patroni_install#
root@ubuntu13:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
+ Cluster: pg_cluster_wy_prod (7642212398676862997) ---------+----+-------------+-----+------------+-----+-----------------------+
| Member   | Host               | Role         | State   | TL | Receive LSN | Lag | Replay LSN | Lag | Tags                  |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+-----------------------+
| ubuntu12 | 192.168.152.122:9000 | Leader       | running   | 11 |             |   |            |   | failover_priority: 80 |
| ubuntu13 | 192.168.152.123:9000 | Sync Standby | streaming | 11 |   0/C000688 |   0 |0/C000688 |   0 | failover_priority: 60 |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+-----------------------+
root@ubuntu13:/usr/local/patroni_install#新的主节点Ubuntu12上patroni的日记
2026-05-22 13:53:59,956 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-22 13:54:10,451 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-22 13:54:20,026 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-22 13:54:30,458 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-22 13:54:40,461 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-22 13:54:50,499 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-22 13:55:00,456 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-22 13:55:10,457 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)######差不多这个点开始对Ubuntu11掉电
2026-05-22 13:55:20,498 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)######为什么这个点,检测到的Ubuntu11还是正常状态?因为Ubuntu11的lease也就是租约还没有过期
2026-05-22 13:55:32,106 WARNING: Request failed to ubuntu11: GET http://192.168.152.121:8008/patroni (HTTPConnectionPool(host='192.168.152.121', port=8008): Max retries exceeded with url: /patroni (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f893566d880>, 'Connection to 192.168.152.121 timed out. (connect timeout=2)')))
2026-05-22 13:55:32,114 INFO: promoted self to leader by acquiring session lock
2026-05-22 13:55:32,114 INFO: Lock owner: ubuntu12; I am ubuntu12
2026-05-22 13:55:32,115 INFO: updated leader lock during promote
2026-05-22 13:55:33,137 INFO: Lock owner: ubuntu12; I am ubuntu12
2026-05-22 13:55:33,193 INFO: Assigning synchronous standby status to ['ubuntu13']
2026-05-22 13:55:35,316 INFO: Synchronous standby status assigned to ['ubuntu13']
2026-05-22 13:55:35,322 INFO: no action. I am (ubuntu12), the leader with the lock
2026-05-22 13:55:35,377 INFO: no action. I am (ubuntu12), the leader with the lock
2026-05-22 13:55:45,324 INFO: no action. I am (ubuntu12), the leader with the lock
2026-05-22 13:55:55,367 INFO: no action. I am (ubuntu12), the leader with the lock
2026-05-22 13:56:05,329 INFO: no action. I am (ubuntu12), the leader with the lock新的主节点通过psql检察身份状态
postgres=#
postgres=#
postgres=# select now(),pg_is_in_recovery();            ###########################这里开始对原始主节点Ubuntu11 掉电,然后连续查询
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:10.849473+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:11.665724+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
             now            | pg_is_in_recovery
------------------------------+-------------------
2026-05-22 13:55:12.32947+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:13.017149+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:13.799962+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:14.902866+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:15.672331+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:16.435662+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:17.070935+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:17.816528+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:18.546785+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:19.393943+08 | t
(1 row)

#......中间省略掉......

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:29.759037+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:30.417626+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:31.089604+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:31.775459+08 | t
(1 row)

postgres=# select now(),pg_is_in_recovery();            ###########################22秒之后,新的主节点才真正promote起来
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:32.400935+08 | f
(1 row)

postgres=# select now(),pg_is_in_recovery();
             now            | pg_is_in_recovery
------------------------------+-------------------
2026-05-22 13:55:33.27183+08 | f
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:33.950342+08 | f
(1 row)

postgres=# select now(),pg_is_in_recovery();
            now            | pg_is_in_recovery
-------------------------------+-------------------
2026-05-22 13:55:34.758651+08 | f
(1 row)

postgres=# postgres=#
postgres-# postgres=#连合上述日记,来明确ttl的概念,从时间的维度来观察:
1,2026-05-22 13:55:10,457,上面提到差不多在这个是时间点开始对原主节点Ubuntu11断电,
2,2026-05-22 13:55:20,498 ,patroni日记中检测到的Ubuntu11还是正常状态?
3, 2026-05-22 13:55:32.400935,通过查询新的主节点的pg_is_in_recovery,发现pg_is_in_recovery才变为f,也即故障转移乐成
日记是否与现实利用的不符合,明显Ubuntu11在13:55:10就断电了,为什么13:55:20还在检测的时间还是正常的,但是直到13:55:32,新的主节点才真正开始工作,这是不是抵牾的?
这是由于,在13:55:10断电,在13:55:10前几秒(减去一个loop_wait的时间点,loop_wait默认10秒), Ubuntu11上的patroni对etcd中的leader key续约,续约一次收效时间为向后推30秒,lease也就是租约还没有逾期,其租约大概在13:55:30之后才逾期,因此在13:55:20这个时间点,接替它的从节点上的patroni服务,检测到leader key 并没有逾期。
直到下一个检测周期,也即13:55:30的时间,这一轮查抄的时间才发现“2026-05-22 13:55:32,106 WARNING: Request failed to ubuntu11: GET http://192.168.152.121:8008/patroni (HTTPConnectionPool(host='192.168.152.121', port=8008): Max retries exceeded with url: /patroni (Caused by ConnectTimeoutError(, 'Connection to 192.168.152.121 timed out. (connect timeout=2)')))”原始主节点非常,为什么日记是13:55:32,在13:55:30的底子上加了2秒呢?由于connect timeout=2
以上才是patroni参数中ttl的真正寄义。
 
这种场景下的故障转移的流程:
关闭Ubuntu11 电源 模仿主节点故障 ———>10秒后 Ubuntu 11上的leader 租约扔有用(现实上此时Ubuntu已宕机) ———>10秒后 Ubuntu 11上的leader 租约扔有用 (现实上此时Ubuntu已宕机) ———>10秒后 Ubuntu 12检测到leader 失效———> 抢占leader key,promote 本地PostgreSQL为主库
因此如果想提到patroni的故障转移的灵敏性,须要减小ttl的值,也即镌汰leader key的租约时间,同时也要减小loop_wait,增长判断leader key的频率,来提拔故障检测以及转移的灵敏性,但也要意识到,调小这两个参数,大概在网络抖动是会带来的预期之外的故障转移。 
3,自动故障转移场景3:主节点网络分区

用iptables -A OUTPUT -d 192.168.152.121 -j DROP
从节点1
root@ubuntu12:/usr/local/patroni_install# sudo iptables -A OUTPUT -d 192.168.152.121 -j DROP
root@ubuntu12:/usr/local/patroni_install# sudo iptables -A INPUT-s 192.168.152.121 -j DROP
root@ubuntu12:/usr/local/patroni_install#从节点2
root@ubuntu13:/usr/local/patroni_install# sudo iptables -A OUTPUT -d 192.168.152.121 -j DROP
root@ubuntu13:/usr/local/patroni_install# sudo iptables -A INPUT-s 192.168.152.121 -j DROP
root@ubuntu13:/usr/local/patroni_install#网络分区已形成
1,对于新的主节点:Ubuntu12已经乐成担当主节点
2026-05-22 14:47:38,980 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-22 14:47:49,402 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-22 14:47:58,941 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-22 14:48:09,491 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-22 14:48:19,441 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
2026-05-22 14:48:31,104 WARNING: Request failed to ubuntu11: GET http://192.168.152.121:8008/patroni (HTTPConnectionPool(host='192.168.152.121', port=8008): Max retries exceeded with url: /patroni (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f8988fdb1f0>, 'Connection to 192.168.152.121 timed out. (connect timeout=2)')))
2026-05-22 14:48:31,183 INFO: promoted self to leader by acquiring session lock
2026-05-22 14:48:31,187 INFO: Lock owner: ubuntu12; I am ubuntu12
2026-05-22 14:48:31,239 INFO: updated leader lock during promote
2026-05-22 14:48:32,206 INFO: Lock owner: ubuntu12; I am ubuntu12
2026-05-22 14:48:32,214 INFO: Assigning synchronous standby status to ['ubuntu13']
2026-05-22 14:48:34,337 INFO: Synchronous standby status assigned to ['ubuntu13']
2026-05-22 14:48:34,385 INFO: no action. I am (ubuntu12), the leader with the lock
2026-05-22 14:48:42,245 INFO: no action. I am (ubuntu12), the leader with the lock
2026-05-22 14:48:52,256 INFO: no action. I am (ubuntu12), the leader with the lock须要分析的是,对网络分区的故障转移,与上面主节点断电一样,固然在新主节点的日记中,从发现到故障转移只用了10秒多,但现实上,在网络分区之后,由于原主节点对于leader key的末了一次续约加上了30秒(ttl),导致网络分区发生后,新的主节点在探测ttl的时间,前2次探测的时间现实上网络分区已经形成,但此时新的主节点尚未担当,直至原主节点的leader key 租约超期,这一点与上面一种情况一样,详细测试过不在赘述。
2,对于原主节点:
此时原主节点日记已无法毗连至Ubuntu12 和Ubuntu 13,注意日记
2026-05-22 14:48:17,257 ERROR: Error communicating with DCS
2026-05-22 14:48:17,258 INFO: demoting self because DCS is not accessible and I was a leader
2026-05-22 14:48:17,258 INFO: Demoting self (offline)
原始主节点网络分区之后,自动降级为只读状态,因此不会出现双主大概脑裂的征象。同时会一连不绝地实行毗连到Ubuntu12和ubuntu13上的etcd集群(日记在连续天生,没有贴全),以确保网络规复后自动参加集群
2026-05-22 14:47:38,896 INFO: no action. I am (ubuntu11), the leader with the lock
2026-05-22 14:47:38,963 INFO: no action. I am (ubuntu11), the leader with the lock
2026-05-22 14:47:48,912 INFO: no action. I am (ubuntu11), the leader with the lock
2026-05-22 14:47:58,948 INFO: no action. I am (ubuntu11), the leader with the lock
2026-05-22 14:48:08,903 INFO: Lock owner: ubuntu11; I am ubuntu11
2026-05-22 14:48:12,244 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.3332171243333355)")
2026-05-22 14:48:12,244 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:48:12,244 INFO: Retrying on http://192.168.152.123:2379
2026-05-22 14:48:13,913 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b2b0>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:48:13,913 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:48:13,913 INFO: Retrying on http://192.168.152.122:2379
2026-05-22 14:48:15,583 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b2e0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:48:15,583 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:48:17,253 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b520>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:48:17,256 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
2026-05-22 14:48:17,257 ERROR: Error communicating with DCS
2026-05-22 14:48:17,258 INFO: demoting self because DCS is not accessible and I was a leader
2026-05-22 14:48:17,258 INFO: Demoting self (offline)
2026-05-22 14:48:18,932 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b970>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:00,355 INFO: postmaster pid=3525
2026-05-22 14:49:01,400 INFO: demoted self because DCS is not accessible and I was a leader
2026-05-22 14:49:01,403 WARNING: Loop time exceeded, rescheduling immediately.
2026-05-22 14:49:01,405 INFO: Lock owner: ubuntu11; I am ubuntu11
2026-05-22 14:49:01,405 INFO: establishing a new patroni heartbeat connection to postgres
2026-05-22 14:49:04,749 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.33254870033331)")
2026-05-22 14:49:04,749 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:04,749 INFO: Retrying on http://192.168.152.123:2379
2026-05-22 14:49:06,419 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04bfa0>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:06,419 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:06,419 INFO: Retrying on http://192.168.152.122:2379
2026-05-22 14:49:08,089 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e451c0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:08,089 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:09,758 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c53fa90>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:11,417 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=1.00350682653891)")
2026-05-22 14:49:11,417 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:13,086 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04ba60>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:13,088 ERROR: Error communicating with DCS
2026-05-22 14:49:13,088 INFO: DCS is not accessible
2026-05-22 14:49:13,088 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
2026-05-22 14:49:13,090 WARNING: Loop time exceeded, rescheduling immediately.
2026-05-22 14:49:14,757 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b820>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:14,763 INFO: Lock owner: ubuntu11; I am ubuntu11
2026-05-22 14:49:18,103 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.3331819403333234)")
2026-05-22 14:49:18,103 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:18,103 INFO: Retrying on http://192.168.152.122:2379
2026-05-22 14:49:19,773 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e450d0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:19,773 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:19,773 INFO: Retrying on http://192.168.152.123:2379
2026-05-22 14:49:21,441 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45370>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:21,442 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:23,112 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45670>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:23,114 ERROR: Error communicating with DCS
2026-05-22 14:49:23,114 INFO: DCS is not accessible
2026-05-22 14:49:23,114 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
2026-05-22 14:49:23,115 WARNING: Loop time exceeded, rescheduling immediately.
2026-05-22 14:49:24,784 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45b50>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:24,790 INFO: Lock owner: ubuntu11; I am ubuntu11
2026-05-22 14:49:28,128 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.332673626333379)")
2026-05-22 14:49:28,128 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:28,128 INFO: Retrying on http://192.168.152.123:2379
2026-05-22 14:49:29,799 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e5c220>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:29,799 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:29,799 INFO: Retrying on http://192.168.152.122:2379
2026-05-22 14:49:31,469 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c41f460>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:31,469 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:33,138 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c53f9a0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:34,794 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=1.1904628130222932)")
2026-05-22 14:49:34,794 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:36,464 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b0a0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:36,468 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
2026-05-22 14:49:36,468 ERROR: Error communicating with DCS
2026-05-22 14:49:36,468 INFO: DCS is not accessible
2026-05-22 14:49:36,470 WARNING: Loop time exceeded, rescheduling immediately.
2026-05-22 14:49:38,140 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b8b0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:38,145 INFO: Lock owner: ubuntu11; I am ubuntu11
2026-05-22 14:49:41,485 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.33319548833335)")
2026-05-22 14:49:41,485 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:41,485 INFO: Retrying on http://192.168.152.122:2379
2026-05-22 14:49:43,153 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45fa0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:43,153 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:43,153 INFO: Retrying on http://192.168.152.123:2379
2026-05-22 14:49:44,823 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e459d0>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:44,823 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:46,493 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45ac0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:48,150 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=1.4432267216546961)")
2026-05-22 14:49:48,150 INFO: Reconnection allowed, looking for another server.
2026-05-22 14:49:49,819 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45100>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 14:49:49,821 ERROR: Error communicating with DCS
2026-05-22 14:49:49,821 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
2026-05-22 14:49:49,821 INFO: DCS is not accessibleUbuntu 12清除网络分区,root@ubuntu12:/usr/local/patroni_install# sudo iptables -D OUTPUT -d 192.168.152.121 -j DROP
root@ubuntu12:/usr/local/patroni_install# sudo iptables -D INPUT-s 192.168.152.121 -j DROPUbuntu 13上也清除网络分区root@ubuntu13:/usr/local/patroni_install# sudo iptables -D OUTPUT -d 192.168.152.121 -j DROP
root@ubuntu13:/usr/local/patroni_install# sudo iptables -D INPUT-s 192.168.152.121 -j DROP
root@ubuntu13:/usr/local/patroni_install#
root@ubuntu13:/usr/local/patroni_install#可以发现被隔离的Ubuntu11自动以从节点身份参加集群。root@ubuntu12:/usr/local/patroni_install#
root@ubuntu12:/usr/local/patroni_install#
root@ubuntu12:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
+ Cluster: pg_cluster_wy_prod (7642589780522937440) ---------+----+-------------+-----+------------+-----+
| Member   | Host               | Role         | State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
| ubuntu11 | 192.168.152.121:9000 | Replica      | streaming |5 |   0/60043F0 |   0 |0/60043F0 |   0 |
| ubuntu12 | 192.168.152.122:9000 | Leader       | running   |5 |             |   |            |   |
| ubuntu13 | 192.168.152.123:9000 | Sync Standby | streaming |5 |   0/60043F0 |   0 |0/60043F0 |   0 |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
root@ubuntu12:/usr/local/patroni_install#
root@ubuntu12:/usr/local/patroni_install# Ubuntu11上的日记,自动实行pg_rewind,然后以从节点的身份参加集群2026-05-22 15:02:40,971 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e5c4c0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 15:02:42,626 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=1.3500901345867078)")
2026-05-22 15:02:42,626 INFO: Reconnection allowed, looking for another server.
2026-05-22 15:02:44,295 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e5cb80>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 15:02:44,296 ERROR: Error communicating with DCS
2026-05-22 15:02:44,297 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
2026-05-22 15:02:44,297 INFO: DCS is not accessible
2026-05-22 15:02:44,298 WARNING: Loop time exceeded, rescheduling immediately.
2026-05-22 15:02:45,967 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b3a0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 15:02:45,975 INFO: Lock owner: ubuntu11; I am ubuntu11
2026-05-22 15:02:47,985 ERROR: failed to update leader lock
2026-05-22 15:02:47,994 INFO: not promoting because failed to update leader lock in DCS
2026-05-22 15:02:47,994 INFO: Lock owner: ubuntu12; I am ubuntu11
2026-05-22 15:02:48,001 INFO: Local timeline=4 lsn=0/70000A0
2026-05-22 15:02:48,027 INFO: primary_timeline=5
2026-05-22 15:02:48,030 INFO: primary: history=1        0/504F580        no recovery target specified
2        0/6003D20        no recovery target specified
3        0/6003EC0        no recovery target specified
4        0/6004148        no recovery target specified
2026-05-22 15:02:48,049 INFO: running pg_rewind from ubuntu12
2026-05-22 15:02:49,312 INFO: running pg_rewind from dbname=postgres user=rewind_user host=192.168.152.122 port=9000 target_session_attrs=read-write
2026-05-22 15:02:50,305 INFO: pg_rewind exit code=0
2026-05-22 15:02:50,305 INFO:stdout=
2026-05-22 15:02:50,305 INFO:stderr=pg_rewind: servers diverged at WAL location 0/6004148 on timeline 4
pg_rewind: rewinding from last common checkpoint at 0/6004038 on timeline 4
pg_rewind: Done!

2026-05-22 15:02:50,307 WARNING: Postgresql is not running.
2026-05-22 15:02:50,308 INFO: Lock owner: ubuntu12; I am ubuntu11
2026-05-22 15:02:50,319 INFO: pg_controldata:
pg_control version number: 1700
Catalog version number: 202406281
Database system identifier: 7642589780522937440
Database cluster state: in archive recovery
pg_control last modified: Fri May 22 15:02:50 2026
Latest checkpoint location: 0/6004340
Latest checkpoint's REDO location: 0/60042E8
Latest checkpoint's REDO WAL file: 000000050000000000000006
Latest checkpoint's TimeLineID: 5
Latest checkpoint's PrevTimeLineID: 5
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:762
Latest checkpoint's NextOID: 24576
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 731
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 762
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid: 0
Latest checkpoint's newestCommitTsXid: 0
Time of latest checkpoint: Fri May 22 14:53:31 2026
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/60043F0
Min recovery ending loc's timeline: 5
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float8 argument passing: by value
Data page checksum version: 1
Mock authentication nonce: 3587dd0ff212f7ed05a16aa24aa1d6a6f187f55d5d6a2e158ce45327a7e55005

2026-05-22 15:02:50,320 INFO: Lock owner: ubuntu12; I am ubuntu11
2026-05-22 15:02:50,367 INFO: starting as a secondary
2026-05-22 15:02:50,368 INFO: closed patroni connections to postgres
2026-05-22 15:02:50,738 INFO: postmaster pid=3952
2026-05-22 15:02:51,774 INFO: Lock owner: ubuntu12; I am ubuntu11
2026-05-22 15:02:51,774 INFO: establishing a new patroni heartbeat connection to postgres
2026-05-22 15:02:51,795 INFO: Local timeline=5 lsn=0/60043F0
2026-05-22 15:02:51,803 INFO: primary_timeline=5
2026-05-22 15:02:51,812 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
2026-05-22 15:02:52,281 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
2026-05-22 15:03:02,819 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
2026-05-22 15:03:12,778 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
2026-05-22 15:03:22,777 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)  4,自动故障转移场景4:暴力删除主节点PostgreSQL数据文件

这种情况现真相况中险些不会发生,除非刻意为之,那么直接暴力删除运行中的主节点的PostgreSQL的数据文件会发生什么?

以下是实测,可以发现,暴力删除数据文件之后:1,patroni集群会自动故障转移(由于主节点无法对外提供服务了),2,原主节点会自动从集群中克隆一份数据作为从节点运行
######当前集群正常状态
root@ubuntu11:~# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
+ Cluster: pg_cluster_wy_prod (7642589780522937440) ---------+----+-------------+-----+------------+-----+
| Member   | Host               | Role         | State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
| ubuntu11 | 192.168.152.121:9000 | Sync Standby | streaming |6 |   0/70002D0 |   0 |0/70002D0 |   0 |
| ubuntu12 | 192.168.152.122:9000 | Replica      | streaming |6 |   0/70002D0 |   0 |0/70002D0 |   0 |
| ubuntu13 | 192.168.152.123:9000 | Leader       | running   |6 |             |   |            |   |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
root@ubuntu11:~#
root@ubuntu11:~#
root@ubuntu11:~#
root@ubuntu11:~#
######暴力删除原主节点数据文件
root@ubuntu11:~# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
+ Cluster: pg_cluster_wy_prod (7642589780522937440) ----------------+----+-------------+-----+------------+-----+
| Member   | Host               | Role         | State            | TL | Receive LSN | Lag | Replay LSN | Lag |
+----------+----------------------+--------------+------------------+----+-------------+-----+------------+-----+
| ubuntu11 | 192.168.152.121:9000 | Leader       | running          |7 |             |   |            |   |
| ubuntu12 | 192.168.152.122:9000 | Sync Standby | streaming      |7 |   0/7000410 |   0 |0/7000410 |   0 |
| ubuntu13 | 192.168.152.123:9000 | Replica      | creating replica |    |   unknown |   |    unknown |   |
+----------+----------------------+--------------+------------------+----+-------------+-----+------------+-----+
root@ubuntu11:~#
######集群恢复正常
root@ubuntu11:~# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
+ Cluster: pg_cluster_wy_prod (7642589780522937440) ---------+----+-------------+-----+------------+-----+
| Member   | Host               | Role         | State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
| ubuntu11 | 192.168.152.121:9000 | Leader       | running   |7 |             |   |            |   |
| ubuntu12 | 192.168.152.122:9000 | Sync Standby | streaming |7 |   0/7000410 |   0 |0/7000410 |   0 |
| ubuntu13 | 192.168.152.123:9000 | Replica      | streaming |7 |   0/9000000 |   0 |0/9000000 |   0 |
+----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+原主节点日记,可以发现:replica has been created using basebackup,也就是说当前粉碎的节点,会自动基于basebackup备份一份数据,自动规复并参加集群,就像打不死的小强一样刚强。固然如果库比力大的情况就另说。026-05-22 16:39:25,903 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:39:35,943 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:39:45,901 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:39:55,902 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:40:05,944 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:40:15,902 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:40:25,901 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:40:35,947 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:40:45,904 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:40:55,899 INFO: Lock owner: ubuntu13; I am ubuntu13
2026-05-22 16:40:57,570 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fd2642bc7c0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
2026-05-22 16:40:57,573 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:41:05,902 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:41:15,901 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:41:25,905 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:41:35,944 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:41:45,903 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:41:55,908 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:42:05,945 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:42:15,905 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:42:25,902 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:42:35,943 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:42:45,901 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:42:55,901 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:43:05,944 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:43:15,901 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:43:25,904 INFO: no action. I am (ubuntu13), the leader with the lock
2026-05-22 16:43:35,906 INFO: Lock owner: ubuntu13; I am ubuntu13
2026-05-22 16:43:35,958 INFO: Leader key released
2026-05-22 16:43:35,961 INFO: released leader key voluntarily as data dir empty and currently leader
2026-05-22 16:43:35,961 INFO: Lock owner: None; I am ubuntu13
2026-05-22 16:43:36,003 INFO: waiting for leader to bootstrap
2026-05-22 16:43:36,016 INFO: Lock owner: ubuntu11; I am ubuntu13
2026-05-22 16:43:36,018 INFO: trying to bootstrap from leader 'ubuntu11'
2026-05-22 16:43:36,134 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2026-05-22 16:43:36,135 WARNING: Trying again in 5 seconds
2026-05-22 16:43:37,069 INFO: Lock owner: ubuntu11; I am ubuntu13
2026-05-22 16:43:37,116 INFO: bootstrap from leader 'ubuntu11' in progress
2026-05-22 16:43:42,134 INFO: replica has been created using basebackup
2026-05-22 16:43:42,135 INFO: bootstrapped from leader 'ubuntu11'
2026-05-22 16:43:42,531 INFO: postmaster pid=25110
2026-05-22 16:43:43,580 INFO: Lock owner: ubuntu11; I am ubuntu13
2026-05-22 16:43:43,580 INFO: establishing a new patroni heartbeat connection to postgres
2026-05-22 16:43:43,602 INFO: Local timeline=7 lsn=0/9000000
2026-05-22 16:43:43,628 INFO: primary_timeline=7
2026-05-22 16:43:43,665 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-22 16:43:47,112 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-22 16:43:57,127 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-22 16:44:07,584 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-22 16:44:17,625 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-22 16:44:27,579 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-22 16:44:37,584 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-22 16:44:47,627 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-22 16:44:57,579 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-22 16:45:07,587 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
2026-05-22 16:45:17,619 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11) 

5,总结

 本文通过三种现实的故障,严苛测试了patroni故障转移集群的高可用性,可以发现patroni可以完善处置惩罚各种故障,实现集群的高可用性,同时对于故障转移集群的ttl参数,以及loop_wait参数,在故障转移中的作用,做了实操性的验证,笔者自己也对这两个参数有了更加深刻的熟悉。    
免责声明:如果侵犯了您的权益,请联系站长及时删除侵权内容,谢谢合作!qidao123.com:ToB企服之家,中国第一个企服评测及软件市场,开放入驻,技术点评得现金.
页: [1]
查看完整版本: PostgreSQL 高可用集群 patroni 自动故障转移测试