PostgreSQL 高可用集群 patroni 自动故障转移测试

[复制链接]
发表于 3 天前 | 显示全部楼层 |阅读模式
网上有许多关于patroni的文章许多,绝大多数文章是通过手动搭建的方式,仅做出了一个patroni的情况搭建,包罗各种微信群等,对于patroni参数的利用,故障转移的原理以及实操都只字未提,本文通过Ubuntu 20 情况下 patroni 自动化安装,一分钟快速搭建 patroni 集群 来快速搭建一个集群,以及实操的方式实现故障转移的测试和验证,通过模仿真实的故障以及故障转移的日记,来分析故障转移的实现和效果。
 

0,patroni 集群状态

ubuntu11 注,ubuntu12,ubuntu13 为从,以下测试始终保持Ubuntu11 为主,Ubuntu 12 Ubuntu 13为从的架构
  1. root@ubuntu11:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
  2. + Cluster: pg_cluster_wy_prod (7641831362696373502) ---------+----+-------------+-----+------------+-----+
  3. | Member   | Host                 | Role         | State     | TL | Receive LSN | Lag | Replay LSN | Lag |
  4. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
  5. | ubuntu11 | 192.168.152.121:9000 | Leader       | running   |  6 |             |     |            |     |
  6. | ubuntu12 | 192.168.152.122:9000 | Sync Standby | streaming |  6 |   0/F000348 |   0 |  0/F000348 |   0 |
  7. | ubuntu13 | 192.168.152.123:9000 | Replica      | streaming |  6 |   0/F000348 |   0 |  0/F000348 |   0 |
  8. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
复制代码
鉴于测试目标,设置patroni的systemctl service服务的自动启动为no
  1. postgres@ubuntu11:~$ cat /etc/systemd/system/patroni.service
  2. [Unit]
  3. Description=Patroni
  4. After=network.target etcd.service
  5. Wants=etcd.service
  6. [Service]
  7. Type=simple
  8. User=postgres
  9. Group=postgres
  10. Environment="TZ=Asia/Shanghai"
  11. Environment="PYTHONUNBUFFERED=1"
  12. ExecStart=/usr/local/bin/patroni /usr/local/pgsql17/patroni/patroni.yml
  13. ExecReload=/bin/kill -HUP $MAINPID
  14. ExecStop=/bin/kill -TERM $MAINPID
  15. #Restart=on-failure
  16. Restart=no
  17. RestartSec=10
  18. TimeoutStartSec=120
  19. TimeoutStopSec=60
  20. LimitNOFILE=65536
  21. StandardOutput=null
  22. StandardError=journal
  23. SyslogIdentifier=patroni
  24. [Install]
  25. WantedBy=multi-user.target
  26. postgres@ubuntu11:~$
复制代码
Ubuntu 11主节点日记, 每隔 10 秒轮询一次集群状态,轮训隔断由参数loop_wait决定
  1. 2026-05-21 08:46:37,142 INFO: no action. I am (ubuntu11), the leader with the lock
  2. 2026-05-21 08:46:47,145 INFO: no action. I am (ubuntu11), the leader with the lock
  3. 2026-05-21 08:46:57,190 INFO: no action. I am (ubuntu11), the leader with the lock
  4. 2026-05-21 08:47:07,148 INFO: no action. I am (ubuntu11), the leader with the lock
  5. 2026-05-21 08:47:17,145 INFO: no action. I am (ubuntu11), the leader with the lock
  6. 2026-05-21 08:47:27,189 INFO: no action. I am (ubuntu11), the leader with the lock
  7. 2026-05-21 08:47:37,153 INFO: no action. I am (ubuntu11), the leader with the lock
复制代码
Ubuntu 12 从节点日记
  1. 2026-05-21 08:46:47,628 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  2. 2026-05-21 08:46:57,628 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  3. 2026-05-21 08:47:07,671 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  4. 2026-05-21 08:47:17,632 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  5. 2026-05-21 08:47:27,675 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  6. 2026-05-21 08:47:37,139 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  7. 2026-05-21 08:47:47,233 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  8. 2026-05-21 08:47:57,681 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
复制代码
Ubuntu 13 从节点日记
  1. 2026-05-21 08:46:57,643 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  2. 2026-05-21 08:47:07,696 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  3. 2026-05-21 08:47:17,647 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  4. 2026-05-21 08:47:27,688 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  5. 2026-05-21 08:47:37,155 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  6. 2026-05-21 08:47:47,255 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  7. 2026-05-21 08:47:57,696 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
复制代码
 
1,自动故障转移场景1:主节点OS正常,patroni服务非常故障

主节点状态正常,关闭主节点patroni服务模仿主节点故障
  1. root@ubuntu11:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
  2. + Cluster: pg_cluster_wy_prod (7641831362696373502) ---------+----+-------------+-----+------------+-----+
  3. | Member   | Host                 | Role         | State     | TL | Receive LSN | Lag | Replay LSN | Lag |
  4. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
  5. | ubuntu11 | 192.168.152.121:9000 | Leader       | running   |  6 |             |     |            |     |
  6. | ubuntu12 | 192.168.152.122:9000 | Sync Standby | streaming |  6 |   0/F000348 |   0 |  0/F000348 |   0 |
  7. | ubuntu13 | 192.168.152.123:9000 | Replica      | streaming |  6 |   0/F000348 |   0 |  0/F000348 |   0 |
  8. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+root@ubuntu11:/usr/local/patroni_install# systemctl stop patroniroot@ubuntu11:/usr/local/patroni_install#
复制代码
从节点Ubuntu12上观察到的集群状态,此时原始主节点已处于制止状态
  1. root@ubuntu12:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
  2. + Cluster: pg_cluster_wy_prod (7641831362696373502) ---------+----+-------------+-----+------------+-----+
  3. | Member   | Host                 | Role         | State     | TL | Receive LSN | Lag | Replay LSN | Lag |
  4. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
  5. | ubuntu11 | 192.168.152.121:9000 | Replica      | stopped   |    |     unknown |     |    unknown |     |
  6. | ubuntu12 | 192.168.152.122:9000 | Leader       | running   |  7 |             |     |            |     |
  7. | ubuntu13 | 192.168.152.123:9000 | Sync Standby | streaming |  7 |  0/100001A8 |   0 | 0/100001A8 |   0 |
  8. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
复制代码
原始从节点Ubuntu12,成为新的主节点,日记如下
  1. ......
  2. 2026-05-21 08:55:27,680 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  3. 2026-05-21 08:55:37,723 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  4. 2026-05-21 08:55:47,683 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  5. 2026-05-21 08:55:57,681 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  6. 2026-05-21 08:56:06,109 WARNING: Request failed to ubuntu11: GET http://192.168.152.121:8008/patroni (HTTPConnectionPool(host='192.168.152.121', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))))
  7. 2026-05-21 08:56:06,169 INFO: promoted self to leader by acquiring session lock
  8. 2026-05-21 08:56:06,169 INFO: Lock owner: ubuntu12; I am ubuntu12
  9. 2026-05-21 08:56:06,172 INFO: updated leader lock during promote
  10. server promoting
  11. 2026-05-21 08:56:07,185 INFO: Lock owner: ubuntu12; I am ubuntu12
  12. 2026-05-21 08:56:07,195 INFO: Assigning synchronous standby status to ['ubuntu13']
  13. server signaled
  14. 2026-05-21 08:56:09,324 INFO: Synchronous standby status assigned to ['ubuntu13']
  15. 2026-05-21 08:56:09,369 INFO: no action. I am (ubuntu12), the leader with the lock
  16. 2026-05-21 08:56:17,196 INFO: no action. I am (ubuntu12), the leader with the lock
  17. 2026-05-21 08:56:27,187 INFO: no action. I am (ubuntu12), the leader with the lock
  18. 2026-05-21 08:56:37,242 INFO: no action. I am (ubuntu12), the leader with the lock
  19. ......
复制代码
这种场景下的故障转移的流程:
手动关闭Ubuntu11 Patroni 主节点模仿故障 ———>Ubuntu 11上的patroni自动删除 DCS 中的 leader key———> Ubuntu12 从节点颠末loop_wait轮训后检测到DSC无主 ———> 获取锁提拔为 Leader———> promote 本地PostgreSQL为主库
 2,自动故障转移场景2:主节点服务器断电

Ubuntu11 通过“关机”(而非关闭客户机)来模仿服务器忽然断电,这种场景须要深刻明确租约寿命,也就是ttl(默认 30 秒)参数的概念

新的主节点Ubuntu 12上看到的集群状态
  1. root@ubuntu13:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
  2. + Cluster: pg_cluster_wy_prod (7642212398676862997) ----+----+-------------+-----+------------+-----+------------------------+
  3. | Member   | Host                 | Role    | State     | TL | Receive LSN | Lag | Replay LSN | Lag | Tags                   |
  4. +----------+----------------------+---------+-----------+----+-------------+-----+------------+-----+------------------------+
  5. | ubuntu11 | 192.168.152.121:9000 | Leader  | running   | 10 |             |     |            |     | failover_priority: 100 |
  6. | ubuntu12 | 192.168.152.122:9000 | Replica | streaming | 10 |   0/C000000 |   0 |  0/C000358 |   0 | failover_priority: 80  |
  7. | ubuntu13 | 192.168.152.123:9000 | Replica | streaming | 10 |   0/C000380 |   0 |  0/C000380 |   0 | failover_priority: 60  |
  8. +----------+----------------------+---------+-----------+----+-------------+-----+------------+-----+------------------------+
  9. root@ubuntu13:/usr/local/patroni_install#
  10. root@ubuntu13:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
  11. + Cluster: pg_cluster_wy_prod (7642212398676862997) ---------+----+-------------+-----+------------+-----+-----------------------+
  12. | Member   | Host                 | Role         | State     | TL | Receive LSN | Lag | Replay LSN | Lag | Tags                  |
  13. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+-----------------------+
  14. | ubuntu12 | 192.168.152.122:9000 | Leader       | running   | 11 |             |     |            |     | failover_priority: 80 |
  15. | ubuntu13 | 192.168.152.123:9000 | Sync Standby | streaming | 11 |   0/C000688 |   0 |  0/C000688 |   0 | failover_priority: 60 |
  16. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+-----------------------+
  17. root@ubuntu13:/usr/local/patroni_install#
复制代码
新的主节点Ubuntu12上patroni的日记
  1. 2026-05-22 13:53:59,956 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  2. 2026-05-22 13:54:10,451 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  3. 2026-05-22 13:54:20,026 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  4. 2026-05-22 13:54:30,458 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  5. 2026-05-22 13:54:40,461 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  6. 2026-05-22 13:54:50,499 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  7. 2026-05-22 13:55:00,456 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  8. 2026-05-22 13:55:10,457 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)######差不多这个点开始对Ubuntu11掉电
  9. 2026-05-22 13:55:20,498 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)######为什么这个点,检测到的Ubuntu11还是正常状态?因为Ubuntu11的lease也就是租约还没有过期
  10. 2026-05-22 13:55:32,106 WARNING: Request failed to ubuntu11: GET http://192.168.152.121:8008/patroni (HTTPConnectionPool(host='192.168.152.121', port=8008): Max retries exceeded with url: /patroni (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f893566d880>, 'Connection to 192.168.152.121 timed out. (connect timeout=2)')))
  11. 2026-05-22 13:55:32,114 INFO: promoted self to leader by acquiring session lock
  12. 2026-05-22 13:55:32,114 INFO: Lock owner: ubuntu12; I am ubuntu12
  13. 2026-05-22 13:55:32,115 INFO: updated leader lock during promote
  14. 2026-05-22 13:55:33,137 INFO: Lock owner: ubuntu12; I am ubuntu12
  15. 2026-05-22 13:55:33,193 INFO: Assigning synchronous standby status to ['ubuntu13']
  16. 2026-05-22 13:55:35,316 INFO: Synchronous standby status assigned to ['ubuntu13']
  17. 2026-05-22 13:55:35,322 INFO: no action. I am (ubuntu12), the leader with the lock
  18. 2026-05-22 13:55:35,377 INFO: no action. I am (ubuntu12), the leader with the lock
  19. 2026-05-22 13:55:45,324 INFO: no action. I am (ubuntu12), the leader with the lock
  20. 2026-05-22 13:55:55,367 INFO: no action. I am (ubuntu12), the leader with the lock
  21. 2026-05-22 13:56:05,329 INFO: no action. I am (ubuntu12), the leader with the lock
复制代码
新的主节点通过psql检察身份状态
  1. postgres=#
  2. postgres=#
  3. postgres=# select now(),pg_is_in_recovery();            ###########################这里开始对原始主节点Ubuntu11 掉电,然后连续查询
  4.               now              | pg_is_in_recovery
  5. -------------------------------+-------------------
  6. 2026-05-22 13:55:10.849473+08 | t
  7. (1 row)
  8. postgres=# select now(),pg_is_in_recovery();
  9.               now              | pg_is_in_recovery
  10. -------------------------------+-------------------
  11. 2026-05-22 13:55:11.665724+08 | t
  12. (1 row)
  13. postgres=# select now(),pg_is_in_recovery();
  14.              now              | pg_is_in_recovery
  15. ------------------------------+-------------------
  16. 2026-05-22 13:55:12.32947+08 | t
  17. (1 row)
  18. postgres=# select now(),pg_is_in_recovery();
  19.               now              | pg_is_in_recovery
  20. -------------------------------+-------------------
  21. 2026-05-22 13:55:13.017149+08 | t
  22. (1 row)
  23. postgres=# select now(),pg_is_in_recovery();
  24.               now              | pg_is_in_recovery
  25. -------------------------------+-------------------
  26. 2026-05-22 13:55:13.799962+08 | t
  27. (1 row)
  28. postgres=# select now(),pg_is_in_recovery();
  29.               now              | pg_is_in_recovery
  30. -------------------------------+-------------------
  31. 2026-05-22 13:55:14.902866+08 | t
  32. (1 row)
  33. postgres=# select now(),pg_is_in_recovery();
  34.               now              | pg_is_in_recovery
  35. -------------------------------+-------------------
  36. 2026-05-22 13:55:15.672331+08 | t
  37. (1 row)
  38. postgres=# select now(),pg_is_in_recovery();
  39.               now              | pg_is_in_recovery
  40. -------------------------------+-------------------
  41. 2026-05-22 13:55:16.435662+08 | t
  42. (1 row)
  43. postgres=# select now(),pg_is_in_recovery();
  44.               now              | pg_is_in_recovery
  45. -------------------------------+-------------------
  46. 2026-05-22 13:55:17.070935+08 | t
  47. (1 row)
  48. postgres=# select now(),pg_is_in_recovery();
  49.               now              | pg_is_in_recovery
  50. -------------------------------+-------------------
  51. 2026-05-22 13:55:17.816528+08 | t
  52. (1 row)
  53. postgres=# select now(),pg_is_in_recovery();
  54.               now              | pg_is_in_recovery
  55. -------------------------------+-------------------
  56. 2026-05-22 13:55:18.546785+08 | t
  57. (1 row)
  58. postgres=# select now(),pg_is_in_recovery();
  59.               now              | pg_is_in_recovery
  60. -------------------------------+-------------------
  61. 2026-05-22 13:55:19.393943+08 | t
  62. (1 row)
  63. #......中间省略掉......
  64. postgres=# select now(),pg_is_in_recovery();
  65.               now              | pg_is_in_recovery
  66. -------------------------------+-------------------
  67. 2026-05-22 13:55:29.759037+08 | t
  68. (1 row)
  69. postgres=# select now(),pg_is_in_recovery();
  70.               now              | pg_is_in_recovery
  71. -------------------------------+-------------------
  72. 2026-05-22 13:55:30.417626+08 | t
  73. (1 row)
  74. postgres=# select now(),pg_is_in_recovery();
  75.               now              | pg_is_in_recovery
  76. -------------------------------+-------------------
  77. 2026-05-22 13:55:31.089604+08 | t
  78. (1 row)
  79. postgres=# select now(),pg_is_in_recovery();
  80.               now              | pg_is_in_recovery
  81. -------------------------------+-------------------
  82. 2026-05-22 13:55:31.775459+08 | t
  83. (1 row)
  84. postgres=# select now(),pg_is_in_recovery();            ###########################22秒之后,新的主节点才真正promote起来
  85.               now              | pg_is_in_recovery
  86. -------------------------------+-------------------
  87. 2026-05-22 13:55:32.400935+08 | f
  88. (1 row)
  89. postgres=# select now(),pg_is_in_recovery();
  90.              now              | pg_is_in_recovery
  91. ------------------------------+-------------------
  92. 2026-05-22 13:55:33.27183+08 | f
  93. (1 row)
  94. postgres=# select now(),pg_is_in_recovery();
  95.               now              | pg_is_in_recovery
  96. -------------------------------+-------------------
  97. 2026-05-22 13:55:33.950342+08 | f
  98. (1 row)
  99. postgres=# select now(),pg_is_in_recovery();
  100.               now              | pg_is_in_recovery
  101. -------------------------------+-------------------
  102. 2026-05-22 13:55:34.758651+08 | f
  103. (1 row)
  104. postgres=# postgres=#
  105. postgres-# postgres=#
复制代码
连合上述日记,来明确ttl的概念,从时间的维度来观察:
1,2026-05-22 13:55:10,457,上面提到差不多在这个是时间点开始对原主节点Ubuntu11断电,
2,2026-05-22 13:55:20,498 ,patroni日记中检测到的Ubuntu11还是正常状态?
3, 2026-05-22 13:55:32.400935,通过查询新的主节点的pg_is_in_recovery,发现pg_is_in_recovery才变为f,也即故障转移乐成
日记是否与现实利用的不符合,明显Ubuntu11在13:55:10就断电了,为什么13:55:20还在检测的时间还是正常的,但是直到13:55:32,新的主节点才真正开始工作,这是不是抵牾的?
这是由于,在13:55:10断电,在13:55:10前几秒(减去一个loop_wait的时间点,loop_wait默认10秒), Ubuntu11上的patroni对etcd中的leader key续约,续约一次收效时间为向后推30秒,lease也就是租约还没有逾期,其租约大概在13:55:30之后才逾期,因此在13:55:20这个时间点,接替它的从节点上的patroni服务,检测到leader key 并没有逾期。
直到下一个检测周期,也即13:55:30的时间,这一轮查抄的时间才发现“2026-05-22 13:55:32,106 WARNING: Request failed to ubuntu11: GET http://192.168.152.121:8008/patroni (HTTPConnectionPool(host='192.168.152.121', port=8008): Max retries exceeded with url: /patroni (Caused by ConnectTimeoutError(, 'Connection to 192.168.152.121 timed out. (connect timeout=2)')))”原始主节点非常,为什么日记是13:55:32,在13:55:30的底子上加了2秒呢?由于connect timeout=2
以上才是patroni参数中ttl的真正寄义。
 
这种场景下的故障转移的流程:
关闭Ubuntu11 电源 模仿主节点故障 ———>10秒后 Ubuntu 11上的leader 租约扔有用(现实上此时Ubuntu已宕机) ———>10秒后 Ubuntu 11上的leader 租约扔有用 (现实上此时Ubuntu已宕机) ———>10秒后 Ubuntu 12检测到leader 失效———> 抢占leader key,promote 本地PostgreSQL为主库
因此如果想提到patroni的故障转移的灵敏性,须要减小ttl的值,也即镌汰leader key的租约时间,同时也要减小loop_wait,增长判断leader key的频率,来提拔故障检测以及转移的灵敏性,但也要意识到,调小这两个参数,大概在网络抖动是会带来的预期之外的故障转移。 
3,自动故障转移场景3:主节点网络分区

用iptables -A OUTPUT -d 192.168.152.121 -j DROP
从节点1
  1. root@ubuntu12:/usr/local/patroni_install# sudo iptables -A OUTPUT -d 192.168.152.121 -j DROP
  2. root@ubuntu12:/usr/local/patroni_install# sudo iptables -A INPUT  -s 192.168.152.121 -j DROP
  3. root@ubuntu12:/usr/local/patroni_install#
复制代码
从节点2
  1. root@ubuntu13:/usr/local/patroni_install# sudo iptables -A OUTPUT -d 192.168.152.121 -j DROP
  2. root@ubuntu13:/usr/local/patroni_install# sudo iptables -A INPUT  -s 192.168.152.121 -j DROP
  3. root@ubuntu13:/usr/local/patroni_install#
复制代码
网络分区已形成
1,对于新的主节点:Ubuntu12已经乐成担当主节点
  1. 2026-05-22 14:47:38,980 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  2. 2026-05-22 14:47:49,402 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  3. 2026-05-22 14:47:58,941 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  4. 2026-05-22 14:48:09,491 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  5. 2026-05-22 14:48:19,441 INFO: no action. I am (ubuntu12), a secondary, and following a leader (ubuntu11)
  6. 2026-05-22 14:48:31,104 WARNING: Request failed to ubuntu11: GET http://192.168.152.121:8008/patroni (HTTPConnectionPool(host='192.168.152.121', port=8008): Max retries exceeded with url: /patroni (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f8988fdb1f0>, 'Connection to 192.168.152.121 timed out. (connect timeout=2)')))
  7. 2026-05-22 14:48:31,183 INFO: promoted self to leader by acquiring session lock
  8. 2026-05-22 14:48:31,187 INFO: Lock owner: ubuntu12; I am ubuntu12
  9. 2026-05-22 14:48:31,239 INFO: updated leader lock during promote
  10. 2026-05-22 14:48:32,206 INFO: Lock owner: ubuntu12; I am ubuntu12
  11. 2026-05-22 14:48:32,214 INFO: Assigning synchronous standby status to ['ubuntu13']
  12. 2026-05-22 14:48:34,337 INFO: Synchronous standby status assigned to ['ubuntu13']
  13. 2026-05-22 14:48:34,385 INFO: no action. I am (ubuntu12), the leader with the lock
  14. 2026-05-22 14:48:42,245 INFO: no action. I am (ubuntu12), the leader with the lock
  15. 2026-05-22 14:48:52,256 INFO: no action. I am (ubuntu12), the leader with the lock
复制代码
须要分析的是,对网络分区的故障转移,与上面主节点断电一样,固然在新主节点的日记中,从发现到故障转移只用了10秒多,但现实上,在网络分区之后,由于原主节点对于leader key的末了一次续约加上了30秒(ttl),导致网络分区发生后,新的主节点在探测ttl的时间,前2次探测的时间现实上网络分区已经形成,但此时新的主节点尚未担当,直至原主节点的leader key 租约超期,这一点与上面一种情况一样,详细测试过不在赘述。
2,对于原主节点:
此时原主节点日记已无法毗连至Ubuntu12 和Ubuntu 13,注意日记
2026-05-22 14:48:17,257 ERROR: Error communicating with DCS
2026-05-22 14:48:17,258 INFO: demoting self because DCS is not accessible and I was a leader
2026-05-22 14:48:17,258 INFO: Demoting self (offline)
原始主节点网络分区之后,自动降级为只读状态,因此不会出现双主大概脑裂的征象。同时会一连不绝地实行毗连到Ubuntu12和ubuntu13上的etcd集群(日记在连续天生,没有贴全),以确保网络规复后自动参加集群
  1. 2026-05-22 14:47:38,896 INFO: no action. I am (ubuntu11), the leader with the lock
  2. 2026-05-22 14:47:38,963 INFO: no action. I am (ubuntu11), the leader with the lock
  3. 2026-05-22 14:47:48,912 INFO: no action. I am (ubuntu11), the leader with the lock
  4. 2026-05-22 14:47:58,948 INFO: no action. I am (ubuntu11), the leader with the lock
  5. 2026-05-22 14:48:08,903 INFO: Lock owner: ubuntu11; I am ubuntu11
  6. 2026-05-22 14:48:12,244 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.3332171243333355)")
  7. 2026-05-22 14:48:12,244 INFO: Reconnection allowed, looking for another server.
  8. 2026-05-22 14:48:12,244 INFO: Retrying on http://192.168.152.123:2379
  9. 2026-05-22 14:48:13,913 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b2b0>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
  10. 2026-05-22 14:48:13,913 INFO: Reconnection allowed, looking for another server.
  11. 2026-05-22 14:48:13,913 INFO: Retrying on http://192.168.152.122:2379
  12. 2026-05-22 14:48:15,583 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b2e0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  13. 2026-05-22 14:48:15,583 INFO: Reconnection allowed, looking for another server.
  14. 2026-05-22 14:48:17,253 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b520>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  15. 2026-05-22 14:48:17,256 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
  16. 2026-05-22 14:48:17,257 ERROR: Error communicating with DCS
  17. 2026-05-22 14:48:17,258 INFO: demoting self because DCS is not accessible and I was a leader
  18. 2026-05-22 14:48:17,258 INFO: Demoting self (offline)
  19. 2026-05-22 14:48:18,932 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b970>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  20. 2026-05-22 14:49:00,355 INFO: postmaster pid=3525
  21. 2026-05-22 14:49:01,400 INFO: demoted self because DCS is not accessible and I was a leader
  22. 2026-05-22 14:49:01,403 WARNING: Loop time exceeded, rescheduling immediately.
  23. 2026-05-22 14:49:01,405 INFO: Lock owner: ubuntu11; I am ubuntu11
  24. 2026-05-22 14:49:01,405 INFO: establishing a new patroni heartbeat connection to postgres
  25. 2026-05-22 14:49:04,749 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.33254870033331)")
  26. 2026-05-22 14:49:04,749 INFO: Reconnection allowed, looking for another server.
  27. 2026-05-22 14:49:04,749 INFO: Retrying on http://192.168.152.123:2379
  28. 2026-05-22 14:49:06,419 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04bfa0>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
  29. 2026-05-22 14:49:06,419 INFO: Reconnection allowed, looking for another server.
  30. 2026-05-22 14:49:06,419 INFO: Retrying on http://192.168.152.122:2379
  31. 2026-05-22 14:49:08,089 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e451c0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  32. 2026-05-22 14:49:08,089 INFO: Reconnection allowed, looking for another server.
  33. 2026-05-22 14:49:09,758 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c53fa90>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  34. 2026-05-22 14:49:11,417 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=1.00350682653891)")
  35. 2026-05-22 14:49:11,417 INFO: Reconnection allowed, looking for another server.
  36. 2026-05-22 14:49:13,086 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04ba60>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  37. 2026-05-22 14:49:13,088 ERROR: Error communicating with DCS
  38. 2026-05-22 14:49:13,088 INFO: DCS is not accessible
  39. 2026-05-22 14:49:13,088 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
  40. 2026-05-22 14:49:13,090 WARNING: Loop time exceeded, rescheduling immediately.
  41. 2026-05-22 14:49:14,757 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b820>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  42. 2026-05-22 14:49:14,763 INFO: Lock owner: ubuntu11; I am ubuntu11
  43. 2026-05-22 14:49:18,103 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.3331819403333234)")
  44. 2026-05-22 14:49:18,103 INFO: Reconnection allowed, looking for another server.
  45. 2026-05-22 14:49:18,103 INFO: Retrying on http://192.168.152.122:2379
  46. 2026-05-22 14:49:19,773 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e450d0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  47. 2026-05-22 14:49:19,773 INFO: Reconnection allowed, looking for another server.
  48. 2026-05-22 14:49:19,773 INFO: Retrying on http://192.168.152.123:2379
  49. 2026-05-22 14:49:21,441 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45370>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
  50. 2026-05-22 14:49:21,442 INFO: Reconnection allowed, looking for another server.
  51. 2026-05-22 14:49:23,112 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45670>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  52. 2026-05-22 14:49:23,114 ERROR: Error communicating with DCS
  53. 2026-05-22 14:49:23,114 INFO: DCS is not accessible
  54. 2026-05-22 14:49:23,114 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
  55. 2026-05-22 14:49:23,115 WARNING: Loop time exceeded, rescheduling immediately.
  56. 2026-05-22 14:49:24,784 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45b50>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  57. 2026-05-22 14:49:24,790 INFO: Lock owner: ubuntu11; I am ubuntu11
  58. 2026-05-22 14:49:28,128 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.332673626333379)")
  59. 2026-05-22 14:49:28,128 INFO: Reconnection allowed, looking for another server.
  60. 2026-05-22 14:49:28,128 INFO: Retrying on http://192.168.152.123:2379
  61. 2026-05-22 14:49:29,799 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e5c220>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
  62. 2026-05-22 14:49:29,799 INFO: Reconnection allowed, looking for another server.
  63. 2026-05-22 14:49:29,799 INFO: Retrying on http://192.168.152.122:2379
  64. 2026-05-22 14:49:31,469 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c41f460>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  65. 2026-05-22 14:49:31,469 INFO: Reconnection allowed, looking for another server.
  66. 2026-05-22 14:49:33,138 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c53f9a0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  67. 2026-05-22 14:49:34,794 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=1.1904628130222932)")
  68. 2026-05-22 14:49:34,794 INFO: Reconnection allowed, looking for another server.
  69. 2026-05-22 14:49:36,464 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b0a0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  70. 2026-05-22 14:49:36,468 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
  71. 2026-05-22 14:49:36,468 ERROR: Error communicating with DCS
  72. 2026-05-22 14:49:36,468 INFO: DCS is not accessible
  73. 2026-05-22 14:49:36,470 WARNING: Loop time exceeded, rescheduling immediately.
  74. 2026-05-22 14:49:38,140 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b8b0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  75. 2026-05-22 14:49:38,145 INFO: Lock owner: ubuntu11; I am ubuntu11
  76. 2026-05-22 14:49:41,485 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=3.33319548833335)")
  77. 2026-05-22 14:49:41,485 INFO: Reconnection allowed, looking for another server.
  78. 2026-05-22 14:49:41,485 INFO: Retrying on http://192.168.152.122:2379
  79. 2026-05-22 14:49:43,153 ERROR: Request to server http://192.168.152.122:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45fa0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  80. 2026-05-22 14:49:43,153 INFO: Reconnection allowed, looking for another server.
  81. 2026-05-22 14:49:43,153 INFO: Retrying on http://192.168.152.123:2379
  82. 2026-05-22 14:49:44,823 ERROR: Request to server http://192.168.152.123:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.152.123', port=2379): Max retries exceeded with url: /v3/lease/keepalive (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e459d0>, 'Connection to 192.168.152.123 timed out. (connect timeout=1.6666666666666667)'))")
  83. 2026-05-22 14:49:44,823 INFO: Reconnection allowed, looking for another server.
  84. 2026-05-22 14:49:46,493 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45ac0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  85. 2026-05-22 14:49:48,150 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=1.4432267216546961)")
  86. 2026-05-22 14:49:48,150 INFO: Reconnection allowed, looking for another server.
  87. 2026-05-22 14:49:49,819 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e45100>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  88. 2026-05-22 14:49:49,821 ERROR: Error communicating with DCS
  89. 2026-05-22 14:49:49,821 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
  90. 2026-05-22 14:49:49,821 INFO: DCS is not accessible
复制代码
Ubuntu 12清除网络分区,
  1. root@ubuntu12:/usr/local/patroni_install# sudo iptables -D OUTPUT -d 192.168.152.121 -j DROP
  2. root@ubuntu12:/usr/local/patroni_install# sudo iptables -D INPUT  -s 192.168.152.121 -j DROP
复制代码
Ubuntu 13上也清除网络分区
  1. root@ubuntu13:/usr/local/patroni_install# sudo iptables -D OUTPUT -d 192.168.152.121 -j DROP
  2. root@ubuntu13:/usr/local/patroni_install# sudo iptables -D INPUT  -s 192.168.152.121 -j DROP
  3. root@ubuntu13:/usr/local/patroni_install#
  4. root@ubuntu13:/usr/local/patroni_install#
复制代码
可以发现被隔离的Ubuntu11自动以从节点身份参加集群。
  1. root@ubuntu12:/usr/local/patroni_install#
  2. root@ubuntu12:/usr/local/patroni_install#
  3. root@ubuntu12:/usr/local/patroni_install# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
  4. + Cluster: pg_cluster_wy_prod (7642589780522937440) ---------+----+-------------+-----+------------+-----+
  5. | Member   | Host                 | Role         | State     | TL | Receive LSN | Lag | Replay LSN | Lag |
  6. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
  7. | ubuntu11 | 192.168.152.121:9000 | Replica      | streaming |  5 |   0/60043F0 |   0 |  0/60043F0 |   0 |
  8. | ubuntu12 | 192.168.152.122:9000 | Leader       | running   |  5 |             |     |            |     |
  9. | ubuntu13 | 192.168.152.123:9000 | Sync Standby | streaming |  5 |   0/60043F0 |   0 |  0/60043F0 |   0 |
  10. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
  11. root@ubuntu12:/usr/local/patroni_install#
  12. root@ubuntu12:/usr/local/patroni_install#
复制代码
 Ubuntu11上的日记,自动实行pg_rewind,然后以从节点的身份参加集群
  1. 2026-05-22 15:02:40,971 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e5c4c0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  2. 2026-05-22 15:02:42,626 ERROR: Request to server http://192.168.152.121:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='192.168.152.121', port=2379): Read timed out. (read timeout=1.3500901345867078)")
  3. 2026-05-22 15:02:42,626 INFO: Reconnection allowed, looking for another server.
  4. 2026-05-22 15:02:44,295 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2027e5cb80>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  5. 2026-05-22 15:02:44,296 ERROR: Error communicating with DCS
  6. 2026-05-22 15:02:44,297 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
  7. 2026-05-22 15:02:44,297 INFO: DCS is not accessible
  8. 2026-05-22 15:02:44,298 WARNING: Loop time exceeded, rescheduling immediately.
  9. 2026-05-22 15:02:45,967 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f205c04b3a0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  10. 2026-05-22 15:02:45,975 INFO: Lock owner: ubuntu11; I am ubuntu11
  11. 2026-05-22 15:02:47,985 ERROR: failed to update leader lock
  12. 2026-05-22 15:02:47,994 INFO: not promoting because failed to update leader lock in DCS
  13. 2026-05-22 15:02:47,994 INFO: Lock owner: ubuntu12; I am ubuntu11
  14. 2026-05-22 15:02:48,001 INFO: Local timeline=4 lsn=0/70000A0
  15. 2026-05-22 15:02:48,027 INFO: primary_timeline=5
  16. 2026-05-22 15:02:48,030 INFO: primary: history=1        0/504F580        no recovery target specified
  17. 2        0/6003D20        no recovery target specified
  18. 3        0/6003EC0        no recovery target specified
  19. 4        0/6004148        no recovery target specified
  20. 2026-05-22 15:02:48,049 INFO: running pg_rewind from ubuntu12
  21. 2026-05-22 15:02:49,312 INFO: running pg_rewind from dbname=postgres user=rewind_user host=192.168.152.122 port=9000 target_session_attrs=read-write
  22. 2026-05-22 15:02:50,305 INFO: pg_rewind exit code=0
  23. 2026-05-22 15:02:50,305 INFO:  stdout=
  24. 2026-05-22 15:02:50,305 INFO:  stderr=pg_rewind: servers diverged at WAL location 0/6004148 on timeline 4
  25. pg_rewind: rewinding from last common checkpoint at 0/6004038 on timeline 4
  26. pg_rewind: Done!
  27. 2026-05-22 15:02:50,307 WARNING: Postgresql is not running.
  28. 2026-05-22 15:02:50,308 INFO: Lock owner: ubuntu12; I am ubuntu11
  29. 2026-05-22 15:02:50,319 INFO: pg_controldata:
  30.   pg_control version number: 1700
  31.   Catalog version number: 202406281
  32.   Database system identifier: 7642589780522937440
  33.   Database cluster state: in archive recovery
  34.   pg_control last modified: Fri May 22 15:02:50 2026
  35.   Latest checkpoint location: 0/6004340
  36.   Latest checkpoint's REDO location: 0/60042E8
  37.   Latest checkpoint's REDO WAL file: 000000050000000000000006
  38.   Latest checkpoint's TimeLineID: 5
  39.   Latest checkpoint's PrevTimeLineID: 5
  40.   Latest checkpoint's full_page_writes: on
  41.   Latest checkpoint's NextXID: 0:762
  42.   Latest checkpoint's NextOID: 24576
  43.   Latest checkpoint's NextMultiXactId: 1
  44.   Latest checkpoint's NextMultiOffset: 0
  45.   Latest checkpoint's oldestXID: 731
  46.   Latest checkpoint's oldestXID's DB: 1
  47.   Latest checkpoint's oldestActiveXID: 762
  48.   Latest checkpoint's oldestMultiXid: 1
  49.   Latest checkpoint's oldestMulti's DB: 1
  50.   Latest checkpoint's oldestCommitTsXid: 0
  51.   Latest checkpoint's newestCommitTsXid: 0
  52.   Time of latest checkpoint: Fri May 22 14:53:31 2026
  53.   Fake LSN counter for unlogged rels: 0/3E8
  54.   Minimum recovery ending location: 0/60043F0
  55.   Min recovery ending loc's timeline: 5
  56.   Backup start location: 0/0
  57.   Backup end location: 0/0
  58.   End-of-backup record required: no
  59.   wal_level setting: replica
  60.   wal_log_hints setting: on
  61.   max_connections setting: 100
  62.   max_worker_processes setting: 8
  63.   max_wal_senders setting: 10
  64.   max_prepared_xacts setting: 0
  65.   max_locks_per_xact setting: 64
  66.   track_commit_timestamp setting: off
  67.   Maximum data alignment: 8
  68.   Database block size: 8192
  69.   Blocks per segment of large relation: 131072
  70.   WAL block size: 8192
  71.   Bytes per WAL segment: 16777216
  72.   Maximum length of identifiers: 64
  73.   Maximum columns in an index: 32
  74.   Maximum size of a TOAST chunk: 1996
  75.   Size of a large-object chunk: 2048
  76.   Date/time type storage: 64-bit integers
  77.   Float8 argument passing: by value
  78.   Data page checksum version: 1
  79.   Mock authentication nonce: 3587dd0ff212f7ed05a16aa24aa1d6a6f187f55d5d6a2e158ce45327a7e55005
  80. 2026-05-22 15:02:50,320 INFO: Lock owner: ubuntu12; I am ubuntu11
  81. 2026-05-22 15:02:50,367 INFO: starting as a secondary
  82. 2026-05-22 15:02:50,368 INFO: closed patroni connections to postgres
  83. 2026-05-22 15:02:50,738 INFO: postmaster pid=3952
  84. 2026-05-22 15:02:51,774 INFO: Lock owner: ubuntu12; I am ubuntu11
  85. 2026-05-22 15:02:51,774 INFO: establishing a new patroni heartbeat connection to postgres
  86. 2026-05-22 15:02:51,795 INFO: Local timeline=5 lsn=0/60043F0
  87. 2026-05-22 15:02:51,803 INFO: primary_timeline=5
  88. 2026-05-22 15:02:51,812 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
  89. 2026-05-22 15:02:52,281 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
  90. 2026-05-22 15:03:02,819 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
  91. 2026-05-22 15:03:12,778 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
  92. 2026-05-22 15:03:22,777 INFO: no action. I am (ubuntu11), a secondary, and following a leader (ubuntu12)
复制代码
  4,自动故障转移场景4:暴力删除主节点PostgreSQL数据文件

这种情况现真相况中险些不会发生,除非刻意为之,那么直接暴力删除运行中的主节点的PostgreSQL的数据文件会发生什么?

以下是实测,可以发现,暴力删除数据文件之后:1,patroni集群会自动故障转移(由于主节点无法对外提供服务了),2,原主节点会自动从集群中克隆一份数据作为从节点运行
  1. ######当前集群正常状态
  2. root@ubuntu11:~# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
  3. + Cluster: pg_cluster_wy_prod (7642589780522937440) ---------+----+-------------+-----+------------+-----+
  4. | Member   | Host                 | Role         | State     | TL | Receive LSN | Lag | Replay LSN | Lag |
  5. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
  6. | ubuntu11 | 192.168.152.121:9000 | Sync Standby | streaming |  6 |   0/70002D0 |   0 |  0/70002D0 |   0 |
  7. | ubuntu12 | 192.168.152.122:9000 | Replica      | streaming |  6 |   0/70002D0 |   0 |  0/70002D0 |   0 |
  8. | ubuntu13 | 192.168.152.123:9000 | Leader       | running   |  6 |             |     |            |     |
  9. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
  10. root@ubuntu11:~#
  11. root@ubuntu11:~#
  12. root@ubuntu11:~#
  13. root@ubuntu11:~#
  14. ######暴力删除原主节点数据文件
  15. root@ubuntu11:~# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
  16. + Cluster: pg_cluster_wy_prod (7642589780522937440) ----------------+----+-------------+-----+------------+-----+
  17. | Member   | Host                 | Role         | State            | TL | Receive LSN | Lag | Replay LSN | Lag |
  18. +----------+----------------------+--------------+------------------+----+-------------+-----+------------+-----+
  19. | ubuntu11 | 192.168.152.121:9000 | Leader       | running          |  7 |             |     |            |     |
  20. | ubuntu12 | 192.168.152.122:9000 | Sync Standby | streaming        |  7 |   0/7000410 |   0 |  0/7000410 |   0 |
  21. | ubuntu13 | 192.168.152.123:9000 | Replica      | creating replica |    |     unknown |     |    unknown |     |
  22. +----------+----------------------+--------------+------------------+----+-------------+-----+------------+-----+
  23. root@ubuntu11:~#
  24. ######集群恢复正常
  25. root@ubuntu11:~# patronictl -c /usr/local/pgsql17/patroni/patroni.yml list
  26. + Cluster: pg_cluster_wy_prod (7642589780522937440) ---------+----+-------------+-----+------------+-----+
  27. | Member   | Host                 | Role         | State     | TL | Receive LSN | Lag | Replay LSN | Lag |
  28. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
  29. | ubuntu11 | 192.168.152.121:9000 | Leader       | running   |  7 |             |     |            |     |
  30. | ubuntu12 | 192.168.152.122:9000 | Sync Standby | streaming |  7 |   0/7000410 |   0 |  0/7000410 |   0 |
  31. | ubuntu13 | 192.168.152.123:9000 | Replica      | streaming |  7 |   0/9000000 |   0 |  0/9000000 |   0 |
  32. +----------+----------------------+--------------+-----------+----+-------------+-----+------------+-----+
复制代码
原主节点日记,可以发现:replica has been created using basebackup,也就是说当前粉碎的节点,会自动基于basebackup备份一份数据,自动规复并参加集群,就像打不死的小强一样刚强。固然如果库比力大的情况就另说。
  1. 026-05-22 16:39:25,903 INFO: no action. I am (ubuntu13), the leader with the lock
  2. 2026-05-22 16:39:35,943 INFO: no action. I am (ubuntu13), the leader with the lock
  3. 2026-05-22 16:39:45,901 INFO: no action. I am (ubuntu13), the leader with the lock
  4. 2026-05-22 16:39:55,902 INFO: no action. I am (ubuntu13), the leader with the lock
  5. 2026-05-22 16:40:05,944 INFO: no action. I am (ubuntu13), the leader with the lock
  6. 2026-05-22 16:40:15,902 INFO: no action. I am (ubuntu13), the leader with the lock
  7. 2026-05-22 16:40:25,901 INFO: no action. I am (ubuntu13), the leader with the lock
  8. 2026-05-22 16:40:35,947 INFO: no action. I am (ubuntu13), the leader with the lock
  9. 2026-05-22 16:40:45,904 INFO: no action. I am (ubuntu13), the leader with the lock
  10. 2026-05-22 16:40:55,899 INFO: Lock owner: ubuntu13; I am ubuntu13
  11. 2026-05-22 16:40:57,570 ERROR: Failed to get list of machines from http://192.168.152.122:2379/v3: MaxRetryError("HTTPConnectionPool(host='192.168.152.122', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fd2642bc7c0>, 'Connection to 192.168.152.122 timed out. (connect timeout=1.6666666666666667)'))")
  12. 2026-05-22 16:40:57,573 INFO: no action. I am (ubuntu13), the leader with the lock
  13. 2026-05-22 16:41:05,902 INFO: no action. I am (ubuntu13), the leader with the lock
  14. 2026-05-22 16:41:15,901 INFO: no action. I am (ubuntu13), the leader with the lock
  15. 2026-05-22 16:41:25,905 INFO: no action. I am (ubuntu13), the leader with the lock
  16. 2026-05-22 16:41:35,944 INFO: no action. I am (ubuntu13), the leader with the lock
  17. 2026-05-22 16:41:45,903 INFO: no action. I am (ubuntu13), the leader with the lock
  18. 2026-05-22 16:41:55,908 INFO: no action. I am (ubuntu13), the leader with the lock
  19. 2026-05-22 16:42:05,945 INFO: no action. I am (ubuntu13), the leader with the lock
  20. 2026-05-22 16:42:15,905 INFO: no action. I am (ubuntu13), the leader with the lock
  21. 2026-05-22 16:42:25,902 INFO: no action. I am (ubuntu13), the leader with the lock
  22. 2026-05-22 16:42:35,943 INFO: no action. I am (ubuntu13), the leader with the lock
  23. 2026-05-22 16:42:45,901 INFO: no action. I am (ubuntu13), the leader with the lock
  24. 2026-05-22 16:42:55,901 INFO: no action. I am (ubuntu13), the leader with the lock
  25. 2026-05-22 16:43:05,944 INFO: no action. I am (ubuntu13), the leader with the lock
  26. 2026-05-22 16:43:15,901 INFO: no action. I am (ubuntu13), the leader with the lock
  27. 2026-05-22 16:43:25,904 INFO: no action. I am (ubuntu13), the leader with the lock
  28. 2026-05-22 16:43:35,906 INFO: Lock owner: ubuntu13; I am ubuntu13
  29. 2026-05-22 16:43:35,958 INFO: Leader key released
  30. 2026-05-22 16:43:35,961 INFO: released leader key voluntarily as data dir empty and currently leader
  31. 2026-05-22 16:43:35,961 INFO: Lock owner: None; I am ubuntu13
  32. 2026-05-22 16:43:36,003 INFO: waiting for leader to bootstrap
  33. 2026-05-22 16:43:36,016 INFO: Lock owner: ubuntu11; I am ubuntu13
  34. 2026-05-22 16:43:36,018 INFO: trying to bootstrap from leader 'ubuntu11'
  35. 2026-05-22 16:43:36,134 ERROR: Error when fetching backup: pg_basebackup exited with code=1
  36. 2026-05-22 16:43:36,135 WARNING: Trying again in 5 seconds
  37. 2026-05-22 16:43:37,069 INFO: Lock owner: ubuntu11; I am ubuntu13
  38. 2026-05-22 16:43:37,116 INFO: bootstrap from leader 'ubuntu11' in progress
  39. 2026-05-22 16:43:42,134 INFO: replica has been created using basebackup
  40. 2026-05-22 16:43:42,135 INFO: bootstrapped from leader 'ubuntu11'
  41. 2026-05-22 16:43:42,531 INFO: postmaster pid=25110
  42. 2026-05-22 16:43:43,580 INFO: Lock owner: ubuntu11; I am ubuntu13
  43. 2026-05-22 16:43:43,580 INFO: establishing a new patroni heartbeat connection to postgres
  44. 2026-05-22 16:43:43,602 INFO: Local timeline=7 lsn=0/9000000
  45. 2026-05-22 16:43:43,628 INFO: primary_timeline=7
  46. 2026-05-22 16:43:43,665 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  47. 2026-05-22 16:43:47,112 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  48. 2026-05-22 16:43:57,127 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  49. 2026-05-22 16:44:07,584 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  50. 2026-05-22 16:44:17,625 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  51. 2026-05-22 16:44:27,579 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  52. 2026-05-22 16:44:37,584 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  53. 2026-05-22 16:44:47,627 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  54. 2026-05-22 16:44:57,579 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  55. 2026-05-22 16:45:07,587 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
  56. 2026-05-22 16:45:17,619 INFO: no action. I am (ubuntu13), a secondary, and following a leader (ubuntu11)
复制代码
 

5,总结

 本文通过三种现实的故障,严苛测试了patroni故障转移集群的高可用性,可以发现patroni可以完善处置惩罚各种故障,实现集群的高可用性,同时对于故障转移集群的ttl参数,以及loop_wait参数,在故障转移中的作用,做了实操性的验证,笔者自己也对这两个参数有了更加深刻的熟悉。    
免责声明:如果侵犯了您的权益,请联系站长及时删除侵权内容,谢谢合作!qidao123.com:ToB企服之家,中国第一个企服评测及软件市场,开放入驻,技术点评得现金.

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有账号?立即注册

×
回复

使用道具 举报

登录后关闭弹窗

登录参与点评抽奖  加入IT实名职场社区
去登录
快速回复 返回顶部 返回列表