PostgreSQL10流复制配合PGPool-II实现数据库HA主备切换

说明

这里只记录搭建和简单测试过程

不保证ctrl+cctrl+v能完整跑起来

操作系统使用的CentOS-7.6.1810 x86_64

PostgreSQL版本号10.8

PGPool版本号4.0.5

虚拟机配置1CPU 2G内存 20G系统盘

postgres用户默认情况下PGDATA变量是/var/lib/pgsql/10/data

这里没有使用数据盘,有需要的可以自行调整!

由watchdog通过ARP协议提供的VIP切换来保证服务可用性,这里不涉及负载均衡!

注意!在某些公有云环境不一定支持基于ARP协议做VIP

架构图

PGPool内部如何管理后端PostgreSQL数据库

客户端通过PGPool访问PostgreSQL数据库

环境准备

主机清单

主机名IP地址角色监听端口
vip172.16.80.200vip
pg1172.16.80.201master5432
pg2172.16.80.202slave5432

准备基于stream的主从集群

这里可以看这个链接PostgreSQL10基于stream复制搭建主从集群

修改hosts文件

把每个服务器的主机IP地址做静态解析

1
2
3
172.16.80.200 vip
172.16.80.201 pg1
172.16.80.202 pg2

配置系统命令权限

切换脚本需要使用root权限运行

1
chmod u+s /usr/sbin/ip /usr/sbin/arping

配置SSH密钥

  • 修改postgres用户的密码
1
passwd postgres
  • 切换到postgres用户
1
su - postgres
  • 生成SSH密钥
1
ssh-keygen -t ecdsa -b 521 -N '' -f ~/.ssh/id_ecdsa
  • 配置SSH免密登录
1
2
ssh-copy-id pg1
ssh-copy-id pg2

PGPool

创建PGPool健康检查用户

1
2
create user pgpool_check with password 'pgpool_password';
grant all privileges on database postgres to pgpool_check;

安装PGPool

PGPool在PostgreSQL社区提供的YUM源里面有,直接装就是了

1
yum install -y pgpool-II-10-4.0.5-1.rhel7 pgpool-II-10-extensions-4.0.5-1.rhel7

PGPool也提供了WebUI方便管理,按需安装

1
yum install -y pgpoolAdmin

配置PGPool

配置文件的目录在/etc/pgpool-II-10

配置pcp.conf

pcp.conf是配置pgpool-II自己的用户名和密码

  • 使用pg_md5命令加密密码
1
pg_md5 pgpool_password
  • 输出示例
1
4aa0cb9673e84b06d4c8a848c80eb5d0
  • 添加到pcp.conf
1
postgres:4aa0cb9673e84b06d4c8a848c80eb5d0

配置pool_hba.conf

pool_hba.conf跟PostgreSQL里面的pg_hba.conf作用一样,可以拷贝pg_hba.conf的内容过来

1
2
3
4
5
6
# TYPE  DATABASE        USER            ADDRESS                 METHOD
# 默认配置数据库主机以socket、127.0.0.1、::1的方式连接数据库可以跳过认证阶段直接登录数据库
local all all trust
host all all 127.0.0.1/32 trust
host all all ::1/128 trust
host all all 172.16.80.0/24 md5

配置pgpool.conf

pgpool.conf可以参考/etc/pgpool-II-10/pgpool.conf.sample的配置

需要注意的几个点!

master节点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
# ----------------------------
# pgPool-II configuration file
# ----------------------------

#------------------------------------------------------------------------------
# CONNECTIONS
#------------------------------------------------------------------------------
listen_addresses = '*'
port = 9999
socket_dir = '/tmp'
listen_backlog_multiplier = 2
serialize_accept = off
pcp_listen_addresses = '*'
pcp_port = 9898
pcp_socket_dir = '/tmp'
# - Backend Connection Settings -
backend_hostname0 = 'pg1'
backend_port0 = 5432
# load_balance_mode为off时不生效
backend_weight0 = 1
backend_data_directory0 = '/var/lib/pgsql/10/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_hostname1 = 'pg2'
backend_port1 = 5432
# load_balance_mode为off时不生效
backend_weight1 = 1
backend_data_directory1 = '/var/lib/pgsql/10/data'
backend_flag1 = 'ALLOW_TO_FAILOVER'
# - Authentication -
enable_pool_hba = on
# 定义pool_hba.conf读取password文件
pool_passwd = 'pool_passwd'
authentication_timeout = 30
allow_clear_text_frontend_auth = off
# - SSL Connections -
ssl = off
ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL'
ssl_prefer_server_ciphers = off

#------------------------------------------------------------------------------
# POOLS
#------------------------------------------------------------------------------

# - Concurrent session and pool size -
# num_init_children * max_pool的值为PGPool可以接收多少客户端并发连接,计算方式
# (max_pool * num_init_children) < (max_connections - superuser_reserved_connections)
num_init_children = 32
max_pool = 4
# - Life time -
child_life_time = 600
child_max_connections = 0
connection_life_time = 0
client_idle_limit = 0

#------------------------------------------------------------------------------
# LOGS
#------------------------------------------------------------------------------
# - Where to log -
log_destination = 'stderr'
# - What to log -
log_line_prefix = '%t PID=%p: '
log_connections = on
log_hostname = on
log_statement = on
log_per_node_statement = on
log_client_messages = on
log_standby_delay = 'none'

# - Syslog specific -
syslog_facility = 'LOCAL0'
syslog_ident = 'pgpool'

# - Debug -
log_error_verbosity = verbose
#client_min_messages = notice
#log_min_messages = warning

#------------------------------------------------------------------------------
# FILE LOCATIONS
#------------------------------------------------------------------------------
pid_file_name = '/var/run/pgpool-II-10/pgpool.pid'
logdir = '/var/log/pgpool-II-10'

#------------------------------------------------------------------------------
# CONNECTION POOLING
#------------------------------------------------------------------------------
connection_cache = on
reset_query_list = 'ABORT; DISCARD ALL'
#reset_query_list = 'ABORT; RESET ALL; SET SESSION AUTHORIZATION DEFAULT'


#------------------------------------------------------------------------------
# REPLICATION MODE
#------------------------------------------------------------------------------
replication_mode = off
replicate_select = off
insert_lock = on
lobj_lock_table = ''

# - Degenerate handling -
replication_stop_on_mismatch = off
failover_if_affected_tuples_mismatch = off

#------------------------------------------------------------------------------
# LOAD BALANCING MODE
#------------------------------------------------------------------------------
load_balance_mode = off
ignore_leading_white_space = on
white_function_list = ''
black_function_list = 'currval,lastval,nextval,setval'
black_query_pattern_list = ''
database_redirect_preference_list = ''
app_name_redirect_preference_list = ''
allow_sql_comments = off
disable_load_balance_on_write = 'transaction'

#------------------------------------------------------------------------------
# MASTER/SLAVE MODE
#------------------------------------------------------------------------------
master_slave_mode = on
master_slave_sub_mode = 'stream'
# - Streaming -
sr_check_period = 3
sr_check_user = 'repluser'
sr_check_password = 'repluser_password'
sr_check_database = 'postgres'
delay_threshold = 'always'
# - Special commands -
follow_master_command = ''

#------------------------------------------------------------------------------
# HEALTH CHECK GLOBAL PARAMETERS
#------------------------------------------------------------------------------
health_check_period = 10
health_check_timeout = 3
health_check_user = 'pgpool_check'
health_check_password = 'pgpool_password'
health_check_database = 'postgres'
health_check_max_retries = 3
health_check_retry_delay = 1
connect_timeout = 3

#------------------------------------------------------------------------------
# HEALTH CHECK PER NODE PARAMETERS (OPTIONAL)
#------------------------------------------------------------------------------
#health_check_period0 = 0
#health_check_timeout0 = 20
#health_check_user0 = 'nobody'
#health_check_password0 = ''
#health_check_database0 = ''
#health_check_max_retries0 = 0
#health_check_retry_delay0 = 1
#connect_timeout0 = 10000

#------------------------------------------------------------------------------
# FAILOVER AND FAILBACK
#------------------------------------------------------------------------------
failover_command = '/etc/pgpool-II-10/failover_stream.sh -d %d -h %h -p %p -D %D -M %M -m %m -H %H -P %P -r %r -R %R > /var/lib/pgsql/10/data/log/failover.log'
failback_command = ''
failover_on_backend_error = on
detach_false_primary = off
search_primary_node_timeout = 60

#------------------------------------------------------------------------------
# ONLINE RECOVERY
#------------------------------------------------------------------------------
recovery_user = 'nobody'
recovery_password = ''
recovery_1st_stage_command = ''
recovery_2nd_stage_command = ''
recovery_timeout = 90
client_idle_limit_in_recovery = 0

#------------------------------------------------------------------------------
# WATCHDOG
#------------------------------------------------------------------------------
# - Enabling -
use_watchdog = on
# -Connection to up stream servers -
trusted_servers = ''
ping_path = '/bin'
# - Watchdog communication Settings -
wd_hostname = 'pg1'
wd_port = 9000
wd_priority = 1
wd_authkey = 'pgpool_watchdog'
wd_ipc_socket_dir = '/tmp'

# - Virtual IP control Setting -
delegate_IP = '172.16.80.200'
if_cmd_path = '/usr/sbin'
if_up_cmd = 'ip addr add $_IP_$/24 dev ens33 label ens33:0'
if_down_cmd = 'ip addr del $_IP_$/24 dev ens33'
arping_path = '/usr/sbin'
arping_cmd = 'arping -U $_IP_$ -w 1'
# - Behaivor on escalation Setting -
clear_memqcache_on_escalation = on
wd_escalation_command = ''
wd_de_escalation_command = ''
# - Watchdog consensus settings for failover -
failover_when_quorum_exists = on
failover_require_consensus = on
allow_multiple_failover_requests_from_node = off
# - Lifecheck Setting -
# -- common --
wd_monitoring_interfaces_list = 'ens33'
wd_lifecheck_method = 'heartbeat'
wd_interval = 3
# -- heartbeat mode --
wd_heartbeat_port = 9694
wd_heartbeat_keepalive = 2
wd_heartbeat_deadtime = 30
heartbeat_destination0 = 'pg2'
heartbeat_destination_port0 = 9694
heartbeat_device0 = 'ens33'
# -- query mode --
wd_life_point = 3
wd_lifecheck_query = 'SELECT 1'
wd_lifecheck_dbname = 'template1'
wd_lifecheck_user = 'nobody'
wd_lifecheck_password = ''
# - Other pgpool Connection Settings -
other_pgpool_hostname0 = 'pg2'
other_pgpool_port0 = 9999
other_wd_port0 = 9000

#------------------------------------------------------------------------------
# OTHERS
#------------------------------------------------------------------------------
relcache_expire = 0
relcache_size = 256
check_temp_table = on
check_unlogged_table = on

#------------------------------------------------------------------------------
# IN MEMORY QUERY MEMORY CACHE
#------------------------------------------------------------------------------
memory_cache_enabled = off
memqcache_method = 'shmem'
memqcache_memcached_host = 'localhost'
memqcache_memcached_port = 11211
memqcache_total_size = 67108864
memqcache_max_num_cache = 1000000
memqcache_expire = 0
memqcache_auto_cache_invalidation = on
memqcache_maxcache = 409600
memqcache_cache_block_size = 1048576
memqcache_oiddir = '/var/log/pgpool/oiddir'
white_memqcache_table_list = ''
black_memqcache_table_list = ''

slave节点

可以照抄master节点的配置

注意以下几个地方要做对应变更

1
2
3
wd_hostname = 'pg2'
heartbeat_destination0 = 'pg1'
other_pgpool_hostname0 = 'pg1'

生成pool_passwd文件

/etc/pgpool-II-10/pool_passwd添加连接到后端PostgreSQL数据库的用户密码

会提示输入密码,这里的密码请填写PostgreSQL数据库用户对应的密码

1
2
pg_md5 -p -m -u postgres pool_passwd
pg_md5 -p -m -u appuser pool_passwd

修改权限

1
chmod a+r /etc/pgpool-II-10/pool_passwd

创建切换脚本

/etc/pgpool-II-10/failover_stream.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#!/bin/bash
set -ex
# 获取参数
while getopts 'd:D:h:H:m:M:p:P:r:R:' OPT; do
case $OPT in
d)
NODE_ID="$OPTARG"
;;
D)
DATABASE_CLUSTER_PATH="$OPTARG"
;;
h)
HOSTNAME="$OPTARG"
;;
H)
NEW_MASTER_HOSTNAME="$OPTARG"
;;
m)
NEW_MASTER_NODE_ID="$OPTARG"
;;
M)
OLD_MASTER_NODE_ID="$OPTARG"
;;
p)
PORT_NUM="$OPTARG"
;;
P)
OLD_PRIMARY_NODE_ID="$OPTARG"
;;
r)
NEW_MASTER_PORT_NUM="$OPTARG"
;;
R)
NEW_MASTER_DATABASE_CLUSTER_PATH="$OPTARG"
;;
?)
echo "Usage: `basename $0` -d NODE_ID -D DATABASE_CLUSTER_PATH -h HOSTNAME -H NEW_MASTER_NODE_HOSTNAME -m NEW_MASTER_NODE_ID -M OLD_MASTER_NODE_ID -p PORT_NUM -P OLD_PRIMARY_NODE_ID -r NEW_MASTER_PORT_NUM -R NEW_MASTER_DATABASE_PATH"
esac
done

# 大型变量打印现场
echo "HOSTNAME: ${HOSTNAME}"
echo "NODE_ID: ${NODE_ID}"
echo "PORT_NUM: ${PORT_NUM}"
echo "DATABASE_CLUSTER_PATH: ${DATABASE_CLUSTER_PATH}"
echo "OLD_MASTER_NODE_ID: ${OLD_MASTER_NODE_ID}"
echo "OLD_PRIMARY_NODE_ID: ${OLD_PRIMARY_NODE_ID}"
echo "NEW_MASTER_HOSTNAME: ${NEW_MASTER_HOSTNAME}"
echo "NEW_MASTER_NODE_ID: ${NEW_MASTER_NODE_ID}"
echo "NEW_MASTER_PORT_NUM: ${NEW_MASTER_PORT_NUM}"
echo "NEW_MASTER_DATABASE_CLUSTER_PATH: ${NEW_MASTER_DATABASE_CLUSTER_PATH}"

# 定义PostgreSQL家目录,默认是/usr/pgsql-10
PGHOME='/usr/pgsql-10'
# 定义PGDATA,默认是/var/lib/pgsql/10/data,这里使用pgpool.conf定义的backend_data_directory
PGDATA=${NEW_MASTER_DATABASE_CLUSTER_PATH}
# 拼凑字符串命令
TRIGGER_COMMAND="$PGHOME/bin/pg_ctl promote -D $PGDATA"
# pg_ctl promote的功能是让备份服务器退出恢复进程并且转为读写模式
/usr/bin/ssh -T ${NEW_MASTER_HOSTNAME} ${TRIGGER_COMMAND}
exit 0;

增加执行权限

1
chmod a+rx /etc/pgpool-II-10/failover_stream.sh

启动PGPool

1
2
systemctl enable pgpool-II-10.service
systemctl start pgpool-II-10.service

验证PGPool

  • 使用psql连接PGPool
1
psql -U postgres -h vip -p 9999
  • 查看PGPool节点
1
2
3
4
5
6
postgres=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | last_status_change
---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
0 | pg1 | 5432 | up | 0.500000 | primary | 2 | true | 0 | 2019-05-26 20:40:44
1 | pg2 | 5432 | up | 0.500000 | standby | 0 | false | 0 | 2019-05-26 20:40:44
(2 rows)

PGPool节点重启

  • 查看watchdog的VIP所在主机
1
ip address
  • 重启前查看PGPool节点信息
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
postgres=# show pool_nodes;
-[ RECORD 1 ]------+--------------------
node_id | 0
hostname | pg1
port | 5432
status | up
lb_weight | 0.500000
role | primary
select_cnt | 7
load_balance_node | true
replication_delay | 0
last_status_change | 2019-05-26 20:40:44
-[ RECORD 2 ]------+--------------------
node_id | 1
hostname | pg2
port | 5432
status | up
lb_weight | 0.500000
role | standby
select_cnt | 0
load_balance_node | false
replication_delay | 0
last_status_change | 2019-05-26 20:40:44
  • 登录VIP所在主机直接重启
1
reboot
  • 再次查看节点信息

可以看到VIP所在主机重启之后

SQL连接会提示连接不可用

在watchdog的作用下,VIP会自动切换到另一个PGPool节点

SQL连接再次重新连接成功!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
postgres=# show pool_nodes;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
postgres=# show pool_nodes;
-[ RECORD 1 ]------+--------------------
node_id | 0
hostname | pg1
port | 5432
status | up
lb_weight | 0.500000
role | primary
select_cnt | 0
load_balance_node | true
replication_delay | 0
last_status_change | 2019-05-26 20:46:48
-[ RECORD 2 ]------+--------------------
node_id | 1
hostname | pg2
port | 5432
status | down
lb_weight | 0.500000
role | standby
select_cnt | 0
load_balance_node | false
replication_delay | 0
last_status_change | 2019-05-26 20:46:42

PGPool进程被杀

  • 查看watchdog的VIP所在主机
1
ip address
  • 查看PGPool节点信息
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
postgres=# \x
Expanded display is on.
postgres=# show pool_nodes;
-[ RECORD 1 ]------+--------------------
node_id | 0
hostname | pg1
port | 5432
status | up
lb_weight | 0.500000
role | primary
select_cnt | 0
load_balance_node | true
replication_delay | 0
last_status_change | 2019-05-26 20:46:48
-[ RECORD 2 ]------+--------------------
node_id | 1
hostname | pg2
port | 5432
status | down
lb_weight | 0.500000
role | standby
select_cnt | 0
load_balance_node | false
replication_delay | 0
last_status_change | 2019-05-26 20:46:42
  • 找PGPool的进程
1
2
3
ps -ef | grep pgpool
# 示例输出
postgres 12641 1 0 20:39 ? 00:00:00 /usr/pgpool-10/bin/pgpool -f /etc/pgpool-II-10/pgpool.conf -n -D
  • 杀PGPool进程
1
kill -9 12641
  • PGPool进程日志

可以看到PGPool的watchdog几乎立刻就反应过来了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
May 26 20:53:14 pg2 pgpool[9607]: LOG:  watchdog node state changed from [STANDBY] to [JOINING]
May 26 20:53:14 pg2 pgpool[9607]: LOCATION: watchdog.c:6360
May 26 20:53:18 pg2 pgpool[9607]: LOG: watchdog node state changed from [JOINING] to [INITIALIZING]
May 26 20:53:18 pg2 pgpool[9607]: LOCATION: watchdog.c:6360
May 26 20:53:19 pg2 pgpool[9607]: LOG: I am the only alive node in the watchdog cluster
May 26 20:53:19 pg2 pgpool[9607]: HINT: skipping stand for coordinator state
May 26 20:53:19 pg2 pgpool[9607]: LOCATION: watchdog.c:5231
May 26 20:53:19 pg2 pgpool[9607]: LOG: watchdog node state changed from [INITIALIZING] to [MASTER]
May 26 20:53:19 pg2 pgpool[9607]: LOCATION: watchdog.c:6360
May 26 20:53:19 pg2 pgpool[9607]: LOG: I am announcing my self as master/coordinator watchdog node
May 26 20:53:19 pg2 pgpool[9607]: LOCATION: watchdog.c:5420
May 26 20:53:23 pg2 pgpool[9607]: LOG: I am the cluster leader node
May 26 20:53:23 pg2 pgpool[9607]: DETAIL: our declare coordinator message is accepted by all nodes
May 26 20:53:23 pg2 pgpool[9607]: LOCATION: watchdog.c:5454
May 26 20:53:23 pg2 pgpool[9607]: LOG: setting the local node "pg2:9999 Linux pg2" as watchdog cluster master
May 26 20:53:23 pg2 pgpool[9607]: LOCATION: watchdog.c:7087
May 26 20:53:23 pg2 pgpool[9607]: LOG: I am the cluster leader node. Starting escalation process
May 26 20:53:23 pg2 pgpool[9607]: LOCATION: watchdog.c:5473
May 26 20:53:23 pg2 pgpool[9607]: LOG: escalation process started with PID:10118
May 26 20:53:23 pg2 pgpool[9607]: LOCATION: watchdog.c:6000
May 26 20:53:23 pg2 pgpool[9607]: LOG: new IPC connection received
May 26 20:53:23 pg2 pgpool[9607]: LOCATION: watchdog.c:3147
May 26 20:53:23 pg2 pgpool[9607]: LOG: watchdog: escalation started
May 26 20:53:23 pg2 pgpool[9607]: LOCATION: wd_escalation.c:93
May 26 20:53:27 pg2 pgpool[9607]: LOG: successfully acquired the delegate IP:"172.16.80.200"
May 26 20:53:27 pg2 pgpool[9607]: DETAIL: 'if_up_cmd' returned with success
May 26 20:53:27 pg2 pgpool[9607]: LOCATION: wd_if.c:169
May 26 20:53:27 pg2 pgpool[9607]: LOG: watchdog escalation process with pid: 10118 exit with SUCCESS.
May 26 20:53:27 pg2 pgpool[9607]: LOCATION: watchdog.c:2976

模拟master节点宕机

  • 查看节点信息
1
2
3
4
5
6
postgres=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | last_status_change
---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
0 | pg1 | 5432 | up | 0.500000 | primary | 0 | true | 0 | 2019-05-26 20:58:11
1 | pg2 | 5432 | up | 0.500000 | standby | 0 | false | 0 | 2019-05-26 20:59:22
(2 rows)
  • master节点直接关机
1
shutdown -h now
  • 查看PGPool日志
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOG:  new IPC connection received
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOCATION: watchdog.c:3147
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOG: watchdog received the failover command from local pgpool-II on IPC interface
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOCATION: watchdog.c:2570
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOCATION: watchdog.c:2491
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOG: we have got the consensus to perform the failover
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: DETAIL: 1 node(s) voted in the favor
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOCATION: watchdog.c:2363
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10800 USER=[No Connection] DB=[No Connection]: LOG: Pgpool-II parent process has received failover request
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10800 USER=[No Connection] DB=[No Connection]: LOCATION: pgpool_main.c:1594
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOG: new IPC connection received
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOCATION: watchdog.c:3147
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOG: received the failover indication from Pgpool-II on IPC interface
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOCATION: watchdog.c:2716
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOG: watchdog is informed of failover start by the main process
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10801 USER=[No Connection] DB=[No Connection]: LOCATION: watchdog.c:2784
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10800 USER=[No Connection] DB=[No Connection]: LOG: starting degeneration. shutdown host pg1(5432)
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10800 USER=[No Connection] DB=[No Connection]: LOCATION: pgpool_main.c:1873
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10800 USER=[No Connection] DB=[No Connection]: LOG: Restart all children
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10800 USER=[No Connection] DB=[No Connection]: LOCATION: pgpool_main.c:2023
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10800 USER=[No Connection] DB=[No Connection]: LOG: execute command: /etc/pgpool-II-10/failover_stream.sh 0 pg1 5432 /var/lib/pgsql/10/data 0 1 pg2 0 5432 /var/lib/pgsql/10/data > /var/lib/pgsql/10/data/log/failover.log
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10800 USER=[No Connection] DB=[No Connection]: LOCATION: pgpool_main.c:3070
May 26 21:06:02 pg2 pgpool[10800]: + getopts d:D:h:H:m:M:p:P:r:R: OPT
May 26 21:06:02 pg2 pgpool[10800]: + echo 'HOSTNAME: pg2'
May 26 21:06:02 pg2 pgpool[10800]: + echo 'NODE_ID: '
May 26 21:06:02 pg2 pgpool[10800]: + echo 'PORT_NUM: '
May 26 21:06:02 pg2 pgpool[10800]: + echo 'DATABASE_CLUSTER_PATH: '
May 26 21:06:02 pg2 pgpool[10800]: + echo 'OLD_MASTER_NODE_ID: '
May 26 21:06:02 pg2 pgpool[10800]: + echo 'OLD_PRIMARY_NODE_ID: '
May 26 21:06:02 pg2 pgpool[10800]: + echo 'NEW_MASTER_HOSTNAME: '
May 26 21:06:02 pg2 pgpool[10800]: + echo 'NEW_MASTER_NODE_ID: '
May 26 21:06:02 pg2 pgpool[10800]: + echo 'NEW_MASTER_PORT_NUM: '
May 26 21:06:02 pg2 pgpool[10800]: + echo 'NEW_MASTER_DATABASE_CLUSTER_PATH: '
May 26 21:06:02 pg2 pgpool[10800]: + PGHOME=/usr/pgsql-10
May 26 21:06:02 pg2 pgpool[10800]: + PGDATA=
May 26 21:06:02 pg2 pgpool[10800]: + TRIGGER_COMMAND='/usr/pgsql-10/bin/pg_ctl promote -D '
May 26 21:06:02 pg2 pgpool[10800]: + /usr/bin/ssh -T /usr/pgsql-10/bin/pg_ctl promote -D
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10871 USER=postgres DB=postgres: WARNING: failover/failback is in progress
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10871 USER=postgres DB=postgres: DETAIL: executing failover or failback on backend
May 26 21:06:02 pg2 pgpool[10800]: 2019-05-26 21:06:02 PID=10871 USER=postgres DB=postgres: HINT: In a moment you should be able to reconnect to the database
  • 查看PGPool节点

可以看到PGPool帮我们自动将standby节点切换为primary

1
2
3
4
5
6
postgres=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | last_status_change
---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
0 | pg1 | 5432 | down | 0.500000 | standby | 0 | false | 0 | 2019-05-26 21:09:59
1 | pg2 | 5432 | up | 0.500000 | primary | 0 | true | 0 | 2019-05-26 21:12:56
(2 rows)
  • 启动master节点

  • 在master登录到自己的PostgreSQL

1
psql -U postgres
  • 查看pg_is_in_recovery()

发现master节点上面的PostgreSQL状态为f,没有自动转换为pg2的从库

1
2
3
4
5
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
f
(1 row)
  • 手动处理一下

切换用户

1
2
su - postgres
cd $PGDATA

创建recovery.conf文件

1
2
3
4
5
6
7
8
9
10
# 指定timeline为latest,保证主从数据差异尽可能小
recovery_target_timeline = 'latest'
# 此选项会让PostgreSQL一直处于数据恢复状态
# 不断去请求primary_conninfo定义的主节点的WAL日志
# 并将这些WAL日志恢复到本地数据库中
standby_mode = on
# 定义主库的连接方式
primary_conninfo = 'host=pg2 port=5432 user=repluser password=repluser_password'
# 这个文件是用于触发停止流复制的操作!不需要手动创建
trigger_file = '/var/lib/pgsql/10/data/pg.trigger'

重启PostgreSQL

1
pg_ctl restart

登录数据库查看状态,可以看到状态变成了t

1
2
3
4
5
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)

手动加入PGPool集群

这里输入之前pcp.conf定义的用户密码

  • 参数解析如下
    • -h这里指定pcp的IP地址,这里用VIP
    • -p这里指定pcp的端口,默认是9898
    • -U这里指定登录pcp的用户,在/etc/pgpool-II-10/pcp.conf里面定义的
    • -n这里指定节点ID,可以在show pool_nodes里面查到节点对应的ID号
    • -d输出debug日志
1
2
3
4
5
6
7
8
pcp_attach_node -U postgres -h vip -p 9898 -n 0 -d
Password:
DEBUG: recv: tos="m", len=8
DEBUG: recv: tos="r", len=21
DEBUG: send: tos="C", len=6
DEBUG: recv: tos="c", len=20
pcp_attach_node -- Command Successful
DEBUG: send: tos="X", len=4

查看PGPool节点状态,可以看到pg1状态是waiting,过一阵之后就变成up了

1
2
3
4
5
6
postgres=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | last_status_change
---------+----------+------+---------+-----------+---------+------------+-------------------+-------------------+---------------------
0 | pg1 | 5432 | up | 0.500000 | standby | 0 | false | 4200 | 2019-05-26 21:19:05
1 | pg2 | 5432 | up | 0.500000 | primary | 0 | true | 0 | 2019-05-26 21:12:56
(2 rows)

维护操作

PGPool节点down

  • 检查PostgreSQL是否正常启动
  • 查看$PGDATA目录是否有recovery.conf或者recovery.done,一般情况下,只存在一个文件
    • 其中recovery.done会让数据库启动时以主库形式启动
    • recovery.conf会让数据库作为从库启动
    • 在发生主从切换的时候,recovery.conf会被重命名为recovery.done
  • 如确认当前down节点是要作为从库启动,则重命名recovery.donerecovery.conf,然后重启数据库
    • 需要确认一下是否能从新的主库同步数据

PGPool节点waiting

  • 节点还在做主从同步
  • PGPool还在创建子进程用于连接此节点,子进程数量多,耗时会相应变长
  • 如果登录很久都是waiting,可以尝试重启一下PGPool的服务

PGPool重新添加节点

确保主从数据库都已正常,但是节点状态还是down,就需要手工添加到集群中了

  • 使用pcp_attach_node命令,输入pcp.conf定义的密码之后即可将节点重新加入PGPool
  • 参数解析如下
    • -h这里指定pcp的IP地址,这里用VIP
    • -p这里指定pcp的端口,默认是9898
    • -U这里指定登录pcp的用户,在/etc/pgpool-II-10/pcp.conf里面定义的
    • -n这里指定节点ID,可以在show pool_nodes里面查到节点对应的ID号
    • -d输出debug日志
1
2
3
4
5
6
7
8
pcp_attach_node -h vip -p 9898 -U postgres -n NODE_ID -d
Password:
DEBUG: recv: tos="m", len=8
DEBUG: recv: tos="r", len=21
DEBUG: send: tos="C", len=6
DEBUG: recv: tos="c", len=20
pcp_attach_node -- Command Successful
DEBUG: send: tos="X", len=4