consul实现redis以及mysql的高可用

consul是HashiCorp公司(曾经开发过vgrant) 推出的一款开源工具, 基于go语言开发, 轻量级, 用于实现分布式系统的服务发现与配置。 与其他类似产品相比, 提供更“一站式”的解决方案。 consul内置有KV存储, 服务注册/发现, 健康检查, HTTP+DNS API, Web UI等多种功能。其他同类服务发现与配置的主流开源产品有:zookeeper和ETCD。

其典型使用结构如下:

http://www.toname.cn/wp-content/uploads/2018/11/6e3fff84056f69d03fdd6e51e78dad5b.png    安装及使用:

测试环境(生产环境consul server部署3个或者5个):

consul server:192.168.0.10

consul client:192.168.0.20,192.168.0.30,192.168.0.40

consul的安装非常容易,从https://www.consul.io/downloads.html这里下载以后,解压即可使用,就是一个二进制文件,其他的都没有了。我这里使用的是0.92版本。文件下载以后解压放到/usr/local/bin。就可以使用了。不依赖任何东西。上面的4台服务器都安装。

4台机器都创建目录,分别是放配置文件,以及存放数据的。以及存放redis,mysql的健康检查脚本

mkdir /etc/consul.d/ -p && mkdir /data/consul/ -p

mkidr /data/consul/shell -p

然后把相关配置参数写入配置文件,其实也可以不用写,直接跟在命令后面就行,那样不方便管理。
consul server(192.168.0.10)配置文件(具体参数的意思请查询官网或者文章给的参考链接):

[root@db-server-yayun-01 ~]# cat /etc/consul.d/server.json

{

“data_dir”: “/data/consul”,

“datacenter”: “dc1”,

“log_level”: “INFO”,

“server”: true,

“bootstrap_expect”: 1,

“bind_addr”: “192.168.0.10”,

“client_addr”: “192.168.0.10”,

“ui”:true

}

[root@db-server-yayun-01 ~]#

consul client(192.168.0.20,192.168.0.30,192.168.0.40)

[root@db-server-yayun-02 ~]# cat /etc/consul.d/client.json

{

“data_dir”: “/data/consul”,

“enable_script_checks”: true,

“bind_addr”: “192.168.0.20”,

“retry_join”: [“192.168.0.10”],

“retry_interval”: “30s”,

“rejoin_after_leave”: true,

“start_join”: [“192.168.0.10”]

}

[root@db-server-yayun-02 ~]#

3台服务器的配置文件差异不大,唯一有区别的就是bind_addr地方,自行修改为你自己服务器的ip。我测试环境是虚拟机,有多快网卡,所以必须指定,否则可以绑定0.0.0.0。
下面我们先启动consul server:

nohup consul agent -config-dir=/etc/consul.d > /data/consul/consul.log &

查看日志:

[root@db-server-yayun-01 consul]# cat consul.log

==> WARNING: BootstrapExpect Mode is specified as 1; this is the same as Bootstrap mode.

==> WARNING: Bootstrap mode enabled! Do not enable unless necessary

==> Starting Consul agent…

==> Consul agent running!

Version: ‘v0.9.2’

Node ID: ‘5e612623-ec5b-386c-19be-d38876a9a46f’

Node name: ‘db-server-yayun-01’

Datacenter: ‘dc1’

Server: true (bootstrap: true)

Client Addr: 192.168.0.10 (HTTP: 8500, HTTPS: -1, DNS: 8600)

Cluster Addr: 192.168.0.10 (LAN: 8301, WAN: 8302)

Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

2017/12/09 09:49:53 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:192.168.0.10:8300 Address:192.168.0.10:8300}]

2017/12/09 09:49:53 [INFO] raft: Node at 192.168.0.10:8300 [Follower] entering Follower state (Leader: “”)

2017/12/09 09:49:53 [INFO] serf: EventMemberJoin: db-server-yayun-01.dc1 192.168.0.10

2017/12/09 09:49:53 [INFO] serf: EventMemberJoin: db-server-yayun-01 192.168.0.10

2017/12/09 09:49:53 [INFO] agent: Started DNS server 192.168.0.10:8600 (udp)

2017/12/09 09:49:53 [INFO] consul: Adding LAN server db-server-yayun-01 (Addr: tcp/192.168.0.10:8300) (DC: dc1)

2017/12/09 09:49:53 [INFO] consul: Handled member-join event for server “db-server-yayun-01.dc1” in area “wan”

2017/12/09 09:49:53 [INFO] agent: Started DNS server 192.168.0.10:8600 (tcp)

2017/12/09 09:49:53 [INFO] agent: Started HTTP server on 192.168.0.10:8500

2017/12/09 09:50:00 [ERR] agent: failed to sync remote state: No cluster leader

2017/12/09 09:50:00 [WARN] raft: Heartbeat timeout from “” reached, starting election

2017/12/09 09:50:00 [INFO] raft: Node at 192.168.0.10:8300 [Candidate] entering Candidate state in term 2

2017/12/09 09:50:00 [INFO] raft: Election won. Tally: 1

2017/12/09 09:50:00 [INFO] raft: Node at 192.168.0.10:8300 [Leader] entering Leader state

2017/12/09 09:50:00 [INFO] consul: cluster leadership acquired

2017/12/09 09:50:00 [INFO] consul: New leader elected: db-server-yayun-01

2017/12/09 09:50:00 [INFO] consul: member ‘db-server-yayun-01’ joined, marking health alive

2017/12/09 09:50:03 [INFO] agent: Synced node info

可以从日志中看到(HTTP: 8500, HTTPS: -1, DNS: 8600),http端口默认8500,在reload以及web ui会用到,dns端口是8600,在使用dns解析的时候会用到。还可以看到这台机器就是leader,consul: New leader elected: db-server-yayun-01。因为只有一台机器。所以生产环境一定要3个或者5个server。

下面启动3台client,3台client启动命令是一样的。然后查看其中一台client的日志:

nohup consul agent -config-dir=/etc/consul.d > /data/consul/consul.log &

[root@db-server-yayun-02 consul]# cat /data/consul/consul.log

==> Starting Consul agent…

==> Joining cluster…

Join completed. Synced with 1 initial agents

==> Consul agent running!

Version: ‘v0.9.2’

Node ID: ‘0ec901ab-6c66-2461-95e6-50a77a28ed72’

Node name: ‘db-server-yayun-02’

Datacenter: ‘dc1’

Server: false (bootstrap: false)

Client Addr: 127.0.0.1 (HTTP: 8500, HTTPS: -1, DNS: 8600)

Cluster Addr: 192.168.0.20 (LAN: 8301, WAN: 8302)

Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

2017/12/09 10:06:10 [INFO] serf: EventMemberJoin: db-server-yayun-02 192.168.0.20

2017/12/09 10:06:10 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)

2017/12/09 10:06:10 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)

2017/12/09 10:06:10 [INFO] agent: Started HTTP server on 127.0.0.1:8500

2017/12/09 10:06:10 [INFO] agent: (LAN) joining: [192.168.0.10]

2017/12/09 10:06:10 [INFO] agent: Retry join is supported for: aws azure gce softlayer

2017/12/09 10:06:10 [INFO] agent: Joining cluster…

2017/12/09 10:06:10 [INFO] agent: (LAN) joining: [192.168.0.10]

2017/12/09 10:06:10 [INFO] serf: EventMemberJoin: db-server-yayun-01 192.168.0.10

2017/12/09 10:06:10 [INFO] agent: (LAN) joined: 1 Err: <nil>

2017/12/09 10:06:10 [INFO] consul: adding server db-server-yayun-01 (Addr: tcp/192.168.0.10:8300) (DC: dc1)

2017/12/09 10:06:10 [INFO] agent: (LAN) joined: 1 Err: <nil>

2017/12/09 10:06:10 [INFO] agent: Join completed. Synced with 1 initial agents

2017/12/09 10:06:10 [INFO] agent: Synced node info

 

可以看到提示agent: Join completed. Synced with 1 initial agents,以及Server: false (bootstrap: false)。这也是client和server的区别。
我们继续执行命令看一下集群:

[root@db-server-yayun-02 ~]# consul members

Node Address Status Type Build Protocol DC

db-server-yayun-01 192.168.0.10:8301 alive server 0.9.2 2 dc1

db-server-yayun-02 192.168.0.20:8301 alive client 0.9.2 2 dc1

db-server-yayun-03 192.168.0.30:8301 alive client 0.9.2 2 dc1

db-server-yayun-04 192.168.0.40:8301 alive client 0.9.2 2 dc1

[root@db-server-yayun-02 ~]#

[root@db-server-yayun-02 ~]# consul operator raft list-peers

Node ID Address State Voter RaftProtocol

db-server-yayun-01 192.168.0.10:8300 192.168.0.10:8300 leader true 2

[root@db-server-yayun-02 ~]#

我们看看web ui,consul自带的ui,非常轻便。访问:http://192.168.0.10:8500/ui/

https://images2017.cnblogs.com/blog/609710/201712/609710-20171209101554468-1208752157.png

到这来consul集群就搭建完成了,是不是很简单。对就是这么简单,但是从上面可以看到,client节点并没有注册服务,显示0 services。这也就是接下来需要讲解的。那么到底如何实现redis及mysql的高可用呢?正式开始:

Consul 使用场景一(redis sentinel)
(1)Redis 哨兵架构下,服务器部署了哨兵,但业务部门没有在app 层面,使用jedis 哨兵驱动来自动发现Redis master,而使用直连IP master。当master挂掉,其他redis节点担当新master后,应用需要手工修改配置,指向新master。
(2)Redis 客户端驱动,还没有读写分离的配置,若想slave的读负载均衡,暂时没好的办法。我们程序都是支持读写分离,所以没影响
(3)Consul 可以满足以上需求,配置两个DNS服务,一个是master的服务,利用consul自身的服务健康检查和探测功能, 自动发现新的master。 然后定义一个slave的服务,基于DNS本身, 能够对slave角色的redis IP做轮询。

架构图如下:

https://images2017.cnblogs.com/blog/609710/201712/609710-20171209102332984-1109262425.png

 

同样也可以对mysql做高可用,mha和sentinel的角色一样,架构图如下:

https://images2017.cnblogs.com/blog/609710/201712/609710-20171209115744777-856648252.png

下面就说说redis高可用的实现过程,mysql的我就不说了,mysql用到的健康检查脚本我会贴出来。思路都是一样的。

Consul 服务定义(Redis)

上面已经搭建好了consul集群,server是192.168.0.10 client是20到40. 那么20我们就拿来当redis master,30,40拿来当redis slave。下面定义服务(20,30,40都要存在):

20,30,40的配置文件如下,除了address要修改为对应的服务器地址,其他一样。

[root@db-server-yayun-02 consul.d]# pwd

/etc/consul.d

[root@db-server-yayun-02 consul.d]# ll

total 12

-rw-r–r–. 1 root root 221 Dec 9 09:44 client.json

-rw-r–r–. 1 root root 319 Dec 9 10:48 r-6029-redis-test.json

-rw-r–r–. 1 root root 321 Dec 9 10:48 w-6029-redis-test.json

[root@db-server-yayun-02 consul.d]#

master的服务定义配置文件:

[root@db-server-yayun-02 consul.d]# cat w-6029-redis-test.json

{

“services”: [

{

“name”: “w-6029-redis-test”,

“tags”: [

“master-test-6029”

],

“address”: “192.168.0.20”,

“port”: 6029,

“checks”: [

{

“script”: “/data/consul/shell/check_redis_master.sh 6029 “,

“interval”: “15s”

}

]

}

]

}

[root@db-server-yayun-02 consul.d]#

slave的服务定义配置文件:

[root@db-server-yayun-02 consul.d]# cat r-6029-redis-test.json

{

“services”: [

{

“name”: “r-6029-redis-test”,

“tags”: [

“slave-test-6029”

],

“address”: “192.168.0.20”,

“port”: 6029,

“checks”: [

{

“script”: “/data/consul/shell/check_redis_slave.sh 6029 “,

“interval”: “15s”

}

]

}

]

}

[root@db-server-yayun-02 consul.d]#

每个agent都注册后, 对应有两个域名:
w-6029-redis-test.service.consul (对应唯一一个master IP)
r-6029-redis-test.service.consul  (对应两个slave IP, 客户端请求时, 随机分配一个)

其中”script”: “/data/consul/shell/check_redis_slave.sh 6029 “代表对redis 6029端口进行健康检查,关于更多健康检查请查看官网介绍。

[root@db-server-yayun-03 shell]# pwd

/data/consul/shell

[root@db-server-yayun-03 shell]# ll

total 16

-rwxr-xr-x. 1 root root 480 Dec 9 10:56 check_mysql_master.sh

-rwxr-xr-x. 1 root root 3004 Dec 9 10:55 check_mysql_slave.sh

-rwxr-xr-x. 1 root root 254 Dec 9 10:51 check_redis_master.sh

-rwxr-xr-x. 1 root root 379 Dec 9 10:51 check_redis_slave.sh

[root@db-server-yayun-03 shell]#

/data/consul/shell目录下面有4个脚本,是对redis和mysql进行健康检查用的。脚本比较简单,大概就是如果只有一个master,那么读写都在master,如果有slave可用,那么读会在slave进行。如果slave复制不正常,或者复制延时,那么slave服务将不会注册。

[root@db-server-yayun-03 shell]# cat check_redis_master.sh

#!/bin/bash

myport=$1

auth=$2

if [ ! -n “$auth” ]

then

auth=’\”\”‘

fi

comm=”/usr/local/bin/redis-cli -p $myport -a $auth ”

role=`echo ‘INFO Replication’|$comm |grep -Ec ‘role:master’`

echo ‘INFO Replication’|$comm

if [ $role -ne 1 ]

then

exit 2

fi

[root@db-server-yayun-03 shell]#

[root@db-server-yayun-03 shell]# cat check_redis_slave.sh

#!/bin/bash

myport=$1

auth=$2

if [ ! -n “$auth” ]

then

auth=’\”\”‘

fi

comm=”/usr/local/bin/redis-cli -p $myport -a $auth ”

role=`echo ‘INFO Replication’|$comm |grep -Ec ‘^role:slave|^master_link_status:up’`

single=`echo ‘INFO Replication’|$comm |grep -Ec ‘^role:master|^connected_slaves:0’`

echo ‘INFO Replication’|$comm

if [ $role -ne 2 -a $single -ne 2 ]

then

exit 2

fi

[root@db-server-yayun-03 shell]#

“name”: “r-6029-redis-test”,这个就是域名了,默认后缀是servers.consul,consul可以利用domain参数修改。配置文件生成以后安装redis,搭建主从复制(省略)。主从复制完成以后就可以重新reload consul了。redis info信息:

127.0.0.1:6029> info replication

# Replication

role:master

connected_slaves:2

slave0:ip=192.168.0.40,port=6029,state=online,offset=6786,lag=0

slave1:ip=192.168.0.30,port=6029,state=online,offset=6786,lag=1

master_repl_offset:6786

repl_backlog_active:1

repl_backlog_size:67108864

repl_backlog_first_byte_offset:2

repl_backlog_histlen:6785

127.0.0.1:6029>

reload consul(3台client,也就是20-40):

[root@db-server-yayun-02 ~]# consul reload

Configuration reload triggered

[root@db-server-yayun-02 ~]#

在其中一台服务器查看consul日志(20):

[root@db-server-yayun-02 consul]# tail -f consul.log

2017/12/09 10:09:59 [INFO] serf: EventMemberJoin: db-server-yayun-04 192.168.0.40

2017/12/09 11:14:55 [INFO] Caught signal: hangup

2017/12/09 11:14:55 [INFO] Reloading configuration…

2017/12/09 11:14:55 [INFO] agent: Synced service ‘r-6029-redis-test’

2017/12/09 11:14:55 [INFO] agent: Synced service ‘w-6029-redis-test’

2017/12/09 11:14:55 [INFO] agent: Synced check ‘service:w-6029-redis-test’

2017/12/09 11:15:00 [WARN] agent: Check ‘service:r-6029-redis-test’ is now critical

2017/12/09 11:15:15 [WARN] agent: Check ‘service:r-6029-redis-test’ is now critical

2017/12/09 11:15:30 [WARN] agent: Check ‘service:r-6029-redis-test’ is now critical

2017/12/09 11:15:45 [WARN] agent: Check ‘service:r-6029-redis-test’ is now critical

可以看到r-6029-redis-test,w-6029-redis-test服务都已经注册,但是只有w-6029-redis-test注册成功,也就是写的,因为服务器20上面的redis是master,slave的服务当然无法注册成功。我们通过web ui看看。
https://images2017.cnblogs.com/blog/609710/201712/609710-20171209111949855-381276474.png

可以看到3个client节点每个节点都已经注册了2个服务。还可以看到我们自定义的输出:

https://images2017.cnblogs.com/blog/609710/201712/609710-20171209112315386-2017226683.png

下面我们使用dns来解析看看,是否是我们想要的。我们注册两个服务。r-6029-redis-test,w-6029-redis-test,那么就是就产生了2个域名,分别是r-6029-redis-test.service.consul和w-6029-redis-test.service.consul。我们使用dig来看看:

我们可以看到读的域名r-6029-redis-test.service.consul解析到了两台服务器。那么我们就能够对从库进行负载均衡了。那么写的域名呢?

和我们预料的没错,解析在了20上面。那么我们如果关闭其中一个从库会是怎样的?

[root@db-server-yayun-03 ~]# ifconfig eth1 | grep -oP ‘(?<=inet addr:)\S+’

192.168.0.30

[root@db-server-yayun-03 ~]# pgrep -fl redis-server | awk ‘{print $1}’ | xargs kill

[root@db-server-yayun-03 ~]#

127.0.0.1:6029> info replication

# Replication

role:master

connected_slaves:1

slave0:ip=192.168.0.40,port=6029,state=online,offset=8200,lag=0

master_repl_offset:8200

repl_backlog_active:1

repl_backlog_size:67108864

repl_backlog_first_byte_offset:2

repl_backlog_histlen:8199

127.0.0.1:6029>

可以看到只有一个从了,我们再次dig 读域名看看:

可以看到踢掉了另外一台机器。如果我再次关闭40这个从呢?

[root@db-server-yayun-04 shell]# ifconfig eth1 | grep -oP ‘(?<=inet addr:)\S+’

192.168.0.40

[root@db-server-yayun-04 shell]# pgrep -fl redis-server | awk ‘{print $1}’ | xargs kill

[root@db-server-yayun-04 shell]#

那么我们的redis就没有可用从库了,那么读写都将在master上面。

这里测试的就差不多了,下面结合sentinel来实现高可用。我会恢复刚才的环境。也就是20是master,30,40是slave。10是sentinel。生产环境sentinel也要部署3个或5个。我的10上面已经有sentinel,端口是36029,我直接添加对20的6029监控。

127.0.0.1:36029> sentinel monitor my-test-6029 192.168.0.20 6029 1

OK

127.0.0.1:36029>

127.0.0.1:36029> info Sentinel

# Sentinel

sentinel_masters:1

sentinel_tilt:0

sentinel_running_scripts:0

sentinel_scripts_queue_length:0

master0:name=my-test-6029,status=ok,address=192.168.0.20:6029,slaves=2,sentinels=1

127.0.0.1:36029>

再次看看读写域名是否正常了,我已经恢复环境:

可以看到已经正常,现在关闭redis master:

[root@db-server-yayun-02 ~]# ifconfig eth1 | grep -oP ‘(?<=inet addr:)\S+’

192.168.0.20

[root@db-server-yayun-02 ~]# pgrep -fl redis-server | awk ‘{print $1}’ | xargs kill

看看sentinel信息:

127.0.0.1:36029> info Sentinel

# Sentinel

sentinel_masters:1

sentinel_tilt:0

sentinel_running_scripts:0

sentinel_scripts_queue_length:0

master0:name=my-test-6029,status=ok,address=192.168.0.30:6029,slaves=2,sentinels=1

127.0.0.1:36029>

可以看到master已经是30了,dig域名看看:

ok,可以看到已经是我们想要的结果了。最后说说dns的问题。

App端配置域名服务器IP来解析consul后缀的域名,DNS解析及跳转, 有三个方案:
1. 原内网dns服务器,做域名转发,consul后缀的,都转到consul server上(我们线上是采用这个)
2. dns全部跳到consul DNS服务器上,非consul后缀的,使用 recursors 属性跳转到原DNS服务器上
3. dnsmaq 转: server=/consul/10.16.X.X#8600 解析consul后缀的

我们内网dns是用的bind,对于bind的如何做域名转发consul官网也有栗子:https://www.consul.io/docs/guides/forwarding.html,另外也对consul的dns进行了压力测试,不存在性能问题:

https://images2017.cnblogs.com/blog/609710/201712/609710-20171209120138558-1039531277.png

 

参考资料:

https://www.consul.io/docs/

https://book-consul-guide.vnzmi.com/

http://www.liangxiansen.cn/2017/04/06/consul/

https://www.cnblogs.com/gomysql

总结:

对于单机多实例的mysql以及redis,利用consul能够很好的实现高可用,当然要结合mha或者sentinel,最大的好处是consul足够轻量,方便,简单。如果程序支持读写分离的,那么用起来更加方便。从挂掉一个或者多个也不会影响服务。

Leave a Reply

电子邮件地址不会被公开。 必填项已用*标注

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>