Zookeeper 通知更新可靠吗？解读源码找答案！

栏目: 服务器 · 发布时间: 6年前

内容简介：导读：遇到Keepper通知更新无法收到的问题，思考节点变更通知的可靠性，通过阅读源码解析了解到zk Watch的注册以及触发的机制，本地调试运行模拟zk更新的不可靠的场景以及得出相应的解决方案。过程很曲折，但问题的根本原因也水落石出了，本文最后陈述了更新无法收到的根本原因，希望对其他人有所帮助。-----------------------------------------

导读：

遇到Keepper通知更新无法收到的问题，思考节点变更通知的可靠性，通过阅读源码解析了解到zk Watch的注册以及触发的机制，本地调试运行模拟zk更新的不可靠的场景以及得出相应的解决方案。

过程很曲折，但问题的根本原因也水落石出了，本文最后陈述了更新无法收到的根本原因，希望对其他人有所帮助。-----------------------------------------

通常Zookeeper是作为配置存储、分布式锁等功能被使用，配置读取如果每一次都是去Zookeeper server读取效率是非常低的，幸好Zookeeper提供节点更新的通知机制，只需要对节点设置Watch监听，节点的任何更新都会以通知的方式发送到Client端。

如上图所示：应用Client通常会连接上某个ZkServer，forPath不仅仅会读取Zk 节点zkNode的数据（通常存储读取到的数据会存储在应用内存中，例如图中Value），而且会设置一个Watch，当zkNode节点有任何更新时，ZkServer会发送notify，Client运行Watch来才走出相应的事件相应。这里假设操作为更新Client本地的数据。这样的模型使得配置异步更新到Client中，而无需Client每次都远程读取，大大提高了读的性能，（图中的re-regist重新注册是因为对节点的监听是一次性的，每一次通知完后，需要重新注册）。但这个Notify是可靠的吗？如果通知失败，那岂不是Client永远都读取的本地的未更新的值？

由于现网环境定位此类问题比较困难，因此本地下载源码并模拟运行ZkServer & ZkClient来看通知的发送情况。

1、git 下载源码 https://github.com/apache/zookeeper

2、cd 到路径下，运行ant eclipse 加载工程的依赖。

3、导入Idea中。

https://stackoverflow.com/questions/43964547/how-to-import-zookeeper-source-code-to-idea

查看相关问题和步骤。

首先运行ZkServer。QuorumPeerMain是Server的启动类。这个可以根据bin下ZkServer.sh找到入口。注意启动参数配置参数文件，指定例如启动端口等相关参数。

在此之前，需要设置相关的断点。

首先我们要看client设置监听后，server是如何处理的

ZkClient 是使用Nio的方式与ZkServer进行通信的，Zookeeper的线程模型中使用两个线程：

SendThread专门成立的请求的发送，请求会被封装为Packet（包含节点名称、Watch描述等信息）类发送给Sever。

EventThread则专门处理SendThread接收后解析出的Event。

ZkClient 的主要有两个Processor，一个是SycProcessor负责Cluster之间的数据同步（包括集群leader选取）。另一个是叫FinalRuestProcessor，专门处理对接受到的请求（Packet）进行处理。

//ZookeeperServer 的processPacket方法专门对收到的请求进行处理。
    public void processPacket(ServerCnxn cnxn, ByteBuffer incomingBuffer) throws IOException {
        // We have the request, now process and setup for next
        InputStream bais = new ByteBufferInputStream(incomingBuffer);
        BinaryInputArchive bia = BinaryInputArchive.getArchive(bais);
        RequestHeader h = new RequestHeader();
        h.deserialize(bia, "header");
        // Through the magic of byte buffers, txn will not be
        // pointing
        // to the start of the txn
        incomingBuffer = incomingBuffer.slice();
        //鉴权请求处理
        if (h.getType() == OpCode.auth) {
            LOG.info("got auth packet " + cnxn.getRemoteSocketAddress());
            AuthPacket authPacket = new AuthPacket();
            ByteBufferInputStream.byteBuffer2Record(incomingBuffer, authPacket);
            String scheme = authPacket.getScheme();
            ServerAuthenticationProvider ap = ProviderRegistry.getServerProvider(scheme);
            Code authReturn = KeeperException.Code.AUTHFAILED;
            if(ap != null) {
                try {
                    authReturn = ap.handleAuthentication(new ServerAuthenticationProvider.ServerObjs(this, cnxn), authPacket.getAuth());
                } catch(RuntimeException e) {
                    LOG.warn("Caught runtime exception from AuthenticationProvider: " + scheme + " due to " + e);
                    authReturn = KeeperException.Code.AUTHFAILED;
                }
            }
            if (authReturn == KeeperException.Code.OK) {
                if (LOG.isDebugEnabled()) {
                    LOG.debug("Authentication succeeded for scheme: " + scheme);
                }
                LOG.info("auth success " + cnxn.getRemoteSocketAddress());
                ReplyHeader rh = new ReplyHeader(h.getXid(), 0,
                        KeeperException.Code.OK.intValue());
                cnxn.sendResponse(rh, null, null);
            } else {
                if (ap == null) {
                    LOG.warn("No authentication provider for scheme: "
                            + scheme + " has "
                            + ProviderRegistry.listProviders());
                } else {
                    LOG.warn("Authentication failed for scheme: " + scheme);
                }
                // send a response...
                ReplyHeader rh = new ReplyHeader(h.getXid(), 0,
                        KeeperException.Code.AUTHFAILED.intValue());
                cnxn.sendResponse(rh, null, null);
                // ... and close connection
                cnxn.sendBuffer(ServerCnxnFactory.closeConn);
                cnxn.disableRecv();
            }
            return;
        } else {
            
            if (h.getType() == OpCode.sasl) {
                Record rsp = processSasl(incomingBuffer,cnxn);
                ReplyHeader rh = new ReplyHeader(h.getXid(), 0, KeeperException.Code.OK.intValue());
                cnxn.sendResponse(rh,rsp, "response"); // not sure about 3rd arg..what is it?
                return;
            }
            else {
                Request si = new Request(cnxn, cnxn.getSessionId(), h.getXid(),
                  h.getType(), incomingBuffer, cnxn.getAuthInfo());
                si.setOwner(ServerCnxn.me);
                // Always treat packet from the client as a possible
                // local request.
                setLocalSessionFlag(si);
                //交给finalRequestProcessor处理
                submitRequest(si);
            }
        }
        cnxn.incrOutstandingRequests(h);
    }

FinalRequestProcessor 对请求进行解析，Client连接成功后，发送的exist命令会落在这部分处理逻辑。

zkDataBase 由zkServer从disk持久化的数据建立而来，上图可以看到这里就是添加监听Watch的地方。

然后我们需要了解到，当Server收到节点更新事件后，是如何触发Watch的。

首先了解两个概念，FinalRequestProcessor处理的请求分为两种，一种是事务型的，一种非事务型，exist 的event-type是一个非事物型的操作，上面代码中是对其处理逻辑，对于事物的操作，例如SetData的操作。则在下面代码中处理。

private ProcessTxnResult processTxn(Request request, TxnHeader hdr,
                                        Record txn) {
        ProcessTxnResult rc;
        int opCode = request != null ? request.type : hdr.getType();
        long sessionId = request != null ? request.sessionId : hdr.getClientId();
        if (hdr != null) {
            //hdr 为事物头描述，例如SetData的操作就会被ZkDataBase接管操作，
            //因为是对Zk的数据存储机型修改
            rc = getZKDatabase().processTxn(hdr, txn);
        } else {
            rc = new ProcessTxnResult();
        }
        if (opCode == OpCode.createSession) {
            if (hdr != null && txn instanceof CreateSessionTxn) {
                CreateSessionTxn cst = (CreateSessionTxn) txn;
                sessionTracker.addGlobalSession(sessionId, cst.getTimeOut());
            } else if (request != null && request.isLocalSession()) {
                request.request.rewind();
                int timeout = request.request.getInt();
                request.request.rewind();
                sessionTracker.addSession(request.sessionId, timeout);
            } else {
                LOG.warn("*****>>>>> Got "
                        + txn.getClass() + " "
                        + txn.toString());
            }
        } else if (opCode == OpCode.closeSession) {
            sessionTracker.removeSession(sessionId);
        }
        return rc;
    }

这里设置了断点，就可以拦截对节点的更新操作。

这两个设置了断点，就可以了解到Watch的设置过程。

接下来看如何启动Zookeeper的Client。ZookeeperMain为Client的入口，同样在bin/zkCli.sh中可以找到。注意设置参数，设置Server的连接地址。

修改ZookeeperMain方法，设置对节点的Watch监听。

public ZooKeeperMain(String args[]) throws IOException, InterruptedException, KeeperException {
        cl.parseOptions(args);
        System.out.println("Connecting to " + cl.getOption("server"));
        connectToZK(cl.getOption("server"));
        while (true) {
            // 模拟注册对/zookeeper节点的watch监听
            zk.exists("/zookeeper", true);
            System.out.println("wait");
        }
    }

启动Client。

由于我们要观察节点变更的过程，上面这个Client设置了对节点的监听，那么我们需要另外一个cleint对节点进行更改，这个我们只需要在命令上进行就可以了。

此时命令行的zkClient更新了/zookeeper节点，Server此时会停在setData事件的处理代码段。

public Stat setData(String path, byte data[], int version, long zxid,
            long time) throws KeeperException.NoNodeException {
        Stat s = new Stat();
        DataNode n = nodes.get(path);
        if (n == null) {
            throw new KeeperException.NoNodeException();
        }
        byte lastdata[] = null;
        synchronized (n) {
            lastdata = n.data;
            n.data = data;
            n.stat.setMtime(time);
            n.stat.setMzxid(zxid);
            n.stat.setVersion(version);
            n.copyStat(s);
        }
        // now update if the path is in a quota subtree.
        String lastPrefix = getMaxPrefixWithQuota(path);
        if(lastPrefix != null) {
          this.updateBytes(lastPrefix, (data == null ? 0 : data.length)
              - (lastdata == null ? 0 : lastdata.length));
        }
        //触发watch监听
        dataWatches.triggerWatch(path, EventType.NodeDataChanged);
        return s;
    }

此时，我们重点关注的类出现了。WatchManager

package org.apache.zookeeper.server;

import java.io.PrintWriter;
import java.util.HashMap;
import java.util.HashSet;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;

import org.apache.zookeeper.WatchedEvent;
import org.apache.zookeeper.Watcher;
import org.apache.zookeeper.Watcher.Event.EventType;
import org.apache.zookeeper.Watcher.Event.KeeperState;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * This class manages watches. It allows watches to be associated with a string
 * and removes watchers and their watches in addition to managing triggers.
 */
class WatchManager {
    private static final Logger LOG = LoggerFactory.getLogger(WatchManager.class);
    //存储path对watch的关系
    private final Map<String, Set<Watcher>> watchTable =
        new HashMap<String, Set<Watcher>>();
    //存储watch监听了哪些path节点
    private final Map<Watcher, Set<String>> watch2Paths =
        new HashMap<Watcher, Set<String>>();

    synchronized int size(){
        int result = 0;
        for(Set<Watcher> watches : watchTable.values()) {
            result += watches.size();
        }
        return result;
    }
    //添加监听
    synchronized void addWatch(String path, Watcher watcher) {
        Set<Watcher> list = watchTable.get(path);
        if (list == null) {
            // don't waste memory if there are few watches on a node
            // rehash when the 4th entry is added, doubling size thereafter
            // seems like a good compromise
            list = new HashSet<Watcher>(4);
            watchTable.put(path, list);
        }
        list.add(watcher);

        Set<String> paths = watch2Paths.get(watcher);
        if (paths == null) {
            // cnxns typically have many watches, so use default cap here
            paths = new HashSet<String>();
            watch2Paths.put(watcher, paths);
        }
        paths.add(path);
    }
    //移除
    synchronized void removeWatcher(Watcher watcher) {
        Set<String> paths = watch2Paths.remove(watcher);
        if (paths == null) {
            return;
        }
        for (String p : paths) {
            Set<Watcher> list = watchTable.get(p);
            if (list != null) {
                list.remove(watcher);
                if (list.size() == 0) {
                    watchTable.remove(p);
                }
            }
        }
    }

    Set<Watcher> triggerWatch(String path, EventType type) {
        return triggerWatch(path, type, null);
    }
    //触发watch
    Set<Watcher> triggerWatch(String path, EventType type, Set<Watcher> supress) {
        WatchedEvent e = new WatchedEvent(type,
                KeeperState.SyncConnected, path);
        Set<Watcher> watchers;
        synchronized (this) {
            watchers = watchTable.remove(path);
            if (watchers == null || watchers.isEmpty()) {
                if (LOG.isTraceEnabled()) {
                    ZooTrace.logTraceMessage(LOG,
                            ZooTrace.EVENT_DELIVERY_TRACE_MASK,
                            "No watchers for " + path);
                }
                return null;
            }
            for (Watcher w : watchers) {
                Set<String> paths = watch2Paths.get(w);
                if (paths != null) {
                    paths.remove(path);
                }
            }
        }
        for (Watcher w : watchers) {
            if (supress != null && supress.contains(w)) {
                continue;
            }
            //通知发送
            w.process(e);
        }
        return watchers;
    }
}

重点关注triggerWatch的方法，可以发现watch被移除后，即往watch中存储的client信息进行通知发送。

@Override
    public void process(WatchedEvent event) {
        ReplyHeader h = new ReplyHeader(-1, -1L, 0);
        if (LOG.isTraceEnabled()) {
            ZooTrace.logTraceMessage(LOG, ZooTrace.EVENT_DELIVERY_TRACE_MASK,
                                     "Deliver event " + event + " to 0x"
                                     + Long.toHexString(this.sessionId)
                                     + " through " + this);
        }

        // Convert WatchedEvent to a type that can be sent over the wire
        WatcherEvent e = event.getWrapper();

        sendResponse(h, e, "notification");
    }

没有任何确认机制，不会由于发送失败，而回写watch。

结论:

到这里，可以知道watch的通知机制是不可靠的，zkServer不会保证通知的可靠抵达。虽然zkclient与zkServer端是会有心跳机制保持链接，但是如果通知过程中断开，即时重新建立连接后，watch的状态是不会恢复。

现在已经知道了通知是不可靠的，会有丢失的情况，那ZkClient的使用需要进行修正。

本地的存储不再是一个静态的等待watch更新的状态，而是引入缓存机制，定期的去从Zk主动拉取并注册Watch（ZkServer会进行去重，对同一个Node节点的相同时间类型的Watch不会重复）。

另外一种方式是，Client端收到断开连接的通知，重新注册所有关注节点的Watch。但作者遇到的现网情况是client没有收到更新通知的同时，也没有查看到连接断开的错误信息。这块仍需进一步确认。水平有限，欢迎指正 :D

在StackOverFlow上的提问有了新进展:

https://stackoverflow.com/questions/49328151/is-zookeeper-node-change-notification-reliable-under-situations-of-connection-lo

原来官方文档已经解释了在连接断开的时候，client对watch的一些恢复操做，ps：原来上面我提到的客户端的策略已经官方实现。。。

客户端会通过心跳保活，如果发现断开了连接，会重新建立连接，并发送之前对节点设置的watch以及节点zxid，如果zxid与服务端的小则说明断开期间有更改，那么server会触发通知。

这么来看，Zookeeper的通知机制至少在官方的文档说明上是可靠的，至少是有相应机制去保证。ps：除Exist watch外。但是本人遇到的问题仍未解开。。后悔当初没有保留现场，深入发掘。计划先把实现改回原来的，后续进一步验证。找到原因再更新这里。

最终结论更新！

通过深入阅读apache的zk论坛以及源码，有一个重要的信息。

上面提到的连接断开分为recoverble以及unrecoverble两种场景，这两种的区别主要是基于Session的有效期，所有的client操作包括watch都是和Session关联的，当Session在超时过期时间内，重新成功建立连接，则watch会在连接建立后重新设置。但是当Session Timeout后仍然没有成功重新建立连接，那么Session则处于Expire的状态。下面连接讲述了这个过程

How should I handle SESSION_EXPIRED?

这种情况下，ZookeeperClient会重新连接，但是Session将会是全新的一个。同时之前的状态是不会保存的。

private void conLossPacket(Packet p) {
        if (p.replyHeader == null) {
            return;
        }
        switch (state) {
        case AUTH_FAILED:
            p.replyHeader.setErr(KeeperException.Code.AUTHFAILED.intValue());
            break;
        case CLOSED:
            // session关闭状态，直接返回。
            p.replyHeader.setErr(KeeperException.Code.SESSIONEXPIRED.intValue());
            break;
        default:
            p.replyHeader.setErr(KeeperException.Code.CONNECTIONLOSS.intValue());
        }
        // 如果session未过期，这里进行session的状态（watches）会重新注册。
        finishPacket(p);
    }

1、什么是zookeeper的会话过期？

一般来说，我们使用zookeeper是集群形式，如下图，client和zookeeper集群(3个实例)建立一个会话session。

在这个会话session当中，client其实是随机与其中一个zk provider建立的链接，并且互发心跳heartbeat。zk集群负责管理这个session，并且在所有的provider上维护这个session的信息，包括这个session中定义的临时数据和监视点watcher。

如果再网络不佳或者zk集群中某一台provider挂掉的情况下，有可能出现connection loss的情况，例如client和zk provider1连接断开，这时候client不需要任何的操作(zookeeper api已经给我们做好了)，只需要等待client与其他provider重新连接即可。这个过程可能导致两个结果：

1）在session timeout之内连接成功

这个时候client成功切换到连接另一个provider例如是provider2，由于zk在所有的provider上同步了session相关的数据，此时可以认为无缝迁移了。

2）在session timeout之内没有重新连接

这就是session expire的情况，这时候zookeeper集群会任务会话已经结束，并清除和这个session有关的所有数据，包括临时节点和注册的监视点Watcher。

在session超时之后，如果client重新连接上了zookeeper集群，很不幸，zookeeper会发出session expired异常，且不会重建session，也就是不会重建临时数据和watcher。

我们实现的ZookeeperProcessor是基于Apache Curator的Client封装实现的。

Apache Curator 错误处理机制

它对于Session Expire的处理是提供了处理的监听注册ConnectionStateListner，当遇到Session Expire时，执行使用者要做的逻辑。（例如：重新设置Watch）遗憾的是，我们没有对这个事件进行处理，因此连接是一致断开的，但是！我们应用仍然会读到老的数据！

在这里，我们又犯了另外一个错误，本地缓存了zookeeper的节点数据。。其实zookeeperClient已经做了本地缓存的机制，但是我们有加了一层（注：这里也有一个原因，是因为zk节点的数据时二进制的数组，业务要使用通常要反序列化，我们这里的缓存是为了减少反序列化带来的开销！），正式由于我们本地缓存了，因此即使zk断开了，仍然读取了老的值！

至此，谜团已经全部解开，看来之前的实现有许多姿势是错误的，导致后续出现了各种奇怪的BUG 。现在处理的方案，是监听Reconnect的通知，当收到这个通知后，主动让本地缓存失效（这里仍然做了缓存，是因为减少反序列化的开销，zkClient的缓存只是缓存了二进制，每次拿出来仍然需要反序列化）。代码：

curatorFramework.getConnectionStateListenable().addListener(new ConnectionStateListener() {
            @Override
            public void stateChanged(CuratorFramework client, ConnectionState newState) {

                switch (newState) {
                    case CONNECTED:
                        break;
                    case RECONNECTED:
                        LOG.error("zookeeper connection reconnected");
                        System.out.println("zookeeper connection reconnected");
                        //本来使用invalidateAll，但是这个会使得cache所有缓存值同时失效
                        //如果关注节点比较多，导致同时请求zk读值，可能服务会瞬时阻塞在这一步
                        //因此使用guava cache refresh方法，异步更新，更新过程中，
                        //老值返回，知道更新完成
                        for (String key : classInfoMap.keySet()) {
                            zkDataCache.refresh(key);
                        }

                        break;
                    case LOST:
                       // session 超时，断开连接，这里不要做任何操作，缓存保持使用
                        LOG.error("zookeeper connection lost");
                        System.out.println("zookeeper connection lost");
                        break;
                    case SUSPENDED:
                        break;
                    default:
                        break;
                }

            }
        });

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网