Zookeeper - All you need to know
- 6 minsHistorically, each application was a single program running on a single computer with a single CPU. Today, things have changed. In the Big Data and Cloud Computing world, applications are made up of many indepndant programs running on an ever-changing set of computers.
Coordinating the actions of these independant programs is far more difficult than writing a single program to run on a single computer.
Zookeeper was designed to be a robust service that enables application developers to mainly focus on their application logic rather than coordination. It exposes a simple API, inspired by the filesystem API, that allows developers to implement common coordination tasks.
Where is Zookeeper used?
- Apache HBase
- Apache Kafka
- Apache Solr
- Yahoo! Fetching Services
- Facebook Messenger
CAP Theorem
It is impossible for a distributed data store to simultaneously provide more than two out of the following three gurantees.
- Consistency: Every read receives the most recent write or an error
- Availability: Every request receives a (non-error) response - without the gurantee that it contains the most recent write
- Partition tolerance: the system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
It implies that in the presence of a network partition, one has to choose between Consistency and Availability
Zookeeper is designed with mostly Consistency and Availability in mind. It also provides read-only capability in the presence of network partition. What that means is, the clients connected to a znode, which was part of an ensamble/cluster, can keep communicating with the znode (even after network partition) and will keep receiving stale, read-only data.
Paxos and Virtual Synchrony algorithms have been perticularly influential in the design of Zookeeper
Zookeeper is a CP system
Definitions
znode
Every node in a Zookeeper tree is referred to as a znode
watch
A watch is a one-shot operation, which means that it triggers one notification
quorum
Zookeeper quorum is the minimum number of servers that have to be running and available in order for Zookeeper to work
zxid
A zxid is a long (64 bit) integer split into two parts: the epoch and the counter. Each part has 32 bits. Simply put, zxid is the timestamp of the last change
leader
The leader is the central point for handling all requests that change the Zookeeper system. It acts as a sequencer and establishes the order of updates to the Zookeeper state.
follower
A follower receive and vote on the updates proposed by the leader to guarantee that updates to the state survive crashes
split-brain
Two subsets of servers making progress simultaneously. In this scenario, a single cluster can have multiple Leaders. Zookeeper strongly advises against split-brain. Needless to say, a split-brain points to a faulty configuration
Myth-buster: Zookeeper is not for bulk storage
znode in detail
Persistent znode
A persistent znode /path can be deleted only through a call to delete/remove (there is a slight difference between the two)
By default, all znodes are persistent znodes unless otherwise stated
Ephemeral znode
An ephemeral znode is deleted if the client that created it crashes or simply closes the connection to Zookeeper.
Sequential znode
A sequential znode is assigned a unique, monotonically increasing integer
There are four options for the mode of a znode: persistent, ephemeral, persistent_sequential and ephemeral_sequential
watch in detail
To replace client polling, Zookeeper has opted for a mechanism based on notifications. Clients register with Zookeeper to receive notifications of changes ro znodes. Registering to receive a notification for a given znode consists of setting a watch.
Watches are one time triggers and are sent asynchronously to the watchers/client.
Watches are set while reading data and triggered while writing data.
Notifications are delivered to a client before any other change is made to the same node
Watches only tell that something has changed, it does not talk about what has changed.
A client can set a watch for:
- changes to the data of a znode
- changes to the children of a znode
- znode being created or deleted
quorum in detail
This number is the minimum number of servers that have to store client’s data before telling the client that its data is safely stored.
We should always shoot for an odd number of servers. Typically, (2F + 1)
servers. Where F
is the number of server failures the cluster can tolerate
Versions
A couple of operations in the Zookeeper API can be executed conditionally: setData
and delete
. Both calls take a version as an input parameter and the operation succeeds only if the version passed by the client matches the current version on the server.
The use of version is important when multiple clients might be trying to perform operations over the same znode.
Sessions
Before executing any request against a Zookeeper ensemble, a client must establish a session with the Zookeeper service.
As soon as a client is connected to a server and a session established, the session is replicated with the leader.
Sessions offer order grantees, which means, requests in a session are executed in FIFO
order. However, a client can have multiple concurrent sessions, in which case FIFO
ordering is not preserved across sessions.
Sessions may be moved to a different server of the client has not heard from its current server for some time.
Moving the session to a different server, transparently, is handled by the client library
All operations a client submits to Zookeeper are associated to a session. When a session ends for any reason, the ephemeral nodes created dusring that session disappear.
The Zookeeper ensamble is the one responsible for declaring session expired, not the client. The client may choose to close the session, however.
Session handling on the client side
On the client side, if it has heard nothing from the server at the end of 1/3rd of t
, it sends a heartbeat message to the server. At 2/3rd of t
the client starts looking for another server, and it has another 1/3rd of t
to find one.
Stay tuned for season 2
There is so much more to Zookeeper than this. Lets not overwhelm ourselves with information. Take time to process all this and in the meanwhile, allow me to write another blog in the continuation.
I will cover the following topics in next post:
- Types of servers
- Leader election
- How to configure Zookeeper in standalone mode and in ensemble mode
- Notification messages
- Logs
See you soon!