Kafka的offset管理

栏目: IT技术 · 发布时间: 5年前

内容简介:Kafka的offset管理

对Kafka offset的管理,一直没有进行系统的总结,这篇文章对它进行分析。

 

什么是offset

offset是consumer position,Topic的每个Partition都有各自的offset.

Keeping track of what has been consumed, is, surprisingly, one of the key performance points of a messaging system.  
  
Most messaging systems keep metadata about what messages have been consumed on the broker. That is, as a message is handed out to a consumer, the broker either records that fact locally immediately or it may wait for acknowledgement from the consumer. This is a fairly intuitive choice, and indeed for a single machine server it is not clear where else this state could go. Since the data structure used for storage in many messaging systems scale poorly, this is also a pragmatic choice--since the broker knows what is consumed it can immediately delete it, keeping the data size small.  
  
What is perhaps not obvious, is that getting the broker and consumer to come into agreement about what has been consumed is not a trivial problem. If the broker records a message as consumed immediately every time it is handed out over the network, then if the consumer fails to process the message (say because it crashes or the request times out or whatever) that message will be lost. To solve this problem, many messaging systems add an acknowledgement feature which means that messages are only marked as sent not consumed when they are sent; the broker waits for a specific acknowledgement from the consumer to record the message as consumed. This strategy fixes the problem of losing messages, but creates new problems. First of all, if the consumer processes the message but fails before it can send an acknowledgement then the message will be consumed twice. The second problem is around performance, now the broker must keep multiple states about every single message (first to lock it so it is not given out a second time, and then to mark it as permanently consumed so that it can be removed). Tricky problems must be dealt with, like what to do with messages that are sent but never acknowledged.  
  
Kafka handles this differently. Our topic is divided into a set of totally ordered partitions, each of which is consumed by one consumer at any given time. This means that the position of consumer in each partition is just a single integer, the offset of the next message to consume. This makes the state about what has been consumed very small, just one number for each partition. This state can be periodically checkpointed. This makes the equivalent of message acknowledgements very cheap.  
  
There is a side benefit of this decision. A consumer can deliberately rewind back to an old offset and re-consume data. This violates the common contract of a queue, but turns out to be an essential feature for many consumers. For example, if the consumer code has a bug and is discovered after some messages are consumed, the consumer can re-consume those messages once the bug is fixed.  
 

消费者需要自己保留一个offset,从kafka 获取消息时,只拉去当前offset 以后的消息。Kafka 的scala/java 版的client 已经实现了这部分的逻辑,将offset 保存到zookeeper 上

 

1. auto.offset.reset

What to do when there is no initial offset in ZooKeeper or if an offset is out of range:
  • smallest : automatically reset the offset to the smallest offset
  • largest : automatically reset the offset to the largest offset
  • anything else: throw exception to the consumer

如果Kafka没有开启Consumer,只有Producer生产了数据到Kafka中,此后开启Consumer。在这种场景下,将auto.offset.reset设置为largest,那么Consumer会读取不到之前Produce的消息,只有新Produce的消息才会被Consumer消费

 

2. auto.commit.enable(例如true,表示offset自动提交到Zookeeper)

If true, periodically commit to ZooKeeper the offset of messages already fetched by the consumer. This committed offset will be used when the process fails as the position from which the new consumer will begin

 

3. auto.commit.interval.ms(例如60000,每隔1分钟offset提交到Zookeeper)

The frequency in ms that the consumer offsets are committed to zookeeper.

 

问题:如果在一个时间间隔内,没有提交offset,岂不是要重复读了?

 

4. offsets.storage

Select where offsets should be stored (zookeeper or kafka).默认是Zookeeper

 

5. 基于offset的重复读

The Kafka consumer works by issuing "fetch" requests to the brokers leading the partitions it wants to consume. The consumer specifies its offset in the log with each request and receives back a chunk of log beginning from that position. The consumer thus has significant control over this position and can rewind it to re-consume data if need be.

 

6. Kafka的可靠性保证(消息消费和Offset提交的时机决定了At most once和At least once语义)

At Most Once:

At Least Once:


Kafka的offset管理
 

Kafka默认实现了At least once语义


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

标签: kafka的offset管理

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

任正非传

任正非传

孙力科 / 浙江人民出版社 / 2017-5-2 / 39.80

编辑推荐: 超权威、超丰富、超真实的任正非传记 亲述任正非跌宕起伏、传奇精彩的一生 ◆知名财经作家孙力科,历时十年,数十次深入华为,采访华为和任正非历程中各个关键人物,几度增删,创作成此书 ◆全书展现了任正非从出生至今70多年的人生画卷,从入伍到退役进国企,从艰难创业到开拓海外市场,囊括其人生道路上各个关键点,时间跨度之长,内容之丰富,前所未有 ◆迄今为止,任正非一生......一起来看看 《任正非传》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

MD5 加密
MD5 加密

MD5 加密工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具