Reducing search indexing latency to one second

栏目: IT技术 · 发布时间: 4年前

内容简介:One of the key metrics for a search system is theUntil mid-2019, the indexing latency for Twitter’s search system was around 15 seconds. It was fast enough to power relevance-based product features such as the ranked Home timeline, which delivers Tweets ba

One of the key metrics for a search system is the indexing latency , the amount of time it takes for new information to become available in the search index. This metric is important because it determines how quickly new results show up. Not all search systems need to update their contents quickly. In a warehouse inventory system, for example, one daily update to its search index might be acceptable. At Twitter -- where people are constantly looking for the answer to “what’s happening” -- real-time search is a must.

Until mid-2019, the indexing latency for Twitter’s search system was around 15 seconds. It was fast enough to power relevance-based product features such as the ranked Home timeline, which delivers Tweets based on their relevance. Since determining a Tweet's relevance is based on factors such as how much engagement it gets, there is less need for instant indexing. Use cases requiring a much lower indexing latency, however, couldn’t be powered by our search infrastructure. For example, we couldn’t use this same search system to retrieve Tweets for a person’s profile page where people generally expect to see their Tweets appear the moment they are published.

Two main limitations prevented us from lowering our indexing latency:

  1. Some fields (bits of information associated with each Tweet) are not available when a Tweet is created. For example, a fully resolved URL usually provides a much better ranking signal than a shortened URL like http://t.co/foo. However, resolving new URLs takes time, and our old system required us to index all fields for a Tweet at the same time, so we needed to wait for all these delayed fields to become available.
  2. Most features in the Twitter application prioritize the newest relevant Tweets. Therefore, we sort our index by the time a Tweet was created. This would be easy to do if our systems received events that were strictly ordered. However, because our search system is decoupled from Tweet creation, the search system doesn’t necessarily receive Tweets in chronological order. To overcome this limitation, we added a buffer in our ingestion pipeline that stored all incoming Tweets for a few seconds, and sorted them in strictly increasing order of created time before sending them to our indexing system.

Overcoming these limitations required major changes to our ingestion pipeline and our indexing system, but we believe the results were worth the effort. Tweets are now available for searching within one second of creation, which allows us to power product features with strict real-time requirements, such as real-time conversations or the profile pages. Let's take a closer look at how we've achieved this.

Posting lists

The core of almost all search systems is a data structure called an inverted index . An inverted index is designed to quickly answer questions like "Which documents have the word cat in them?". It does this by keeping a map from terms to posting lists . A term is typically a word, but is sometimes a conjunction, phrase, or number. A posting list is a list of document identifiers (or document IDs ) containing the term. The posting list often includes extra information, like the position in the document where the term appears, or payloads to improve the relevance of our ranking algorithms. 1

The search systems at Twitter process hundreds of thousands of queries per second and most involve searching through posting lists of thousands of items, making the speed at which we can iterate through a posting list a critical factor in serving queries efficiently. For example, consider how many Tweets contain the word “the.”

Document identifiers

We use Lucene as our core indexing technology. In standard Lucene, an index is subdivided into chunks called segments , and document IDs are Java integers. The first document indexed in a particular segment is assigned an ID of 0, and new document IDs are assigned sequentially. When searching through a segment, the search starts at the lowest document IDs in the segment and proceeds to the highest IDs in the segment.

To support our requirement of searching for the newest Tweets first, we diverge from standard Lucene and assign document IDs from high to low: the first document in a segment is assigned a maximum ID (determined by how large we want our Lucene segment to be), and each new document gets a smaller document ID. This lets us traverse documents so that we retrieve the newest Tweets first, and terminate queries after we examine a client-specified number of hits . This decision is critical in reducing the amount of time it takes to evaluate a search query and therefore in letting us scale to extremely high request rates.

When we were using sorted streams of incoming Tweets, it was easy to assign document IDs: the first Tweet in a segment would get the ID of size of the segment minus one, the second Tweet would get the size of the segment minus two, and so on, until we got to document ID 0. However, this document ID assignment scheme doesn’t work when the incoming stream isn’t sorted by the time a Tweet was created. In order to remove the delay added by sorting, we needed to come up with a new scheme.

In the new document ID assignment scheme, each Tweet is assigned a document ID based on the time that it was created. We needed to fit our document IDs into a 31-bit space, because Lucene uses positive Java integers as document IDs. Each document ID is unique within a segment, and our segments are usually writable for about 12 hours. We decided to allocate 27 bits to store timestamps with millisecond granularity, which is enough for a segment to last for a bit more than 37 hours. We use the last four bits as a counter for the number of Tweets with the same timestamp. This means that if we get more than 2 4 (16) Tweets with the same millisecond timestamp, then some of them will be assigned a document ID that is in a slightly incorrect order. In practice, this is extremely rare, and we decided that this downside was acceptable because we often ran into a similar situation in the old system when a Tweet was delayed for more than 15 seconds, which also resulted in the assignment of an unordered document ID.

How it used to work: Unrolled linked lists

For the past eight years, the search systems at Twitter used a prepend-only unrolled linked list as the data structure backing the posting lists. This has allowed us to avoid the overhead of storing a pointer for every value and vastly improved the speed of traversing a posting list because it was cache friendly. (An unrolled linked list is a linked list with multiple values per link — there is a good description on Wikipedia .)

In our old implementation, the linked list started out with a single value, and we allocated exponentially larger nodes each time the list needed to grow.

This Tweet is unavailable

This Tweet is unavailable.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

JavaScript核心技术

JavaScript核心技术

Shelley Powers / 苏敬凯 / 机械工业出版社 / 2007-6 / 45.00

Ajax是当今Web开发领域最流行的词汇。而JavaScript与CSS、XML和DOM几种老技术,加上XMLHttpRequest就构成了Ajax的四大基石。对于JavaScript,一些更资深的同事告诉我的感觉是失望。面对不同的浏览器和浏览器的不同版本,没有优秀的调试开发工具,JavaScript成了软件开发的泥潭。. 而本书的出版则给我们增加了一丝解决这些问题的信心。 它从最简单......一起来看看 《JavaScript核心技术》 这本书的介绍吧!

MD5 加密
MD5 加密

MD5 加密工具

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具