Reducing search indexing latency to one second

栏目: IT技术 · 发布时间: 5年前

内容简介：One of the key metrics for a search system is theUntil mid-2019, the indexing latency for Twitter’s search system was around 15 seconds. It was fast enough to power relevance-based product features such as the ranked Home timeline, which delivers Tweets ba

One of the key metrics for a search system is the indexing latency , the amount of time it takes for new information to become available in the search index. This metric is important because it determines how quickly new results show up. Not all search systems need to update their contents quickly. In a warehouse inventory system, for example, one daily update to its search index might be acceptable. At Twitter -- where people are constantly looking for the answer to “what’s happening” -- real-time search is a must.

Until mid-2019, the indexing latency for Twitter’s search system was around 15 seconds. It was fast enough to power relevance-based product features such as the ranked Home timeline, which delivers Tweets based on their relevance. Since determining a Tweet's relevance is based on factors such as how much engagement it gets, there is less need for instant indexing. Use cases requiring a much lower indexing latency, however, couldn’t be powered by our search infrastructure. For example, we couldn’t use this same search system to retrieve Tweets for a person’s profile page where people generally expect to see their Tweets appear the moment they are published.

Two main limitations prevented us from lowering our indexing latency:

Some fields (bits of information associated with each Tweet) are not available when a Tweet is created. For example, a fully resolved URL usually provides a much better ranking signal than a shortened URL like http://t.co/foo. However, resolving new URLs takes time, and our old system required us to index all fields for a Tweet at the same time, so we needed to wait for all these delayed fields to become available.
Most features in the Twitter application prioritize the newest relevant Tweets. Therefore, we sort our index by the time a Tweet was created. This would be easy to do if our systems received events that were strictly ordered. However, because our search system is decoupled from Tweet creation, the search system doesn’t necessarily receive Tweets in chronological order. To overcome this limitation, we added a buffer in our ingestion pipeline that stored all incoming Tweets for a few seconds, and sorted them in strictly increasing order of created time before sending them to our indexing system.

Overcoming these limitations required major changes to our ingestion pipeline and our indexing system, but we believe the results were worth the effort. Tweets are now available for searching within one second of creation, which allows us to power product features with strict real-time requirements, such as real-time conversations or the profile pages. Let's take a closer look at how we've achieved this.

Posting lists

The core of almost all search systems is a data structure called an inverted index . An inverted index is designed to quickly answer questions like "Which documents have the word cat in them?". It does this by keeping a map from terms to posting lists . A term is typically a word, but is sometimes a conjunction, phrase, or number. A posting list is a list of document identifiers (or document IDs ) containing the term. The posting list often includes extra information, like the position in the document where the term appears, or payloads to improve the relevance of our ranking algorithms. ¹

The search systems at Twitter process hundreds of thousands of queries per second and most involve searching through posting lists of thousands of items, making the speed at which we can iterate through a posting list a critical factor in serving queries efficiently. For example, consider how many Tweets contain the word “the.”

Document identifiers

We use Lucene as our core indexing technology. In standard Lucene, an index is subdivided into chunks called segments , and document IDs are Java integers. The first document indexed in a particular segment is assigned an ID of 0, and new document IDs are assigned sequentially. When searching through a segment, the search starts at the lowest document IDs in the segment and proceeds to the highest IDs in the segment.

To support our requirement of searching for the newest Tweets first, we diverge from standard Lucene and assign document IDs from high to low: the first document in a segment is assigned a maximum ID (determined by how large we want our Lucene segment to be), and each new document gets a smaller document ID. This lets us traverse documents so that we retrieve the newest Tweets first, and terminate queries after we examine a client-specified number of hits . This decision is critical in reducing the amount of time it takes to evaluate a search query and therefore in letting us scale to extremely high request rates.

When we were using sorted streams of incoming Tweets, it was easy to assign document IDs: the first Tweet in a segment would get the ID of size of the segment minus one, the second Tweet would get the size of the segment minus two, and so on, until we got to document ID 0. However, this document ID assignment scheme doesn’t work when the incoming stream isn’t sorted by the time a Tweet was created. In order to remove the delay added by sorting, we needed to come up with a new scheme.

In the new document ID assignment scheme, each Tweet is assigned a document ID based on the time that it was created. We needed to fit our document IDs into a 31-bit space, because Lucene uses positive Java integers as document IDs. Each document ID is unique within a segment, and our segments are usually writable for about 12 hours. We decided to allocate 27 bits to store timestamps with millisecond granularity, which is enough for a segment to last for a bit more than 37 hours. We use the last four bits as a counter for the number of Tweets with the same timestamp. This means that if we get more than 2 ⁴ (16) Tweets with the same millisecond timestamp, then some of them will be assigned a document ID that is in a slightly incorrect order. In practice, this is extremely rare, and we decided that this downside was acceptable because we often ran into a similar situation in the old system when a Tweet was delayed for more than 15 seconds, which also resulted in the assignment of an unordered document ID.

How it used to work: Unrolled linked lists

For the past eight years, the search systems at Twitter used a prepend-only unrolled linked list as the data structure backing the posting lists. This has allowed us to avoid the overhead of storing a pointer for every value and vastly improved the speed of traversing a posting list because it was cache friendly. (An unrolled linked list is a linked list with multiple values per link — there is a good description on Wikipedia .)

In our old implementation, the linked list started out with a single value, and we allocated exponentially larger nodes each time the list needed to grow.

This Tweet is unavailable

This Tweet is unavailable.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Reducing search indexing latency to one second

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Windows 程序设计：第5版

CharlesPetzold / 北京博彦科技发展有限公司 / 北京大学出版社 / 2003-11-1 / 160.00元

Windows程序设计（第5版）对于Windows程序员来说，“从 Charles 的（Windows程序设计）一书中寻找答案。”几乎成了一句至理名言。而（Windows程序设计》第5版是专门为在Microsoft Windows 98、Microsoft Windows NT 4和 Windows NT 5下编程的开发人员编写的。内容博大精深，并有大量的源代码来帮助读者掌握Windows编程。本......一起来看看《Windows 程序设计：第5版》这本书的介绍吧!

码农工具