Reducing search indexing latency to one second

栏目: IT技术 · 发布时间: 3年前

内容简介:One of the key metrics for a search system is theUntil mid-2019, the indexing latency for Twitter’s search system was around 15 seconds. It was fast enough to power relevance-based product features such as the ranked Home timeline, which delivers Tweets ba

One of the key metrics for a search system is the indexing latency , the amount of time it takes for new information to become available in the search index. This metric is important because it determines how quickly new results show up. Not all search systems need to update their contents quickly. In a warehouse inventory system, for example, one daily update to its search index might be acceptable. At Twitter -- where people are constantly looking for the answer to “what’s happening” -- real-time search is a must.

Until mid-2019, the indexing latency for Twitter’s search system was around 15 seconds. It was fast enough to power relevance-based product features such as the ranked Home timeline, which delivers Tweets based on their relevance. Since determining a Tweet's relevance is based on factors such as how much engagement it gets, there is less need for instant indexing. Use cases requiring a much lower indexing latency, however, couldn’t be powered by our search infrastructure. For example, we couldn’t use this same search system to retrieve Tweets for a person’s profile page where people generally expect to see their Tweets appear the moment they are published.

Two main limitations prevented us from lowering our indexing latency:

  1. Some fields (bits of information associated with each Tweet) are not available when a Tweet is created. For example, a fully resolved URL usually provides a much better ranking signal than a shortened URL like http://t.co/foo. However, resolving new URLs takes time, and our old system required us to index all fields for a Tweet at the same time, so we needed to wait for all these delayed fields to become available.
  2. Most features in the Twitter application prioritize the newest relevant Tweets. Therefore, we sort our index by the time a Tweet was created. This would be easy to do if our systems received events that were strictly ordered. However, because our search system is decoupled from Tweet creation, the search system doesn’t necessarily receive Tweets in chronological order. To overcome this limitation, we added a buffer in our ingestion pipeline that stored all incoming Tweets for a few seconds, and sorted them in strictly increasing order of created time before sending them to our indexing system.

Overcoming these limitations required major changes to our ingestion pipeline and our indexing system, but we believe the results were worth the effort. Tweets are now available for searching within one second of creation, which allows us to power product features with strict real-time requirements, such as real-time conversations or the profile pages. Let's take a closer look at how we've achieved this.

Posting lists

The core of almost all search systems is a data structure called an inverted index . An inverted index is designed to quickly answer questions like "Which documents have the word cat in them?". It does this by keeping a map from terms to posting lists . A term is typically a word, but is sometimes a conjunction, phrase, or number. A posting list is a list of document identifiers (or document IDs ) containing the term. The posting list often includes extra information, like the position in the document where the term appears, or payloads to improve the relevance of our ranking algorithms. 1

The search systems at Twitter process hundreds of thousands of queries per second and most involve searching through posting lists of thousands of items, making the speed at which we can iterate through a posting list a critical factor in serving queries efficiently. For example, consider how many Tweets contain the word “the.”

Document identifiers

We use Lucene as our core indexing technology. In standard Lucene, an index is subdivided into chunks called segments , and document IDs are Java integers. The first document indexed in a particular segment is assigned an ID of 0, and new document IDs are assigned sequentially. When searching through a segment, the search starts at the lowest document IDs in the segment and proceeds to the highest IDs in the segment.

To support our requirement of searching for the newest Tweets first, we diverge from standard Lucene and assign document IDs from high to low: the first document in a segment is assigned a maximum ID (determined by how large we want our Lucene segment to be), and each new document gets a smaller document ID. This lets us traverse documents so that we retrieve the newest Tweets first, and terminate queries after we examine a client-specified number of hits . This decision is critical in reducing the amount of time it takes to evaluate a search query and therefore in letting us scale to extremely high request rates.

When we were using sorted streams of incoming Tweets, it was easy to assign document IDs: the first Tweet in a segment would get the ID of size of the segment minus one, the second Tweet would get the size of the segment minus two, and so on, until we got to document ID 0. However, this document ID assignment scheme doesn’t work when the incoming stream isn’t sorted by the time a Tweet was created. In order to remove the delay added by sorting, we needed to come up with a new scheme.

In the new document ID assignment scheme, each Tweet is assigned a document ID based on the time that it was created. We needed to fit our document IDs into a 31-bit space, because Lucene uses positive Java integers as document IDs. Each document ID is unique within a segment, and our segments are usually writable for about 12 hours. We decided to allocate 27 bits to store timestamps with millisecond granularity, which is enough for a segment to last for a bit more than 37 hours. We use the last four bits as a counter for the number of Tweets with the same timestamp. This means that if we get more than 2 4 (16) Tweets with the same millisecond timestamp, then some of them will be assigned a document ID that is in a slightly incorrect order. In practice, this is extremely rare, and we decided that this downside was acceptable because we often ran into a similar situation in the old system when a Tweet was delayed for more than 15 seconds, which also resulted in the assignment of an unordered document ID.

How it used to work: Unrolled linked lists

For the past eight years, the search systems at Twitter used a prepend-only unrolled linked list as the data structure backing the posting lists. This has allowed us to avoid the overhead of storing a pointer for every value and vastly improved the speed of traversing a posting list because it was cache friendly. (An unrolled linked list is a linked list with multiple values per link — there is a good description on Wikipedia .)

In our old implementation, the linked list started out with a single value, and we allocated exponentially larger nodes each time the list needed to grow.

This Tweet is unavailable

This Tweet is unavailable.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Web应用安全权威指南

Web应用安全权威指南

德丸浩 / 赵文、刘斌 / 人民邮电出版社 / 2014-10 / 79

《web应用安全权威指南》系日本web安全第一人德丸浩所创,是作者从业多年的经验总结。作者首先简要介绍了web应用的安全隐患以及产生原因,然后详细介绍了web安全的基础,如http、会话管理、同源策略等。此外还重点介绍了web应用的各种安全隐患,对其产生原理及对策进行了详尽的讲解。最后对如何提高web网站的安全性和开发安全的web应用所需要的管理进行了深入的探讨。本书可操作性强,读者可以通过下载已......一起来看看 《Web应用安全权威指南》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

MD5 加密
MD5 加密

MD5 加密工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具