Paper Summary: Lessons from Giant-Scale Services

栏目: IT技术 · 发布时间: 4年前

内容简介:Today, I read a paper titledThe paper starts by defining the problem statement.As I wrote in an earlierblog on CAP theorem,

Today, I read a paper titled Lessons from Giant-Scale Services . This paper is written by Eric Brewer, the guy behind CAP theorem. It is an old paper published in 2001. The paper helps the reader build mental model on how to think about availability of large scale distributed systems.

The paper starts by defining the problem statement.

 High availability is a major driving requirement behind giant-scale system design. Other infrastructures — such as the telephone, rail, and water systems — aim for perfect availability, a goal that should apply to IP-based infrastructure services as well. All these systems plan for component failures and natural disasters, but information systems must also deal with constantly evolving features and unpredictable growth. 

As I wrote in an earlierblog on CAP theorem,

 Availability means every request receives a non-error response from a non-failing node in a reasonable time. It is expected that the client may not receive the most recent write. 

We typically measure availability in terms of uptime like 99.99%. As mentioned in the paper, uptime is defined as

uptime = (MTBF - MTTR) / MTBF

MTBF is mean time between two failures

MTTR is the time it takes to recover from a failure.

For high uptime, we can should either have high MTBF or lower MTTR.

 Following this equation, we can improve uptime either by reducing the frequency of failures or reducing the time to fix them. Although the former is more pleasing aesthetically, the latter is much easier to accomplish with evolving systems. 

When you are building complex distributed systems you it is difficult to have less failures so MTBF will be low. The only way we can achieve high availability is by making sure we can recover from failure faster keeping MTTR low. We need to have highly automated deployment process that can keep MTTR low. This requires investment in platform engineering.

 In addition, new features tend to reduce MTBF but have relatively little impact on MTTR, which makes it more stable. Thus, giant-scale systems should focus on improving MTTR and simply apply best effort to MTBF. 

The author then goes on to define two more terms – yield and harvest .

 yield = queries completed / queries offered 

Author makes the case that yield is more useful in practice than uptime as it directly maps to the user experience. The reason yield is more useful is because it help us understand impact on user when queries are not fulfilled.

 Being down for a second when there are no queries has no impact on users or yield, but reduces uptime. Similarly, being down for one second at peak and off-peak times generates the same uptime, but vastly different yields because there might be an order-of-magnitude difference in load between the peak second and the minimum-load second. Thus we focus on yield rather than uptime. 

Harvest is defined by the following formula

 harvest = data available / complete data 

This means how much of the data is reflected in the answer. It will become clear when we talk about replication and partitioning.

 A perfect system would have 100 percent yield and 100 percent harvest. That is, every query would complete and would reflect the entire database. 

The author makes the case that with out design we can influence whether faults will impact yield or harvest.

Replicated systems impact yield when faults occur. They have access to all the data so harvest is not impacted. I think this is more like CP systems in CAP theorem.

On the other hand, partitioned systems impact harvest when faults occurs. They loose part of the data so harvest is impacted. I think this is more like AP system in CAP theorem.

The last most important point covered in the paper is DQ Principle .

 Data per query × queries per second → constant
 The intuition behind this principle is that the system’s overall capacity tends to have a particular physical bottleneck, such as total I/O bandwidth or total seeks per second, which is tied to data movement. The DQ value is the total amount of data that has to be moved per second on average, and it is thus bounded by the underlying physical limitation. At the high utilization level typical of giant-scale systems, the DQ value approaches this limitation
 Overall, DQ normally scales linearly with the number of nodes, which means a small test cluster is a good predictor for DQ changes on the production system 

DQ principle is useful for data intensive systems.

 Replicas maintain D and reduce Q (and thus yield), while partitions keep Q constant and reduce D (and thus harvest).
 The DQ principle gives us new options for graceful degradation: We can either limit Q (capacity) to maintain D, or we can reduce D and increase Q. We can focus on harvest through admission control (AC), which reduces Q, or on yield through dynamic database reduction, which reduces D, or we can use a combination of the two. Temporarily cutting the effective database size in half, for instance, should roughly double our capacity. 

Finally author talks about three different deployment strategies:

  1. Fast reboot
  2. Rolling upgrade
  3. Big flip

Big flip is today’s blue green deployment. I have seen many organisations preferring blue green deployment over rolling upgrade because of the limitations of it.


以上所述就是小编给大家介绍的《Paper Summary: Lessons from Giant-Scale Services》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Python高性能(第2版)

Python高性能(第2版)

[加] 加布丽埃勒•拉纳诺(Gabriele Lanaro) / 袁国忠 / 人民邮电出版社 / 2018-8 / 59.00元

本书是一本Python性能提升指南,展示了如何利用Python的原生库以及丰富的第三方库来构建健壮的应用程序。书中阐释了如何利用各种剖析器来找出Python应用程序的性能瓶颈,并应用正确的算法和高效的数据结构来解决它们;介绍了如何有效地利用NumPy、Pandas和Cython高性能地执行数值计算;解释了异步编程的相关概念,以及如何利用响应式编程实现响应式应用程序;概述了并行编程的概念,并论述了如......一起来看看 《Python高性能(第2版)》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

SHA 加密
SHA 加密

SHA 加密工具

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具