Enabling HDFS Federation Having 1B File System Objects

栏目: IT技术 · 发布时间: 5年前

内容简介：Today’s successful organizations are data-driven. At eBay, we have thousands of engineers, analysts, and data scientists crunching petabytes of data every day to provide a seamless user experience. Through the use of data, we can execute at a massive scale

Today’s successful organizations are data-driven. At eBay, we have thousands of engineers, analysts, and data scientists crunching petabytes of data every day to provide a seamless user experience. Through the use of data, we can execute at a massive scale and connect our millions of users to global ecommerce.

Apache Hadoop has been one of the most popular framework choices for Big Data analytics at eBay. We use Hadoop to generate value from data to improve search experience and identify and optimize relevant advertisements. This value also helps us enrich our product catalogs, and perform click stream analysis to understand how eBay customers leverage our marketplace. The Analytics Data Infrastructure (ADI) team at eBay is responsible for providing highly available and reliable Hadoop clusters for all customers. Currently, the team maintains 10+ Hadoop clusters ranging from hundreds to thousands of nodes. In this blog, we explore how we have enabled federation on one of our largest Hadoop clusters with 4,000+ nodes, 1B file system objects and 170PB of storage – all with the goal of improving our customer experience.

Problem Statement

One of our largest Hadoop clusters reached the magic number of 1B total file system objects. The NameNode process was running with 200GB of memory. Once file system objects count exceeds 950M, we began observing various scaling issues.

1) The NameNode process was running with almost 90 percent memory leading to frequent full GC. Given we were already running with 200GB memory, we had reached the physical memory limit of enterprise node with no further scope of increasing memory.

2) As shown below, we faced NameNode failover every other day due to high GC activities. We observed six failovers in seven days and the cluster was becoming unstable.

Enabling HDFS Federation Having 1B File System Objects

3) The overall RPC processing time by NameNode increased significantly and resulted in slowness for various SLA jobs.

4) We were running out of RPC threads on the NameNode which hindered our ability to add more data nodes to cluster.

Under these conditions, we were not able to provide a highly available and reliable cluster. This forced us to consider scaling out a solution for name service by enabling federation for our cluster. Below, we discuss View FileSystem, namespace selection (the most important design decision), and basic ideas and outcome of the NameNode federation. We then conclude with some of our learnings and future plans.

NameNode Federation

Hadoop NameNode federation allows horizontal scaling of the name service. It contains several NameNodes or namespaces, each of which act independent of each other. These independent NameNodes are federated, which means that they don’t require inter-coordination. DataNodes are used as common storage by all the NameNodes, while each DataNode is registered with all the NameNodes in the cluster. For more information about NameNode federation design, please click here .

Enabling HDFS Federation Having 1B File System Objects

View FileSystem

The View File System (ViewFs) provides a way to manage multiple Hadoop file system namespaces (or namespace volumes). It is particularly useful for clusters that have multiple NameNodes and hence multiple namespaces in HDFS Federation. ViewFs is analogous to client-side mount tables in some Unix/Linux systems. ViewFs can be used to create personalized namespace views and also per-cluster common views. The team was particularly interested in using viewFS as it enabled our customers to use their existing code on the federated cluster with only config changes. For more information about viewFS, please click here .

Enabling HDFS Federation Having 1B File System Objects

Selection of Namespace

One of the important decisions for enabling federation is selection of namespace. As shown in the table below, we had a couple of folders which were candidates for separate namespace. One of the obvious choices was “/PathA” folder having the highest storage utilization. However, NameNode performance (in terms of RPC processing time, GC activities, etc.) directly depends on the amount of file system objects it needs to store in memory and the amount of operations performed for those objects. As the below shows “/PathB” folder used only 13.6PB storage, but it had the highest number of file system objects and operations being performed. Therefore, we decided to create “/PathB” as separate namespace and placed all the other remaining folders under another single namespace.

Enabling HDFS Federation Having 1B File System Objects Federation Outcome

NameNode Heap Usage

The most important benefit of federation is that the overall metadata will be reduced for NameNodes in terms of memory usage. As shown in the graph below, memory usage of existing NN reduced from 146GB to 88-95GB after the federation rollout.

Enabling HDFS Federation Having 1B File System Objects

NameNode Metadata Size

As heap memory usage reduces, it’s able to store more metadata per NameNode process with less memory requirement. The table and graph below show that currently, the new federation cluster is holding almost the same or slightly more metadata information than the old cluster.

Enabling HDFS Federation Having 1B File System Objects

Though total file system objects for the new cluster is a little higher than the old cluster, overall heap usage is almost half. As shown in the below graph, heap usage for federated NameNodes is 85GB and 96GB.

Enabling HDFS Federation Having 1B File System Objects

For the old cluster, it was 182GB even after full GC.

Enabling HDFS Federation Having 1B File System Objects

Based on current metrics, we can easily double the number of file system objects for new clusters without impacting performance.

GC Time for Each Cycle

Amount of time spent for each GC also matters in measuring performance of application. As shown in the graph below, the GC time is around 100 ms for federated NameNodes.

Enabling HDFS Federation Having 1B File System Objects

For the old cluster, the time was around 200 ms.

Enabling HDFS Federation Having 1B File System Objects

Average Operation Time

There was direct impact on average operation time performed by NameNode. One of the most important and costly operations performed by Namenode is "GetBlocks". As shown in the graph below, this operation takes around 50ms to 125ms for federated NameNodes.

Enabling HDFS Federation Having 1B File System Objects

For similar operations in our older cluster, it took 300ms to 400ms.

Enabling HDFS Federation Having 1B File System Objects

FS Image Size

FS image contains all metadata information which is used in event of a NameNode restart. At the NameNode restart, the entire image gets loaded into memory and block reports are generated. NameNode stays in safe mode and no write operations are allowed during this time. This FS image size purely depends on metadata information. We now have almost half FS image size for federation NameNodes compared to the old cluster, which reduces NameNode startup time.

Enabling HDFS Federation Having 1B File System Objects Learning

While enabling federation of our cluster, we came across four main issues:

1) Default move operation is very strict in viewFs. By default, viewFs does not allow mv operations across mount points for the same namespace. This resulted in issues with Hive/Spark queries. The solution was to apply the HADOOP-14455 patch.

2) Spark/Hive job failure because of ORC write error on viewFs. This was resolved with patch HIVE-10790 .

3) We observed higher iowait caused by “du” command as HDFS needs to scan two 256 * 256 blockpools resulting in higher system load.

All existing Hive/Spark tables locations had to be updated in Hive Metastore. We modified all locations from hdfs:// to viewfs:// during roll out. If you have a firewall in your system, then make sure to have all connectivity issues resolved for newly added federated NameNodes.

After enabling the NameNode federation, we were able to provide a more reliable and performant cluster for our customers. Our selection for namespace was near-perfect as we were able to reduce heap usage, RPC processing time, and GC time by half for both NameNodes. We are now confident to add additional federated NameNodes in the future to increase our cluster capacity beyond 5,000 nodes.

Appendix

At eBay, we run our Hadoop 2.7.3 custom build on about 5,000+ nodes on our custom designed hardware (ODM) using Kubernetes in assimilation mode.

以上所述就是小编给大家介绍的《Enabling HDFS Federation Having 1B File System Objects》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Enabling HDFS Federation Having 1B File System Objects

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

浅薄

[美] 尼古拉斯·卡尔 / 刘纯毅 / 中信出版社 / 2010-12 / 42.00元

《浅薄：互联网如何毒化了我们的大脑》在我们跟计算机越来越密不可分的过程中，我们越来越多的人生体验通过电脑屏幕上闪烁摇曳、虚无缥缈的符号完成，最大的危险就是我们即将开始丧失我们的人性，牺牲人之所以区别于机器的本质属性。——尼古拉斯•卡尔“谷歌在把我们变傻吗？”当尼古拉斯•卡尔在发表于《大西洋月刊》上赫赫有名的那篇封面文章中提出这个问题的时候，他就开启了人们热切渴望的期盼源泉，让人急于弄清楚互联网是在......一起来看看《浅薄》这本书的介绍吧!

码农工具

Enabling HDFS Federation Having 1B File System Objects

Problem Statement

NameNode Federation

View FileSystem

Selection of Namespace

NameNode Heap Usage

NameNode Metadata Size

GC Time for Each Cycle

Average Operation Time

FS Image Size

Appendix

浅薄

HTML 编码/解码

URL 编码/解码

RGB CMYK 转换工具