A distributed Java-based file system for storing large volumes of data
HDFS and YARN form the data management layer of Apache Hadoop. YARN is the architectural center of Hadoop, the resource management framework that enables the enterprise to process data in multiple ways simultaneously—for batch, interactive and real-time data workloads on one shared dataset. YARN provides the resource management and HDFS provides the scalable, fault-tolerant, cost-efficient storage for big data.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. When that quantity and quality of enterprise data is available in HDFS, and YARN enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms.
HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
These specific features ensure that data is stored efficiently in a Hadoop cluster and that it is highly available:
|Rack awareness||Considers a node’s physical location when allocating storage and scheduling tasks|
|Minimal data motion||Hadoop moves compute processes to the data on HDFS and not the other way around. Processing tasks can occur on the physical node where the data resides, which significantly reduces network I/O and provides very high aggregate bandwidth.|
|Utilities||Dynamically diagnose the health of the file system and rebalance the data on different nodes|
|Rollback||Allows operators to bring back the previous version of HDFS after an upgrade, in case of human or systemic errors|
|Standby NameNode||Provides redundancy and supports high availability (HA)|
|Operability||HDFS requires minimal operator intervention, allowing a single operator to maintain a cluster of 1000s of nodes|
An HDFS cluster is comprised of a NameNode, which manages the cluster metadata, and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas.
The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local file system on the DataNodes.
The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM.
The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. The instructions include commands to:
In HDP 2.2, the rolling upgrade feature and the underlying HDFS High Availability configuration enable Hadoop operators to upgrade the cluster software and restart upgraded services, without taking the entire cluster down.
Since its first deployment at Yahoo in 2006, HDFS has established itself as the defacto scalable, reliable and robust file system for big data. Since then, HDFS has addressed several fundamental challenges of distributed storage at unparalleled scale and with enterprise rigor.
The Apache community continues innovating. For example, a new initiative called Ozone introduces an object store, which extends HDFS beyond a file system, toward a more versatile enterprise object-enables storage layer for use cases such as storing all the photos uploaded on Facebook or all the email attachments in Gmail.
Fortune 1000 CIOs see the Hadoop cluster’s storage and compute resources as a valuable infrastructure for running both Hadoop and non-Hadoop applications and services. This emerging trend of PaaS-on-Hadoop, propelled by YARN, opens up the Hadoop infrastructure to new use cases. Object store is a natural fit for the storage component of a PaaS model, and the Apache community is thrilled to work on adding these new capabilities to HDFS and Apache Hadoop.
The Apache Hadoop HDFS team is working on the following improvements:
|Reliable and Secure Operations||
|Scalability and Efficiency||
|Support for Heterogenous Hardware||
Introduction Hadoop has always been associated with BigData, yet the perception is it’s only suitable for high latency, high throughput queries. With the contribution of the community, you can use Hadoop interactively for data exploration and visualization. In this tutorial you’ll learn how to analyze large datasets using Apache Hive LLAP on Amazon Web Services […]
이미지 파일에서 텍스트를 인덱싱하는 기능(예: 스캔한 PNG 파일의 텍스트)은 많은 고객들이 흔히 요청하는 사항입니다. 이 튜토리얼에서는 SOLR를 통해 이를 수행하는 방법에 대해 살펴봅니다. 사전 요구 사항: Hortonworks Sandbox 다운로드. HDP Sandbox의 로프 학습 튜토리얼 완료. 단계별 가이드 […]
Apache Zeppelin on HDP 2.4.2 Author: Vinay Shukla In March 2016 we delivered the second technical preview of Apache Zeppelin, on HDP 2.4. Meanwhile we and the Zeppelin community have continued to add new features to Zeppelin. These features are now available in the final technical preview of Apache Zeppelin. This technical preview works with […]
소개: 내장된 BI 보고 도구인 JReport를 활용하면 Apache Hive JDBC 드라이버를 사용하여 Hortonworks Data Platform 2.3에서 손쉽게 데이터를 추출하고 시각화할 수 있습니다. 이후에 보고서, 대시보드 및 데이터 분석을 생성할 수 있으며, 이러한 항목을 나만의 애플리케이션에 포함할 수 있습니다. 이 튜토리얼에서는 다음 단계를 살펴봅니다. […]
Introduction In this tutorial, you will learn about the different features available in the HDF sandbox. HDF stands for Hortonworks DataFlow. HDF was built to make processing data-in-motion an easier task while also directing the data from source to the destination. You will learn about quick links to access these tools that way when you […]
The Hortonworks Sandbox is delivered as a Dockerized container with the most common ports already opened and forwarded for you. If you would like to open even more ports, check out this tutorial.
Introduction R is a popular tool for statistics and data analysis. It has rich visualization capabilities and a large collection of libraries that have been developed and maintained by the R developer community. One drawback to R is that it’s designed to run on in-memory data, which makes it unsuitable for large datasets. Spark is […]
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Phoenix, NiFi, HAWQ, Zeppelin, Atlas, Slider, Mahout, MapReduce, HDFS, YARN, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.