This tutorial will explain the current Hortonworks Sandbox architecture, starting in HDP 2.6.5 a new Sandbox structure is introduced making it possible to instantiate two single node clusters (i.e. HDP and HDF) within a single Sandbox with the purpose of combining the best features of the Data-At-Rest and Data-In-Motion methodologies in a single environment. Have a look at the graphical representation of the Sandbox below, it shows where the Sandbox exists in relation to the outside world, the instance depicted is of the Connected Data Architecture (CDA) if you are not yet familiarized with the concept of CDA do not worry, we will review it at a later section.
At a high level the Sandbox is a Linux (CentOS 7) Virtual Machine leveraging docker to host different Sandbox distributions, namely HDP or HDF. In order to orchestrate communication between the outside world and the Sandbox a reverse proxy server NGINX is containerized and configured to only open the ports needed to the outside enabling us to granularly interact with each container.
필수 전제 조건
- Downloaded and deployed the Hortonworks Data Platform (HDP) or Hortonworks DataFlow (HDF) Sandbox
- Sandbox Deployment and Install Guide
- HDP Sandbox의 로프 학습
- Learning the Ropes of the HDF Sandbox
- Basic understanding of Docker Containers
In the Docker architecture above, Docker registry are services used for storing Docker images, such as Docker Hub. Docker Host is the computer Docker runs on. Diving deeper into the host, you can see the Docker Daemon, which is used to create and manage Docker objects, such as images, containers, networks and volumes. The user or client is able to interact with Docker daemon via Client Docker’s Command Line Interface (CLI). The Docker daemon is a long-running program also known as a server, and the CLI utilizes Docker’s REST API to interact with the Docker daemon. As you can observe, the Docker Engine is a client-server application comprised of Client Docker CLI, REST API and Docker daemon.
On this new architecture NGINX is used as a reverse proxy server; traditionally, a proxy server is used as an intermediary which forwards traffic across multiple clients in the internet. In contrast, a reverse proxy server resides behind a firewall and directs incoming requests to specific back-end servers, in our case these severs are the HDP and HDF containers.
Why a Reverse Proxy Server is Needed
One of the biggest obstacles to overcome with this architecture is keeping ports consistent and reduce conflicts as much as possible between containers; for example, we wanted to keep Ambari UI as port 8080 across any Sandbox. The best solution is to keep the default ports as they are but distinguish the back-end server by domain name, this is why in this build we must change the host’s name from:
sandbox-hdp.hortonworks.com:<PORT> and sandbox-hdf.hortonworks.com:<PORT>
This allows us to maintain consistency across different Sandboxes and avoid conflicts, so when CDA is deployed we may reach Ambari UI at:
In this example Ambari UI is reachable for different Sandboxes at the same time by specifying the domain name of the Sandbox we are trying to reach:
Cool stuff right? Now let’s take a look at where out containers are in relation to our virtual environment.
View Running Containers
If you would like to visualize the running Sandbox container and proxy you you must log on to the host, you may chose to follow along; however, this is not necessary.
Native Docker Sandbox
The Sandbox may also run using Docker which is native to the host operating system; for example, rather than running a VM to instantiate the containers you may directly interact with the docker daemon. In the Docker architecture for the Sandbox you directly interact with Docker environment as your native operating system is the host for the Sandboxes.
If you are using VMWare or VirtualBox you may log on to both the Sandbox or the host, here is a complete list of the TCP open ports for SSH services:
|Destination||TCP Port for SSH|
|VM – VirtualBox||2200|
|VM – VMWare||22|
If you are running the VirtualBox VM:
# SSH on to VirtualBox Virtual Machine ssh email@example.com -p 2200
Or if you are using VMWare:
# SSH on to VMWare Virtual Machine ssh firstname.lastname@example.org -p 22
Note: The default password is hadoop.
Now that you are in the Virtual Machine hosting the containers we can see what Docker images are ready for deployment:
Furthermore, we can see what containers are currently running by using the following command:
If you started out with HDP you will see two containers running, the first is the NGINX proxy container followed by a list of open ports and where they are being forwarded. Since HDP was used as a base in this example we can see that it is listed as a running container.
here is some context on the information displayed:
|Container ID given to an instantiated image by docker.||The executable package from which your container has been instantiated.||Command used to instantiate your container, typically this is the path of an initialization script.||How long ago the container was created.||A container may be:
|Open ports. Note that the proxy container also tells us where ports are being forwarded to.||This is the container name e.g. “sandbox-hdp” & “sandbox-proxy”|
When CDA has been deployed both HDP and HDF are displayed as running containers:
HDP vs HDF
data-in-motion은 모든 종류의 다양한 장치로부터 흐름이나 스트림으로 데이터가 수집된다는 개념입니다. 데이터가 이러한 흐름을 따라 이동하는 동안 NiFi가 '프로세서'라고 부르는 구성요소가 데이터를 수정, 변환, 집계 및 라우팅하고 있습니다. 활성 데이터는 Big Data Application을 구축할 때 사전 처리 단계의 대부분을 담당합니다. 예를 들어, 데이터 처리란 데이터 과학자들이 데이터를 분석 및 시각화하는 데 집중할 수 있도록 데이터 엔지니어들이 원시 데이터를 향상된 스키마로 포맷하는 활동을 일컫습니다.
Data-At-Rest는 데이터가 이동하지 않으며 Hadoop Distributed File System(HDFS)과 같은 분산 데이터 스토리지에 상주하는 데이터베이스 또는 강력한 데이터 저장소에 저장된다는 개념입니다. 데이터를 쿼리로 전송하는 대신 의미 있는 통찰력을 확보하기 위해 쿼리를 데이터로 전송합니다. 이러한 스테이지 데이터에서 Big Data Application을 구축할 때 데이터 처리와 분석이 이루어집니다.
What is CDA?
Hortonworks Connected Data Architecture (CDA) is composed of both Hortonworks DataFlow (HDF) and Hortonworks DataPlatform (HDP) sandboxes and allows you to play with both data-in-motion and data-at-rest frameworks simultaneously.
As data is coming in from the edge, it is collected, curated and analyzed in real-time, on premise or in the cloud using the HDF framework. Once the data has been captured you can convert the your Data-In-Motion into Data-At-Rest with the HDP framework to gain further insights.
How CDA is made possible in the sandbox
In order for HDF to send data into HDP, both sandboxes need to be set up to communicate with each other. If you would like to know more about the deployment of CDA check out the Sandbox Deployment and Install Guide under the Advanced Topic section. When CDA is enabled a script internal to the Sandbox takes into account what base you started with and calls on the Docker daemon to instantiate the image of the complementing Sandbox flavour (e.g. HDP installs HDF, and HDF installs HDP).
In the image below we used HDP as our base and launched the initialization script for CDA. As you can see all the needed components for HDF are being loaded into a new container:
A custom Docker network was created between the running containers through Docker Engine, this is one of the many advantages of being a container because inside the Docker Engine containers can communicate directly with each other through a Docker network named bridge, thus making it possible for the clusters to communicate.
Congratulations, you have learned a great deal about the structure of our Sandbox, and how HDP and HDF single node clusters are implemented. Additionally, you have learned what CDA is and how it can be used to capture insights from both Data-At-Rest and Data-In-Motion. Additionally, you have learned about the inter-container communication made possible by Docker’s internal network and communication with the outside world done via NGINX. Now that you know the internal workings of CDA on the Sandbox, bring your understanding to practice with these great CDA ready tutorials:
- Analyze IOT Weather Station Data via Connected Data Architecture
- Real-Time Event Processing in NiFi, SAM, Schema Registry, and SuperSet
- Deploy Machine Learning Models using Spark Structured Streaming