뉴스레터

이메일로 Hortonworks의 새 업데이트를 받으세요.

한 달에 한 번 빅 데이터와 관련한 최신 인사이트, 동향, 분석 정보, 지식을 받아 보세요.

AVAILABLE NEWSLETTERS:

Sign up for the Developers Newsletter

한 달에 한 번 빅 데이터와 관련한 최신 인사이트, 동향, 분석 정보, 지식을 받아 보세요.

CTA

시작하기

클라우드

시작할 준비가 되셨습니까?

Sandbox 다운로드

어떤 도움이 필요하십니까?

* 저는 언제든지 구독을 해지할 수 있다는 점을 이해합니다. 또한 저는 Hortonworks이 개인정보 보호정책에 추가된 정보를 확인하였습니다.
닫기닫기 버튼
HDP > Hadoop를 통한 개발 > Apache Spark

Hands-On Tour of Apache Spark in 5 Minutes

클라우드 시작할 준비가 되셨습니까?

SANDBOX 다운로드

소개

In this tutorial, we will provide an overview of Apache Spark, it’s relationship with Scala, Zeppelin notebooks, Interpreters, Datasets and DataFrames. Finally, we will showcase Apache Zeppelin notebook for our development environment to keep things simple and elegant.

Zeppelin will allow us to run in a pre-configured environment and execute code written for Spark in Scala and SQL, a few basic Shell commands, pre-written Markdown directions, and an HTML formatted table.

To make things fun and interesting, we will introduce a film series dataset from the Silicon Valley Comedy TV show and perform some basic operations with Spark in Zeppelin.

필수 전제 조건

개요

Concepts

Apache Spark

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow developers to execute a variety of data intensive workloads.

Spark Logo

Spark Datasets are strongly typed distributed collections of data created from a variety of sources: JSON and XML files, tables in Hive, external databases and more. Conceptually, they are equivalent to a table in a relational database or a DataFrame in R or Python.

New to Scala?

Throughout this tutorial we will use basic Scala syntax.

Learn more about Scala, here’s an excellent introductory tutorial.

New to Zeppelin?

If you haven’t already, checkout the Hortonworks Apache Zeppelin page as well as the Getting Started with Apache Zeppelin tutorial. You will find the official Apache Zeppelin page here.

New to Spark?

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

If you would like to learn more about Apache Spark visit:

What are Interpreters?

Zeppelin Notebooks supports various interpreters which allow you to perform many operations on your data. Below are just a few of operations you can do with Zeppelin interpreters:

  • Ingestion
  • Munging
  • Wrangling
  • Visualization
  • Analysis
  • Processing

These are some of the interpreters that will be utilized throughout our various Spark tutorials.

Interpreter 설명
%spark2 Spark interpreter to run Spark 2.x code written in Scala
%spark2.sql Spark SQL interpreter (to execute SQL queries against temporary tables in Spark)
%sh Shell interpreter to run shell commands like move files
%angular Angular interpreter to run Angular and HTML code
%md Markdown for displaying formatted text, links, and images

Note the % at the beginning of each interpreter. Each paragraph needs to start with % followed by the interpreter name.

Learn more about Zeppelin interpreters.

What are Datasets and DataFrames?

Datasets and DataFrames are distributed collections of data created from a variety of sources: JSON and XML files, tables in Hive, external databases and more. Conceptually, they are equivalent to a table in a relational database or a DataFrame in R or Python. Key difference between the Dataset and the DataFrame is that Datasets are strongly typed.

There are complex manipulations possible on Datasets and DataFrames, however they are beyond this quick guide.

Learn more about Datasets and DataFrames.

Apache Spark in 5 Minutes Notebook Overview

Silicon Valley Image

We will download and ingest an external dataset about the Silicon Valley Show episodes into a Spark Dataset and perform basic analysis, filtering, and word count.

After a series of transformations, applied to the Datasets, we will define a temporary view (table) such as the one below.

DataFrame Contents Table

You will be able to explore those tables via SQL queries likes the ones below.

Complex SQL Query Graph

Once you have a handle on the data and perform a basic word count, we will add a few more steps for a more sophisticated word count analysis like the one below.

Improved Word Count Sample

By the end of this tutorial, you should have a basic understanding of Spark and an appreciation for its powerful and expressive APIs with the added bonus of a developer friendly Zeppelin notebook environment.

Import the Notebook

Import the Apache Spark in 5 Minutes notebook into your Zeppelin environment. (If at any point you have any issues, make sure to checkout the Getting Started with Apache Zeppelin tutorial).

To import the notebook, go to the Zeppelin home screen.

1. Click Import note

2. Select Add from URL

3. Copy and paste the following URL into the Note URL

# Getting Started ApacheSpark in 5 Minutes Notebook

https://raw.githubusercontent.com/hortonworks/data-tutorials/master/tutorials/hdp/hands-on-tour-of-apache-spark-in-5-minutes/assets/Getting%20Started%20_%20Apache%20Spark%20in%205%20Minutes.json

4. Click on Import Note

Once your notebook is imported, you can open it from the Zeppelin home screen by:

5. Clicking Getting Started

6. Select Apache Spark in 5 Minutes

Once the Apache Spark in 5 Minutes notebook is up, follow all the directions within the notebook to complete the tutorial.

Summary

We hope that you’ve been able to successfully run this short introductory notebook and we’ve got you interested and excited enough to further explore Spark with Zeppelin.

더 읽기

사용자 리뷰

사용자 등급
0 No Reviews
5 Star 0%
4 Star 0%
3 Star 0%
2 Star 0%
1 Star 0%
튜토리얼 이름
Hands-On Tour of Apache Spark in 5 Minutes

질문을 하거나 답변을 찾으시려면, Hortonworks Community Connection을 방문하시기 바랍니다.

No Reviews
리뷰 작성

등록

리뷰를 작성하려면 등록해주세요

나의 경험 공유하기

예: 내가 본 최고의 튜토리얼

이 필드에는 최소 50글자를 입력해야 합니다.

성공

리뷰를 공유해 주셔서 감사합니다!