دیدنش با چشم چون ممکن نبود - اندر آن تاریکیش کف می‌بسود

مولانا

Introduction to Hadoop

Big Data, Hadoop & more

Save Button?

Floppy Disk

But things changed!

Increase in

Sources (Users, Clients), Forms

Flood is coming

MySQL, MariaDB, Access, SQL Server, Oracle

Time out on expensive servers...

Overload

You must change your technologies

What makes Big Data

Implicit and Explicit

Forums, Messengers, Comments, Q/A, Reviews

Share ...!?

Did you just pause that?

Share, view, follow

These are only the visible parts of bigdata

Logs

Clicks

Applications

Servers (Web, Email, Proxy, ...)

Systems journalct

Hardwares Access point

Life

Main source

Mysterious and valuable part!

Digital footprint

A simple tweet

How can we process these data?

Scalability

Low resistance but inexpensive and infinite

Organizations

Saturation

Bottlenecks like: Disk IO, load avg

No matter how much memory I have...

Now we know what we are looking for!

Client server architecture

Raid

Google

GFS, ???

Doug Cutting

Apache Lucene (IR)

Apache Nutch (Web Crawler)

White paper

MapReduce

It's just a framework, C4.5

Hadoop comes into play!

Naming

Hortonworks defination

An opensource software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware

Two sides of hadoop

GFS, MapReduce

HDFS, Hadoop MapReduce

What we do with data

What we do with data

  • Save
  • Transfer
  • Join
  • Index
  • Analytics
  • Aggregate
  • Visualize

Ecosystem

Ecosystem

HDFS | YARN | MapReduce | Pig | Hive | TEZ | Storm | Spark | oozie | Zookeeper | Others

HDFS

Allows us to distribute the storage
All hard drives look likes a single huge hard disk
Keeps copy of data
No single point of failure
Back

YARN

Manage the resouces
What get to run tasks and when
Which node is available
We build applications on top of it
Back

MapReduce

Got remvoed from YARN

Back

Pig

No java, Python? Scripting language like SQL
Transforms the script to something than can be run on MapReduce

Back

Hive

SQL
Makes the data to look like a RDBMS

Back

TEZ

Hive on TEZ is faster than MapReduce

Back

Spark

Sitting at same level of MapReduce on top of YARN Or MESOS
Python, Java, Scala
Fast
Active development
Handle SQL Query
Machine learning
Handle Stream data

Back

Storm

Processing streaming data
Sensors, Logs
Spark streaming does the same thing
Update machine learning model
Update data as it comes

Back

oozie

Schedule of jobs

Complicated steps
Load to hive, Query using spark then transform to HBASE.

Back

Zookeeper

Cordinates everything on clusters
Which node is up or down
Many of these apps relay on zookeeper

Back

Data Ingestion

Sqoop

Turn into hadoop, talks to ODBS JDBC

Flume

Transport Web logs into hadoop (spark, storm)

Kafka

Like flume but more general, cluster of PC or webservers or whatever to broadcast into Hadoop cluster.

Back

Distributions

Hortonworks Sandbox

An Hive example

TTY

Apache Ambari

Data visualizer

Star Wars

Complicated stuff, we don't have to worry about the details

Hadoop V2 separated from MapReduce

Run MapReduce alternatives on it (TEZ)

Idea? split the computation across the cluster

Maintain data locality (Integrated with HDFS, where data lives)

YARN

YARN

We can have multiple resource manager

TEZ (Still YARN)

Allow our big data to be stored across entire cluster in distributed and reliable manner.

Handling large files

Breaking data into blocks - 128 MB

Keeps multiple copies of these blocks (Clever way)

Allow us to use regular computers (No special hardware needed)

HDFS

HA HDFS

Client <--- > Zookeeper

One namenode is active at a time

HDFS Federation

Sub Directories -> namespace Volume -> each namenode manage one namespace volume (dumpe2fs)

Map: Transfer Data that we care about

Shuffle and sort

Reduce: Aggregate Data