دیدنش با چشم چون ممکن نبود - اندر آن تاریکیش کف می‌بسود

_مولانا

Introduction to Hadoop

Big Data, Hadoop & more

Save Button?

Floppy Disk

But things changed!

Increase in

Sources (Users, Clients), Forms

Flood is coming

MySQL, MariaDB, Access, SQL Server, Oracle

Time out on expensive servers...

Overload

You must change your technologies

What makes Big Data

Implicit and Explicit

Forums, Messengers, Comments, Q/A, Reviews

Share ...!?

Did you just pause that?

Share, view, follow

These are only the visible parts of bigdata

Logs

Clicks

Applications

Servers (Web, Email, Proxy, ...)

Systems _journalct

Hardwares _{Access point}

Life

Main source

Mysterious and valuable part!

Digital footprint

A simple tweet

How can we process these data?

Scalability

Low resistance but inexpensive and infinite

Organizations

Saturation

Bottlenecks like: Disk IO, load avg

No matter how much memory I have...

Now we know what we are looking for!

Client server architecture

Raid

Google

GFS, ???

Doug Cutting

Apache Lucene _(IR)

Apache Nutch _{(Web Crawler)}

White paper

MapReduce

It's just a framework, C4.5

Hadoop comes into play!

Naming

Hortonworks defination

An opensource software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware

Two sides of hadoop

GFS, MapReduce

HDFS, Hadoop MapReduce

What we do with data

Save
Transfer
Join
Index
Analytics
Aggregate
Visualize

Ecosystem

HDFS

Allows us to distribute the storage
All hard drives look likes a single huge hard disk
Keeps copy of data
No single point of failure
Back

YARN

Manage the resouces
What get to run tasks and when
Which node is available
We build applications on top of it
Back

MapReduce

Got remvoed from YARN

Back

Pig

No java, Python? Scripting language like SQL
Transforms the script to something than can be run on MapReduce

Back

Hive

SQL
Makes the data to look like a RDBMS

Back

TEZ

Hive on TEZ is faster than MapReduce

Back

Spark

Sitting at same level of MapReduce on top of YARN Or MESOS
Python, Java, Scala
Fast
Active development
Handle SQL Query
Machine learning
Handle Stream data

Back