Hadoop

[|http://hadoop.apache.org/common/docs/current/hdfs_design.html#Introduction]

Should we really move to Hadoop?

The answer to this question would be Yes.


 * Reason why we need to goto Hadoop.


 * 1) We can't run the business with age old technologies.
 * 2) We are loosing the legacy expertise in the market pretty fast, 80% of the Business rules / Batch processing is built on Mainframe Technologies.
 * 3) Mainframes is becoming a costly affair for the companies, the cost of the servers is something that need to be considered
 * There were days where business used to wait for week to months to get the data processed to make a decision, but these days, business is very dynamic business have to make decision every hour if not days, to capture the customer.
 * Example of pricing of online goods take an example of Amazon, they get data from other sites and update there prices based on their competitors.
 * ** So we can't run the business with age old technologies **
 * Finding the Mainframe gurus or expertise has become very difficult, where are as the most of the business rules in most organization are written in Mainframe. When a CR comes its extremely difficult to do the impact analysis and implement the changes.
 * ** As we don't see the real experts to deep dive into the system to analyse the impact. **


 * Cost of Mainframe systems is measured by MIPS and MSU.Considering the hardware cost per MIPS is around $1500 to $3000 per year. For an average MIPS capacity of 2000 MIPS, the hardware MIPS cost alone will come to WHOPPING $3M to $6M Dollars annually! A saving of 100 MIPS per year will translate to a saving of $150,000 K to $300,000 K in dollar terms.
 * ** So any small saving on MIPS is a huge benefit to the organization. **

MIPS : Million Instruction Per Second
 * MIPS are used to describe the Speed of computer processor.

MSU: Million Service Units. ( One MSU = 6 MIPS)


 * Organization are looking for Three technologies to replace the Mainframe systems Namely
 * Hadoop (Exploring options to Migrate the Mainframe to Pig/Hive and Ruby)
 * DataStage/Informatica (ETL Systems ) (I worked on migrating the Mainframe systems to Datastage)
 * SAP Systems (My Brother is working on migrating the Mainframe systems to SAP)

Here is a reference to the open source / licences tools that can be used to auto convert the JCL (Job Control Language ) to Unix / Hadoop echo Systems.

http://www.uvsoftware.ca/vsejcl.htm#9A0

Hadoop Class - ORAPRO

__Anatomy of File Read__

File read = getting metadata + get data Getting Metadata
 * 1) Client Call open method on HDFS
 * 2) HDFS invoke name node open method by submitting File name using RCP
 * 3) NameNode Perform Validations
 * 4) File exists or not in metadata
 * 5) Permission to read the file or not (if anything fails throws exception invalid inputs path exception or permission denied )
 * 6) On successful namenode returns FsDataInputStream, contains block information, data and replication
 * 7) Data is sorted based on the proximity, and start reading the data from data blocks
 * 8) In order to read data from data nodes client will open number of threads = 40 (Configuration)
 * 9) Client will read data blocks and write to tmp files,
 * 10) The client will identify, the blocks order using sync marker, which is a 4 byte information
 * 11) Rewrite the temp files to the original files

__Proximity (Replication and Rack awareness)__
 * 1) In order to compute distance between nodes name node requires rack-awareness information
 * 2) by Default, hadoop follows, flat topology
 * 3) in order to accommodate large no of machines Name node follows T- topology
 * 4) in order to accomplish T topology nodes are clubbed with in a rack
 * 5) in order to calculate the distance between the nodes, one should use (TCP/IP ) Packet flow
 * 6) Calculating TCP/IP packets is not always static in nature, it is cumbersome operation.
 * 7) Hadoop Engineers used Rack awareness information to compute the distance (R1N1-R1N2 < R1N1-R2N4)

__Block Placement Policy__

if data center or cluster goes down, we are maintaining 3 copies across geographically locations
 * 1) Replication : choosing a data node by
 * 2) It suggest that, first block should go to, a data node which is nearest to the client
 * 3) First replicated copy should go to a data node, which handle following failures
 * 4) Data Center (Natural Disaster)
 * 5) Cluster Failures
 * 6) Rack Failures
 * 7) Data Node Failures

Rack Failures: by keep another data block in another rack Data Node failure is handled in rack failure

Block 1 replicate 2 should go data node 2 which will handle read performance It was chosen that, with in the same rack


 * __Secondary Name Node Check Point Back up Mechanism__**

__secondary Name node connects to Primary Name node and invoke a method called rolls__

__Name node will stop writting metadata to Edits.logs file creates a new file called (edit.new.log)__

__secondary name node pulls FSimage and edits.log using http get method, in order to work with http, hadoop installed with Joboss Webservices__

__seconday name node pulls into memory desearalise and merge the edits.log, if we compare NNode with SNNode then they are equal__ __After merge operation secondry name node invoke, rollfs image on namenode instance and submit FsimageChkpt and maintain a back up in side Secondary namenode__

__Name node delete edits.log rename edits.new.log to edits.logs and maintan the Fsimage versions__

__In the even of NN going down there is a always a certenity of edits.news.log file can go done.. in order to overcome in SNN, edits.logs wiilll be maintained in NFS V3 or Quoram Journals__

28th April 2017: HBASE: is NOSql database on top of Hadoop, Hbase stores its data on HDFS, there is no other file system for hbase. Hbase is specific for a column family oriented datastore, works on large amount of data.This is a big table implementation. Hbase is implementation of Btable, supports dynamic schema, supporting horzintal sharting, operate, sparse data stores contains multi dimentional sorted data sorted data, supports compression,

support real time updates 1st May 2017
 * Row Oriented || Column Oriented ||
 * Row Formats are fill table scans || are not full table scan ||
 * Compression on hetrogenious objects are not that much effective, compared to homogeneous ||  ||
 * Hadoop Configurations : We are in the era of configuration, hadoop contains by default, configurations in (default.xmls), for client specific configurations, we have site.xml, in our hadoop cluster, we will find configurations under, $HADOOP_HOME/etc/

hadoop core-site.xml core-site.xml: I/O related, Name Node, FIle System Information HDFS site.xml: Repliction, permissions, namenods FS image, Data node, Data block operating system directory

Hadoop Modes: hadoop can be run in three different modes: 1. Fully Distributed mode 2. sudo distributed mode 3. Stand alone mode.

Fully Distributed: Dedicated machines + dedicated Services sudo:Single Machine dedicated services Stand alone: single machine, no dedicated services, NO HDFS services.