What are the 5 Phases of Real-Time? - answer-1) Data Distillation
2) Model Development
3) Validation and Deployment
4)real-time scoring
5) model refresh
SQOOP - answer--SQL+Hadoop = sq oop
-To import data from relational databases into Hadoop and
-to export data to relational databases fr...
GFS - answer--Google File System
-designed to solve the issues with distributed systems
What does GFS store? - answer-stores large volume of data, and distributed MapReduce
processes that data
When was Hadoop published? - answer-2003-2004. Based on the solution used by
Google in the 1990s
What led to Hadoop - answer-Doug Cutting's Open source project: Nutch
What is the Hadoop solution? - answer--bring computation to the data rather than
bringing data to the computation
-Distribute computing to where data is stored
-run computations where data resides
Apache's Hadoop Definition - answer-The Apache Hadoop software library is a framework
that allows for the distributed processing of large scale data sets across clusters of
computers using simple programming models. It is designed to scale up from single
servers to thousands of machines, each offering local computation and storage. Rather
than rely on hardware to deliver high availability, the library itself is designed to detect
and handle failures at the application layer, so delivering a highly available service on
top of a cluster of computers, each of which may be prone to failures.
What does Hadoop Include? - answer--Hadoop Distributed File System (HDFS)
-Hadoop YARN
-Hadoop Common
Hadoop Distributed File System (HDFS)? - answer--A distributed file system that provides
high-throughput access to application data
-is a distributed, scalable, and portable file system written in Java for the Hadoop
framework
-can store any type of file
-data is automatically split into chunks and replicated for high availability
Hadoop YARN? - answer-- A framework for job scheduling and cluster resource
management
-Hadoop MapReduce: a YARN-based system for parallel processing of large data sets
-manages the cluster resources, for job processing
Hadoop Common? - answer-The common utilities that support the other Hadoop modules
Hadoop- Processing, storing, and analyzing large volumes of data - answer--Software:
handles distribution of data, handling failures
-Hardware: handles storage of data and processing power
,Hadoop is distributed - answer-a Hadoop cluster can have several machines
Hadoop is scalable - answer-can add more machines to the cluster (proportionally adds
capacity)
Hadoop is Fault-Tolerant - answer-can recover hardware failures
-Master re-assigns work
-Data replicates by default on 3 machines
-Nodes that recover rejoin the cluster automatically
Hadoop is Open Source - answer--overseen by Apache
-close to 100 committers from companies like Cloudera, Hortonworks, etc.
Hadoop MapReduce - answer--processing framework to process the data
-other processing frameworks, also now available.
- A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts the
outputs of the maps, which are then input to the reduce tasks. Typically both the input
and the output of the job are stored in a file-system.
MapReduce Process - answer-usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner. The framework
sorts the outputs of the map, which are then the input to the reduce task. Typically both
the input and the output of the job are stored in a file-system.
Japan is seeking ________, while India craves _____________ and ___________. The leaders of
both countries, _______________(india) and _______________(Japan), are also working to
counter the growing regional influence of _________ -- an important economic partner to
both but also historically a rival. - answer-Japan is seeking growth markets, while India
craves Advanced technology and Foreign Investment. The Shinzo Abe (Japan), are also
working to counter the growing regional influence of Chine -- an important economic
partner to both but also historically a rival.
Reports suggest that big data and analytics market in India will grow approximately __
times, to ______ by 2020 - answer-8 Times
$16 Billion
Japanese companies are using India as a _____________ ______ to expand into Africa, and
service providers are expanding from Japan into India - answer-Manufacturing base
Hadoop "Ecosystem" - answer--Tools built around the core Hadoop
-All ecosystem tools are open source
, -Tools are designed to extend Hadoop's Functionality
-New tools are added all the time
Hadoop Ecosystem projects included in Cloudera's CDH: - answer--Spark, Hbase, Hive,
Impala, Parquet, Sqoop, Flume/ Kafka, Solr, Hue, Sentry
Spark - answer-in-memory and Streaming processing framework
HBase - answer-noSQL database built on HDFS
Hive - answer-SQL processing engine designed for batch workloads
Impala - answer-SQL Query Engine designed for BI workloads
Parquet - answer-Columnar data storage format
Sqoop - answer-Data movement/ETL to and from RDBMS
Flume, Kafka - answer-streaming data ingestion
Solr - answer-test search functionality
Hue - answer-web based user interface for Hadoop
Sentry - answer-an authorization tool for managing security
Hadoop Is? - answer--Scalable, for parallel/distributable problems (no dependencies
across data)
-A write once, read many solution (vs. RDMS for write and update a lot)
Hadoop is not? - answer--Database (random Access)
-Interactive OLAP (for the moment)
-Updates to files
-Nonparallel work
-Many small files
-Low latency
What do most organizations prefer? - answer--An enterprise-ready distribution of Hadoop
that is: Tested thoroughly, supported, and integrates well with Hadoop projects and
other key software like ETL tools and databases.
Most widely used enterprise-ready Hadoop distributions? - answer-Cloudera,
Hortonworks, and MapR
A cluster? - answer-a group of computers working together
a node? - answer-is an individual computer in that cluster
Two kind of nodes? - answer--Master node (Name Node)
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller TOPDOCTOR. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $12.99. You're not tied to anything after your purchase.