HADOOP CERTIFICATION EXAM
For data in motion. Powered by Apache NiFi. 1) real-time - add, trace, adjust; 2) integrated -
common input, output, transformation; 3) secure - security rules, encryption, traceability; 4)
adaptive - adapts data flow, scalable; if connection poor skinnies down data - answer-
Hortonworks Data Flow (HDF)
A user-driven process of searching for patterns or specific items in a data set. Data discovery
applications use visual tools such as geographical maps, pivot-tables, and heat-maps to make
the process of finding patterns or specific items rapid and intuitive. Data discovery may leverage
statistical and data mining. Ex. Web log analysis, online ad placement, claims notes mining -
answer-Data discovery
Ex. sensor data ingest - answer-ETL onboard
Ex. individual driver histories - answer-Active archive
Perishable insights - answer-Data in motion
Historical insights - answer-Data at rest
Supports data discovery, single view, predictive analytics - answer-Actionable intelligence
A Single View application aggregates data from multiple sources into a central repository to
create a single view of anything — of customers, inventory, systems - answer-Single view
Offers the leading platform for Operational Intelligence. It enables the curious to look closely at
what others ignore—machine data—and find what others never see: insights that can help make
your company more productive, profitable, competitive and secure - answer-Splunk
An open source big data processing framework built around speed, ease of use, and
sophisticated analytics. It was originally developed in 2009 in UC Berkeley's AMPLab, and open
sourced in 2010 as an Apache project - answer-Apache Splunk
Real-time event processing for sensor and business activity monitoring. A free and open source
distributed realtime computation system. Storm makes it easy to reliably process unbounded
streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is
simple, can be used with any programming language. Ingests millions of events per second.
Manage with Ambari. Horizontally scalable. Fixed, low latency and continuous processing for very
high frequency streaming data. - answer-Apache Storm
Data operating system. Cluster resource management. 2013 - includes batch, interactive and
realtime. At core of Hortonworks Data Platform (HDP) for data at rest. Centralized platform for: 1)
operations - cluster management, one data lake or clusters; 2) governance - data lifecycle mgt,
modeling with metadata, lineage capability 3) security - roles or data tags, encryption at rest and
in motion, authentication. Includes data functions for: batch, machine learning, search,
interactive, streaming - answer-YARN
SQL:2011 for analytics - answer-Hive on YARN
Data at rest. Powered by Open Enterprise Hadoop. 1) Open - open source; 2) Central - Yarn at
core; 3) Interoperable - existing technology, skills; 4) Ready - enterprise-ready re operations,
, governance, security; dev efforts include: 1) data management; 2) data access; 3) governance
and integration; 4) operations; 5) security - answer-Hortonworks Data Platforms (HDP)
An open source cluster computing framework originally developed in the AMPLab at University of
California, Berkeley but was later donated to the Apache Software Foundation where it remains
today. Integrated component of HDP. Agile analytics using data science notebooks, includes
geospatial, entity resolution; wide array of data sources; RDD sharing, HDFS memory tier. Newer
approach than SQL handled by Hive. Data access engine for fast, large scale data processing.
Designed for iterative, in-memory computations and interactive data mining. APIs for Scala, Java,
Python. Spark SQL, Spark Streaming, MLlib, GraphX - can run as a YARN workload - can run on a
single data set in Hadoop. - answer-Apache Spark at Scale
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
partitioned collection of elements that can be operated on in parallel. - answer-Resilient
Distributed Dataset (RDD)
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system
written in Java for the Hadoop framework. A Hadoop cluster has nominally a single namenode
plus a cluster of datanodes, although redundancy options are available for the namenode due to
its criticality. Each datanode serves up blocks of data over the network using a block protocol
specific to HDFS. The file system uses TCP/IP sockets for communication. Clients use remote
procedure call (RPC) to communicate between each other. - answer-Hadoop Distributed File
System (HDFS)
SQL interface to Hadoop data. Most widely used SQL engine in Hadoop Community. Alternative
to Spark at Scale. Apache Hive is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis. While initially developed by Facebook,
Apache Hive is now used and developed by other companies such as Netflix. Amazon maintains a
software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web
Services. Enable transactions, SQL:2011 - answer-Apache Hive on YARN
90 that innovate for Hortonworks; customer advocates and roadmapping - answer-Apache
Hadoop Committers
Includes: integrated customer portal, knowledge base, on-demand training, smartsense
(machine learning and predictive analytics on customer cluster); proactively optimizes your
cluster - answer-Hortonworks customer support
Machine learning is a subfield of computer science that evolved from the study of pattern
recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel
defined machine learning as a "Field of study that gives computers the ability to learn without
being explicitly programmed". - answer-Machine learning
Includes: answers; knowledge base; code hub; sandbox; tutorials; events - answer-Hortonworks
Community Connection
- answer-Hortonworks Partnerworks
One node, mini-cluster HDP that runs in VM on laptop with tutorials and sample data sets -
answer-Hortonworks Sandbox
- answer-Hortonworks Blog