EECS 485 Final Exam with complete solution
Map Reduce System provides - 1
Automatic parallelization & distribution
Map Reduce System provides - 2
Fault-tolerance
Map Reduce System provides - 3
Status and monitoring tools
Map Reduce System provides - 4
Clean abstraction for programmers
Ma...
EECS 485 Final Exam with complete solution
Map Reduce System provides - 1
Automatic parallelization & distribution
Map Reduce System provides - 2
Fault-tolerance
Map Reduce System provides - 3
Status and monitoring tools
Map Reduce System provides - 4
Clean abstraction for programmers
Map Reduce is Popular because - 1
Easy to write distributed programs
Map Reduce is Popular because - 2
Built-in reliability on large clusters
Map Reduce is Popular because - 3
Byte streams, not relations
Map Reduce is Popular because - 4
Schema-later, or schema-never
Map Reduce is Popular because - 5
Your choice of programming languages
Map Reduce is Popular because - 6
Hadoop relatively easy to administer
Mappers can work...
Mappers can work independently
Reducers can work...
Reducers can work independently
Grouper does not have to...
Count
Map Reduce Optimization 1
the mappers don't need write the words down, just "1" for each word of a certain length
Map Reduce Optimization 2
mappers do part of the reducer work: number of 1-letter words, 2-letter words, etc
Master work
Dividing the work
Grouper work
Grouping the values by keys
Work of the mappers and reducers
differs with problem
MR Computation
Input key/value pairs
Output different key/value pairs
Map Function
Takes an input pair and produces a set of intermediate key/value pairs
MapReduce library
,Groups together all intermediate values associated with the same intermediate key and
passes them to the Reduce function
Reduce function
Accepts an intermediate key and a list of values for that key.
Merges together these values to form a possibly smaller set of values.
Map and reduce have related types
map (k1, v1) → list(k2, v2)
reduce (k2, list(v2)) → list(v2)
MR output is smaller
computing summary statistics
MR output is larger
computing some kind of data structure for downstream use
Output value produced per reduce invocation
Zero or one
Execution Pipeline
Stage 1, partition input data and run map() on many machines
Then group intermediate data by intermediate key
Stage 2, partition intermediate data by key and run reduce() on many machines
Output is whatever reduce() emits
Shuffle/Sort
Default: hash(key)%R
- Break input into M chunks
- Process each chunk w/ map process
- Group-by map output keys
- Place key-groups into R chunks
- Process each chunk w/ reduce process
- reduce fn's outputs go to disk
What can be a MapReduce program?
- URL counting in logs
- Inverted index construction for search engines, Sorting
- Massive image conversion, others
How do we know if a machine goes down?
Heartbeat messages tell master which machines are online
,Map Worker Dies
- Just restart that task on a different box
- You lose the map() work, but no big deal
Reduce Worker Dies
- Restart the reducer, using output from source mappers
Relational database management system (RDBMS) machine dies
Query is restarted
What about slow, not dead, machines?
- Speculative execution for stragglers
- Kill the 2nd-place finisher
What about data placement?
Use GFS across cluster disks; start tasks where the target data already lies
Isn't the intermediate data size large?
- Use a local reducer called a Combiner at each map
- Compress data between map and reduce
Mapper and Reducer are...
Stateless - no side effects
Scalability and fault-tolerance achieved by optimizing the execution engine once
Use it many times by writing different map and reduce functions for different
applications
Map and reduce functions inspired by functions of the same name in...
Lisp programming language
Functional programming
Computation as the evaluation of mathematical functions
MR is easy to...
Parallelize
Simple Distributed System Failures
Power loss
Process crash
Complex Distributed System Failures
- Slow or unresponsive application
- Example: process crash
Fail-Stop, Asynchronous environment
, - Components may fail (stop working) and recover
- Components may take arbitrarily long to respond
- Others may not be able to tell when a component fails
- Examples: network partition, hung process
Byzantine
- Components may fail in arbitrary ways
- Examples: hacker compromises a component
Primary-Backup fault tolerance
- Both primary and backup should commit modifications before acknowledging
- Client will never lose data for operations it observes to complete
- Client is uncertain about operations in progress. Must recheck with new primary to see
if it completed.
Asynchronous environment 1
Network can delay messages arbitrarily
Asynchronous environment 2
Network can partition
Asynchronous environment 3
Servers can crash and take a long time to recover
Asynchronous environment 4
Servers can hang and stop responding
Asynchronous environment 5
Servers can run very slow and appear to be crashed
In async environment...
Hard to differentiate slow and failed components
Async means...
primary-backup not great in practice
Paxos
Goal: nodes all agree on one result
Tolerates fail-stop failures in synchronous environ.
Requires 2f+1 nodes to tolerate f faults (Primary-backup only needed f+1)
Example: 5 Paxos nodes can tolerate 2 such faults
Basic (synod) protocol Phase 1
- Leaders propose ballot #.
- Others (acceptors) adopt ballot if highest # so far.
- Leader waits for f+1 adoptions
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller katoinyambi96. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $15.99. You're not tied to anything after your purchase.