100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Big Data Summary $7.69   Add to cart

Summary

Big Data Summary

 56 views  4 purchases
  • Course
  • Institution

Notes for the Big Data course. It contains the slides and explanation of them.

Preview 4 out of 101  pages

  • December 27, 2022
  • 101
  • 2022/2023
  • Summary
avatar-seller
Big Data
Lesson 1

Introduction

What is Big Data?

● No fixed definition, the concept changes over time

● Megabytes, Gigabytes, Terabytes, Petabytes, Exabytes, Zettabytes, …

● In the past:

● Storage was expensive

● Only the most crucial data was preserved

● Most companies did no more than consult historical data, rather than analyse it



Storing the Data

● Recent trends:

● Storage is (relatively) cheap and easy

● Companies and governments preserve huge amounts of data

● There is a lot more data being generated

● Customer information, historical purchases, click logs, search histories, patient
histories, financial transactions, GPS trajectories, usage logs,
images/audio/video, sensor data, …

● More and more companies and governments rely on data analysis

● Recommender systems, next event prediction, fraud detection, predictive
maintenance, image recognition, COVID contact tracing, …



Making Data Useful

● However:

● Data analysis is computationally intensive and expensive

● Examples

● Online recommender systems: require instant results


1

, ● Frequent pattern mining: time complexity exponential in the number of
different items, independent of the number of transactions (e.g., market basket
analysis)

● Multi-label classification: exponential number of possible combinations of labels
to be assigned to a new sample (e.g., Wikipedia tagging)

● Subspace clustering: exponential number of possible sets of dimensions in
which clusters could be found (e.g., customer segmentation)

So what is Big Data?

● Dependent on the use case

● Data becomes Big Data when it becomes too large or too complex to be analyzed with
traditional data analysis software

● Analysis becomes too slow or too unreliable

● Systems become unresponsive

● Day-to-day business is impacted



Three aspects of Big Data

● Volume

● The actual quantity of data that is gathered

● Number of events logged, number of transactions (rows in the data), number of
attributes (columns) describing each event/transaction, …

● Variety

● The different types of data that are gathered

● Some attributes may be numeric, others textual

● Structured v unstructured data

● Irregular timing

● Sensor data may come in regular time intervals, accompanying log data
are irregular

● The variety of the data increases the complexity of the analysis of the data

● Velocity

● The speed at which new data is coming in and the speed at which data must be handled

● May result in irrecoverable bottlenecks


2

,What can we do about it?

● Invest in hardware

● Store more data

● Process the data faster

● Typically (sub)linearly faster – doesn’t help much if an algorithm has exponential
complexity

● Design intelligent algorithms to speed up the analysis

● Specifically make use of available hardware resources

● Provide good approximate results at the fraction of the cost/time

● Take longer to build a model that can then be used on-the-fly

● We focus on the latter



Parallel computing

Goal: leveraging the full potential of your multicore multiprocessor multicomputer system

● If you have to process large amounts of data it would be a shame not to use all n cores of a
CPU.

● If a single system does not suffice, how can you set up multiple computers so that they work
together to solve a problem? For instance, you can rent a cluster of 100 instances using the
cloud to do some computations that take 10 hours, but then what?

Hardware has a lot of potentials that algorithms don’t always make use of.
Nowadays, even a single computer has typically multiple processors and each processor has
multiple cores -> so there are already on single computer-level ways that you can make use of
parallelization
Parallelization comes into play even more when you have multiple computers or cloud
computing.
If you have to process large amounts of data and you have multiple cores, multiple processors
and multiple computers at your disposal then you should make the most of that and parallelize
your work as much as possible

Goal of parallel processing is to reduce computation time




3

, ● Algorithms are typically designed to solve a problem in a serial fashion. To fully leverage the
power of your multicore CPU you need to adapt your algorithm: split your problem into smaller
parts that can be executed in parallel

● We can’t always expect to parallelize every part of the algorithm, however in some cases it is
almost trivial to split the entire problem in smaller parts that can run in parallel, i.e.
embarrassingly parallel

● In that case you can expect to have a linear speedup, i.e. executing two tasks in parallel on two
cores should halve the running time

Parallel processes can’t really help reduce the complexity of the theoretical algorithm, but we
can still cut down on runtimes
How do we do this? We try to split the problem / the computation / the code into smaller parts
that can be independently executed in parallel. Key word: independently, because if a part of
your process depends on each other ( one process has to wait for the output of the previous
process), then you can’t let them run in parallel. So it is not always possible to parallelize your
processes, but a lot of the processes are easily parallelized.
If you can fully parallelize processes then the ultimate goal is linear speedup, which mean that
your speed up is speed up achieved is linear, proportional to the number of processes that you
are running. E.g., if you are running 10 processes then your total run time should be 10 times
shorter than the original.



Parallel computation
● Instruction level parallelism (pipelining, out-of-order execution) is completely transparent to
the user

● Task parallelism: multiple tasks are applied on the same data in parallel

● Data parallelism: a calculation is performed in parallel on many different data chunks




2 main types of parallelization
- In task parallelism you will run multiple tasks on the same data in parallel
- In data parallelism you will split the data, so you will run the same task on different
chunks ( parts) of the data
Regardless of what type of parallelism you have, the goal is to shorten the run times. In the
ideal world you will have linear speedup.

4

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller LenkaZ. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $7.69. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

79650 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$7.69  4x  sold
  • (0)
  Add to cart