Garantie de satisfaction à 100% Disponible immédiatement après paiement En ligne et en PDF Tu n'es attaché à rien
logo-home
Summary Data Engineering €9,76   Ajouter au panier

Resume

Summary Data Engineering

2 revues
 409 vues  24 fois vendu

This summary Data Engineering contains the course material with extra notes in grey and is made in the year including my answers for the example exam and example questions during the course. Also contains questions of exam itself. This document is very handy to learn in a structured way (highly st...

[Montrer plus]
Dernier document publié: 4 année de cela

Aperçu 4 sur 190  pages

  • 21 mai 2020
  • 15 juin 2020
  • 190
  • 2019/2020
  • Resume
Tous les documents sur ce sujet (5)

2  revues

review-writer-avatar

Par: arnaudalloin • 6 mois de cela

review-writer-avatar

Par: jeroenvandekerckhove • 4 année de cela

avatar-seller
julievantroyen
Data Engineering 2019-2020
Content table – Data Engineering 2019-2020

Course 1 ......................................................................................................................................................... 4
1.1 Intro ............................................................................................................................................................... 4
1.1.A defining data engineering....................................................................................................................... 4
1.1.B Course topics .......................................................................................................................................... 5
1.1.C Class format, lab sessions, exam and project ......................................................................................... 6
1.2 Basic computer architecture and operating systems .................................................................................... 7
1.2.A Basic Computer Architecture ................................................................................................................. 7
1.2.B Operating System (OS) level ................................................................................................................. 10
1.3 File formats.................................................................................................................................................. 14
1.3.A human readable file formats ................................................................................................................ 14
1.3.A.1 CSV..................................................................................................................................................... 14
1.3.A.2 XML.................................................................................................................................................... 15
1.3.A.3 JSON .................................................................................................................................................. 16
1.3.B Not human readable and compressed file formats .............................................................................. 19
1.4 Python concepts .......................................................................................................................................... 21

Course 2 ....................................................................................................................................................... 25
2.1 basic computer architecture and Operating systems (os) ........................................................................... 25
2.2 intro to computer networks......................................................................................................................... 25
2.2.A Important network applications: Web – HTTP ..................................................................................... 27
2.2.B Important network applications: DNS .................................................................................................. 30
2.2.C lab sessions ........................................................................................................................................... 30
2.3 Regular expressions (regex)......................................................................................................................... 31
2.3.A DeFInition and general application ...................................................................................................... 31
2.3.B Regular expressions in Python .............................................................................................................. 32
2.3.C Gone wrong .......................................................................................................................................... 34
2.3.D Concluding remarks .............................................................................................................................. 34
Summary ........................................................................................................................................................... 34

Course 3 ....................................................................................................................................................... 35
3.1 Basic Linux ................................................................................................................................................... 35
3.1.A linux ...................................................................................................................................................... 36
3.1.B Linux command line instructions (FIle manipulation) .......................................................................... 38
3.1.C JQ .......................................................................................................................................................... 39
3.2 Cloud Services .............................................................................................................................................. 40
3.2.A DEFIning cloud services ........................................................................................................................ 40
3.2.B Core AWS services ................................................................................................................................ 41
3.2.C Storage infrastructure .......................................................................................................................... 44
3.2.D Database services ................................................................................................................................. 44
3.2.E Cloud architecture example.................................................................................................................. 45
Summary ........................................................................................................................................................... 45




1

,Course 4 ....................................................................................................................................................... 46
4.1 algorithms and complexity .......................................................................................................................... 46
4.1.A Storting ................................................................................................................................................. 49
4.2 basic datastructures .................................................................................................................................... 53
4.2.A collections or container ........................................................................................................................ 54
A.1 List ........................................................................................................................................................... 54
A.2 set ............................................................................................................................................................ 55
A.3 map.......................................................................................................................................................... 55
4.2.B trees ...................................................................................................................................................... 55
4.2.C Hash Tables ........................................................................................................................................... 57
Summary ........................................................................................................................................................... 58

Course 5 ....................................................................................................................................................... 59
Databases.......................................................................................................................................................... 59
5.1 Data, data, data ....................................................................................................................................... 59
5.2 evolution of databases ............................................................................................................................ 59
5.3 relational databases................................................................................................................................. 60
5.4 types of databases ................................................................................................................................... 63
5.4.A type 1: production database ................................................................................................................ 63
5.4.B type 2: analytical database ................................................................................................................... 63
5.5 NoSQL Data Stores ................................................................................................................................... 64
5.6 Big Data.................................................................................................................................................... 64

Course 6&7 .................................................................................................................................................. 65
6. Parallel and distributed computing ............................................................................................................... 65
6.1 Parallel computing ................................................................................................................................... 65
6.1.A communication patterns ...................................................................................................................... 66
6.1.B Examples ............................................................................................................................................... 68
6.1.C Analysis of speedup .............................................................................................................................. 70
6.1.D Dependencies ....................................................................................................................................... 70
6.2 Distributed computing ............................................................................................................................. 71
6.3 Use cases ................................................................................................................................................. 73
7. Map reduce ................................................................................................................................................... 74
7.1 map reduce .............................................................................................................................................. 75
7.2 Map-Reduce example .............................................................................................................................. 76
7.3 SQL operations......................................................................................................................................... 77
7.4 Hadoop .................................................................................................................................................... 78
7.5 Shuffling ................................................................................................................................................... 79
7.6 matrix operations .................................................................................................................................... 79
7.7 summary .................................................................................................................................................. 80
7.8 Spark ........................................................................................................................................................ 81
7.9 the debit example on spark ..................................................................................................................... 82
7.10 indexing web pages using spark ............................................................................................................ 83
7.11 Spark functions ...................................................................................................................................... 83
7.11 use cases ................................................................................................................................................ 85

Course 8 & 9: Gdelt project .......................................................................................................................... 85




2

,Course 10 ..................................................................................................................................................... 86
10. Web api’s ..................................................................................................................................................... 86
10.1 Rest api .................................................................................................................................................. 87
10.2 Designing a REST API.............................................................................................................................. 88
10.3 demo ...................................................................................................................................................... 89
10.4 api access ............................................................................................................................................... 90
10.5 Microservices ......................................................................................................................................... 91
10.6 summary ................................................................................................................................................ 92

Course 11: closing remarks ........................................................................................................................... 93
11.1 Choose your technology stack ................................................................................................................... 93
11.2 Streaming .................................................................................................................................................. 94
11.3 Sampling .................................................................................................................................................... 94
11.4 filtering ...................................................................................................................................................... 95
11.5 Streaming technology ............................................................................................................................... 95
11.6 data warehouses ....................................................................................................................................... 96
11.7 Unstructured data ..................................................................................................................................... 98
11.8 Web API’s .................................................................................................................................................. 98

Example Exam .............................................................................................................................................. 99

Quick review of course 1-10 ....................................................................................................................... 109

Gdelt project .............................................................................................................................................. 138




3

, COURSE 1

1.1 INTRO

1.1.A DEFINING DATA ENGINEERING
Defining a data engineer by differentiating it from a data scientist
A data scientist’s principal role is to find value or discover new
opportunities in the company’s data or fulfill business needs using
that data. The data scientist/analyst uses the company’s tools and
infrastructure together with his/her knowledge of basic
mathematics, machine learning and statistics

The role of the data engineer is to provide the data scientist with
the software infrastructure for fetching and processing the data so
that the data scientist can easily explore and gain insight in the
data. He/she is responsible deploying new models and applications
typically making use of a workflow management platform

Extract/Transform/Load (ETL)
Besides supporting data science, the data engineer is more
generally responsible for the processing of data

The data engineer is responsible for
Extract/Transform/Load (ETL)implementing the interfaces that are
The data engineer is responsible for implementing the interfaces that are
necessary for managing the data flow and Data
necessary for managing the data flow and keeping the data available for source
keeping the data available for analysis
analysis
extract
The data architect is usually the person load
The data architect is usually the person responsible for the design of the
responsible for the design of the whole Data
whole system Data
transform
system source
warehouse
Typically there are many different data sources within the company. To
Typically there are many different data
enable data scientists to gain insight in that data and generate value, all
sources within the company. Toenable data
that data should be accessible in a central repository in some uniform Data
scientists to gain insight in that data and source
format
generate value, all that data should be
accessible in a central repository in some
uniform format
The data pipeline
The set of processes to automatically extract data from different sources, transform it into some uniform format and store
it in a central place defines the data pipeline

The data pipeline can also contain production models made by data scientists. Depending on the requirements these
models have to run in real-time, once per hour/day...
Data engineers need to maintain this data flow and ensure its availability and quality:
● make changes if data is added/removed
● solve bottlenecks in the pipeline
● monitor, log and solve errors
● handle duplicate, incorrect or corrupted data
● scale
● test
Workflow Management Platform
● ...

Workflow Management Platform
Image shows how we manage
this data.
We split up the data in parts,
and each split is a step, but you
don’t do every step yourself
(don’t have to reinvent the
wheel every time)




4
DAG configuration and monitoring @PrediCube

Les avantages d'acheter des résumés chez Stuvia:

Qualité garantie par les avis des clients

Qualité garantie par les avis des clients

Les clients de Stuvia ont évalués plus de 700 000 résumés. C'est comme ça que vous savez que vous achetez les meilleurs documents.

L’achat facile et rapide

L’achat facile et rapide

Vous pouvez payer rapidement avec iDeal, carte de crédit ou Stuvia-crédit pour les résumés. Il n'y a pas d'adhésion nécessaire.

Focus sur l’essentiel

Focus sur l’essentiel

Vos camarades écrivent eux-mêmes les notes d’étude, c’est pourquoi les documents sont toujours fiables et à jour. Cela garantit que vous arrivez rapidement au coeur du matériel.

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.

Garantie de remboursement : comment ça marche ?

Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.

Auprès de qui est-ce que j'achète ce résumé ?

Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur julievantroyen. Stuvia facilite les paiements au vendeur.

Est-ce que j'aurai un abonnement?

Non, vous n'achetez ce résumé que pour €9,76. Vous n'êtes lié à rien après votre achat.

Peut-on faire confiance à Stuvia ?

4.6 étoiles sur Google & Trustpilot (+1000 avis)

79202 résumés ont été vendus ces 30 derniers jours

Fondée en 2010, la référence pour acheter des résumés depuis déjà 14 ans

Commencez à vendre!
€9,76  24x  vendu
  • (2)
  Ajouter