Garantie de satisfaction à 100% Disponible immédiatement après paiement En ligne et en PDF Tu n'es attaché à rien
logo-home
Summary Big data management & Analytics. Grade: 8.8 €4,99   Ajouter au panier

Resume

Summary Big data management & Analytics. Grade: 8.8

2 revues
 175 vues  11 fois vendu
  • Cours
  • Établissement
  • Book

Summary of the course BDMA. Grade achieved: 8.8

Aperçu 4 sur 84  pages

  • Oui
  • 7 décembre 2020
  • 84
  • 2019/2020
  • Resume

2  revues

review-writer-avatar

Par: ravdeepksingh • 11 mois de cela

review-writer-avatar

Par: felienkarsten • 3 année de cela

avatar-seller
Summary Big Data Management and
Analytics
Book Data Science
Chapter 1
Data science involves principles, processes, and techniques for understanding phenomena via the
(automated) analysis of data. The ultimate goal is improving decision making.

Data-driven decision-making (DDD) refers to the practice of basing decisions on the analysis of data,
rather than purely on intuition. There are two sorts of decisions:

(1) Decisions for which “discoveries” need to be made within data
(2) Decisions that repeat, especially at massive scale, and so decision-making can benefit from
even small increases in decision-making accuracy based on data analysis.




There is a lot to data processing that is not data science—despite the impression one might get from
the media. Data engineering and processing are critical to support data science, but they are more
general.

 Data science needs access to data and it often benefits from sophisticated data engineering
that data processing technologies may facilitate, but these technologies are not data science
technologies per se.
 Data processing technologies are very important for many data-oriented business tasks that
do not involve extracting knowledge or data-driven decision-making, such as efficient
transaction processing, modern web system processing, and online advertising campaign
management.

Big data essentially means datasets that are too large for traditional data processing systems, and
therefore require new processing technologies. Used for:

 Data engineering
 Data mining
 But, most often: Data processing in support of data mining techniques and other data science
activities

1

,A fundamental strategy of data science is to acquire the necessary data at a cost. Once we view data
as a business asset, we should think about whether and how much we are willing to invest.

Four fundamental concepts of data science:

1. Extracting useful knowledge from data to solve business problems can be treated
systematically by following a process with reasonably well-defined stages.
2. From a large mass of data, information technology can be used to find informative
descriptive attributes of entities of interest.
3. If you look too hard at a set of data, you will find something—but it might not generalize
beyond the data you’re looking at.
4. Formulating data mining solutions and evaluating the results involves thinking carefully
about the context in which they will be used.

Chapter 2
Fundamental concepts: A set of canonical data mining tasks; The data mining process; Supervised
versus unsupervised data mining.

An important principle of data science is that data mining is a process with fairly wellunderstood
stages.

Examples of data mining algorithm tasks:

1. Classification and class probability estimation attempt to predict, for each individual in a
population, which of a (small) set of classes this individual belongs to. (E.g. “Among all the
customers of MegaTelCo, which are likely to respond to a given offer?”) In this example the
two classes could be called will respond and will not respond.
a. A closely related task is scoring or class probability estimation. A scoring model
applied to an individual produces, instead of a class prediction, a score representing
the probability that that individual belongs to each class.
2. Regression: (“value estimation”) attempts to estimate or predict, for each individual, the
numerical value of some variable for that individual. An example regression question would
be: “How much will a given customer use the service?”
a. Regression is related to classification, but the two are different. Informally,
classification predicts whether something will happen, whereas regression predicts
how much something will happen.
3. Similarity matching: attempts to identify similar individuals based on data known about
them. Similarity matching can be used directly to find similar entities. For example, IBM is
interested in finding companies similar to their best business customers, in order to focus
their sales force on the best opportunities.
4. Clustering: attempts to group individuals in a population together by their similarity, but not
driven by any specific purpose. An example clustering question would be: “Do our customers
form natural groups or segments?”


Supervised versus unsupervised methods: A vital part in the early stages of the data mining process
is (i) to decide whether the line of attack will be supervised or unsupervised, and (ii) if supervised, to
produce a precise definition of a target variable.

 Consider two similar questions we might ask about a customer population. The first is: “Do
our customers naturally fall into different groups?” Here no specific purpose or target has


2

, been specified for the grouping. When there is no such target, the data mining problem is
referred to as unsupervised.
o Example: Clustering
 Contrast this with a slightly different question: “Can we find groups of customers who have
particularly high likelihoods of canceling their service soon after their contracts expire?” Here
there is a specific target defined: will a customer leave when her contract expires? In this
case, segmentation is being done for a specific reason. This is called a supervised data mining
problem.
o Examples: Classification & Regression.

Cross Industry Standard Process for Data Mining




This process diagram makes explicit the fact that iteration is the rule rather than the exception.
Going through the process once without having solved the problem is, generally speaking, not a
failure.

Business Understanding

Initially, it is vital to understand the problem to be solved. This may seem obvious, but business
projects seldom come pre-packaged as clear and unambiguous data mining problems. Often
recasting the problem and designing a solution is an iterative process of discovery. The process
model represents this as cycles within a cycle, rather than as a simple linear process. The initial
formulation may not be complete or optimal so multiple iterations may be necessary for an
acceptable solution formulation to appear. In this first stage, the design team should think carefully
about the use scenario – What exactly do we want to do?

Data Understanding

If solving the business problem is the goal, the data comprise the available raw material from which
the solution will be built. It is important to understand the strengths and limitations of the data
because rarely is there an exact match with the problem. A critical part of the data understanding
phase is estimating the costs and benefits of each data source and deciding whether further
investment is merited. In data understanding we need to dig beneath the surface to uncover the

3

, structure of the business problem and the data that are available, and then match them to one or
more data mining task.

Data Preparation

A data preparation phase often proceeds along with data understanding, in which the data are
manipulated and converted into forms that yield better results. Typical examples of data preparation
are converting data to tabular format, removing or inferring missing values, and converting data to
different types.

Modeling

The output of modeling is some sort of model or pattern capturing regularities in the data. The
modeling stage is the primary place where data mining techniques are applied to the data.

Evaluation

The purpose of the evaluation stage is to assess the data mining results rigorously and to gain
confidence that they are valid and reliable before moving on. Equally important, the evaluation stage
also serves to help ensure that the model satisfies the original business goals. Recall that the primary
goal of data science for business is to support decision making.

A model may be extremely accurate (> 99%) by laboratory standards, but evaluation in the actual
business context may reveal that it still produces too many false alarms to be economically feasible.

Deployment

In deployment the results of data mining—and increasingly the data mining techniques themselves—
are put into real use in order to realize some return on investment. The clearest cases of deployment
involve implementing a predictive model in some information system or business process.

The main difference between data mining and other analytics techniques is that data mining focuses
on the automated search for knowledge, patterns, or regularities from data.

Chapter 3
Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute
selection.

Supervised segmentation: how can we segment the population into groups that differ from each
other with respect to some quantity of interest.

 One of the fundamental ideas of data mining: finding or selecting important, informative
variables or “attributes” of the entities described by the data.
o Information is a quantity that reduces uncertainty about something.
 Finding informative attributes also is the basis for a widely used predictive modeling
technique called tree induction. Tree induction incorporates the idea of supervised
segmentation in an elegant manner, repeatedly selecting informative attributes.

Supervised data mining can be divided into classification and regression.

Supervised learning is model creation where the model describes a relationship between a set of
selected variables (attributes or features) and a predefined variable called the target variable. The
model estimates the value of the target variable as a function (possibly a probabilistic function) of
the features.


4

Les avantages d'acheter des résumés chez Stuvia:

Qualité garantie par les avis des clients

Qualité garantie par les avis des clients

Les clients de Stuvia ont évalués plus de 700 000 résumés. C'est comme ça que vous savez que vous achetez les meilleurs documents.

L’achat facile et rapide

L’achat facile et rapide

Vous pouvez payer rapidement avec iDeal, carte de crédit ou Stuvia-crédit pour les résumés. Il n'y a pas d'adhésion nécessaire.

Focus sur l’essentiel

Focus sur l’essentiel

Vos camarades écrivent eux-mêmes les notes d’étude, c’est pourquoi les documents sont toujours fiables et à jour. Cela garantit que vous arrivez rapidement au coeur du matériel.

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.

Garantie de remboursement : comment ça marche ?

Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.

Auprès de qui est-ce que j'achète ce résumé ?

Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur jeremyut. Stuvia facilite les paiements au vendeur.

Est-ce que j'aurai un abonnement?

Non, vous n'achetez ce résumé que pour €4,99. Vous n'êtes lié à rien après votre achat.

Peut-on faire confiance à Stuvia ?

4.6 étoiles sur Google & Trustpilot (+1000 avis)

79202 résumés ont été vendus ces 30 derniers jours

Fondée en 2010, la référence pour acheter des résumés depuis déjà 14 ans

Commencez à vendre!
€4,99  11x  vendu
  • (2)
  Ajouter