Document made by: Bob Lokhorst
1 Lecture 1: Introduction to Data Analytics ------------------------------------------------------------------------------------ 3
1.1 Introduction and Motivation – The basic information for Data Analytics ---------------------------------- 3
1.2 Managerial Decision Making -------------------------------------------------------------------------------------------- 4
1.3 Decision Support System -------------------------------------------------------------------------------------------------- 6
1.4 Business Intelligence (BI) -------------------------------------------------------------------------------------------------- 7
1.5 Business Analytics and Big Data ---------------------------------------------------------------------------------------- 8
1.6 Data Science ------------------------------------------------------------------------------------------------------------------ 9
2 Lecture 2: Data warehousing and Visual Analytics ----------------------------------------------------------------------- 11
2.1 Database System ---------------------------------------------------------------------------------------------------------- 11
2.2 Data warehousing --------------------------------------------------------------------------------------------------------- 12
2.3 Data warehouse Architectures ---------------------------------------------------------------------------------------- 13
2.4 Getting Access to Data --------------------------------------------------------------------------------------------------- 16
2.5 Online Analytical Processing ------------------------------------------------------------------------------------------- 17
2.6 Data Warehousing and Big Data -------------------------------------------------------------------------------------- 19
3 Lecture 3: Database Concepts and Data Modelling ---------------------------------------------------------------------- 21
3.1 Database concepts -------------------------------------------------------------------------------------------------------- 21
3.2 Database Components--------------------------------------------------------------------------------------------------- 23
3.3 Data Modelling ------------------------------------------------------------------------------------------------------------- 26
3.4 Relationships --------------------------------------------------------------------------------------------------------------- 27
3.5 Additional ER Modelling Aspects (Voluntary, may increase knowledge) --------------------------------- 30
3.6 Databases and Big Data ------------------------------------------------------------------------------------------------- 31
4 Lecture 4: Data retrieval -------------------------------------------------------------------------------------------------------- 32
4.1 ERD Transformation ------------------------------------------------------------------------------------------------------ 32
4.2 SQL Overview --------------------------------------------------------------------------------------------------------------- 35
4.3 Basic SQL Commands----------------------------------------------------------------------------------------------------- 37
4.4 Executing SQL Statements ---------------------------------------------------------------------------------------------- 40
4.5 Sub-Queries and Set Operators --------------------------------------------------------------------------------------- 42
5 Lecture 5: overview of data mining ------------------------------------------------------------------------------------------ 44
5.1 Overview of Data Mining ----------------------------------------------------------------------------------------------- 44
5.2 Statistics and Data Mining ---------------------------------------------------------------------------------------------- 47
5.3 Classification Methods--------------------------------------------------------------------------------------------------- 50
5.4 Quality of Classification methods ------------------------------------------------------------------------------------ 52
5.5 Decision Trees -------------------------------------------------------------------------------------------------------------- 55
5.6 Cluster Analysis ------------------------------------------------------------------------------------------------------------ 56
1
,Document made by: Bob Lokhorst
5.7 An Example of a Clustering Algorithm ------------------------------------------------------------------------------ 57
5.8 Association Rule Mining, Software, and Concluding Remarks ----------------------------------------------- 59
6 Lecture 6: Process Mining ------------------------------------------------------------------------------------------------------ 61
6.1 Business Process Moddeling ------------------------------------------------------------------------------------------- 61
6.2 Process Mining Basics ---------------------------------------------------------------------------------------------------- 63
6.3 Audit Standards and Novel Audit Data Analytics ---------------------------------------------------------------- 68
6.4 Examples of Process Mining ------------------------------------------------------------------------------------------- 69
6.5 Limitations for Using Process Mining-------------------------------------------------------------------------------- 73
6.6 Outlook to Deep Data Analytics (Voluntary) ---------------------------------------------------------------------- 78
7 Lecture 7: Text mining ----------------------------------------------------------------------------------------------------------- 80
7.1 Text Mining Basics -------------------------------------------------------------------------------------------------------- 80
7.2 Text Mining Core Concepts --------------------------------------------------------------------------------------------- 83
7.3 Natural Language Processing ------------------------------------------------------------------------------------------ 85
7.4 The Text Mining Process ------------------------------------------------------------------------------------------------ 86
7.5 Sentiment Analysis (Voluntary) --------------------------------------------------------------------------------------- 89
8 Workshops A, B & C -------------------------------------------------------------------------------------------------------------- 91
8.1 Workshop A, Data visualization and modeling – Microstrategy -------------------------------------------- 91
8.2 Workshop B – Data retrieval and Mining -------------------------------------------------------------------------- 99
8.3 Workshop C – Process Mining and Text Mining ----------------------------------------------------------------- 103
9 Something extra related to the test exam--------------------------------------------------------------------------------- 111
2
,Document made by: Bob Lokhorst
1 Lecture 1: Introduction to Data Analytics
1.1 Introduction and Motivation – The basic information for Data Analytics
1.1.1 New technologies Affect Accounting
Why do we need Data Analytics & Professional skills? Since the 20th Century, the profession of
accounting is shifting. In the past, a lot of things that accountants did in the past, aren’t done by them
anymore and are automated through processes. Some of the technologies have been developed in
digital storage and databases.
70 years ago, bookkeeping was about putting transactions in books, this is all gone, it's all-digital. ERP
systems support and automate the data of a company. The data we extract from these ERP systems are
the basis for our reporting, we get everything from the system.
Nowadays (21st century) we developed even further. Everything is stored in the Cloud Services
nowadays. Thes Cloud Services have big data centers which we call “servers”. In these servers,
everything is stored. Before the existence of these servers, we used localized the IT department which
had a server where they stored their data. (So Internal storage with an IT department moved to Cloud-
Based Services)
Furthermore, we currently have robotic process
automation (AI), where robots automatically
process digital documents in the system. This
Artificial Intelligence “AI” is a self-learning
mechanism, the system keeps feeding itself with
information and keeps learning daily.
On the picture on the right, we see an overview of
characteristics of the 20th century and 21st century
related to data and data storage.
1.1.2 Developments in the Profession
So, relating this to the accounting profession, how does all of this impact the accounting profession.
Business Intelligence (BI), Big Data analytics, artificial intelligence are used extensively by firms,
especially for internal reporting and decision making. This Business Intelligence and the data warehouses
form the foundation of nowadays corporate reporting.
Thinking back to the audit profession, audit firms also started using data analytics to support and
automate auditing! (e.g. PwC Halo, E&Y Helix, KPMG Clara).
3
,Document made by: Bob Lokhorst
1.1.3 The impact of Technological Innovations
Furthermore, we can look at the impact of technological innovations. Cloud-based services provide
access to digital capabilities to all kinds of firms that have previously only been accessible to large
companies.
(1) Virtually all data is digital and accessible.
(2) Software takes over the task of processing and recording business transactions as well as traditional
bookkeeping activities.
1.1.4 The Changing Role of the Accountant
Looking at the first 4 points, things like “identifying”,
“measuring”, “Recording”, and “Communication” are
already done by software.
So, we see that the traditional role of accounting is almost
fully taken over by digital tools and services. This means
that there is a shift in the accounting profession.
Nowadays, the role of accountants will probably become
more of a business advisor. This doesn’t mean this is a bad
thing, as you can then focus on the more important stuff
rather than the boring stuff.
1.1.5 Why do we need this course?
So, coming back to the question we had at the beginning of the document. Why do we need this course?
The reason is the following:
- Big Data (large, fast accumulating, heterogonous data) is causing information overload among decision-
makers. So, someone must be able to understand the data by using suitable techniques capable of
dealing with Big Data via advanced data analytics. But… there is a problem because there is a severe
shortage of professionals with sufficient data analysis skills, especially in the accounting profession.
➢ A data scientist might be one of the most important jobs of the next decade(s).
1.2 Managerial Decision Making
1.2.1 Information for Managerial Decision Making
Management Decision Making? How are they related to each other?
- Management is a process by which organizational goals are achieved by using resources.
- Decision-making: selecting the best solution from two or more alternatives
➢ Selecting the best solution management requires sufficient information.
1.2.2 Decision-Making Process
In business, managers usually make decisions by following a four-step process
Intelligence: Define the problem (or opportunity).
Design: Construct a model that describes the real-world problem, defines evaluation criteria,
and searches for alternative solutions.
Choice: Compare, choose, and recommend a potential solution to the problem.
Implementation: Implement the chosen solution.
4
,Document made by: Bob Lokhorst
1.2.3 Decision-Making Process (By Simon 1977)
So, the key steps in decision-making for managers were
intelligence, design, choice, and implementation. The graph
on the right shows this step-system in a more detailed way.
We see that finally there is an implementation of the
solution, which either has success or fails. When there is a
failure, you need to go back in your steps to determine
where something went wrong.
We see at “Choice” the term called “Sensitivity analysis”,
which means: testing some important parameters to see if
the outcomes still would be the same.
1.2.4 Models
Decision-making processes involve the inclusion of at least one model. A model is a simplified
representation or abstraction of reality. Modeling is a combination of art and science.
But, what are the benefits of these “Models”, why do now actually use Models?
1. Manipulating a model is much easier than manipulating a real system.
2. Simulation is easier and does not interfere with the organization’s daily operations.
3. Compression of time, years of operations can be simulated in minutes or seconds.
4. The cost is much lower than experiments conducted on a real system.
5. The consequences of making mistakes are less severe.
6. Mathematical models enable the analysis of a very large number of possible solutions.
7. Models enhance and reinforce learning and training.
8. Models and solution methods are readily available.
1.2.5 Decision Support Framework (by Gory and Scott-Morten, 1971)
On the left, we see that there are different types of
controls and different types of structures. A rather
structured problem we see is deciding the location of
a warehouse. Using this support framework, we can
make a distinction between top management and low
management decision-making.
- The MIS domain is mainly related to structured data.
It stands for management information systems (MIS
domain). These MIS mainly refer to the computerized
databases and reporting processes that managers use
to measure the effectiveness of their organization,
departments, teams, etc. (From ERP systems for
example!)
- Decision support systems (DDS domain) can be used
to support decisions for unstructured and semi-structured data.
For example evaluation of the credit rating of a potential business partner is a semi-structured type of
decision. There might be some structured aspects for example the profits of a company and the targets.
But there are also some unstructured ones, maybe it’s a startup company and the profits are not that good
(yet). The CEO is may be very smart which has a lot of potentials. In this case, it’s hard to be fully
structured.
5
,Document made by: Bob Lokhorst
1.2.6 An Early Decision Support Framework
Degree of Structuredness (Simon, 1977)
• Highly structured (programmable)
• Semi-structured
• Highly unstructured (non-programmable)
Types of Control (Anthony, 1965)
• Strategic planning (top-level, long-range)
• Management control (tactical planning)
• Operational control
1.3 Decision Support System
1.3.1 What is a system?
So, what exactly is a system? Think about the climate or a solar system. They work
together to achieve a goal. It’s a set of two or more interrelated components interacting
to achieve a goal. Furthermore, a system can be identified by the fact that it:
- Has a boundary
(The solar system ends with the last planet, after that there only is “space”, so
this might be the boundary of our solar system)
- Has inputs and outputs
(Stars from other suns and our solar systems shine back, this is the case for
systems in general)
- Interacts with its environment
- Is governed by processes, rules, and procedures
(In a solar system is the rules of gravity, for example, define how components connect)
In decision support systems we have components such as an operating system and hardware.
1.3.2 Data vs Information
Data is not the same as information. Because Data are facts that are collected, recorded,
stored, and processed. Data is on its own insufficient for decision-making
On the other hand, information is processed data used in decision-making. However, it's
important to know that too much information will make it more, not less, difficult to make
decisions. This is known as ‘data overload’ or ‘information overload’.
So, if we relate this to a quiz question in the lecture. We can ask ourselves, what is
information?
A – Data that has been organized and processed so that it’s meaningful
B – Raw facts about transactions
C – The same as data
D – Potentially useful facts when processed on time.
6
,Document made by: Bob Lokhorst
1.3.3 The concept of Decision Support Systems (DSS)
What is DDS? DDS is an interactive computer-based system, which helps decision-makers utilize data
and models to solve unstructured problems (Gorry and Scott-Morton, 1971)
A few important aspects of DSS are:
- Couple the intellectual resources of individuals (expertise) with the computational capabilities
of the computer to improve the quality of decisions.
- Primarily emerged from science.
1.4 Business Intelligence (BI)
1.4.1 Evolution of Computerized Decision Support to Business Intelligence and Data Science
We must take into consideration that a new ‘name’ is a new phenomenon (DSS, EEIS, BI, Analytics, BD).
Above there is a timeline with different technologies that are incorporated over the years. It shows how
they are related to each other. We see that it for example shows the relation between Data Science and
Business Intelligence. And the difference between Business intelligence and Decision Suppor Systems
(DSS).
1.4.2 Business intelligence
Business Intelligence is an evolution of decision support concepts over time.
Before: Executive Information System (EIS/DSS)
Now: Everybody’s Information System (BI)
BI systems are enhanced with additional visualizations, alerts, and performance measurement
capabilities. Business Intelligence primarily emerged from the industry.
Definition of Business Intelligence
Business Intelligence combines architectures, tools, databases, analytical tools, applications, and
methodologies. It is a content-free expression, this means that it has a different definition for different
people. If you talk to computer scientists, they have a different understanding of BI compared to for
example an accountant.
The major objective of business intelligence is to enable easy access to data (and models) and make it
easy for business managers to analyze it.
Furthermore, it helps in transforming data into information, to improve decisions, and finally to
implement actions.
Business Intelligence Architecture
BI system has four major components
• a data warehouse with its source data
• business analytics (a collection of tools for manipulating, mining, and analyzing the data)
• business performance management (BPM) capabilities for monitoring and analyzing
performance
• a user interface (e.g. a dashboard)
7
,Document made by: Bob Lokhorst
Typical functionalities of business intelligence
Typical key functionalities are:
- Report delivery and alerting
- Enterprise reporting
- Cube analysis
- Ad hoe queries
- Statistics and data mining
Differences between Decision Support System (DSS) and Business Intelligence (BI)
Decision Support systems Business intelligence tools
May or may not use a data warehouse Implies use of a data warehouse
Directly support specific decision making Provide information to support decision making
indirectly
Oriented towards analysts Executive and strategy orientation
Customized solutions for very unstructured Commercially available tools
problems
Mostly developed in the academic world Developed mostly by software companies
Many tools that BI uses are also considered DSS tools (data mining, predictive analytics)
1.5 Business Analytics and Big Data
Business Analytics
Business Analytics is a Combination of
(1) computer technology
(2) management science techniques
(3) statistics (to solve problems)
They are usually categorized as:
• Descriptive Analytics (backward-looking)
• Predictive Analytics (forecasts)
• Prescriptive Analytics (future, but also
recommendations what you should do)
On the right, we see an overview of the three types of
business analytics. We see what questions are asked during
the process, which enablers activate the statistic and what
the outcomes are of each type of business analytics.
Alternative classification
8
,Document made by: Bob Lokhorst
1.5.1 Big Data
The growing availability of information -> Big Data
• Examples(e.g. personal devices connected to the Internet and equipped with digital sensors) The Term
Big Data has been used with several and inconsistent meanings, the Big Data lacks a formal definition.
So, how do we define Big Data? There are many different ways in which Big Data is defined.
“data is the new oil, the source for corporate energy and differentiation in the 21st century” (ECM, 2011)
“seriously massive and often highly complex sets of information” (Microsoft Research, 2013).
“when the processing capacity of conventional database systems is exceeded” (Dumbill 2013)
“a cultural, technological, and scholarly phenomenon” Boyd and Crawford (2012, p. 663)
➢“Big Data is the Information asset characterized by such a High Volume, Velocity and Variety to require
specific Technology and Analytical Methods for its transformation into Value.” (De Mauro et al. 2016)
Examples related to the three criteria of Big Data:
(1) Volume Volume refers to the quantity of data to be stored. For example, Walmart deals
with big data. They handle more than 1 million customer transactions every hour, importing more
than 2.5 petabytes of data into their database.
(2) Velocity Velocity refers to the speed with which data is generated. High-velocity data is
generated at such a pace that it requires distinct (distributed) processing techniques. An example
of data that is generated with high velocity would be Twitter messages or Facebook posts.
(3) Variety Variety of Big Data refers to structured, unstructured, and semistructured data
that is gathered from multiple sources. While in the past, data could only be collected from
spreadsheets and databases, today data comes in an array of forms such as emails, PDFs, photos,
videos, audios, SM posts, and so much more. An example of the high variety of data sets would
be the CCTV audio and video files that are generated at various locations in a city.
1.6 Data Science
1.6.1 What is a data scientist?
A data scientist is a high-ranking professional with the training and curiosity to make discoveries in the
world of big data. They are responsible for forming theories, testing hunches, and finding patterns to
predict. They understand how to fish out answers to important business questions from today’s tsunami
of unstructured information. Furthermore, they can bring structure to large quantities of formless data
and make analysis possible. They can identify rich data sources, join them with other, potentially
incomplete data sources. Also, display information visually and communicate patterns they find clear
and compelling. Hybrid of data hacker, analyst, communicator, and trusted adviser
Do such individuals exist? No, these people don’t exist. It’s a description of a group, but not
an individual. You need a group.
1.6.2 Data Science and BI
So, there is no clear difference between
Data Science en BI. They are involved with
each other. The only difference is that BI
primarily focused on descriptive analysis.
Data science more focusses on
prescriptive analysis. (Also see the graph
on the right)
9
, Document made by: Bob Lokhorst
1.6.3 Ingredients of Data Science
This is a visualization of what makes something
data science, we got a lot of aspects here. We
will focus on a couple of these ingredients during
this course. The ones we will focus on are
marked as red. These are:
- Privacy security law & ethics
- Visualization & Visual analytics
- Databases
- Predictive analytics
- Process mining
- Data mining
Some questions which may be asked related to these red circles could be:
- What are databases?
- How can we carry out visualization and visual analytics?
- Privacy, security, law, and ethics.
- We will use data mining and process mining
- Predictive analysis.
1.6.4 The focus of this course
Data analysts require knowledge in
- Business economics
- Statistics (not really in this course)
- Computer science
In the course, we are applying tools that already exist. It is important
to understand the foundation of the techniques and what their
limitations are. We need to know this to be able to interpret
outcomes when we use analytics.
10