Exam Questions Data Engineering
Introduction and file formats
1) How are integer, decimal numbers, text and images stored in a computer? Give an example
of binary encoding for each type.
What is binary encoding?
• Binary encoding is a method of representing data using the binary number system, with two
digits: 0 and 1.
• Binary encoding is used to convert various types of data, such as numbers, text, and images,
into a format that can be easily stored, processed, and transmitted by computers.
• In binary encoding, each digit is referred to as a "bit" (short for binary digit), and a group of 8
bits is called a "byte." These bits are combined to represent different types of data by assigning
specific meanings to sequences of 0s and 1s.
1) Integer:
Computers works with bits or Boolean values (0/1). Integer numbers are typically stored using a fixed
number of bits, which represent the binary equivalent of the integer value
• The first bit is used for the sign of the integer: 1 is negative and 0 is positive
o For example, consider an 8-bit binary representation:
▪ 1 Sign bit and 7 Magnitude bits
▪ 8-bit: 00110101 = +53 (positive)
▪ 8-bit: 10110101 = +53 (negative)
• You need N bits for representing a number between 0 and 2^N - 1.
• E.g. with 8 bits or one byte you can store all integers between 0 and 2^8 - 1 = 255.
2) Decimal numbers:
Decimal numbers, which include real numbers with fractional parts, are usually stored using floating-
point representation. This involves breaking down the number into three parts: sign bit, exponent,
and mantissa. Need to know how many bits are reserved for the mantissa and the exponent
Floating-point number consists of 3 parts ( here 32 bits):
• Sign bit
• Exponent /scaling factor
• Mantissa (after multiplication with factor).
Example (simplified): - 1 000 0001 = -1 * 10^6 * 1.0000001. Represent very large and very small
numbers using a fixed number of bits. However, exact numbers, fractions and irrational numbers are
rounded to some precision, for example: x = 1/3 = 0.333333333332
,3) Texts:
Text is stored in computers using character encoding schemes such as ASCII, UTF-8, or UTF-16. Each
character is assigned a unique binary code, allowing the computer to represent and process text.
• Sequence of characters or string where each character is encoded using a single byte using an
encoding table. => each character encoded using a single byte of memory
o 1 byte = 8 bits
o Typical book = 250page * 300 words/page * 5 characters/page = 0.4 MB
4) Image:
Images are stored using various formats, and the most basic form involves encoding pixel values. Each
pixel's colour is represented using a combination of red, green, and blue (RGB) values, which are stored
using binary codes.
• Matrix of pixels: each pixel represented by 3 numbers between 0 and 255 for red, green and
blue intensity.
• 4K image = 3840 × 2160 * 3 bytes = 2.4 MB
,2) What is encoding and decoding? Explain and give an example.
Encoding and decoding are processes used to convert data from one representation or format to
another.
Encoding:
Encoding refers to the process of converting data or information into a specific format or
representation that is suitable for a particular purpose, such as storage, transmission, or processing.
The encoding process often involves transforming the original data into a standardized or compressed
format that can be easily interpreted or utilized by a computer or system.
Example:
• Binary Encoding
• Categorical values encoding
• Run length encoding
• Huffman encoding
Decoding:
Decoding is the reverse process of encoding. It involves converting encoded data back into its original
form or format, making it understandable or usable by humans or systems. The decoding process
typically requires knowledge of the encoding scheme or algorithm used to transform the data initially.
, 3 We saw three different data models for representing data. Name and provide a short
summary of each data model.
1) Relational model:
• Consists of tables and rows (or tuples / records)
• Each column contains simple atomic values such as string, integer, float or date.
o Each column has a type: INTEGER, DATETIME, VARCHAR
o Values in a column have the same type
o Each column is either NULL or NOT NULL (value is mandatory)
• Two types of tables:
o Entities, i.e. Persons, groups (in general: objects)
o Relations between entities: i.e. part-of, has-a, has-many, linked-to (relations between
different tables).
• Each table can be saved as Comma-Separated-Values (or CSV) file or together in relational
database.
• NULL is used if a value is unknown or not available
• A table has an attribute with an unique value or primary key
• A table references (or links) to another table using a foreign key
• Can combine tables by joining on primary and foreign key
• Database checks schema to ensure each attribute is of the correct type!
o E.g. see pwp