Taaltheorie en Taalverwerking Samenvatting 1
Lecture 1a
Reasons for studying natural language processing (NLP) within AI:
You want a computer to communicate with users in their terms.
There is a vast store of information recorded in natural language that can be accessible via
computers: news, government reports, social media etc.
Three main types of applications:
o Enabling human-computer communication
o Improving human-human communication
o Doing useful processing of text or speech
Formal (FLs) and Natural (NLs) languages:
Formal (computer) languages: e.g. Java, Prolog, Python, HTML
Natural (human) languages: e.g. Dutch, English, Spanish, German, Japanese
‘Languages’ that represent the behavior of a machine or a system: e.g. think about
‘communicating’ with a vending machine via coin insertions and button presses: pressButton1
Syntax trees:
Syntax: Describes the structural properties of the language. Natural language is much more
complicated than the formal languages used for the artificial languages of logics and computer
programs.
Natural language semantics:
Semantics: Represents the meaning of words and sentences. Logic is a good candidate.
Consider the sentence:
o Every student has access to a computer
The meaning of this can be expressed as two different logical formulas:
o ∀x.(student(x) ⇒ ∃y.(computer(y) ∧ hasAccessTo(x, y)))
Every student has access to their own computer
o ∃y.(computer(y) ∧ ∀x.(student(x) ⇒ hasAccessTo(x, y)))
There is a computer to which every student has access to
Problem: How can (either of) these formulas be mechanically generated from a syntax tree for
the original sentence?
FLs and NLs:
There are close relationships between FLs and NLs, but also important differences:
o FLs can be pinned down by a precise definition.
o NLs are fluid, fuzzy at the edges and constantly evolving.
o NLs are riddled with ambiguity at all levels.
This is normally avoidable in FLs.
,Ambiguity in Natural Language:
Phonological ambiguity: ‘an ice lolly’ vs. ‘a nice lolly’
Lexical ambiguity: ‘bass’ has at least two meanings: fish and musical instrument
Syntactic ambiguity: two possible syntax trees for ‘complaints about referees multiplying
o Could mean:
There are more and more complaints about referees
Referees are multiplying and there are complaints about that
Semantic ambiguity: ‘Please use all available doors while boarding the train’ vs. ‘Please fill all
sections in the questionnaire’
o First sentence: Doesn’t mean that you personally have to use every single door
o Second sentence: Does mean that you have to fill in every single section
Levels of Language Complexity:
Some languages features are ‘more complex’ (harder to describe, harder to process) than
others. We can classify languages on a scale of complexity known as the Chomsky Hierarchy:
o Regular languages: Those whose phrases can be ‘recognized’ by a finite state machine
o Context-free languages: Many aspects of NLs can be described at this level (also used
for most programming languages)
o Context-sensitive languages: Some NLs involve features at this level of complexity
o Recursively enumerable languages: All languages that can in principle be defined via
mechanical rules.
Formal Languages:
A formal language is a set of strings
Each string is composed of symbols from a set called an alphabet (or a vocabulary)
o Examples of alphabets:
English letters: Σ = {a, b, c . . . z}
Decimal digits: Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Programming language ‘tokens’: Σ = {if, while, x, ==}
‘Primitive’ actions performed by a machine or system, e.g. a vending machine
Σ = {insert50c, pressButton1, ...}
Words in (some fragment of) a natural language: Σ = {the, an, a, dog, cat,
sleeps}
o Examples of strings over alphabets:
Let Σ1 = {0, 1} be an alphabet. Then 01101, 000001, 1101 are strings over Σ1.
— in fact, all binary numbers are strings over Σ1
Let Σ2 = {a, b, c, d, e, f, g} be an alphabet. Then bee, dad, cabbage, and face
are strings over Σ2, as are fffff and agagag.
Let Σ3 = {ba, ca, fa, ce, fe, ge} be an alphabet. Then face is a string over Σ3 but
bee, dad or cabbage are not.
Let Σ4 = {♠, ©, ♣} be an alphabet. Then ♠♠ and ♣©♣ are strings over Σ4.
The length of a string is the number of token symbols from the alphabet it contains
o Examples:
The length of ‘face’ over Σ2 = {a, b, c, d, e, f, g} is 4
The length of ‘face’ over Σ3 = {ba, ca, fa, ce ,fe, ge} is 2
The string of length 0 is called the empty string, denoted by ε (epsilon).
Given a string s, a substring of s is a string formed by taking contiguous symbols of s in the
order in which they occur in s.
An initial substring is called prefix and a final substring a suffix.
, o Let unthinkable be a string over Σ = {a, b, c . . . x, y, z} Then, ε, un, unth, unthinkable
are prefixes, while ε, e, able, thinkable, and unthinkable are suffixes. Other substrings
include nthi, inka, bl.
Σ* denotes the set of all strings over an alphabet Σ
o Σ is always infinite, regardless of the number of symbols Σ contains.
We may now define a formal language L over an alphabet Σ as any subset of Σ*: L ⊆ Σ*
o Examples: let Σ = (a, b, c … x, y, z}. Then Σ* is the set of strings over the Latin alphabet
and the following subsets of Σ* are possible formal languages:
The set of strings consisting of consonants (medeklinker) only
The set of strings containing at least one vowel (klinker) and one consonant
The set of strings whose length is less than 9 symbols
The set {one, two, tree, four, five, six, seven, eight, nine, ten}
The set of all English words
The empty set
Ways to define a Formal Language:
Given an alphabet Σ and the infinite set Σ* of formal languages it can give rise to, how can we
select a particular formal language?
o Direct mathematical definition: Σ = {a, b, c}, L1 = {aa, bb, abc, abcc}, L2 = {a nbn |n > 0}
o Formalisms (formal expressions and grammars): sets of rules
o Automata: computational devices for computing languages
Specify some machine for testing whether a string is legal or not
Formalisms and automata allow us to distinguish a formal language of interest (a set of
strings) from other possible languages over a given alphabet
o They capture the patterns that characterize a language
o As such, they act as a definition of the language they capture
From an abstract point of view, a natural language, like Dutch, is a set of strings
(sounds/letters/words etc.)
Therefore, formalisms and automata can help us to model aspects of natural languages
Regular expressions:
Regular expressions are a formal notation for characterizing sets of strings that follow a fairly
simple regular pattern.
We can construct regular expressions over an alphabet Σ as follows:
(these regular expressions are written in mathematical notation)
o We often ignore the dot in concatenation, and simply write ab
o Disjunction (or union) may be written as a|b a+b or a∪b
o a+ is the set of a-strings with at least one a (same as a*a or aa*)
o an can be used to abbreviate the concatenation of a with itself n times
o the notation Σ* can be seen as abbreviating (a|b|…)* for any symbol a, b, … in Σ
Examples: Let Σ = {a, b, c … x, y, z}
o me(o)*w mew, meow, meooow, meooooooooooow etc.