Summary

Summary of all lectures

84 views 6 purchases

Course
Text Mining (L_PABAALG002)

Institution
Vrije Universiteit Amsterdam (VU)

I made this brief summary and I got an 8.5. I hope you too soon!

[Show more]

Preview 3 out of 16 pages

View example

Uploaded on October 10, 2023
Number of pages 16
Written in 2021/2022
Type Summary

text
mining
ai

Institution
Vrije Universiteit Amsterdam (VU)
Education
Artificial Intelligence
Course
Text Mining (L_PABAALG002)

gideonrouwendaal

Member since 1 year 36 documents sold

$11.78

Added

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Lecture 1: Introduction
Computational linguistics: algorithms that model language data, e.g., similarity, information value
and sequence probabilities (mathematical view)

Natural Language Processing (NLP): engineering to address aspects of natural language, e.g.,
tokenization, lemmatization, compound splitting, syntactic splitting, entity detection, sentiment
analysis… (engineering view)

NLP Toolkits: software packages and resources that provide and/or combine collections of NLP
modules

Language applications: machine translation, summarization, chat bots, text mining

Text mining: from unstructured text to structured data (information or knowledge)

Lecture 2: Linguistic and Natural Language Processing
Subdiscipline Medium or unit Natural Language Model
Phonetics, phonology Sounds Automatic speech recognition
Morphology Words, word formation Part-of-speech taggers,
lemmatizes, compound
splitters
Syntax Sentences, grammatical structure and Syntactic parsers, chunkers
function
Semantics Meaning Semantic parsers
Pragmatics Language use in context Context and domain models
Methods Introspection, behaviorism, empirical
(experimental and stochastic),
mathematical models
Resources Lexicons (dictionary as database),
grammars, data collections and
annotations, data models, annotations
We use minimal information to express a lot (e.g., riots in Amsterdam, exactly know which). Without
context, data (spoken words etc.) is difficult to understand.

- Morphology: study of form and structure of words. Words are composed of morphemes.
Morpheme is the smallest meaning-bearing unit (e.g., talked contains of 2 morphemes: -talk
(activity) and -ed (past)

Different types of morphemes:

- Free morphemes: occur independently (e.g., boy, sing)
- Bound morphemes: attached to another morpheme, and cannot be used independently
(English -s: boys, Dutch -s/en: appels/appelen)
- Affix: prefixes (e.g., gelopen), infixes (e.g., burgemeesterspost) suffixes (e.g., loopje)

Some other basic terms:

- Root or Base: an un-analysable morpheme, expressing the basic lexical content of a word.
Also defined as ‘what is left of a complex form when affixes are stripped’
- Stem: consists of at least a root. It can contain a derivational affix(es) “aardigste”  “aardig”
/ “aard”

, - Lemma: an entry in a dictionary. Single form for nouns (“stemmetje”  “stem”) and
infinitive form for verbs “stemde”  “stemmen”)

The difference between stem and lemma is that stem does not have to be an actual word, whereas
lemma is an actual language word.

Words have part-of-speech (PoS), which specifies the typical phrase structures in which they can be
the head. Open Class (open to word formation and neologisms). Noun (N, boat), Verb (V, float),
Adjective (A, large/fast), Adverb (very/largely). New words are invented veery day and other words
are forgotten. Millions of open class words if we include specialized language. Closed Class (you can
not invent a new closed class word). Pronoun (PRN, he/him/…), Preposition (P, in, at, from…).
Relatively fixed, slowly change over generations; small set of less than a hundred words.

Word modification: given a root, base or stem derive different forms. Inflection: expresses syntactic
properties such as person (1, 2, 3), number (singular/plural), gender, tense… Derivation: changes
semantic and grammatical properties, e.g., incapable. Compounding: “beach head”. Combinations:
aircraft-carriers. Word formation is very productive, our lexicon is potentially infinite: the number of
unseen compounds detected in German and Dutch newspapers grows linearly with the number of
newspapers over time. The names for new chemical compounds and proteins grow rapidly every
year. New products launched every year.

Zipfian distribution (Zipfs law): the frequency of a word in a ranked list is the equal to the frequency
of the most frequent word, divided by the rank. Most frequent words also tend to be short and have
many different meanings.

Lexicon of forms: lists all common base forms with: their part-of-speech, inflectional paradigm
(plural, singular, person, tense) and typical (conventional) derived forms. Inflectional paradigms (-s, -
ed) and derivational morphemes (-ation, -ity, -ly).

Morphology in computation linguistics: analyzing complex words, defining their component parts
(ant+dis+establishment+…). Analysis of grammatical information, encoded in words: part-of-speech
= VERB and inflectional information = [PERSON 3, NUMBER singular, TENSE present]. Obtaining the
stem or root: to reduce the size of the data and to find the word in the lexicon.

Part-of-speech tagging: task is to assign the part-of-speech category to every token and add the
lemma. The main challenge is data sparseness for specific languages and domains. PoS-tagging has
an accuracy around 95-96% for all tokens when training and testing. Remaining issues: long distance
dependencies/genuine ambiguities, annotation errors and unknown words. Relatively high
proportion of sentences has at least one error These errors can propagate: wrong PoS may lead to
wrong word sense/named entity…

Multiword expression: fixed idioms (an apple a day keeps the doctor away), less fixed idioms
(shooting from the hip), slots (X, let alone Y), collocations (running engine, running a programme)
and selectional restrictions (a glass of …)

- Syntax: we experience a sentence as a complete grammatical structure. We can freely
combine words into phrases or constituents and we have a strong intuition about the
grammaticality of these structures within a sentence.

Phrase: a word or a group of words which functions as a single unit within a grammatical hierarchy.
A phrase is built around a head lexical item and has a certain syntactic behaviour (she  Noun
Phrase (NP), the head is a pronoun. A very beautiful morning (NP, the head is a noun). Chases the cat

,  Verb phrase (VP, head is a verb)). The head of a phrase is the element that determines the
syntactic function of the whole phrase.

Syntactic elements

Phrasal categories Lexical categories
Noun phrase (NP Noun (N)
Prepositional phrase (PP) Pronoun (Pr)
Verb phrase (VP) Adjective (A)
Adverbial phrase (AdvP) Adverb (Adv)
Adjectival phrase (AP) Verb (V)
Preposition (P)
A phrase structure can be nested. The nesting is hierarchically and have head – modifier relations.
For example:

- Very nice = Adjective Phrase or AP (head is an adjective (A))
- A very nice looping = NP (head is a noun (N))
- Performs a nice looping = VP (head is a verb (V))
- With a long stick = Prepositional phrase (head is preposition
(P))
- The cow performs a very nice looping with a long stick =
Sentence (S)

Phrase functions: subject, object, main verb, modifier, adjunct…
phrase functions and the different categories can be modelled inside
a syntactic tree:

Gram Subject: agreement with the main verb

Gram Objects: obligatory NPs or PPs to form a grammatical sentence

Syntax Tree with dependency labels:

Most important types of predicates in terms of obligatory arguments
(the complementation = that what is needed to obtain a grammatical
structure:

Valency Predicate Complementation Example
Intransitive walk.v NP.subject The cow walks
Transitive Perform.v NP.subject, NP.direct object The cow performs a loopring
Transitive Count.v NP.subject, PP(on).pp – object The cow is hoping for a big
applause
Transitive Be.v NP.subject, NP.object/AP.object This cow is a
phenomenom/this cow is
phenomenal
Ditransitive Give.v NP.subject, NP.direct object, The cow gives the spectators
NP.indirect object an unforgettable day
A lexicon provides a list of verbs with their complementation patterns

Phrase structure parsers: lookup words from a sentence in a sentence to find a candidate for a main
verb. Get the obligatory arguments of the verb. Match the structure of surrounding phrases with the

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller gideonrouwendaal. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $11.78. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

78834 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Summary

Summary of all lectures

Document information

Subjects

Written for

Seller

Reviews received

Content preview