Chapter 7Formal Languagesnotes.en.pdf:134-137

Phrase-Structure Grammars

⟨N, Σ, P, S⟩ — the workhorse of syntax

Def 7.2.1PSGDef 7.2.2Derives ⇒Def 7.2.3Derives* ⇒*Def 7.2.4L(G)
Concept

A phrase-structure grammar (PSG) is a finite recipe that can describe an
infinite language. It is the workhorse of syntax in formal language theory.

Definition 7.2.1 (notes p. 134): A PSG is a 4-tuple

G=N,Σ,P,SG = \langle N, \Sigma, P, S \rangle

where:
- NN — finite set of non-terminal symbols (placeholders, "variables").
- Σ\Sigma — finite set of terminal symbols (the actual alphabet of the language); NΣ=N \cap \Sigma = \emptyset.
- PP — finite set of production rules hbh \to b (or hbh \Rightarrow b), where
h(ΣN)N(ΣN)h \in (\Sigma \cup N)^* N (\Sigma \cup N)^* (head, contains at least one
nonterminal) and b(ΣN)b \in (\Sigma \cup N)^* (body).
- SNS \in N — the start symbol (also sentence symbol).

A production rule hbh \to b says "rewrite hh as bb". We may write
hb1b2bnh \to b_1 \mid b_2 \mid \dots \mid b_n as a shorthand for nn rules with the
same head.

Derivation (Definition 7.2.2): ss derives tt in one step, written
sGts \Rightarrow_G t, if there is a production rule hbh \to b and strings u,vu, v
such that s=uhvs = uhv and t=ubvt = ubv. We write G\Rightarrow^*_G for the
reflexive-transitive closure (zero or more steps).

Sentential form (Definition 7.2.3): any ss reachable from SS via G\Rightarrow^*_G.
A sentence is a sentential form with no nonterminals — i.e., a string in
Σ\Sigma^*.

Language generated (Definition 7.2.4): L(G)L(G) is the set of all sentences of GG.

Two grammars are equivalent (Definition 7.2.5) iff they generate the same
language.

Animation — derivation steps
Transcript — click a line to jump7 cues
  1. 0.0sGrammar Derivation Steps
  2. 0.8sGrammar: E -> E + E | n.
  3. 2.1sStart with the single nonterminal E.
  4. 3.0sApply E -> E + E; we now have E + E.
  5. 4.1sApply E -> n on the left; we have n + E.
  6. 5.2sApply E -> n on the right; we have n + n.
  7. 6.3sHighlight n + n: this string is in the language.
Lexical vs structural rules

A PSG with rule set PP is often split into two parts:

  • Lexical rules (Def 2.10): produce individual words / tokens / lexemes from characters. Example: Det → the | a | an.
  • Structural / syntactic rules (Def 2.11): combine words into phrases. Example: NP → Det N | Pron.

A lexicon (or vocabulary, Def 2.11) is the set of terminal strings produced by the lexical rules — i.e. the 'words' of the language. The syntactic categories (Def 2.12) are the nonterminals of the structural rules (e.g. NP, VP, S).

This split mirrors how real NLP systems work: a lexer/tokeniser first chops the input into tokens, then a parser builds the phrase structure.

Parsing = syntax analysis

Definition 2.7. Parsing is syntax analysis — given a string ss and a grammar GG, decide whether sL(G)s \in L(G) and produce a parse tree if so.

Parsing is the inverse of derivation: derivation goes from SS outward to a string; parsing goes from a string back to SS (and the tree of how).

Worked example
Step 0 of 2
Practice — score 100% to advance
Multiple choice
Q1
What are the four components of a PSG?
Q2
In a production rule hbh \to b, the head hh must contain at least…
Q3
sGts \Rightarrow^*_G t means…
Q4
L(G)L(G) is the set of…
Q5
Two grammars are equivalent iff…
Q6
Which is a valid production rule in a PSG?
Q7
A 'lexical rule' typically produces:
Fill in the blank
Q1
A PSG is the 4-tuple N,Σ,P,\langle N, \Sigma, P, \rangle.
Q2
The relation G\Rightarrow^*_G is the ___-transitive closure of G\Rightarrow_G.
Q3
L(G)L(G) is the set of all ___ of GG.
Q4
Parsing goes from a ___ back to the start symbol SS.
Order the steps
Arrange these proof steps in the correct order using the arrows.
1
Write production rules hbh \to b in PP
2
Pick the start symbol SNS \in N
3
Apply rules from SS until only terminals remain
4
Choose nonterminals NN and terminals Σ\Sigma
Loading…