The CYK Algorithm
The Cocke-Younger-Kasami (CYK) algorithm is a parsing algorithm for context-free grammars. It determines whether a string can be generated by a grammar and, if so, how it can be generated.
Context-Free Grammars
A context-free grammar (CFG) is a formal grammar whose production rules are of the form:
$$A → α$$
where $A$ is a single non-terminal symbol and $α$ is a string of terminals and/or non-terminals (possibly empty).
Example of a Context-Free Grammar
Consider the following grammar for well-balanced parentheses:
$$S → SS | (S) | ε$$
Where:
- $S$ is the start symbol
- $ε$ represents the empty string
This grammar can generate strings like "()", "(())", "()()", etc.
Chomsky Normal Form
For the CYK algorithm to work, the grammar must be in Chomsky Normal Form (CNF). In CNF, all production rules are of one of the following forms:
- $A → BC$ where $B$ and $C$ are non-terminal symbols
- $A → a$ where $a$ is a terminal symbol
- $S → ε$ (only if $S$ is the start symbol and it doesn't appear on the right side of any rule)
Converting to Chomsky Normal Form
Converting a CFG to CNF involves several steps:
-
Eliminate the start symbol from right-hand sides
- Create a new start symbol $S_0$ and add the rule $S_0 → S$
-
Eliminate ε-productions
- For each rule $A → ε$, remove it and modify other rules accordingly
-
Eliminate unit productions
- Replace rules like $A → B$ with the rules for $B$
-
Convert remaining rules to CNF form
- Break rules with more than two non-terminals
- Replace terminals in mixed rules
The CYK Algorithm
The CYK algorithm uses dynamic programming to determine if a string can be generated by a grammar in CNF.
Algorithm Steps
Given a grammar $G$ in CNF and a string $w = a_1 a_2 ... a_n$:
-
Create a triangular table $T$ where $T[i,j]$ represents the set of non-terminals that can generate the substring $a_i ... a_j$.
-
Base case: For each $i$ from 1 to $n$, set $T[i,i]$ to the set of non-terminals $A$ such that $A → a_i$ is a rule in the grammar.
-
Inductive case: For each length $l$ from 2 to $n$, for each starting position $i$ from 1 to $n-l+1$:
- Set $j = i+l-1$ (ending position)
- For each split point $k$ from $i$ to $j-1$:
- For each rule $A → BC$:
- If $B$ is in $T[i,k]$ and $C$ is in $T[k+1,j]$, add $A$ to $T[i,j]$
- For each rule $A → BC$:
-
The string $w$ is in the language of $G$ if and only if the start symbol is in $T[1,n]$.
Time Complexity
The CYK algorithm has a time complexity of $O(n^3|G|)$, where:
- $n$ is the length of the input string
- $|G|$ is the size of the grammar (number of rules)
Example
Let's trace through the CYK algorithm with a simple example:
Grammar in CNF:
- $S → AB | BC$
- $A → BA | a$
- $B → CC | b$
- $C → AB | a$
Input string: $abaa$
Step 1: Fill the diagonal
- $T[1,1] = {A, C}$ (because $A → a$ and $C → a$)
- $T[2,2] = {B}$ (because $B → b$)
- $T[3,3] = {A, C}$ (because $A → a$ and $C → a$)
- $T[4,4] = {A, C}$ (because $A → a$ and $C → a$)
Step 2: Fill the table for substrings of length 2
For $i=1, j=2$:
- Check if any non-terminal can generate $ab$
- $A → BA$: Is $B$ in $T[1,1]$ and $A$ in $T[2,2]$? No.
- $C → AB$: Is $A$ in $T[1,1]$ and $B$ in $T[2,2]$? Yes, so add $C$ to $T[1,2]$.
- $T[1,2] = {C}$
(Continue filling the table...)
Step 3: Check if the start symbol is in $T[1,n]$
If $S$ is in $T[1,4]$, then the string $abaa$ is in the language of the grammar.
Applications
The CYK algorithm is used in:
- Natural Language Processing: Parsing sentences according to a grammar
- Compiler Design: Syntax analysis phase
- RNA Secondary Structure Prediction: Determining the folding of RNA molecules
Limitations and Extensions
- The requirement for CNF can be cumbersome
- Extensions exist for probabilistic context-free grammars
- Earley's algorithm is an alternative that doesn't require CNF
Conclusion
The CYK algorithm is a powerful tool for parsing context-free languages. Despite its cubic time complexity, its simplicity and elegance make it a fundamental algorithm in formal language theory and computational linguistics.