The CYK Algorithm

The Cocke-Younger-Kasami (CYK) algorithm is a parsing algorithm for context-free grammars. It determines whether a string can be generated by a grammar and, if so, how it can be generated.

Context-Free Grammars

A context-free grammar (CFG) is a formal grammar whose production rules are of the form:

$$A → α$$

where $A$ is a single non-terminal symbol and $α$ is a string of terminals and/or non-terminals (possibly empty).

Example of a Context-Free Grammar

Consider the following grammar for well-balanced parentheses:

$$S → SS | (S) | ε$$

Where:

$S$ is the start symbol
$ε$ represents the empty string

This grammar can generate strings like "()", "(())", "()()", etc.

Chomsky Normal Form

For the CYK algorithm to work, the grammar must be in Chomsky Normal Form (CNF). In CNF, all production rules are of one of the following forms:

$A → BC$ where $B$ and $C$ are non-terminal symbols
$A → a$ where $a$ is a terminal symbol
$S → ε$ (only if $S$ is the start symbol and it doesn't appear on the right side of any rule)

Converting to Chomsky Normal Form

Converting a CFG to CNF involves several steps:

Eliminate the start symbol from right-hand sides
- Create a new start symbol $S_0$ and add the rule $S_0 → S$
Eliminate ε-productions
- For each rule $A → ε$, remove it and modify other rules accordingly
Eliminate unit productions
- Replace rules like $A → B$ with the rules for $B$
Convert remaining rules to CNF form
- Break rules with more than two non-terminals
- Replace terminals in mixed rules

The CYK Algorithm

The CYK algorithm uses dynamic programming to determine if a string can be generated by a grammar in CNF.

Algorithm Steps

Given a grammar $G$ in CNF and a string $w = a_1 a_2 ... a_n$:

Create a triangular table $T$ where $T[i,j]$ represents the set of non-terminals that can generate the substring $a_i ... a_j$.
Base case: For each $i$ from 1 to $n$, set $T[i,i]$ to the set of non-terminals $A$ such that $A → a_i$ is a rule in the grammar.
Inductive case: For each length $l$ from 2 to $n$, for each starting position $i$ from 1 to $n-l+1$:
- Set $j = i+l-1$ (ending position)
- For each split point $k$ from $i$ to $j-1$:
  - For each rule $A → BC$:
    - If $B$ is in $T[i,k]$ and $C$ is in $T[k+1,j]$, add $A$ to $T[i,j]$
The string $w$ is in the language of $G$ if and only if the start symbol is in $T[1,n]$.

Time Complexity

The CYK algorithm has a time complexity of $O(n^3|G|)$, where:

$n$ is the length of the input string
$|G|$ is the size of the grammar (number of rules)

Example

Let's trace through the CYK algorithm with a simple example:

Grammar in CNF:

$S → AB | BC$
$A → BA | a$
$B → CC | b$
$C → AB | a$

Input string: $abaa$

Step 1: Fill the diagonal

$T[1,1] = {A, C}$ (because $A → a$ and $C → a$)
$T[2,2] = {B}$ (because $B → b$)
$T[3,3] = {A, C}$ (because $A → a$ and $C → a$)
$T[4,4] = {A, C}$ (because $A → a$ and $C → a$)

Step 2: Fill the table for substrings of length 2

For $i=1, j=2$:

Check if any non-terminal can generate $ab$
$A → BA$: Is $B$ in $T[1,1]$ and $A$ in $T[2,2]$? No.
$C → AB$: Is $A$ in $T[1,1]$ and $B$ in $T[2,2]$? Yes, so add $C$ to $T[1,2]$.
$T[1,2] = {C}$

(Continue filling the table...)

Step 3: Check if the start symbol is in $T[1,n]$

If $S$ is in $T[1,4]$, then the string $abaa$ is in the language of the grammar.

Applications

The CYK algorithm is used in:

Natural Language Processing: Parsing sentences according to a grammar
Compiler Design: Syntax analysis phase
RNA Secondary Structure Prediction: Determining the folding of RNA molecules

Limitations and Extensions

The requirement for CNF can be cumbersome
Extensions exist for probabilistic context-free grammars
Earley's algorithm is an alternative that doesn't require CNF

Conclusion

The CYK algorithm is a powerful tool for parsing context-free languages. Despite its cubic time complexity, its simplicity and elegance make it a fundamental algorithm in formal language theory and computational linguistics.