Project description

For a long time, data-driven parsing was restricted to probabilistic context-free grammars (PCFG). Work in other statistical natural language processing (NLP) areas concentrated on using n-gram models. Only recently, a beginning tendency to incorporate more and more syntactic information can be observed. Building on this, we will pursue the challenging and promising objective of using formalisms beyond PCFG in statistical NLP. 

In data-driven parsing, discontinuities and long-distance dependencies present difficulties since they do not display the locality of CFGs. A few recent approaches therefore use extensions of CFG that can deal with these phenomena. A comparable development can be observed in statistical machine translation (SMT) where one is faced with similar problems. After the first word- and phrase-based translation models, an increasing amount of syntactic aspects is taken into account in order to learn accurate alignment models. This project contributes to these two closely related research areas. We will develop new models and algorithms for probabilistic parsing and for SMT, using formalisms such as Linear Context-Free Rewriting Systems (LCFRS) and Range Concatenation Grammars (RCG), extensions of CFG that combine aspects of synchronous grammars with the capacity to describe discontinuities.