Extraction of Syntactic Translation Models from Parallel Data using Syntax from Source and Target Languages
Date of Original Version
Abstract or Table of Contents
We propose a generic rule induction framework that is informed by syntax from both sides of a parsed parallel corpus, as sets of structural, boundary and labeling related constraints. Factoring syntax in this manner empowers our framework to work with independent annotations coming from multiple resources and not necessarily a single syntactic structure. We then explore the issue of lexical coverage of translation models learned in different scenarios using syntax from one side vs. both sides. We specifically look at how the non-isomorphic nature of parse trees for the two languages affects coverage. We propose a novel technique for restructuring targetside parse trees, that generates alternate isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. We also show that combining rules extracted by restructuring syntactic trees on both sides produces significantly better translation models. The improved precision and coverage of our syntax tables particularly fill in for the lack of lexical coverage in Syntax based Machine Translation approaches.