This archive contains automatic *high-precision* alignments for 283,588 predicate pairs that occur in our previously published data set of comparable texts (Roth & Frank, 2012a). Each file in this archive provides a list of aligned predicates for a given pair of newswire sources in the English Gigaword Fifth Edition (Parker et al., 2011) -- e.g., "afp-apw.out" contains all predicate alignments between documents from "Agence France-Press" and "Associate Press Worldstream". Each alignment is specified in the following format:

[DOCID1],[SENTENCEID1],[TOKENID1],[WORDFORM] (tab) [DOCID2],[SENTENCEID2],[TOKENID2],[WORDFORM]

The document IDs refer to the original IDs as contained in Gigaword. Sentence and token IDs refer to automatically-generated annotation using Stanford CoreNLP [2]. We additionally provide the word form of each predicate for performing automatic sanity checks.

All alignments were automatically created using a modified version of our clustering approach, described in (Roth & Frank, 2012b): instead of tuning F1-score on the development set, we tuned F0.33, i.e., we weighted precision three times higher than recall. We additionally used some currently still unpublished similarity measures. On our previously published test set, this modified method achieved a precision of 86.2% at a recall of 29.1%.

If you want to use the predicate alignments for your own work, please cite Roth & Frank (2012a). In case of questions, please do not hesitate to get in touch with the first author at mroth@cl.uni-heidelberg.de

--
Robert Parker, David Graff, Junbo Kong, Ke Chen and Kazuaki Maeda (2011). English Gigaword Fifth Edition. Linguistic Data Consortium, Philadelphia.

Michael Roth and Anette Frank (2012a). Aligning predicate argument structures in monolingual comparable texts: A new corpus for a new task. Proceedings of the First Joint Conference on Lexical and Computational Semantics (*SEM), Montreal, Canada.

Michael Roth and Anette Frank (2012b). Aligning predicates across monolingual comparable texts using graph-based clustering. Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processign (EMNLP), Jeju, Republic of Korea.