Title from disc label. Data type: Text. Data sources: Newsgroups, newswire, weblogs. Applications: Handwriting recognition, machine translation. "LDC2013T09". Authors: David Lee, Safa Ismael, Stephen Grimes, Dave Doermann, Stephanie Strassel, Zhiyi Song.
Summary:
"MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Phase 2 Training Set contains all training data created by the Linguistic Data Consortium to support Phase 2 of the DARPA MADCAT Program. The data in this release consists of handwritten Arabic documents, scanned at high resolution and annotated for the physical coordinates of each line and token. Digital transcripts and English translations of each document are also provided, with the various content and annotation layers integrated in a single MADCAT XML output.
This resource is supported by the Institute of Museum and Library Services under the provisions of the Library Services and Technology Act as administered by State Library of Iowa.