Interesting links

From Slavko Zitnik's research wiki


Frameworks seem a little bit different from libraries. Developing an application with a framework means writing new custom plugins, accepting their architecture and completely following their guidelines and architecture (e.g. document representation). Libraries may seem much less overwhelming as you can easily integrate it inside your application.

GATE (General Architecture for Text Engineering)

GATE is a quite comprehensive text processing framework. It is actively being developed at The University of Sheffield since 1995.

GATE roughly consists of Language (e.g. lexicons, corpora, ontologies), Processing (algorithms) and Visual (GUI) resources. The text is represented in documents as content, annotations and features.

I think the GATE is mainly known by ANNIE and JAPE. ANNIE (A Nearly-New Information Extraction System) is a rule based IE system that uses gazetteer lists and JAPE extensively. JAPE (Java Annotation Patterns Engine) provides powerful language for defining rules, actions and annotations at all types of text processing. GATE also contains a pile of (CREOLE - Collection of REusable Objects for LanguagE) plugins, which some of them are similar to UIMA's plugins.

Ontologies are supported by the high level API which supports representation, manipulation and reasoning with OWL-Lite ontologies.

Machine Learning modules are also part of GATE. It has integrated LibSVM and Weka interface for entity and relation extraction.

GATE tools are accessible on as a service.

UIMA (Unstructured Information Management)

UIMA includes software systems that analyze large volumes of unstructured information in order to discover knowledge, relevant to an end user. The Framework includes a number of specific annotators (e.g. HMM tagger, snowball algorithm, Alchemy, OpenCalais) which are written in Java or C++..

UIMA also provides capabilities to wrap components as network services and can scale to very large volumes. UIMA framework was used by IBM's Watson at the Jeopardy 2010 challenge. Additional UIMA componets can be found in different repositories.

Stanford CoreNLP

Stanford CoreNLP is a set of natural language tools for raw English language text, maintained by The Stanford Natural Language Processing Group. It provides foundational building blocks for higher level text understanding applications.

It integrates all their NLP tools for the English language, including PoS tagger, Named Entity Recognizer and coreference resolution system.

The developer defines a workflow of annotations and then processes text in a pipeline. User can set many different annotation parameters or even develop his own annotator.


Open NLP

Open NLP library intended to be an organizational center for open source projects related to natural language processing. It includes a variety of java-based NLP tools for sentence detection, tokenization, PoS tagging, chunking and parsing, named entity recognition and coreference resolution. OpenNLP is currently (july 2011) in the incubation phase under Apache.

As already mentioned it's main goal is to combine as many tools as possible such as Stanford parser, Corpora management, UIMA integration...


The Natural Language Toolkit (NLTK) is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing. It has the same aim as OpenNLP with fewer functionalities.


DepPattern Toolkit is a linguistic package providing a grammar compiler, PoS taggers, and dependency based parsers for several languages. It is implemented in Perl, provided with parsers for 5 languages: English, Spanish, Galician, French, and Portuguese. Three demos are available: a basic demo allowing some sentences to be analyzed, a grammar-based demo where the user inserts his/her own grammar, and advanced demo to analyze text files.


The FreeLing package consists of a library providing language analysis services. It is written in C++.

FreeLing is designed to be used as an external library from any application requiring this kind of services. Nevertheless, a simple main program is also provided as a basic interface to the library, which enables the user to analyze text files from the command line. Man functions are: Text tokenization, Sentence splitting, Morphological analysis, Suffix treatment, tokenization of clitic pronouns, Flexible multiword recognition, Contraction splitting, Probabilistic prediction of unkown word categories, Named entity detection, Recognition of dates, numbers, ratios, currency, and physical magnitudes (speed, weight, temperature, density, etc.), PoS tagging, Chart-based shallow parsing, Named entity classification, WordNet based sense annotation and disambiguation, Rule-based dependency parsing, Nominal correference resolution.

Currently supported languages are Spanish, Catalan, Galician, Italian, English, Russian, Portuguese, Welsh and Asturian.


Whatswrong is a NLP visualization library. It is written in Java. Main features are: Visualization (syntactic dependency graphs, semantic dependency graphs (a la CoNLL 2008), Chunks (such as syntactic chunks, NER chunks, SRL chunks etc.), Bilingual alignments, BioNLP events, proteins, locations, Generic format to load and visualize your own data), Compare gold standard trees to your generated trees (e.g. highlight false positive and negative dependency edges), Search corpora for sentences with certain attributes using powerful search expressions.

It supports reading the following formats: CoNLL 2000, 2002, 2003, 2004, 2006 and 2008, Lisp S-Expressions, Malt-Tab format, markov thebeast format, BioNLP 2009 Shared Task format.

Graphics can be exported to EPS format. The library also provides API to incorporate NLP visualization in your application.


Breeze is a continuation of formerly known ScalaNLP and Scalala projects. It is entirely written in Scala language.

It consists of numerical processing, machine learning, and natural language processing tools. Its primary focus is on being generic, clean, and powerful without sacrificing (much) efficiency.

The most important parts are: breeze-math (Linear algebra and numerics routines), breeze-process (Libraries for processing text and managing data pipelines), breeze-learn (Machine Learning, Statistics, and Optimization)


Factorie is a continuation of older very popular factor graphs tool []. Factorie is a toolkit for deployable probabilistic modeling, implemented as a software library in Scala. It provides its users with a succinct language for creating relational factor graphs, estimating parameters and performing inference.

Key features are named entity recognition, entity resolution, relation extraction, parsing, schema matching, ontology alignment, latent-variable generative models, including latent Dirichlet allocation.

Entity Extraction

Apache cTAKES

Apache cTAKES is a natural language processing system for extraction of information from electronic medical record clinical free-text. It is based on UIMA Framework and offers event discovery, UMLS classification, negation detection, uncertainty detection and time expression discovery. The cTakes tool is based on previously known Open Health Natural Language Processing.

Stanford NER

Stanford NER is machine-learned named entity recognizer tool. It was first used for the 2003 CoNLL and BioCreative shared task and in 2004 for BioNLP task at CoLing. It uses MEMMs and CRFs with rich features including parse trees, web and entity labels on previous runs.

At CoNLL 2003 English news testbed CRF classifier achieves 87.94% F1 score, 88.21% precision and 87.68% recall.

Current version 1.2.1 (19th June 2011) includes 3, 4 and 7 class (Location, Person, Organization, Misc, Time, Person, Money, Percent, Date) trained models. For your own domain you can also retrain the classifier. It is written in Java 1.5 and licensed under the GNU GPLv2.

TEES 2.0 (Turku Event Extraction System)

Turku Event Extraction System (TEES) is a free and open source natural language processing system developed for the extraction of events and relations from biomedical text. It is written mostly in Python, and should work in generic Unix/Linux environments.

Biomedical event extraction refers to the automatic detection of molecular interactions from research articles. Events can accurately represent complex statements, e.g. the sentence “Protein A causes protein B to bind protein C” produces the event CAUSE(A, BIND(B, C)). Such formal structures can be processed with computational methods, allowing large-scale analysis of the literature, as well as applications such as pathway construction.

Models for predicting their targets are provided with TEES and can be used on any unannotated text. TEES has been evaluated in the following Shared Tasks: BioNLP 2009 Shared Task (1st place), BioNLP 2011 Shared Task (1st place in 4/8 tasks, only system to participate in all tasks), DDI 2011 (Drug-drug interactions) Challenge (4th place, at 96% of the performance of the best system)

Relation Extraction

RelEx - Dependency Relationship Extractor

RelEx is an English semantic dependency relationship extractor. It is built on the Carnegie-Mellon Link Grammar parser and uses a series of graph rewriting rules to identify subject, object, indirect object and other syntactic relationships. It generates trees similar to Stanford parser's (RelEx is 4-times faster, in "compatibility mode" it creates the same output). As a result it returns word properties (e.g. number, tense, PoS) and binary relations between words or phrases. It also attempts basic entity extraction and lists antecedent candidates (useful for coreference resolution) where it uses Hobbs pronoun resolution algorithm.

RelEx is part of open source OpenCog project, which is designed in C++. Applying RelEx on one million MEDLINE abstracts resulted in 150,000 extracted relations with 80% both precision and recall values.

jSRE (java Simple Relation Extraction)

jSRE is a relation extraction tool. It was built in 2006, funded by X-Media project.

It uses supervised methods (SVM) with a combination of kernel functions. It is able to process within sentences relations and local contexts around the pre-identified entities. As pre-processing steps it requires toeknization, sentence splitting, PoS tagging and lemmatization.

It is available as Java sources under Apache v2 License.

Coreference resolution


Reconcile is state-of-the-art automatic coreference resolution system. It was developed as test-bed for researches to implement new ideas. It can process MUC, ACE and raw text files. Configuration can be done through config file where we define pre-processing steps, feature extractors, NER models, paths, outputs, ... Final output is xml file which contains tagged text with coreferences and all selected features.

It was tested against MUC-6, MUC-7 and ACE05 datasets, where it receieves up to 70.77% F1 score on MUC-6 dataset. Trained model with small dataset is supplied within the software package.

Source code is written in Java and available under the GNU GPL. It utilizes Weka toolkit, Berkely Parser and Stanford NER.

Stanford Deterministic Coreference Resolution System

Stanford NLP Group's Coreference resolution system was the top ranked system at the CoNLL-2011 shared task.

On MUC-6 dataset it achieved up to 78.4% F1 score. On conllst2011 test dataset it achieved 74.0 F1 BLANC.

Ontologies, Taxonomies

Probase, 2.6 million concepts
YAGO, 150k concepts
CYC, 120k concepts
WordNet, 25k concepts


CoNLL '03

The CoNLL '03 dataset concentrates on four types of named entities: persons, organizations, names and miscellaneous (do not belong to previous three groups). There are training/development/test splits in German and English language. The best achieved F-score on the shared task was 88.76+-0.7. To successfully build the dataset you need the Reuters Corpus that can be obtained without any charge.

CoNLL '02

CoNLL '02 defines the same task and different data (Spanish, Dutch) with the same named entities as in CoNLL '03. They were especially interested in methods that can use additional unannotated data for improving their performance (for example co-training). It contains POS tags and NER annotations.

Cora IE

Cora Information Extraction dataset contains research paper headers and citations, with labeled segments for authors, title, institutions, venue, date, page numbers and several other fields.

Reuters Corpora

RCV1,2 (Reuters Corpus Volume 1,2) is intended to be used in research and development of natural language processing, information retrieval, and machine learning systems. This corpus is significantly larger than the older, well-known Reuters-21578 collection (text categorization) heavily used in the text classification community.

RCV1 contains about 2.5GB uncompressed English News stories from 20th August 1997 to 19th August 1997. RCV2 is multilingual corpus from the same timeframe.


MUC-6 tasks included named entity recognition, coreference resolution, template elements and scenario templates (traditional information extraction)

CMU seminars

The dataset contains 48 emailed seminar announcements, with labeled segments for speaker, title, time, sentence, header and body. Labeled by Dayne Freitag.

OntoNotes Release 4.0

OntoNotes dataset is available through LDC only. It was developed as part of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern California's Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).

ACE 2004

The objective of the ACE (Automatic Content Extraction) program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic. Dataset

RDC (Relation Detection and Characterization), I am interested into, "involves the identification of relations between entities. This task was added in Phase 2 of ACE. The current definition of RDC targets physical relations including Located, Near and Part-Whole; social/personal relations including Business, Family and Other; a range of employment or membership relations; relations between artifacts and agents (including ownership); affiliation-type relations like ethnicity; relationships between persons and GPEs like citizenship; and finally discourse relations. For every relation, annotators identify two primary arguments (namely, the two ACE entities that are linked) as well as the relation's temporal attributes. Relations that are supported by explicit textual evidence are distinguished from those that depend on contextual inference on the part of the reader."

ACE 2005

ACE 2005 (English SpatialML Annotations Version 2) was developed by researchers at The MITRE Corporation and applies SpatialML tags to the English newswire and broadcast training data annotated for entities, relations and events. The corpus contains 210065 total words and 17821 unique words.


MUC-7 (Message Understanding Conference 7) datasets the dryrun (and training) consists of aircrashes scenarios and the formalrun consists of missile launches scenarios. The final version updates especially the Template Relations portion of the guidelines.

Brown Corpus

The Corpus contains POS-tagged texts.


This directory contains the NEWSWIRE development test data for the NIST 1999 IE-ER Evaluation. The files were taken from the subdirectory: /ie_er_99/english/devtest/newswire/*.ref.nwt and filenames were shortened. The dataset contains tagged PERSONS, DATES, ORGANIZATIONS, NUMBERS, LOCATIONS.

European Parliament

This is a sample of the European Parliament Proceedings Parallel Corpus 1996-2006. This sample contains 10 documents for 11 languages: Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, Swedish. The text is untagged.


New York Times data set contains 150 business articles from New York Times. The articles were crawled from the NYT website between November 2009 and January 2010. After sentence splitting and tokenization, the Stanford NER tagger was used to identify PER and ORG named entities from each sentence. For named entities that contain multiple tokens we concatenated them into a single token. We then took each pair of (PER, ORG) entities that occur in the same sentence as a single candidate relation instance, where the PER entity is treated as ARG-1 and the ORG entity is treated as ARG-2.

Wikipedia data set comes from (link), previously created by Aron Culotta et al.. Since the original data set did not contain the annotation information we need, we re-annotated it. Similarly, we performed sentence splitting, tokenization and NER tagging, and took pairs of (PER, PER) entities occurring in the same sentence as a candidate relation instance. We always treat the first PER entity as ARG-1 and the second PER entity as ARG-2.

A human annotator manually went through each candidate relation instance to decide (1) whether there is a relation between the two arguments and (2) whether there is an explicit sequence of words describing the relation held by ARG-1 and ARG-2.

The dataset can be downloaded here. There are 536 instances (208 P, 328 N) with 140 distinct descriptors int NYT dataset and 700 instances (122 P, 578 N) with 70 distinct descriptors.

Web pages

Andrew McCallum
NLTK Corpuses