Software and datasets: Difference between revisions

From Slavko Zitnik's research wiki
No edit summary
No edit summary
Line 33: Line 33:


Read more: Žitnik S., Šubelj L., Lavbič D., Vasilecas O., Bajec M. (2013). '''General Context-Aware Data Matching and Merging Framework''' in Informatica, vol. 24, num. 1, pp. 119-152. [{{filepath:INFO902.pdf}} Article]
Read more: Žitnik S., Šubelj L., Lavbič D., Vasilecas O., Bajec M. (2013). '''General Context-Aware Data Matching and Merging Framework''' in Informatica, vol. 24, num. 1, pp. 119-152. [{{filepath:INFO902.pdf}} Article]
= Datasets =
==Slavko Public Facebook==
The network contains public data, crawled from March to May 2012. It contains 51.394.379 nodes with 136.048.906 edges (4.431.920 double edges). There are 373.476 checked nodes and 50.687.612 unchecked nodes (situation in tables at the end). Non-anonymized data contains real Facebook ID, Name and Username of every user in the network. For more, see license file next to the network: [http://zitnik.si/temp/FacebookNetwork_SlavkoZitnik_public.zip Anonymized Facebook network]. Some basic analysis is available as [{{filepath: 03_projekt_VelikaOmrezja_SlavkoZitnik.pdf}} homework 3] to [http://pajek.imfm.si/doku.php?id=event:events Big Networks class], taught by [http://vlado.fmf.uni-lj.si/ prof. dr. Vladimir Batagelj].
==Slovene news==
[{{filepath:rtvslo_dec2011.tsv}} Slovene news v1] is tagged according to standard BIO scheme. The corpus contains annotated entities (B-PER (131), I-PER (74), B-ORG(162), I-ORG(158), O(5508)), relation descriptors (B-REL(32), I-REL(24), O(5977)) and coreference descriptors (B-COREF(274), I-COREF(249), 0(5510)). The dataset was lemmatized and POS-tagged using slovene [http://xn--oznaevalnik-qnb.xn--slovenina-qfb73g.eu/Vsebine/Sl/ProgramskaOprema/Oblikoslovni.aspx POS tagger]. In the dataset, there are 285 sentences with 6034 tokens.
[{{filepath:Rtvslo_dec2011_v2.tsv}} Slovene news v2] is upgraded v1, which contains CoNLL2012-like tagged coreference tags and documents separated by ###.
The material can be used for research purposes only.

Revision as of 20:49, 5 August 2022

Some useful research results we have been working on. Here I do not list Github repositories that are not packaged as a standalone software component.

Software

nutIE

NutIE (codename) will be an end-to-end information extraction toolkit. It will consist of a self-contained runnable web application (GUI) and Scala library for programmatic access.

The tool currently supports the data import and visualization, model training and evaluation for the coreference resolution task.

The project currently consists of two separate projects:

  • Web-based managements part: nutIE Web
  • Backend with REST API and programmatic Scala library to use in third-party projects: nutIE Core


Lemmagen4J

I have rewritten Lemmagen v3.0 (http://lemmatise.ijs.si/) from C# to Java code. The eclipse project is available here: Lemmagen4J.zip.

See Train and Test classes and other code for documentation purposes. For building Slovene model, you can use Slovene part from MULTEXT-EAST dataset.

You can read more about Lemmagen in the author's paper: Lemmagen Paper, 2010.

Merging and matching framework

Framework for matching and merging using semantics. It implements attribute resolution, collective entity resolution and redundancy elimination techniques with various metrics and approaches. Download the project along with the datasets here: Data Merging framework, october 2011.

Read more: Žitnik S., Šubelj L., Lavbič D., Vasilecas O., Bajec M. (2013). General Context-Aware Data Matching and Merging Framework in Informatica, vol. 24, num. 1, pp. 119-152. Article

Datasets

Slavko Public Facebook

The network contains public data, crawled from March to May 2012. It contains 51.394.379 nodes with 136.048.906 edges (4.431.920 double edges). There are 373.476 checked nodes and 50.687.612 unchecked nodes (situation in tables at the end). Non-anonymized data contains real Facebook ID, Name and Username of every user in the network. For more, see license file next to the network: Anonymized Facebook network. Some basic analysis is available as homework 3 to Big Networks class, taught by prof. dr. Vladimir Batagelj.

Slovene news

Slovene news v1 is tagged according to standard BIO scheme. The corpus contains annotated entities (B-PER (131), I-PER (74), B-ORG(162), I-ORG(158), O(5508)), relation descriptors (B-REL(32), I-REL(24), O(5977)) and coreference descriptors (B-COREF(274), I-COREF(249), 0(5510)). The dataset was lemmatized and POS-tagged using slovene POS tagger. In the dataset, there are 285 sentences with 6034 tokens.

Slovene news v2 is upgraded v1, which contains CoNLL2012-like tagged coreference tags and documents separated by ###.

The material can be used for research purposes only.