Submissions/Wikidata Toolkit: A Java library for working with Wikidata

This is an accepted submission for Wikimania 2014.

Submission no. 5053
Title of the submission

Wikidata Toolkit: A Java library for working with Wikidata

Type of submission (discussion, hot seat, panel, presentation, tutorial, workshop)

Tutorial

Author of the submission

Markus Krötzsch (contact author), Michael Günther (co-presenter), Julian Mendez (co-presenter)

E-mail address

markus.kroetzsch@tu-dresden.de

Username

Markus Krötzsch (talk)

Country of origin

Germany

Affiliation, if any (organisation, company etc.)

TU Dresden

Personal homepage or blog

http://korrekt.org/

Abstract (at least 300 words to describe your proposal)

Wikidata Toolkit is a Java library that greatly simplifies using data from Wikidata or other Wikibase installations in your programs. It provides data structures to mirror all Wikibase data in Java, and convenient facilities to load, manipulate, analyse, and query such data. The primary goal of the project is to enable new and innovative applications around Wikidata, and thus to serve a wider community of developers, researchers, and practitioners that are eager to take advantage of that new data resource.

However, the project is very recent and thus not widely known yet. In fact, development is supported by an Individual Engagement Grant of the WMF that runs from February till August 2014, such that the initial funding phase only just finished at the time of Wikimania. It is therefore an ideal time to present the features, provide help to current users, and discuss next steps with the community.

This tutorial therefore provides a practical introduction to the Wikidata Toolkit for the working Java developer. The goal of this initial introduction is to explain the overall architecture and programming facilities that the library provides, and to enable participants to develop their own data-driven applications.

The planned structure of the tutorial is as follows:

  • Feature overview: what Wikidata Toolkit can do for you
  • Main components: which parts do you actually need
  • The Wikidata data model for the working developer
  • My first application: a data-driven equivalent of "Hello World"
  • Towards serious applications: further examples explained
  • Performance considerations: how big a machine you might need
  • Wikidata Toolkit workshop: bring your own questions

The overall time being quite short, extensive hands-on sessions are not included here, but we will have a few developers around who can help with practical problems, also in the breaks after the tutorial.

Although Java is the programming language used by Wikidata Toolkit, the tutorial should also be of interest to developers working in other languages. On the one hand, the toolkit can still be a valuable resource for pre-processing data to be used in another software. On the other hand, it provides reference implementations of several key mechanisms and data structures that are useful to work with Wikidata.


Track
  • Technology, Interface & Infrastructure
Length of session (if other than 30 minutes, specify how long)
30 minutes
Will you attend Wikimania if your submission is not accepted?

Yes

Slides or further information (optional)
package org.wikidata.wdtk.examples;

import org.wikidata.wdtk.dumpfiles.DumpProcessingController;
import org.wikidata.wdtk.dumpfiles.MwRevision;
import org.wikidata.wdtk.dumpfiles.StatisticsMwRevisionProcessor;

public class WikimaniaExample {

	public static void main(String[] args) {
		ExampleHelpers.configureLogging();

		// Controller object for processing dumps:
		DumpProcessingController dumpProcessingController = new DumpProcessingController(
				"wikidatawiki");
		dumpProcessingController.setOfflineMode(true);

		// Example processor for item documents:
		WikimaniaDocumentProcessor documentProcessor = new WikimaniaDocumentProcessor();
		dumpProcessingController.registerEntityDocumentProcessor(
				documentProcessor, MwRevision.MODEL_WIKIBASE_ITEM, true);

		// Another processor for statistics & time keeping:
		dumpProcessingController.registerMwRevisionProcessor(
				new StatisticsMwRevisionProcessor("statistics", 10000), null,
				true);

		dumpProcessingController.processMostRecentMainDump();

		documentProcessor.storeResults();
	}

}
    • The class WikimaniaDocumentProcessor used to compute average life expectancy and to print it to a CSV file:
package org.wikidata.wdtk.examples;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;

import org.wikidata.wdtk.datamodel.interfaces.EntityDocumentProcessor;
import org.wikidata.wdtk.datamodel.interfaces.ItemDocument;
import org.wikidata.wdtk.datamodel.interfaces.PropertyDocument;
import org.wikidata.wdtk.datamodel.interfaces.Statement;
import org.wikidata.wdtk.datamodel.interfaces.StatementGroup;
import org.wikidata.wdtk.datamodel.interfaces.TimeValue;
import org.wikidata.wdtk.datamodel.interfaces.Value;
import org.wikidata.wdtk.datamodel.interfaces.ValueSnak;

public class WikimaniaDocumentProcessor extends Object implements
		EntityDocumentProcessor {
	long countItems = 0;
	long populationCount = 0;

	final long[] lifeSpans = new long[2100];
	final long[] peopleCount = new long[2100];

	@Override
	public void processItemDocument(ItemDocument itemDocument) {
		this.countItems++;

		int birthYear = Integer.MIN_VALUE;
		int deathYear = Integer.MIN_VALUE;

		for (StatementGroup sg : itemDocument.getStatementGroups()) {
			// P569 is "birth date"
			if ("P569".equals(sg.getProperty().getId())) {
				for (Statement s : sg.getStatements()) {
					if (s.getClaim().getMainSnak() instanceof ValueSnak) {
						Value v = ((ValueSnak) s.getClaim().getMainSnak())
								.getValue();
						if (v instanceof TimeValue) {
							birthYear = (int) ((TimeValue) v).getYear();
							break;
						}
					}
				}
			}
			// P570 is "death date"
			if ("P570".equals(sg.getProperty().getId())) {
				for (Statement s : sg.getStatements()) {
					if (s.getClaim().getMainSnak() instanceof ValueSnak) {
						Value v = ((ValueSnak) s.getClaim().getMainSnak())
								.getValue();
						if (v instanceof TimeValue) {
							deathYear = (int) ((TimeValue) v).getYear();
							break;
						}
					}
				}
			}
		}

		if (birthYear != Integer.MIN_VALUE && deathYear != Integer.MIN_VALUE
				&& birthYear >= 1200) {
			if (deathYear > birthYear && deathYear - birthYear < 130) {
				lifeSpans[birthYear] += (deathYear - birthYear);
				peopleCount[birthYear]++;
			}
		}
	}

	@Override
	public void processPropertyDocument(PropertyDocument propertyDocument) {
		// TODO Auto-generated method stub
	}

	@Override
	public void finishProcessingEntityDocuments() {
		// TODO Auto-generated method stub
	}

	public void storeResults() {
		try (PrintStream out = new PrintStream(new FileOutputStream(
				"results.csv"))) {
			for (int i = 0; i < lifeSpans.length; i++) {
				if (peopleCount[i] != 0) {
					out.println(i + "," + (double) lifeSpans[i]
							/ peopleCount[i] + "," + peopleCount[i]);
				}
			}
		} catch (IOException e) {
			System.out.println("Oops");
		}

	}

}
Special requests
  • Must leave on Sunday, so presentation should be on Friday or Saturday if at all possible
  • This talk should be given later than Lydia's Wikidata keynote and not in parallel to any Wikidata talk in the Open Data track.


Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with a hash and four tildes. (# ~~~~).

  1. --Sannita (talk) 22:14, 31 March 2014 (UTC)[reply]
  2. Bene* (talk) 14:05, 1 April 2014 (UTC)[reply]
  3. Tpt (talk) 14:27, 4 April 2014 (UTC)[reply]
  4. Promelior (talk) 14:07, 31 July 2014 (UTC)[reply]
  5. I will be your session host Edwardx (talk) 18:02, 31 July 2014 (UTC)[reply]
  6. Maximilianklein (talk) 15:45, 7 August 2014 (UTC)[reply]
  7. Add your username here.