Wikipedia has a sister project called “Wikidata” (Wikidata). You can click on it from the left sidebar of Wikipedia.
“Wikidata takes all of Wikipedia’s data and organizes it into a machine-processable database for easy access. For example, which is the most populous region in Shanxi Province?
This kind of question is very time-consuming to query in Wikipedia, and you have to manually extract information from each entry. However, Wikidata can execute just one command and return the answer (see later for details). Because it provides structured data, it can be queried by machine.
However, Wikidata is not a relational database, but an RDF database; the query language is not SQL, but SPARQL. I learned a little bit of RDF and SPARQL in a rough way, and this article is a learning note to demonstrate how to use Wikidata to query information.
Meaning of RDF
As we all know, relational databases are the most widely used databases today, abstracting data into a tabular relationship of rows and columns.
However, the real world is less like a table and more like a network. Various things are connected together in a web through intricate relationships.
A network is called a graph in mathematics, and each thing is a node of a graph, and the relationship between the nodes is the edge that connects them together. If a database stores data as a graph, it is called a graph database.
RDF is a way of describing graph databases, or a protocol for using them. It describes the direct relationship between things in a “triple” fashion.
A “triple” is the core concept of RDF, which refers to two things and the relationship between them, syntactically rendered as “subject + predicate + object”.
The above sentence is an RDF triplet.” sky" (subject) and “blue” (object) are two things that are connected by a color relation (predicate).
RDF requires that predicates (i.e., relationships between things) must be clearly defined. Think of it this way, if the predicate is given, you can use the subject to query the object, or the object to query the subject. For example, if the color relationship is given, then the following query can be made to the database.
RDF requires that each set of predicates must have an explicit URL that distinguishes between the different predicates, and RDF officially defines a set of common predicates with the following URLs.
When you use a URL, just refer to it so that others know which set of predicates is being used.
RDF allows you to specify a prefix that represents the URL address, such as the URL of the official predicate above, which is usually represented by the prefix
Each URL can contain multiple predicates, distinguished by the “prefix : predicate” form. For example, the official definition of a “type” predicate, which specifies the type of the subject, can be represented by
The above sentence, written as an RDF triplet, is of the following form.
rdf:type is a common predicate, RDF allows it to be abbreviated to
a, so that “Xiao Ming is a student” can in turn be expressed as
Xiao Ming a student.
Note that each RDF triplet ends with an English period to distinguish multiple triples.
Example of RDF syntax
Here is an example to demonstrate how RDF defines the relationship between things.
The above passage is a natural language text. We start by drawing a network relationship diagram.
In the above example, there are two URLs, indicating that two sets of predicates are used. One set is the official predicate, represented using the prefix
rdf; the other set is self-defined, with an empty prefix, indicating that this is the default prefix.
The phrase “The Beatles are a band with John Lennon, Paul McCartney, Ringo Starr, and George Harrison.” corresponds to the following triplet.
In the above example,
:member are all predicates. Since these triples have the same subject, RDF allows them to be merged.
In the above code, the triples with the same subject are written in a merged way, each triple is separated by a semicolon, and the last triple is terminated by a period.
The rest of the RDF triples are as follows.
SPARQL Query Language
SPARQL is a query language for RDF databases, much like the syntax of SQL. The core idea is to extract the eligible subject or object from a triple according to the given predicate verb.
The syntax of a SPARQL query is as follows.
In the above code,
<variables> is the subject or object to be extracted, and
<graph pattern> is the triple pattern to be queried.
For example, query all albums in the database.
In the above code,
?album is a variable with any name you want, but the first character must be a question mark
? . The query condition is that the variable
?album is the subject, and according to the predicate
rdf:type, the object
:Album can be obtained. This object also has a prefix indicating that this is defined by the current database.
The variable can be replaced by an asterisk
* if all the records that match the condition are returned, and the keyword
WHERE can be omitted inside the
SELECT query, as well as the period at the end of the last triple, so the above query can also be written as the following.
In addition to the album name, if you want to return the artist of the album, you can add a variable
In the above code, the variable
?artist must be the object of
?album (subject) and
Example of Wikidata query: Most populated areas in Shanxi Province
The following is a further study of SPARQL syntax by looking up “which is the most populous region in Shanxi Province” through Wikidata.
First, go to the Wikidata website (https://www.wikidata.org/), and search for “Shanxi” in the search field at the top of the page. Alternatively, Wikipedia’s “Shanxi Province” page has a link to Wikidata in the left column.
Then, go to the page of Shanxi Province.
At this point, look for the URL of this page.
Q46913 at the end of the URL above is the number (i.e. the subject) of the entry for Shanxi Province in the wiki data, which will be used later.
Next, scroll down the page and find the section “contains administrative territorial entity”, which lists the regions under Shanxi Province.
Click on the title “contains administrative territorial entity” to go to its page, and look for the URL as well.
The last part of the URL above,
P150, is the number of the predicate verb “administrative entity included”.
Now, you can start the query. Go to Wikidata’s online query page query.wikidata.org
In the query box, enter the following SPARQL statement.
The above code requires the return of the variable
?area, which must satisfy the subject
Shanxi Province (
wd:Q46913) and the predicate
administrative entity contained (
wdt:P150). The prefix
wd indicates that this is a wiki data entry, while the prefix
wdt indicates that this is a predicate relationship defined by the wiki data.
Click the triangular Run button in the left sidebar to get the results of the query at the bottom of the page.
As you can see from the above image, the returned entries are numbered. Modify the query statement to add a column of text labels.
In the above code, add a returned variable
?areaLabel which is the text label of the previous variable
?area (satisfying the predicate
rdfs:label), and add a filter statement
FILTER which requires only Chinese labels to be returned.
Run this query and you will see the Chinese names for each area.
Next, add a population variable
?popTotal that returns the total number of people in each region.
Run this code and you will see the population total.
Then, add a sort clause
order by to sort by the population in reverse order.
The results of the run are as follows.
Finally, add a
limit 1 clause to return only the first piece of data.
This gives us the most populated area in Shanxi Province.
Example of a wiki data query: Programmer Directory
Here is another example to find out all programmers of Wikipedia revenue.
Run this query and you will see the list of programmers.
Note that only programmers with Chinese names are returned here. If there is no Chinese name of the programmer inside the database, it will not be returned here.
Then, query the main achievements of each programmer.
The results of the run are as follows.
Some programmers have multiple achievements, for example, John Carmack has two achievements, “Doom” and “Thor’s Hammer”. You can use the
GROUP BY clause to combine them together.
In the above code, the
GROUP_CONCAT function is used to merge multiple
?notableworkLabel variables into a new column
The result is as follows.
In the picture above, “Doom” and “Thor’s Hammer” have been combined into one cell.
Next, add an avatar photo for each person.
In the above code, the return value adds a photo variable
?image. Since not everyone has a photo, the photo requirement is placed in the
OPTIONAL condition, indicating that this item is optional.
After getting the query result, switch the table view of the result to the image grid.
At this point, the photo is ready to be displayed.
Finally, we want to know where they are from, and the wiki data provides their place of birth.
Switch the view to map.
This is where you can see where these programmers are located on the world map.
This tutorial ends here, there are many more ways to query wiki data, you can continue learning by clicking the
Examples button in the header of Query Page to see the official examples provided.
- RDF, Wikipedia
- RDF Graph Data Model, Stardog
- Learn SPARQL, Stardog
- SPARQL Nuts & Bolts, Cambridge Semantics
- How to Extract Knowledge from Wikipedia, Data Science Style, Michael Li