Wikipedia has a sister project called “Wikidata” (Wikidata). You can click on it from the left sidebar of Wikipedia.

“Wikidata takes all of Wikipedia’s data and organizes it into a machine-processable database for easy access. For example, which is the most populous region in Shanxi Province?

This kind of question is very time-consuming to query in Wikipedia, and you have to manually extract information from each entry. However, Wikidata can execute just one command and return the answer (see later for details). Because it provides structured data, it can be queried by machine.

However, Wikidata is not a relational database, but an RDF database; the query language is not SQL, but SPARQL. I learned a little bit of RDF and SPARQL in a rough way, and this article is a learning note to demonstrate how to use Wikidata to query information.

Meaning of RDF

As we all know, relational databases are the most widely used databases today, abstracting data into a tabular relationship of rows and columns.

However, the real world is less like a table and more like a network. Various things are connected together in a web through intricate relationships.

A network is called a graph in mathematics, and each thing is a node of a graph, and the relationship between the nodes is the edge that connects them together. If a database stores data as a graph, it is called a graph database.

RDF is a way of describing graph databases, or a protocol for using them. It describes the direct relationship between things in a “triple” fashion.

A “triple” is the core concept of RDF, which refers to two things and the relationship between them, syntactically rendered as “subject + predicate + object”.

1
The sky is blue.

The above sentence is an RDF triplet.” sky" (subject) and “blue” (object) are two things that are connected by a color relation (predicate).

RDF requires that predicates (i.e., relationships between things) must be clearly defined. Think of it this way, if the predicate is given, you can use the subject to query the object, or the object to query the subject. For example, if the color relationship is given, then the following query can be made to the database.

1
2
3
Query 1: sky + color = ?

Query 2: ? + color = blue

RDF requires that each set of predicates must have an explicit URL that distinguishes between the different predicates, and RDF officially defines a set of common predicates with the following URLs.

https://www.w3.org/1999/02/22-rdf-syntax-ns

When you use a URL, just refer to it so that others know which set of predicates is being used.

RDF allows you to specify a prefix that represents the URL address, such as the URL of the official predicate above, which is usually represented by the prefix rdf.

1
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>

Each URL can contain multiple predicates, distinguished by the “prefix : predicate” form. For example, the official definition of a “type” predicate, which specifies the type of the subject, can be represented by rdf:type.

1
XiaoMing is a student.

The above sentence, written as an RDF triplet, is of the following form.

1
2
3
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>

Xiaoming rdf:type Student.

Since rdf:type is a common predicate, RDF allows it to be abbreviated to a, so that “Xiao Ming is a student” can in turn be expressed as Xiao Ming a student.

1
2
3
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>

Xiaoming a student.

Note that each RDF triplet ends with an English period to distinguish multiple triples.

Example of RDF syntax

Here is an example to demonstrate how RDF defines the relationship between things.

1
The Beatles were a band consisting of John Lennon, Paul McCartney, Ringo Starr and George Harrison, all artists, who released an album in 1963, 《Please Please Me》, which contained the single 《Love Me Do》, 125 seconds long.

The above passage is a natural language text. We start by drawing a network relationship diagram.

然后,它被转换为一个RDF三元组。首先,给出谓词的URL,以及相应的前缀。

1
2
PREFIX : <http://foo.com/tutorial/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>

In the above example, there are two URLs, indicating that two sets of predicates are used. One set is the official predicate, represented using the prefix rdf; the other set is self-defined, with an empty prefix, indicating that this is the default prefix.

The phrase “The Beatles are a band with John Lennon, Paul McCartney, Ringo Starr, and George Harrison.” corresponds to the following triplet.

1
2
3
4
5
6
Beatles rdf:type Band .
Beatles :name "Beatles" .
Beatles :member John_Lennon .
Beatles :member Paul_McCartney .
Beatles :member Ringo_Starr .
Beatles :member George_Harrison .

In the above example, rdf:type, :name, and :member are all predicates. Since these triples have the same subject, RDF allows them to be merged.

1
2
3
Beatles a band ;
      :name "Beatles" ;
      :member John_Lennon, Paul_McCartney, George_Harrison, Ringo_Starr .

In the above code, the triples with the same subject are written in a merged way, each triple is separated by a semicolon, and the last triple is terminated by a period.

The rest of the RDF triples are as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
John_Lennon      a artists .
Paul_McCartney   a artists .
Ringo_Starr      a artists .
George_Harrison  a artists .
Please_Please_Me a album;
                 :name "Please Please Me" ;
                 :date "1963" ;
                 :artist "artists" ;
                 :track Love_Me_Do .
Love_Me_Do       a Song ;
                 :name "Love Me Do" ;
                 :length 125 .

SPARQL Query Language

SPARQL is a query language for RDF databases, much like the syntax of SQL. The core idea is to extract the eligible subject or object from a triple according to the given predicate verb.

The syntax of a SPARQL query is as follows.

1
2
3
4
SELECT <variables>
WHERE {
   <graph pattern>
}

In the above code, <variables> is the subject or object to be extracted, and <graph pattern> is the triple pattern to be queried.

For example, query all albums in the database.

1
2
3
4
SELECT ?album
WHERE {
   ?album rdf:type :Album .
}

In the above code, ?album is a variable with any name you want, but the first character must be a question mark ? . The query condition is that the variable ?album is the subject, and according to the predicate rdf:type, the object :Album can be obtained. This object also has a prefix indicating that this is defined by the current database.

The variable can be replaced by an asterisk * if all the records that match the condition are returned, and the keyword WHERE can be omitted inside the SELECT query, as well as the period at the end of the last triple, so the above query can also be written as the following.

1
SELECT * { ?album a Album }

In addition to the album name, if you want to return the artist of the album, you can add a variable ?artist.

1
2
3
4
5
SELECT ?album ?artist
{
   ?album a :Album .
   ?album :artist ?artist .
}

In the above code, the variable ?artist must be the object of ?album (subject) and :artist (predicate).

Example of Wikidata query: Most populated areas in Shanxi Province

The following is a further study of SPARQL syntax by looking up “which is the most populous region in Shanxi Province” through Wikidata.

First, go to the Wikidata website (https://www.wikidata.org/), and search for “Shanxi” in the search field at the top of the page. Alternatively, Wikipedia’s “Shanxi Province” page has a link to Wikidata in the left column.

Then, go to the page of Shanxi Province.

At this point, look for the URL of this page.

1
https://www.wikidata.org/wiki/Q46913

The Q46913 at the end of the URL above is the number (i.e. the subject) of the entry for Shanxi Province in the wiki data, which will be used later.

Next, scroll down the page and find the section “contains administrative territorial entity”, which lists the regions under Shanxi Province.

Click on the title “contains administrative territorial entity” to go to its page, and look for the URL as well.

1
https://www.wikidata.org/wiki/Property:P150

The last part of the URL above, P150, is the number of the predicate verb “administrative entity included”.

Now, you can start the query. Go to Wikidata’s online query page query.wikidata.org

In the query box, enter the following SPARQL statement.

1
2
3
4
SELECT ?area
WHERE {
   wd:Q46913  wdt:P150 ?area .
}

The above code requires the return of the variable ?area, which must satisfy the subject Shanxi Province (wd:Q46913) and the predicate administrative entity contained (wdt:P150). The prefix wd indicates that this is a wiki data entry, while the prefix wdt indicates that this is a predicate relationship defined by the wiki data.

Click the triangular Run button in the left sidebar to get the results of the query at the bottom of the page.

As you can see from the above image, the returned entries are numbered. Modify the query statement to add a column of text labels.

1
2
3
4
5
6
7
8
SELECT 
  ?area
  ?areaLabel
WHERE {
   wd:Q46913  wdt:P150 ?area .
   ?area rdfs:label ?areaLabel .
   FILTER(LANGMATCHES(LANG(?areaLabel), "zh-CN")) 
}

In the above code, add a returned variable ?areaLabel which is the text label of the previous variable ?area (satisfying the predicate rdfs:label), and add a filter statement FILTER which requires only Chinese labels to be returned.

Run this query and you will see the Chinese names for each area.

Next, add a population variable ?popTotal that returns the total number of people in each region.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SELECT 
  ?area 
  ?areaLabel 
  ?popTotal
WHERE {
   wd:Q46913  wdt:P150 ?area .
   ?area rdfs:label ?areaLabel .
   FILTER(LANGMATCHES(LANG(?areaLabel), "zh-CN")) 

   ?area wdt:P1082 ?popTotal .
}

Run this code and you will see the population total.

Then, add a sort clause order by to sort by the population in reverse order.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SELECT 
  ?area 
  ?areaLabel 
  ?popTotal
WHERE {
   wd:Q46913  wdt:P150 ?area .
   ?area rdfs:label ?areaLabel .
   FILTER(LANGMATCHES(LANG(?areaLabel), "zh-CN")) 

   ?area wdt:P1082 ?popTotal .
}

The results of the run are as follows.

Finally, add a limit 1 clause to return only the first piece of data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
SELECT 
  ?area 
  ?areaLabel 
  ?popTotal
WHERE {
   wd:Q46913  wdt:P150 ?area .
   ?area rdfs:label ?areaLabel .
   FILTER(LANGMATCHES(LANG(?areaLabel), "zh-CN")) 

   ?area wdt:P1082 ?popTotal .
}
ORDER BY desc(?popTotal)
limit 1

This gives us the most populated area in Shanxi Province.

Example of a wiki data query: Programmer Directory

Here is another example to find out all programmers of Wikipedia revenue.

1
2
3
4
5
6
7
8
9

SELECT 
  ?programmer 
  ?programmerLabel
WHERE {
  ?programmer wdt:P106 wd:Q5482740 .
  ?programmer rdfs:label ?programmerLabel .  
  FILTER (LANGMATCHES(LANG(?programmerLabel), "zh-CN"))
}

In the above code, Q5482740 is the programmer and P106 is the occupation.

Run this query and you will see the list of programmers.

Note that only programmers with Chinese names are returned here. If there is no Chinese name of the programmer inside the database, it will not be returned here.

Then, query the main achievements of each programmer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14

SELECT 
  ?programmer 
  ?programmerLabel 
  ?notableworkLabel
WHERE {
  ?programmer wdt:P106 wd:Q5482740 .
  ?programmer rdfs:label ?programmerLabel .  
  FILTER (LANGMATCHES(LANG(?programmerLabel), "zh-CN"))

  ?programmer wdt:P800 ?notablework .  
  ?notablework rdfs:label ?notableworkLabel .  
  FILTER(LANGMATCHES(LANG(?notableworkLabel), "zh-CN"))
}

The results of the run are as follows.

Some programmers have multiple achievements, for example, John Carmack has two achievements, “Doom” and “Thor’s Hammer”. You can use the GROUP BY clause to combine them together.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15

SELECT 
  ?programmer 
  ?programmerLabel 
  (GROUP_CONCAT(?notableworkLabel; separator="; ") AS ?works)
WHERE {
  ?programmer wdt:P106 wd:Q5482740 .
  ?programmer rdfs:label ?programmerLabel .  
  FILTER(LANGMATCHES(LANG(?programmerLabel), "zh-CN"))

  ?programmer wdt:P800 ?notablework .  
  ?notablework rdfs:label ?notableworkLabel .  
  FILTER (LANGMATCHES(LANG(?notableworkLabel), "zh-CN"))
}
GROUP BY ?programmer ?programmerLabel

In the above code, the GROUP_CONCAT function is used to merge multiple ?notableworkLabel variables into a new column works.

The result is as follows.

In the picture above, “Doom” and “Thor’s Hammer” have been combined into one cell.

Next, add an avatar photo for each person.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
SELECT 
  ?programmer
  ?programmerLabel 
  (GROUP_CONCAT(?notableworkLabel; separator="; ") AS ?works) 
  ?image
WHERE {
  ?programmer wdt:P106 wd:Q5482740 .
  ?programmer rdfs:label ?programmerLabel .  
  FILTER(LANGMATCHES ( LANG ( ?programmerLabel ), "zh-CN"))

  ?programmer wdt:P800 ?notablework .  
  ?notablework rdfs:label ?notableworkLabel .  
  FILTER (LANGMATCHES ( LANG ( ?notableworkLabel ), "zh-CN"))

  OPTIONAL {?programmer wdt:P18 ?image}
}
GROUP BY ?programmer ?programmerLabel ?image

In the above code, the return value adds a photo variable ?image. Since not everyone has a photo, the photo requirement is placed in the OPTIONAL condition, indicating that this item is optional.

After getting the query result, switch the table view of the result to the image grid.

At this point, the photo is ready to be displayed.

Finally, we want to know where they are from, and the wiki data provides their place of birth.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
SELECT ?programmer 
  ?programmerLabel 
  (GROUP_CONCAT(?notableworkLabel; separator="; ") AS ?works) 
  ?image
  ?cood
WHERE {
  ?programmer wdt:P106 wd:Q5482740 .
  ?programmer rdfs:label ?programmerLabel .  
  FILTER(LANGMATCHES ( LANG ( ?programmerLabel ), "zh-CN"))

  ?programmer wdt:P800 ?notablework .  
  ?notablework rdfs:label ?notableworkLabel .  
  FILTER (LANGMATCHES ( LANG ( ?notableworkLabel ), "zh-CN"))

  OPTIONAL {?programmer wdt:P18 ?image}

  OPTIONAL {
    ?programmer wdt:P19 ?birthplace .
    ?birthplace wdt:P625 ?cood .
  }
}
GROUP BY ?programmer ?programmerLabel ?image ?cood

上面的代码在返回值中加入了坐标变量cood,首先查询了程序员的出生地,然后查询了出生地的地理坐标。

运行查询后,坐标出现在默认的表视图中。

Switch the view to map.

This is where you can see where these programmers are located on the world map.

This tutorial ends here, there are many more ways to query wiki data, you can continue learning by clicking the Examples button in the header of Query Page to see the official examples provided.


Reference https://www.ruanyifeng.com/blog/2020/02/sparql.html