Skip to content Skip to sidebar Skip to footer

Parse Birth And Death Dates From Wikipedia?

I'm trying to write a python program that can search wikipedia for the birth and death dates for people. For example, Albert Einstein was born: 14 March 1879; died: 18 April 1955.

Solution 1:

You can consider using a library such as BeautifulSoup or lxml to parse the response html/xml.

You may also want to take a look at Requests, which has a much cleaner API for making requests.


Here is the working code using Requests, BeautifulSoup and re, arguably not the best solution here, but it is quite flexible and can be extended for similar problems:

import re
import requests
from bs4 import BeautifulSoup

url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml'

res = requests.get(url)
soup = BeautifulSoup(res.text, "xml")

birth_re = re.search(r'(Birth date(.*?)}})', soup.revisions.getText())
birth_data = birth_re.group(0).split('|')
birth_year = birth_data[2]
birth_month = birth_data[3]
birth_day = birth_data[4]

death_re = re.search(r'(Death date(.*?)}})', soup.revisions.getText())
death_data = death_re.group(0).split('|')
death_year = death_data[2]
death_month = death_data[3]
death_day = death_data[4]

Per @JBernardo's suggestion using JSON data and mwparserfromhell, a better answer for this particular use case:

import requests
import mwparserfromhell

url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=json'

res = requests.get(url)
text = res.json["query"]["pages"].values()[0]["revisions"][0]["*"]
wiki = mwparserfromhell.parse(text)

birth_data = wiki.filter_templates(matches="Birth date")[0]
birth_year = birth_data.get(1).value
birth_month = birth_data.get(2).value
birth_day = birth_data.get(3).value

death_data = wiki.filter_templates(matches="Death date")[0]
death_year = death_data.get(1).value
death_month = death_data.get(2).value
death_day = death_data.get(3).value

Solution 2:

First: The wikipedia API allows the use of JSON instead of XML and that will make things much easier.

Second: There's no need to use HTML/XML parsers at all (the content is not HTML nor the container need to be). What you need to parse is this Wiki format inside "revisions" tag of the JSON.

Check some Wiki parsers here


What seems to be confusing here is that the API allows you to request a certain format (XML or JSON) but that's is just a container for some text in the real format you want to parse:

This one: {{Birth date|df=yes|1879|3|14}}

With one of the parsers provided in the link above, you will be able to do that.

Solution 3:

First, use pywikipedia. It allows you to query article text, template parameters etc. through a high-level abstract interface. Second, I would go with the Persondata template (look towards the end of the article). Also, in the long term, you might be interested in Wikidata, which will take several months to introduce, but it will make most metadata in Wikipedia articles easily queryable.

Solution 4:

The persondata template is deprecated now, and you should instead access Wikidata. See Wikidata:Data access. My earlier (now deprecated) answer from 2012 was as follows:

What you should do is to parse the {{persondata}} template found in most biographical articles. There are existing tools for easily extracting such data programmatically, with your existing knowledge and the other helpful answers I am sure you can make that work.

Solution 5:

One alternative in 2019 is to use the Wikidata API, which, among other things, exposes biographical data like birth and death dates in a structured format that is very easy to consume without any custom parsers. Many Wikipedia articles depend on Wikidata for their info, so in many cases this will be the same as if you were consuming Wikipedia data.

For example, look at the Wikidata page for Albert Einstein and search for "date of birth" and "date of death", you will find they are the same as in Wikipedia. Every entity in Wikidata has a list of "claims" which are pairs of "properties" and "values". To know when Einstein was born and died, we only need to search the list of statements for the appropriate properties, in this case, P569 and P570. To do this programatically, it's best to access the entity as json, which you can do with the following url structure:

https://www.wikidata.org/wiki/Special:EntityData/Q937.json

And as an example, here is what the claim P569 states about Einstein:

"P569":[{"mainsnak":{"property":"P569","datavalue":{"value":{"time":"+1879-03-14T00:00:00Z","timezone":0,"before":0,"after":0,"precision":11,"calendarmodel":"http://www.wikidata.org/entity/Q1985727"},"type":"time"},"datatype":"time"},"type":"statement",

You can learn more about accessing Wikidata in this article, and more specifically about how dates are structured in Help:Dates.

Post a Comment for "Parse Birth And Death Dates From Wikipedia?"