I recently ran across a post on r/boardgames that looked at the historical dynamics of BGG Top 60 Games rankings. The author took advantage of BGG's XML API to access stats for games from 2007 to now. I found this quite curious, as BGG has an abundance of data available on games; they report numbers of copies owned and desired, average ratings, weighted ratings, standard deviation of ratings, etc. I figured this would be a good opportunity to do some data extraction and manipulation. My original plan was to query the XML API for the information I needed and then import that into MS SQL Server to play around with it a little.
After looking into the API a bit deeper (aka wasting an hour trying to make it return the data I wanted), it became clear that its limitations would require a little bit of automation to access the information I needed. I quick Python script is all that is required to fetch the names and gameIds of the top \(n\) games in the current BGG ranking. I had to reinstall Python 3.4.1, since I didn't have it on my laptop. I ran into a bit of an issue as pip didn't seem to work out of the box as advertised. I had to go to Start/Edit System Environmental Variables and add the C:\Python34\Scripts path to ROOT to make pip work from the command prompt. For the method I used to scrape data from the BGG website, the following Python libraries need to be installed:
The following Pyhton script gets the names and game ids for the top \(n\) ranked games:
Once we have the game ids we can use python to scrub for further details. However, as I wanted to get the data to MS SQL Server, I just wanted to save the XML pages with the historical ranking data for the top 20 games on BGG. Those included details about the game (publishing year, publisher, number of players, play length, etc.) as well as daily ranking statistics from 2006-03-18. To obtain the XML pages, I did the following, using urllib.request's urlopen method:
Once I had the XML files with the historical ranking data all that was left was to get them into a database. My approach was to use MS SQL's OPENROWSET() OPENXML() functions. Below you can see the full query I used. The same result can be achieved by using the value() and nodes() methods (and would probably be shorter), but I shall save that for another day.
The result is a very neat database with all sorts of historical data for the top 20 games on BGG. Once I figure out what I want to do with the data, I will probably extend it to cover the top 50 or 100 games. Updates with interesting results and graphs should follow soon!