Playing with the BGG XML API and MS SQL Server

Ivan Vanchev

Thursday, August 21, 2014

I recently ran across a post on r/boardgames that looked at the historical dynamics of BGG Top 60 Games rankings. The author took advantage of BGG's XML API to access stats for games from 2007 to now. I found this quite curious, as BGG has an abundance of data available on games; they report numbers of copies owned and desired, average ratings, weighted ratings, standard deviation of ratings, etc. I figured this would be a good opportunity to do some data extraction and manipulation. My original plan was to query the XML API for the information I needed and then import that into MS SQL Server to play around with it a little.

After looking into the API a bit deeper (aka wasting an hour trying to make it return the data I wanted), it became clear that its limitations would require a little bit of automation to access the information I needed. I quick Python script is all that is required to fetch the names and gameIds of the top \(n\) games in the current BGG ranking. I had to reinstall Python 3.4.1, since I didn't have it on my laptop. I ran into a bit of an issue as pip didn't seem to work out of the box as advertised. I had to go to Start/Edit System Environmental Variables and add the C:\Python34\Scripts path to ROOT to make pip work from the command prompt. For the method I used to scrape data from the BGG website, the following Python libraries need to be installed:

  • Beautiful Soup
  • requests
  • numpy
  • pandas
The following Pyhton script gets the names and game ids for the top \(n\) ranked games:


                  from bs4 import BeautifulSoup
                  import requests
                  import xml.etree.ElementTree as ElementTree
                  import numpy as np
                  import pandas as pd

                  #set number of games to fetch
                  nGames = 200
                  nGamesPerPage = 100
                  urlBase = 'http://www.boardgamegeek.com/browse/boardgame/page/'

                  #get the top nGames games from the ranking
                  def get_top_games(ngames=nGames):
                      # number of pages to scrape
                      nPages = round(ngames, -2) // nGamesPerPage

                      games = []
                      for i in range(1, nPages + 1):
                          print("Reading page {0} / {1}".format(i, nPages))
                          url = urlBase + str(i)
                          page = requests.get(url, timeout=120)
                          soup = BeautifulSoup(page.content)
                          # items are found by 'id=results_objectname*' attribute in 'div' tag
                          cntr = 1
                          for _ in range(nGamesPerPage):
                              item = soup.find('div', {'id': 'results_objectname' + str(cntr)})
                              if item:
                                  title = item.a.contents[0]
                                  # some games don't have a publishing year
                                  try:
                                      year = item.span.contents[0][1:-1]
                                  except AttributeError:
                                      year = np.NaN
                                  games.append((title, year))
                                  cntr += 1
                      return games[:ngames]

                  topgames = get_top_games()
                  print("{0} games found.".format(len(topgames)))

                  #get the relevant game ids
                  def get_games_id(games):
                      games = pd.DataFrame.from_records(games, columns=['title', 'year'])

                      def get_game_id(x):
                          url = "http://www.boardgamegeek.com/xmlapi/search?search={0}&exact=1".format(x.title)
                          result = requests.get(url, timeout=120)
                          result = ElementTree.fromstring(result.text)
                          if len(result) == 1:
                              x['id'] = result[0].attrib['objectid']
                          # nothing found? then do inexact search and match manually afterwards
                          elif len(result) == 0:
                              url = "http://www.boardgamegeek.com/xmlapi/search?search={0}".format(x.title)
                              result = requests.get(url, timeout=120)
                              result = ElementTree.fromstring(result.text)
                              for found in result:
                                  if found[0].text == x.title:
                                      x['id'] = found.attrib['objectid']
                                      break
                          # multiple results found? then try to match on year
                          else:
                              for found in result:
                                  if found[0].text == x.title:
                                      if len(found) > 1 and found[1].text == x.year:
                                          x['id'] = found.attrib['objectid']
                                          break
                          
                          # print("id {0} found for {1}".format(x.id, x.title))
                          return x

                  games = games.apply(get_game_id, axis=1)
    
                  return games

                  gamedata = get_games_id(topgames)
                  print("{0} ids found".format(len(gamedata[gamedata.id.notnull()].index)))
                

Once we have the game ids we can use python to scrub for further details. However, as I wanted to get the data to MS SQL Server, I just wanted to save the XML pages with the historical ranking data for the top 20 games on BGG. Those included details about the game (publishing year, publisher, number of players, play length, etc.) as well as daily ranking statistics from 2006-03-18. To obtain the XML pages, I did the following, using urllib.request's urlopen method:


                  from urllib.request import urlopen

                  urlBase = "http://www.boardgamegeek.com/xmlapi/boardgame/12333,25613,31260,120677,3076,124742,102794,72125,2651,96848,84876,35677,121921,28720,126163,115746,18602,68448,40834,62219?historical=1&page="
                  nPages = 30

                  for i in range(1,nPages+1):
                      print("Reading page {0}/{1}".format(i, nPages))
                      url = urlBase + str(i)
                      s = urlopen(url)
                      contents = s.read()
                      filename = "top20page" + str(i) + ".xml"

                      file = open(filename, "wb")
                      file.write(contents)
                      file.close()
                

Once I had the XML files with the historical ranking data all that was left was to get them into a database. My approach was to use MS SQL's OPENROWSET() OPENXML() functions. Below you can see the full query I used. The same result can be achieved by using the value() and nodes() methods (and would probably be shorter), but I shall save that for another day.


                  USE master;
                  GO

                  CREATE DATABASE bggstats;
                  GO

                  USE bggstats;
                  GO

                  CREATE TABLE readXML
                  (
                    Id int IDENTITY PRIMARY KEY,
                    XMLData XML,
                    LoadedDate DATE)

                  INSERT INTO readXML(XMLData, LoadedDate)
                  SELECT CONVERT(XML, BulkColumn) AS BulkColumn, GETDATE()
                  FROM OPENROWSET(BULK 'C:\Path\To\Your\BGG XML Data\top20.xml', SINGLE_BLOB) AS x;

                  CREATE TABLE bggTop20
                  (
                    gameId int NOT NULL PRIMARY KEY,
                    yearPub int,
                    minPlayers int,
                    maxPlayers int,
                    playTime int,
                    age int,
                    name nvarchar(60))

                  CREATE TABLE bggRankHistory
                  (
                    gameId int NOT NULL FOREIGN KEY REFERENCES bggTop20(gameId),
                    rankDate date NULL,
                    rankValue int NULL,
                    rankId int NULL,
                    usersRated int NULL,
                    avgRating float NULL,
                    wghtAvgRating float NULL,
                    stdevRating float NULL,
                    owned int NULL,
                    trading int NULL,
                    wanting int NULL,
                    wishing int NULL,
                    numComments int NULL,
                    numWeights int NULL,
                    avgWeights float NULL)

                  DECLARE @XML AS XML, @hDoc AS INT, @SQL NVARCHAR (MAX)

                  SELECT @XML = XMLData FROM readXML

                  EXEC sp_xml_preparedocument @hDoc OUTPUT, @XML

                  INSERT INTO bggTop20 (gameId, yearPub, minPlayers, maxPlayers, playTime, age, name)
                  SELECT gameId, yearPub, minPlayers, maxPlayers, playTime, age, name
                  FROM OPENXML(@hDoc, 'boardgames/boardgame/statistics')
                  WITH
                  (
                    gameId [int] '../@objectid',
                    yearPub [int] '../yearpublished',
                    minPlayers [int] '../minplayers',
                    maxPlayers [int] '../maxplayers',
                    playTime [int] '../playingtime',
                    age [int] '../age',
                    name [nvarchar](60) '../name')

                  INSERT INTO bggRankHistory (gameId, rankDate, rankValue, rankId, usersRated, avgRating, wghtAvgRating, stdevRating, owned, trading, wanting, wishing, numComments, numWeights, avgWeights)
                  SELECT gameId, rankDate, rankValue, rankId, usersRated, avgRating, wghtAvgRating, stdevRating, owned, trading, wanting, wishing, numComments, numWeights, avgWeights
                  FROM OPENXML(@hDoc, 'boardgames/boardgame/statistics/ratings/ranks/rank')
                  WITH
                  (
                    gameId [int] '../../../../@objectid',
                    rankDate [date] '../../@date',
                    rankValue [int] '@value',
                    rankId [int] '@id',
                    usersRated [int] '../../usersrated', 
                    avgRating [float] '../../average',
                    wghtAvgRating [float] '../../bayesaverage',
                    stdevRating [float] '../../stddev',
                    owned [int] '../../owned',
                    trading [int] '../../trading',
                    wanting [int] '../../wanting',
                    wishing [int] '../../wishing',
                    numComments [int] '../../numcomments',
                    numWeights [int] '../../numweights',
                    avgWeights [float] '../../averageweight') 

                  EXEC sp_xml_removedocument @hDoc
                  GO
                

The result is a very neat database with all sorts of historical data for the top 20 games on BGG. Once I figure out what I want to do with the data, I will probably extend it to cover the top 50 or 100 games. Updates with interesting results and graphs should follow soon!


000webhost logo