I love going to the cinema. I usually go once a week to my local Cineworld multiplex. Their website has changed a few times over the years. Generally the changes have all been improvements and my local cinema listing is good. However all cinema websites I have found lack an important view on the listings data. That is a chronological order rather than a film title order. Why is this useful you ask? Well if I want to go to the cinema on a particular evening I really don't care what other films are showing outside my allocated timeslot. I want to be able to easily see which films are going to start say between 7 and 8 that I have not already seen.
So what I did first was email their webmaster, after which I got no response. So I then decided to try to have a go at it myself. To show a different view of the data I needed an actual data source. So I emailed a few other sites to ask where they got there listings data. Surprise surprise no response from them either. They say if you want something done, do it yourself. So I did.
I wrote a script that downloads the raw html pages from the cineworld website and parses it and produces an xml file with the listings data. Initially it was a perl script that was rather hacky and very prone to breakage. It then got rewritten using a few cpan modules to parse the html and use xpath to search for the relevant bits of data. I then rewrote it in a more general form in python.
Currently it is croned to get my local cinemas listings and creates the xml file each night. I'm not sure if cinemas use a xml or have any standards so I went with my own dtd which heavily reflects that data that cineworld expose.
Having an xml file with the raw data is all very cool but it does not really solve the problem. So next I created a transform that displays all the showings in chronological and outputs to a html file that is easily viewable in both a normal browser and a phone (be warned the resultant html is quite large). Both web browsers don't have good enough xml/xstl brains todo the transform so the output is pre-generated each night at the end of the scrape process.
Recently I found out with the help of google that Cineworld themselves publish more of the raw listings data on their site. It seems to quite a new thing, 03/03/2009 according to the readme. A directory list shows all the available exports but the main one seem to be listings.xml. I'm not sure why they split the data into so many xml files. The dtd is quite similar but I'm not sure I like all of there choices. Plus they don't actually export all the data, for instance no link to thumbnail of movie, which is a show stopper for me.