We use cookies to keep our site relevant and easy to use, your continued use of this site is consent that we may set several cookies (see our Privacy & Cookie Policy), click to always allow cookies from our site (and not see this notifcation on your next visit) or read more.Allow Cookies

EU legislation requires that all websites clearly specify if cookies are being used and their purpose, You can read more about how we use cookies (and which cookies we use) in our Privacy and Cookie Policy.

You will see this notification the first time you visit our website unless you accept cookies (in which case we'll set a cookie to remember thay you're happy for us to to set cookies!).

XULRunner and Crowbar – Crawling of sorts?

This was going to be a tutorial on getting these two things running to achieve everything I want, sadly I can’t work out how to get the last step working, which is to navigate the returned Ajax page to allow me to extract different information.

As such this is more a guide on getting the two things installed and working – if you have any more luck than I do on getting navigating Ajax working then let me know!!

XULRunner

First things first, I downloaded the Windows version of XULRunner from (look in the runtimes directory!):

http://releases.mozilla.org/pub/mozilla.org/xulrunner/releases/

(Unpacking takes a while the 8.23MB download contained 302 items totalling 18.8MB!)

Crowbar

Not such a simple download for the uninitiated. It’s not actually released, so it uses Subversion to store its files – you’ll need a Subversion client to download it. I don’t have one on the machine I’m working on, so another post will cover the in’s and out’s of downloading Crowbar with subversion.

All Downloaded and Unpacked – Onwards we go!

Back to the instructions here, which tell me once I’ve done all this to open a command prompt (thankfully a place I’m familiar with) and run:

c:\> %XULRUNNER_HOME%\xulrunner.exe --install-app %CROWBAR%\xulapp
c:\> cd %CROWBAR%\xulapp
c:\> %XULRUNNER_HOME%\xulrunner.exe application.ini

Windows Firewall blocked the program but that was kind of expected, so I unblocked that.
I now have a Crowbar window and an Error Console, apparently I can use Crowbar by visting:

http://127.0.0.1:10000/

On doing so, a nice little web window pops up similar to a web proxy, asking me what page I want to fetch.

I inserted my Ajax based page and the next thing I know, I’m being presented with all the source code for that page, which includes all the output from the Javascript that wouldn’t be there when I did a PHP curl get on the page!!

Now apparently I can run this using curl (why can I see me having to install a fair bit of software on my laptop to get this all working over there?).

OK, so all well and good we’ve fetched one page, but that page has a dropdown box on it that forces the entire page to change – how do I go about “Crowbarring” my way around that?

With little documentation I can’t see a way… Back to the drawing/scraping board?

This entry was posted on Sunday, November 30th, 2008 at 10:50 am and is filed under Programming. You can follow any responses to this entry through the RSS 2.0 feed.


Comments (6)

 

  1. […] post is just a reference point for another post, a Subversion client is needed for downloading Crowbar, so I downloaded TortoiseSVN available […]

  2. Dan says:

    FireWatir might be a better tool for you:

    http://wiki.openqa.org/display/WTR/FireWatir

    Dan

  3. Keiron says:

    I eventually resorted to outsourcing it via rentacoder to an absolutely excellent coder in the US.

    He provided exactly what I needed in PHP to the spec I provided with no extensions or the like! Was really pleased with his work – I think he was kind of surprised when I had no complaints or changes that needed making as well!

  4. John says:

    Keiron,

    So you managed to crawl your pages with php using curl and crowbar?!
    I would love to see that sourcecode man. I’m having a bit of a tough time curling my way into some javascript, and even with Crowbar installed and ready to go I don’t seem to get the results I want.

    What did the curl line you used to call Javascript pages look like?

    All the best,

    ~ John

  5. Keiron says:

    Hi John,

    I outsourced it in the end as I needed to get it done quickly, interestingly I have another project coming up that may need it – I need to dig out the source code I’ll let you know once I can define some decent examples!

  6. John says:

    Excellent! I’m glad to hear you got it sorted in the end.
    Crowbar is an interesting program… you’d think that reading/interpreting JavaScript would be something that all webspiders would be able to do – yet NONE of them do it for the simple reason that understanding JavaScript and the DOM model requires, essentially, a full browser. Building a full browser into a spidering engine is overkill for just this little bit of added functionality – but when you need to scrape JS, you need to scrape JS!
    As a result, the only program that will allow you to do headless JS processing is XULRunner/Crowbar…. but Crowbar doesn’t understand cookies!
    If your outsourced method doesn’t do it, I guess the only other option is to either modify Crowbar to understand and send cookie to XULRunner – OR – point the crowbar proxy to another proxy which can inject cookies into the headers send/receive/modify cookies that way.

    Both ways would work, and the latter way would probably be more extensible, but it’s not very neat… not to mention both would require me to know how to code applications for XULRunner (which I think is actually all JavaScript and a bit of C using a bridging library, but still… all I can code in is PHP and HTML :P).
    I really don’t fancy doing that – so this outsourced sourcecode of your, depending on how it works, could *really* help me out. Certainly save me a LOT of time!!

    So yeah, hehe, thanks a lot for your help – and thanks a lot for this blog! It’s probably the only resource on what Crowbar is that exists besides the under-loved and cryptic Crowbar homepage!! Good job!

Leave a Reply