Comments on: XULRunner and Crowbar – Crawling of sorts?

By: John

John — Thu, 02 Sep 2010 14:19:45 +0000

Excellent! I’m glad to hear you got it sorted in the end.
Crowbar is an interesting program… you’d think that reading/interpreting JavaScript would be something that all webspiders would be able to do – yet NONE of them do it for the simple reason that understanding JavaScript and the DOM model requires, essentially, a full browser. Building a full browser into a spidering engine is overkill for just this little bit of added functionality – but when you need to scrape JS, you need to scrape JS!
As a result, the only program that will allow you to do headless JS processing is XULRunner/Crowbar…. but Crowbar doesn’t understand cookies!
If your outsourced method doesn’t do it, I guess the only other option is to either modify Crowbar to understand and send cookie to XULRunner – OR – point the crowbar proxy to another proxy which can inject cookies into the headers send/receive/modify cookies that way.

Both ways would work, and the latter way would probably be more extensible, but it’s not very neat… not to mention both would require me to know how to code applications for XULRunner (which I think is actually all JavaScript and a bit of C using a bridging library, but still… all I can code in is PHP and HTML :P).
I really don’t fancy doing that – so this outsourced sourcecode of your, depending on how it works, could *really* help me out. Certainly save me a LOT of time!!

So yeah, hehe, thanks a lot for your help – and thanks a lot for this blog! It’s probably the only resource on what Crowbar is that exists besides the under-loved and cryptic Crowbar homepage!! Good job!

By: Keiron

Keiron — Thu, 02 Sep 2010 09:26:52 +0000

In reply to John.

Hi John,

I outsourced it in the end as I needed to get it done quickly, interestingly I have another project coming up that may need it – I need to dig out the source code I’ll let you know once I can define some decent examples!

By: John

John — Fri, 27 Aug 2010 10:58:41 +0000

Keiron,

So you managed to crawl your pages with php using curl and crowbar?!
I would love to see that sourcecode man. I’m having a bit of a tough time curling my way into some javascript, and even with Crowbar installed and ready to go I don’t seem to get the results I want.

What did the curl line you used to call Javascript pages look like?

All the best,

~ John

By: Keiron

Keiron — Wed, 14 Jan 2009 17:06:55 +0000

In reply to Dan.

I eventually resorted to outsourcing it via rentacoder to an absolutely excellent coder in the US.

He provided exactly what I needed in PHP to the spec I provided with no extensions or the like! Was really pleased with his work – I think he was kind of surprised when I had no complaints or changes that needed making as well!

By: Dan

Dan — Wed, 14 Jan 2009 15:51:19 +0000

FireWatir might be a better tool for you:

http://wiki.openqa.org/display/WTR/FireWatir

Dan

By: Using Subversion to get Crowbar | Skillett.com

Using Subversion to get Crowbar | Skillett.com — Sun, 30 Nov 2008 10:51:43 +0000

[…] post is just a reference point for another post, a Subversion client is needed for downloading Crowbar, so I downloaded TortoiseSVN available […]