From the category archives:
Programming
XULRunner and Crowbar – Crawling of sorts?
This was going to be a tutorial on getting these two things running to achieve everything I want, sadly I can’t work out how to get the last step working, which is to navigate the returned Ajax page to allow me to extract different information.
As such this is more a guide on getting the two things installed and working – if you have any more luck than I do on getting navigating Ajax working then let me know!!
XULRunner
First things first, I downloaded the Windows version of XULRunner from (look in the runtimes directory!):
http://releases.mozilla.org/pub/mozilla.org/xulrunner/releases/
(Unpacking takes a while the 8.23MB download contained 302 items totalling 18.8MB!)
Crowbar
Not such a simple download for the uninitiated. It’s not actually released, so it uses Subversion to store its files – you’ll need a Subversion client to download it. I don’t have one on the machine I’m working on, so another post will cover the in’s and out’s of downloading Crowbar with subversion.
All Downloaded and Unpacked – Onwards we go!
Back to the instructions here, which tell me once I’ve done all this to open a command prompt (thankfully a place I’m familiar with) and run:
c:\> %XULRUNNER_HOME%\xulrunner.exe --install-app %CROWBAR%\xulapp c:\> cd %CROWBAR%\xulapp c:\> %XULRUNNER_HOME%\xulrunner.exe application.ini
Windows Firewall blocked the program but that was kind of expected, so I unblocked that.
I now have a Crowbar window and an Error Console, apparently I can use Crowbar by visting:
http://127.0.0.1:10000/
On doing so, a nice little web window pops up similar to a web proxy, asking me what page I want to fetch.
I inserted my Ajax based page and the next thing I know, I’m being presented with all the source code for that page, which includes all the output from the Javascript that wouldn’t be there when I did a PHP curl get on the page!!
Now apparently I can run this using curl (why can I see me having to install a fair bit of software on my laptop to get this all working over there?).
OK, so all well and good we’ve fetched one page, but that page has a dropdown box on it that forces the entire page to change – how do I go about “Crowbarring” my way around that?
With little documentation I can’t see a way… Back to the drawing/scraping board?
{ 3 comments }
Using Subversion to get Crowbar
This post is just a reference point for another post, a Subversion client is needed for downloading Crowbar, so I downloaded TortoiseSVN available here.
A quick install with the default options and I was away.
I created a new folder on my desktop called Crowbar, went into it and did a right click (TortoiseSVN is an explorer extension so there’s no program to run directly!).
TortoiseSVN -> Export
Then pasted in the trunk from the Crowbar instructions page.
{ 3 comments }
Ajax Crawling with PHP and Curl?
I apologise now if I put off some of the more general readers with this post, but I’ve struck upon a bit a problem!
I have some automated code that I wrote using PHP and Curl, that retrieves a mountain of information from a website, does some statistical analysis on it and then presents me with a nice little report (having inserted the data into a MySQL database). It’s wonderful – to do the process manually would take maybe 2-3 hours every day, as it is I wake up to a nice report sat in my inbox everyday with all the information in it.
Now I have a problem, the website that I crawl to get this information is converting to Ajax – this presents me with a huge problem……
Web Spiders
Web spiders for the most part, grab a page from a server, make a list of links in the page and then go off and repeat the process on all those links it’s found (each one triggering a different database or variable call in a website).
Ajax isn’t quite so easy, most of the page isn’t even in the HTML! It’s inserted after by javascript – mostly when the user clicks something, meaning the user doesn’t even navigate to a different page, but actually stays on the same page and lets the javascript refresh (hence our crawler can’t make a list of the links!). Some websites allow for this having a “lite” version, the one I’m using doesn’t
Thinking like a Human
We need to make our crawler think and act like a human – sounds easy enough right? You’ve written a crawler before surely you can do that?
Wrong! I can’t think of any logical way to get PHP to do this for me!
Any crawler process would need to be able to see events and states in the document that a real user might click on.
Some reading around this problem (and believe me, I have), suggests that the easiest way of possible of doing this is to create an AJAX-enabled event-driven reader. Heck we use one of these everyday of the week (it’s your web browser folks, whether it be IE, Firefox, Opera, Safari etc. etc.).
Using the Browser
FleshEater There are a couple of tools around that seem to use the browser, Watir (using Ruby) and Crowbar (which uses a mozilla based browser).
Does anyone have any other bright ideas before I spend hours fighting with yet another new technology?
{ 9 comments }
Mozilla Prism!
I was reading John’s blog today and spotted a link to Mozilla Labs. I’ve always liked some of the Google Labs projects and an avid fan of both Firefox and Thunderbird – so what better way to see some of the things that Mozilla have got on their minds.
I’ve only read about one project before I’ve started typing, but already it strikes me a something remarkably simple but such a good idea!
Prism notes that:
Personal computing is currently in a state of transition. While traditionally users have interacted mostly with desktop applications, more and more of them are using web applications. But the latter often fit awkwardly into the document-centric interface of web browsers. And they are surrounded with controls–like back and forward buttons and a location bar–that have nothing to do with interacting with the application itself.
So Prism will let you add these applications straight into your Start Menu (an eBay Application link, a Facebook application link etc.)
You can see the benefits for some users already! Is anyone using this?
{ 8 comments }
CSS Style overload – Remove unused CSS Styles
I design sites (not well most of the time!) and often pinch bits and bobs from the various CSS stylesheets I’ve created over the years, generally I normally end up coming up with something that looks at least half decent!
However, all that cutting and pasting leaves my stylesheets in a mess – with tonnes of unused CSS styles!
I spotted a reference to Dust-Me Selectors today on San Baldo, and using this fantastic Firefox Extension I have managed, in the space of minutes to reduce on stylesheet from an unmanageable 600 lines to a mere 200!
It extracts all the selectors from all the stylesheets on the page you’re viewing, then analyzes that page to see which of those selectors are not used. The data is then stored in your user preferences, so that as you continue to navigate around a site, selectors will be crossed off the list as they’re encountered.
You’ll end up with a profile of which selectors are not used anywhere on the site.
If you found this post useful, why not buy me a coffee or a beer (depending on the time of day obviously!):
If you really want to learn about CSS you need to be reading some of the following books:
{ 13 comments }
Updated CakePHP Baking
It appears the CakeBaker, may renew my interest in CakePHP having highlighted some projects that use CakePHP.
- Island Cruises looks fantastic!
- Dishola, looks very tempting!
- PokerInside, could be onto a winner!
So, you obviously can do more than just make a notes application with CakePHP, now it’s just a case of what I do with it?!
{ 0 comments }
A first bite of CakePHP
I’ve long been interested in some rapid development PHP framework to speed up my development time, I’m well aware that starting to use one could slow me up in the interim, but once I was fully up to speed with any of the major frameworks I would probably find it very easy.
So I dug my head into the cakePHP manual today, I’ll be honest I found the manual heavy going, but I’ve always felt like that about any programming language, it’s far easier to learn by example, SitePoint have a Tutorial that gives you a real first bite. I found the example really easy, but then it’s an example and you’re supposed to!
Technorati has little on CakePHP, other than people referring to the screencasts available on the CakePHP site.
Well I’ve got my funky little notes application, but can I make something that really works with CakePHP, we’ll see, but I can see it being a steep learning curve, I may have to invest in CakePHP Recipes – £17.86 on Amazon as a starting point?

{ 2 comments }
