From the category archives:

PHP

Ajax Crawling with PHP and Curl?

by Keiron on November 29, 2008

I apologise now if I put off some of the more general readers with this post, but I’ve struck upon a bit a problem!

I have some automated code that I wrote using PHP and Curl, that retrieves a mountain of information from a website, does some statistical analysis on it and then presents me with a nice little report (having inserted the data into a MySQL database). It’s wonderful – to do the process manually would take maybe 2-3 hours every day, as it is I wake up to a nice report sat in my inbox everyday with all the information in it.

Now I have a problem, the website that I crawl to get this information is converting to Ajax – this presents me with a huge problem……

Web Spiders

Web spiders for the most part, grab a page from a server, make a list of links in the page and then go off and repeat the process on all those links it’s found (each one triggering a different database or variable call in a website).

Ajax isn’t quite so easy, most of the page isn’t even in the HTML! It’s inserted after by javascript – mostly when the user clicks something, meaning the user doesn’t even navigate to a different page, but actually stays on the same page and lets the javascript refresh (hence our crawler can’t make a list of the links!). Some websites allow for this having a “lite” version, the one I’m using doesn’t :(

Thinking like a Human

We need to make our crawler think and act like a human – sounds easy enough right? You’ve written a crawler before surely you can do that?

Wrong! I can’t think of any logical way to get PHP to do this for me!

Any crawler process would need to be able to see events and states in the document that a real user might click on.

Some reading around this problem (and believe me, I have), suggests that the easiest way of possible of doing this is to create an AJAX-enabled event-driven reader. Heck we use one of these everyday of the week (it’s your web browser folks, whether it be IE, Firefox, Opera, Safari etc. etc.).

Using the Browser

FleshEater There are a couple of tools around that seem to use the browser, Watir (using Ruby) and Crowbar (which uses a mozilla based browser).

Does anyone have any other bright ideas before I spend hours fighting with yet another new technology?

{ 9 comments }

Updated CakePHP Baking

by Keiron on February 18, 2007

It appears the CakeBaker, may renew my interest in CakePHP having highlighted some projects that use CakePHP.

So, you obviously can do more than just make a notes application with CakePHP, now it’s just a case of what I do with it?!

{ 0 comments }

A first bite of CakePHP

by Keiron on February 18, 2007

I’ve long been interested in some rapid development PHP framework to speed up my development time, I’m well aware that starting to use one could slow me up in the interim, but once I was fully up to speed with any of the major frameworks I would probably find it very easy.

So I dug my head into the cakePHP manual today, I’ll be honest I found the manual heavy going, but I’ve always felt like that about any programming language, it’s far easier to learn by example, SitePoint have a Tutorial that gives you a real first bite. I found the example really easy, but then it’s an example and you’re supposed to!

Technorati has little on CakePHP, other than people referring to the screencasts available on the CakePHP site.

Well I’ve got my funky little notes application, but can I make something that really works with CakePHP, we’ll see, but I can see it being a steep learning curve, I may have to invest in CakePHP Recipes – £17.86 on Amazon as a starting point?
CakePHP Recipes

{ 2 comments }

The Slug! in Newcastle