We use cookies to keep our site relevant and easy to use, your continued use of this site is consent that we may set several cookies (see our Privacy & Cookie Policy), click to always allow cookies from our site (and not see this notifcation on your next visit) or read more.Allow Cookies

EU legislation requires that all websites clearly specify if cookies are being used and their purpose, You can read more about how we use cookies (and which cookies we use) in our Privacy and Cookie Policy.

You will see this notification the first time you visit our website unless you accept cookies (in which case we'll set a cookie to remember thay you're happy for us to to set cookies!).

Ajax Crawling with PHP and Curl?

I apologise now if I put off some of the more general readers with this post, but I’ve struck upon a bit a problem!

I have some automated code that I wrote using PHP and Curl, that retrieves a mountain of information from a website, does some statistical analysis on it and then presents me with a nice little report (having inserted the data into a MySQL database). It’s wonderful – to do the process manually would take maybe 2-3 hours every day, as it is I wake up to a nice report sat in my inbox everyday with all the information in it.

Now I have a problem, the website that I crawl to get this information is converting to Ajax – this presents me with a huge problem……

Web Spiders

Web spiders for the most part, grab a page from a server, make a list of links in the page and then go off and repeat the process on all those links it’s found (each one triggering a different database or variable call in a website).

Ajax isn’t quite so easy, most of the page isn’t even in the HTML! It’s inserted after by javascript – mostly when the user clicks something, meaning the user doesn’t even navigate to a different page, but actually stays on the same page and lets the javascript refresh (hence our crawler can’t make a list of the links!). Some websites allow for this having a “lite” version, the one I’m using doesn’t 🙁

Thinking like a Human

We need to make our crawler think and act like a human – sounds easy enough right? You’ve written a crawler before surely you can do that?

Wrong! I can’t think of any logical way to get PHP to do this for me!

Any crawler process would need to be able to see events and states in the document that a real user might click on.

Some reading around this problem (and believe me, I have), suggests that the easiest way of possible of doing this is to create an AJAX-enabled event-driven reader. Heck we use one of these everyday of the week (it’s your web browser folks, whether it be IE, Firefox, Opera, Safari etc. etc.).

Using the Browser

FleshEater There are a couple of tools around that seem to use the browser, Watir (using Ruby) and Crowbar (which uses a mozilla based browser).

Does anyone have any other bright ideas before I spend hours fighting with yet another new technology?

This entry was posted on Saturday, November 29th, 2008 at 4:37 pm and is filed under PHP. You can follow any responses to this entry through the RSS 2.0 feed.


Comments (10)

 

  1. Magicroundabout says:

    I briefly Tweeted about the possibility of using a synthetic transaction tool to do this, but I think you’re already thinking along those lines with the tools you’ve linked to.

    There may be better GUI-driven tools around, but it may be a case of what are you willing to pay for?

    In the meantime, looks like you might be off to Ruby school!

  2. Keiron says:

    I’ve tried the Crowbar approach with a little success, but can’t achieve exactly what I want. The problem is I need to be able to write this to be completely automated, with no intervention! And then it needs to be able to insert the data in a database for analysis later as well 🙁

    I’m willing to pay for it, if it can achieve all of those things!

  3. marvellin says:

    Did you already succeed in doing this?

  4. Keiron says:

    Yes, after a fashion!

  5. marvellin says:

    So what do you use now to get this done? Watir or still Crowbar? Could you try to teach the general steps? 😉
    thanks, i am actually at the same contest.

  6. Keiron says:

    I forget how the code works now, but it did what I needed in the end!

    We’ve just launched http://www.office-automation.co.uk based on feedback from clients on that sort of work, if you’ve got something you’d like us to take a look at then let me know?

  7. marvellin says:

    wow, that was really useless. bookmark killed

  8. Keiron says:

    lol, as you may have noted:

    • I replied at 5 to midnight, I wasn’t going to go looking for the code then!
    • The post is over a year old, I can’t remember where I stored the code, or how I implemented it.

    I’ll go looking for the code when I’m on my own machine and see if I can make a generic enough post about it to fit most scenarios!

    That said, I have just launched Office Automation because there are a lot of people out there who could use this sort of functionality and don’t have the coding skills to make their lives easier!

  9. Keiron says:

    Looking back at the code that’s running in the cron job I think I implemented almost entirely in Curl and it worked!

  10. Shah Khalid says:

    I am working on the same problem from a long time. But there is no tool developed so far to Crawl AJAX applications efficiently….now me trying to crawl the states of the page using some related desired information…but it wil take some time..

    regards

Leave a Reply