Wednesday, 15 May 2013

Working With The Scraped Data

Web Scraping With PHP & CURL [Part 1] was pretty short and simple, so I thought I’d follow it up rather quickly with Part 2 – Working With The Scraped Data.

In this part, we’re going to create a function to use the data that we scraped in Part 1, for scraping a specific section of data from the page and breaking the page up into sections to iterate over and scrape multiple sections of similar data into an array for further use.

Also, we’re going to introduce a couple of modifications to our cURL PHP function.

Then, we’ll put everything together using a real world example.
The Scraping Function

In order to extract the required data from the complete page we’ve downloaded, we need to create a small function that will scrape data from between two strings, such as tags.
   
<?php
    // Defining the basic scraping function
    function scrape_between($data, $start, $end){
        $data = stristr($data, $start); // Stripping all data from before $start
        $data = substr($data, strlen($start));  // Stripping $start
        $stop = stripos($data, $end);   // Getting the position of the $end of the data to scrape
        $data = substr($data, 0, $stop);    // Stripping all data from after and including the $end of the data to scrape
        return $data;   // Returning the scraped data from the function
    }
?>

The comments in the function should explain it pretty clearly, but just to clarify further:

1. We define the scraping function as scrape_between(), which takes the parameters $data (string, the source you want to scrape from), $start (string, at which point you wish to scrape from), $end (string, at which point you wish to finish scraping).

2. stristr() is used to strip all data from before the $start position.

3. substr() is used to strip the $start from the beginning of the data. The $data variable now holds the data we want scraped, along with the trailing data from the input string.

4. strpos() is used to get the position of the $end of the data we want scraped then substr() is used to leave us with just what we wanted scraped in the $data variable.

5. The data we wanted scraped, in $data, is returned from the function.

In a later part we’re going to look at using Regular Expressions (Regex) for finding strings to scrape that match a certain structure. But, for now, this small function is more than enough.
Modifying The CURL Function

Gradually, as this series progresses, I’m going to introduce more and more of cURL’s options and features. Here, we’ve made a few small modifications to our function from Part 1.
   
<?php  
    // Defining the basic cURL function
    function curl($url) {
        // Assigning cURL options to an array
        $options = Array(
            CURLOPT_RETURNTRANSFER => TRUE,  // Setting cURL's option to return the webpage data
            CURLOPT_FOLLOWLOCATION => TRUE,  // Setting cURL to follow 'location' HTTP headers
            CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
            CURLOPT_CONNECTTIMEOUT => 120,   // Setting the amount of time (in seconds) before the request times out
            CURLOPT_TIMEOUT => 120,  // Setting the maximum amount of time for cURL to execute queries
            CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
            CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
            CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
        );
        
        $ch = curl_init();  // Initialising cURL
        curl_setopt_array($ch, $options);   // Setting cURL's options using the previously assigned array data in $options
        $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
        curl_close($ch);    // Closing cURL
        return $data;   // Returning the data from the function
    }
?>

If you look at the function above, it may seem rather different to the one we created in Part 1, however, it’s essentially the same, just with a few minor tweaks.

The first thing to note, is that, rather than setting the options up one-by-one using curl_setopt(), we’ve created an array called $options to store them all. The array key stores the name of the cURL option and the array value stores it’s setting. This array is then passed to cURL using curl_setopt_array().

Aside from that and the extra settings introduced this function is exactly the same. So not really the same, but, yeah…

The extra cURL settings that have been added are CURLOPT_RETURNTRANSFER, CURLOPT_FOLLOWLOCATION, CURLOPT_AUTOREFERER, CURLOPT_CONNECTTIMEOUT, CURLOPT_TIMEOUT, CURLOPT_MAXREDIRS, CURLOPT_USERAGENT. They are explained in the comments of the function above.
Putting It All Together

We place both of those functions in our PHP script and we can use them like so:
   
<?php
    $scraped_page = curl("http://www.imdb.com");    // Downloading IMDB home page to variable $scraped_page
    $scraped_data = scrape_between($scraped_page, "<title>", "</title>");   // Scraping downloaded dara in $scraped_page for content between <title> and </title> tags
    
    echo $scraped_data; // Echoing $scraped data, should show "The Internet Movie Database (IMDb)"
?>

As you can see. This small scraper visits the IMDb website, downloads the page and scrapes the page title from between the ‘title’ tags, then echos the result.
Scraping Multiple Data Points From A Web Page

Visiting a web page and scraping one piece of data is hardly impressive, let alone worth building a script for. I mean, you could just as easily open your web browser and copy/paste it for yourself.

So, we’ll expand on this a bit and scrape multiple data points from a web page.

For this we’ll still be using IMDb as our target site, however, we’re going to try scraping the search results page for a list of URLs returned for a specific search query.

First up, we need to find out how the search form works. Lucky for us the search query is shown in the URL on the search results page:

http://www.imdb.com/search/title?title=goodfellas

Shown in green is the keyword being searched for.

Shown in blue is the attribute being searched within. For our purposes, searching for the name of a film here is pretty pointless, it’s only going to return a single value that we’d actually want. So, instead, let’s try searching by genre:

http://www.imdb.com/search/title?genres=action

Note: These different attributes can be found by going to the http://www.imdb.com/search/title page and performing a search, then looking at the URL of the search results page.

Now we have this, we can feed the URL (http://www.imdb.com/search/title?genres=action) into our script and we have a page with a list of results we want to scrape returned.

Now we need to break up this page into separate sections for each result, then iterate over the sections and scrape the URL of each result.

   
<?php
    $url = "http://www.imdb.com/search/title?genres=action";    // Assigning the URL we want to scrape to the variable $url
    $results_page = curl($url); // Downloading the results page using our curl() funtion
    
    $results_page = scrape_between($results_page, "<div id=\"main\">", "<div id=\"sidebar\">"); // Scraping out only the middle section of the results page that contains our results
    
    $separate_results = explode("<td class=\"image\">", $results_page);   // Expploding the results into separate parts into an array
        
    // For each separate result, scrape the URL
    foreach ($separate_results as $separate_result) {
        if ($separate_result != "") {
            $results_urls[] = "http://www.imdb.com" . scrape_between($separate_result, "href=\"", "\" title="); // Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array
        }
    }
    
    print_r($results_urls); // Printing out our array of URLs we've just scraped
?>

Now with an explanation of what’s happening here, if it’s not already clear.

1. Assigning the search results page URL we want to scrape the the $url variable.

2. Downloading the results page using our curl() function.

3. Here, on line 5, we are scraping out just the section of results we need. Stripping away the header, sidebar, etc…

4. We need to identify each search result by a common string that can be used to explode the results. This string, that every result has, is the td class=”image”. We use this to explode the results into the array $separate_results.

5. For each separate result, if it’s not empty, we scrape the URL data from between the start point of href=” and end point of ” title= and add it to our $results_urls array. But, in the process, because IMDb uses relative path URLs instead of full path, we need to prepend http://www.imdb.com to our result to give us a full URL which can later be used.

6. Right at the end, we printout our array of URLs, just to check that the script worked properly.

That’s pretty cool, no? Well, it would be, but so far all we have is a short list of URLs. So, up next time we’re going to cover traversing the pages of a website to scrape data from multiple pages and organise the data in a logical structure.

Source: http://www.jacobward.co.uk/working-with-the-scraped-data-part-2/

No comments:

Post a Comment