How to create a webcatching program?


It's the Information age, and information can't be contained, it grows by the second. Hence no single site can suffice the needs of today's visitor be it Yahoo! or any other site. Internet is a whole new medium and tracking of 100's of portal start ups is annoying, but still you wander about here and there. Agreed that bookmarks make life much simpler, but easier hmm...

Today I'll discuss how to grab stuff from all these happening places, I'll take the example of Yahoo! NewsPod, probably the most successful script for grabbing news from Yahoo!

The base methodology behind "grabbing" content is RegEx, which allows you to grab a certain portion of the obtained content.

Linear Working : (1) A socket (behind the scenes) connection to http://www.yahoo.com/index.html
(2) Reading a pre-determined (or filesize determined) bytes of the page and storing into array
(3) Searching for particulars and storing into another variable
(4) Sorting out further for extra customization

<?php

$open = fopen("http://www.yahoo.com/index.html", "r"); 

// the above statement opens up the yahoo! page using the socket libraries and reads it,
// fopen takes the pages as the 1st arguement and type of opening (r in this case) as the 2nd 

$read = fread($open, 15000); 

// fread scours to the 15000th byte and captures all the encountered bytes into $read 

fclose($open); 

// close the socket connection 

$search = eregi("</b></td><td width=100%>(.*)</td></tr><tr><td 
align=right colspan=2><br> <a href=r/xn><small>more...</small>", $read, $printing); 

// the magic line, eregi is a regex which does case insensitive search on a data, in this case
// is the $read variable. It searches for all data in between the 1st instance of "..."
// and the 1st instance of "</td></tr>..." I have generally seen that it should be valid html
// tags for it to result in a positive search, not logically correct though! If the search is
// true then store it in $printing 

$printing[1] = str_replace("href=\"/homer/?", "href=\"", $printing[1]);
$printing[1] = str_replace("</td></tr><tr><td valign=top><b> ", "", $printing[1]);
$printing[1] = str_replace(" </b></td><td>", "", $printing[1]);
$printing[1] = str_replace("<small>", "", $printing[1]);
$printing[1] = str_replace("</small>", "", $printing[1]); 

// the above code block is string manipulations, 1st replacing /homer/ then doing common
// denomination, i.e. in every news headline there is a common denominator the last 4 lines
// just do string manipulation on this basis, helping us to shift each headline into an array
// element, rather than just bringing in the yahoo! interface into it. 

$content = $printing[1];
$content = explode("·", $content); 

// well this is also a common cancellation or rather the place where the shifting headlines into
// elements actually takes place with the help of "?" and explode(); 

$headlines = sizeof($content); 

// counting the array elements 

for ($i = 0; $i < $headlines; $i++) { 

print "News HeadLine (No. ".($i+1).") : $content[$i]
"; 

} 

// this is looping through each headline retrieved

?>
Grabbing news is just one of the no. of applications sockets and regex can be used to pull out concurrent, accurate and customizable data.

Rate this article