Site Grabber


I made a small Console application that grabs all the HTML files from a website called Site Grabber. If you have ever seen screen scraping utilities before, then this one is no different. You start it up by passing the url you want to start off with as a parameter:

SiteGrabber.exe http://www.yourdomain.com/

The utility will go out to that URL and download the source code of the HTML returned. It saves that code to your hard drive. It then parses out all links from the source code and converts them to full URLs - just in case they are only relative URLs. For example, if one of the links on the page is “News.htm”, then it will change the link to “http://www.yourdomain.com/News.htm”.

My program then goes through all the links and repeats the process. It has enough brains to remember what links it has visited before, and can also limit itself to follow through to other links a limited number of times. So if page1.htm points to page2.htm, then it will only go to page2.htm if it isn’t already nested too deep. If you can’t catch on, don’t worry. I’m not always that great at explaining myself.

The program can also be restricted so that it does not follow external links - meaning that if you are on www.yourdomain.com, and a link points to www.anotherdomain.com, then it will not go to that other domain unless you explicitly tell it to do so. Here is the small help text that shows up if you get it wrong:

SiteGrabber.exe <url> <depth> <offsite>

name    default  description——- ———– ————————————————-url     Required The url that you wish to download. (required)depth   0        The number of links to follow from the main page.offsite false    Do you wish to follow links off-site?

Now … I wonder if I could get some cash for this program. I have tons of scripts on the net, but I never actively pursued cash for them before. It would probably have to be made a bit more friendly if I wanted to sell it. Bah! There is less then 200 lines of code to this program anyway.

Tags: , , , , , , , , , , , , , ,

Leave a Reply