A web crawler (also known as a web spider or web robot) is a program or automated script which browses the internet seeking for web pages to process.
Many applications mostly search engines, crawl websites everyday in order to find up-to-date data.
Most of the web crawlers save a copy of the visited page so they could easily index it later and the rest crawl the pages for page search purposes only such as searching for emails ( for SPAM ).
How does it work?
A crawler needs a starting point which would be a web address, a URL.
In order to browse the internet we use the HTTP network protocol which allows us to talk to web servers and download or upload data from and to it.
The crawler browses this URL and then seeks for hyperlinks (A tag in the HTML language).
Then the crawler browses those links and moves on the same way.
Up to here it was the basic idea. Now, how we move on it completely depends on the purpose of the software itself.
If we only want to grab emails then we would search the text on each web page (including hyperlinks) and look for email addresses. This is the easiest type of software to...