What is Crawling?
Crawling is tracking and gathering URLs to prepare for indexing. By giving them a webpage as a starting point, they will trace all the valid links on those pages. As they go from link to link, they bring back data about those web pages back to Google's servers. Therefore, it is crucial that pages on your website are able to be crawled as a page unable to be crawled is unable to be indexed.
However, not all links are able to be crawled due to one or more of the following cases:
- The server was down when link crawling.
- Link coding is in JavaScript format. This is known as the spider trap.
- The link is marked for exclusion via robots.txt.
- Link within the page contains the "nofollow" directive.
- There are no external links and there is the absence of sitemaps.xml. This is also known as the Orphaned link.
There are three cons to crawlers:
- Crawlers are unable to differentiate data.
- Crawlers will crawl your entire website regardless of what data you want to obtain.
- URLs crawled are static.
- If you want to get new information or want your updated webpage and/or site to be shown on search engines, it must be completely recrawled.
- Crawling can take a lot of time.
You may also be interested in learning more about what indexing is here.