How device detection can stem the effects of crawlers

In the last decade websites have been forced to adapt to emerging technologies and devices. Increased smartphone and tablet use has posed new challenges for businesses and website optimisation.

Website interfaces must be aesthetically pleasing and fully functional across large desktop screens, tablets and small screen smartphones. As well as design and functionality obstacles, there are now software applications such as embedded video players along with any codecs that make website experiences even more difficult to control.

The World Wide Web is full of information – if you want to know something, you can almost certainly find answers through a search engine. But how do search engines recommend the most relevant pages from the trillions that exist? The answer lies with web crawlers.

A web crawler is an Internet bot that systematically browses the World Wide Web with the aim of indexing websites for search engines. Sites such as Google, MSN and Yahoo! use Web crawling, otherwise known as spidering, to update their web content and indexes of others sites' web content. The bot will copy the pages it visits for processing by a search engine, indexing pages for more efficient user searches.

Crawlers: Good vs bad

Many legitimate sites use ‘spidering’ as a means of providing up-to-date data. Google is renowned for having the most active web crawlers, closely followed by Bing. It is normal to see crawlers visiting a website, and if a web page wants to be indexed by search engines, crawlers are necessary. Google’s level of activity is increasing due to the introduction of its mobile friendly search algorithm – a bid to encourage businesses to update websites in line with the mobile browsing revolution. Therefore crawlers, the effect they have, are becoming increasingly important.

Spider techniques can be just as damaging as they are helpful. Crawlers can be used for automating maintenance tasks on websites, but negative practices include: harvesting email addresses from web pages for spamming purposes and submitting spam comments to website forms or blogs.

Crawlers can consume resources on the systems they visit and enter sites without tacit approval. Issues concerning the consumption of website resources, the damaging effect on page load time leading to a loss of revenue, and wasted presentation of paid advertising come into play when large collections of pages are accessed without permission.

Server-side Concerns

Bots and crawlers are also capable of skewing server logs, damaging the validity of web traffic analysis. A bot that is ‘spidering’ an entire site may distort logs if a recognised "user-agent" is not supplied, making it difficult to distinguish from regular users. Increased server load will suggest a surge in website visitors. However, when website owners check services such as Google Analytics, they will find the massive increase in traffic will not have registered.

For marketers this fraudulent, false representation can have a damaging effect on evaluation techniques and could lead to long term revenue loss assuming that particular marketing and advertising outputs are a success. The reality is, the leads simply will not surface because the perceived audience activity is the result of crawlers.

If websites have a lot of content – such as news sites – crawlers can be particularly over-active. Aggressive ‘spidering’ can result in server overloading and additional traffic can inconvenience other site visitors, resulting in slower page load time and potential site crashing. Crawlers are capable of retrieving data much quicker and in greater depth than human searchers but the number of page requests they demand can have a crippling effect on the performance of a website.

The performance effect

A recent study, conducted by Forrester Consulting, suggests that two seconds is the new threshold in terms of an average online user’s expectation for a web page to load. If a single crawler is performing multiple requests per second and/or downloading large files, a server would struggle to facilitate requests from crawlers and website visitors. Sites will slow down and functionality could be significantly affected.

Crawlers can span large portions of web-space over short periods of time, encroaching on bandwidth limits. Bottlenecks can arise locally through high bandwidth consumption, particularly if a bot is in frequent or permanent use, or if it is used during network peak times. Slow-down will be exacerbated if the frequency of page requests is left unregulated.

How do you stop a crawler?

Controlling web traffic and stopping crawler requests without having a detrimental effect on SEO is incredibly difficult. Options include setting a crawl delay for all search engines, disallowing all search engines from crawling and disallowing crawlers from accessing particular files, however, all of these can impact user experience. But there are ways to manipulate crawlers into thinking they have reached a page that does not exist.

Device detection solutions can identify what device is being used to access a website, recognising differences in screen size, browser-type, chipset and thousands of other specifications. Device detection is a solution capable of improving user experience, enhancing analysis and delivering easy deployment. A device detection specialist, can also identify if a bot or crawler is trying to access a website and in response, will not serve the crawler a viable page.

Many web pages host adverts to secure additional revenue. When crawlers and bots penetrate web pages they will see it just as a human visitor would, being served adverts and banners that the site would ordinarily display. However, guidelines surrounding the use of bots mean clicking on advertisements is strictly disallowed – there will be no increased revenue to reflect the increased traffic.

After identifying a crawler page request through device detection, a page will not be served to bots or crawlers, removing the threat of crawler fraud and wasted marketing efforts. By refusing to serve adverts that offer no revenue return, websites are able to safeguard their assets and monitor advertising more effectively.

A device detection solution empowers websites to minimise crawler effects by only serving web pages to genuine visitors, ensuring servers are not overloaded and page requests are quickly answered. Negative user experiences can deter individuals from visiting a website again and could result in revenue loss for businesses.

As robotic software and ‘spidering’ techniques become more vigilant with new mobile friendly detection objectives website proprietors must consider taking action to counteract the negative effect of indexing by protecting the customer’s user experience and reputation of their business.

James Rosewell, CEO and Founder, 51Degrees

Image source: Shutterstock/enzozo