Development of a bot/web crawler detection system
I am trying to build a system for my company which wants to check for unusual/abusive pattern of users (mainly web scrapers). Currently the logic I have implemented parses the http access logs and takes into account the following parameters to calculate the potential of a user being a scraper or bot: It checks v/s HTTP 'POST/GET' requests ratio for each IP It calculates the ratio of unique URLs and total number of hits (sparsity) by each IP Based on the above two parameters, we try to block any IP showing unusual behaviour, but these two parameters alone have not been sufficient for bot detection. Thus I would like to know: Are there any other parameters which can be included to improve the detection? I found a paper published in ACM library which follows the Bayesian approach to detect a crawler. Has anyone used this? How effective is this? Stack Overflow and other high traffic sites have such kind of systems deployed, what logic do they follow to keep unwanted spammers/crawlers away in real time?
I am trying to build a system for my company which wants to check for unusual/abusive pattern of users (mainly web scrapers).
Currently the logic I have implemented parses the http access logs and takes into account the following parameters to calculate the potential of a user being a scraper or bot:
It checks v/s HTTP 'POST/GET' requests ratio for each IP
It calculates the ratio of unique URLs and total number of hits (sparsity) by each IP
Based on the above two parameters, we try to block any IP showing unusual behaviour, but these two parameters alone have not been sufficient for bot detection. Thus I would like to know:
Are there any other parameters which can be included to improve the detection?
I found a paper published in ACM library which follows the Bayesian approach to detect a crawler. Has anyone used this? How effective is this?
Stack Overflow and other high traffic sites have such kind of systems deployed, what logic do they follow to keep unwanted spammers/crawlers away in real time?