{"id":7594,"date":"2015-09-01T17:24:03","date_gmt":"2015-09-01T17:24:03","guid":{"rendered":"https:\/\/unknownerror.org\/index.php\/2015\/09\/01\/how-to-filter-duplicate-requests-based-on-url-in-scrapy-open-source-projects-scrapy-scrapy\/"},"modified":"2015-09-01T17:24:03","modified_gmt":"2015-09-01T17:24:03","slug":"how-to-filter-duplicate-requests-based-on-url-in-scrapy-open-source-projects-scrapy-scrapy","status":"publish","type":"post","link":"https:\/\/unknownerror.org\/index.php\/2015\/09\/01\/how-to-filter-duplicate-requests-based-on-url-in-scrapy-open-source-projects-scrapy-scrapy\/","title":{"rendered":"how to filter duplicate requests based on url in scrapy-open source projects scrapy\/scrapy"},"content":{"rendered":"<p><img decoding=\"async\" src=\"http:\/\/graph.facebook.com\/100000186350353\/picture?type=large\" \/> <strong>Manoj Sahu<\/strong><\/p>\n<p>https:\/\/github.com\/scrapinghub\/scrapylib\/blob\/master\/scrapylib\/deltafetch.py<\/p>\n<p>This file might help you. This file creates a database of unique delta fetch key from the url ,a user pass in a scrapy.Reqeust(meta={&#8216;deltafetch_key&#8217;:uniqe_url_key}). This this let you avoid duplicate requests you already have visited in the past.<\/p>\n<p>A sample mongodb implementation using deltafetch.py<\/p>\n<pre><code>        if isinstance(r, Request):\n            key = self._get_key(r)\n            key = key+spider.name\n\n            if self.db['your_collection_to_store_deltafetch_key'].find_one({\"_id\":key}):\n                spider.log(\"Ignoring already visited: %s\" % r, level=log.INFO)\n                continue\n        elif isinstance(r, BaseItem):\n\n            key = self._get_key(response.request)\n            key = key+spider.name\n            try:\n                self.db['your_collection_to_store_deltafetch_key'].insert({\"_id\":key,\"time\":datetime.now()})\n            except:\n                spider.log(\"Ignoring already visited: %s\" % key, level=log.ERROR)\n        yield r\n<\/code><\/pre>\n<p>eg. id = 345 scrapy.Request(url,meta={deltafetch_key:345},callback=parse)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Manoj Sahu https:\/\/github.com\/scrapinghub\/scrapylib\/blob\/master\/scrapylib\/deltafetch.py This file might help you. This file creates a database of unique delta fetch key from the url ,a user pass in a scrapy.Reqeust(meta={&#8216;deltafetch_key&#8217;:uniqe_url_key}). This this let you avoid duplicate requests you already have visited in the past. A sample mongodb implementation using deltafetch.py if isinstance(r, Request): key = self._get_key(r) key = [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-7594","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts\/7594","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/comments?post=7594"}],"version-history":[{"count":0,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts\/7594\/revisions"}],"wp:attachment":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/media?parent=7594"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/categories?post=7594"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/tags?post=7594"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}