{"id":7655,"date":"2015-09-24T12:26:38","date_gmt":"2015-09-24T12:26:38","guid":{"rendered":"https:\/\/unknownerror.org\/index.php\/2015\/09\/24\/running-scrapy-spiders-in-a-celery-task-open-source-projects-scrapy-scrapy\/"},"modified":"2015-09-24T12:26:38","modified_gmt":"2015-09-24T12:26:38","slug":"running-scrapy-spiders-in-a-celery-task-open-source-projects-scrapy-scrapy","status":"publish","type":"post","link":"https:\/\/unknownerror.org\/index.php\/2015\/09\/24\/running-scrapy-spiders-in-a-celery-task-open-source-projects-scrapy-scrapy\/","title":{"rendered":"Running Scrapy spiders in a Celery task-open source projects scrapy\/scrapy"},"content":{"rendered":"<p><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/8cf904d96da73e2c5e857899a8000009?s=128&amp;d=identicon&amp;r=PG\" \/> <strong>byoungb<\/strong><\/p>\n<p>Okay here is how I got Scrapy working with my Django project that uses Celery for queuing up what to crawl. The actual workaround came primarily from joehillen&#8217;s code located here http:\/\/snippets.scrapy.org\/snippets\/13\/<\/p>\n<p>First the <code>tasks.py<\/code> file<\/p>\n<pre><code>from celery import task\n\n@task()\ndef crawl_domain(domain_pk):\n    from crawl import domain_crawl\n    return domain_crawl(domain_pk)\n<\/code><\/pre>\n<p>Then the <code>crawl.py<\/code> file<\/p>\n<pre><code>from multiprocessing import Process\nfrom scrapy.crawler import CrawlerProcess\nfrom scrapy.conf import settings\nfrom spider import DomainSpider\nfrom models import Domain\n\nclass DomainCrawlerScript():\n\n    def __init__(self):\n        self.crawler = CrawlerProcess(settings)\n        self.crawler.install()\n        self.crawler.configure()\n\n    def _crawl(self, domain_pk):\n        domain = Domain.objects.get(\n            pk = domain_pk,\n        )\n        urls = []\n        for page in domain.pages.all():\n            urls.append(page.url())\n        self.crawler.crawl(DomainSpider(urls))\n        self.crawler.start()\n        self.crawler.stop()\n\n    def crawl(self, domain_pk):\n        p = Process(target=self._crawl, args=[domain_pk])\n        p.start()\n        p.join()\n\ncrawler = DomainCrawlerScript()\n\ndef domain_crawl(domain_pk):\n    crawler.crawl(domain_pk)\n<\/code><\/pre>\n<p>The trick here is the &#8220;from multiprocessing import Process&#8221; this gets around the &#8220;ReactorNotRestartable&#8221; issue in the Twisted framework. So basically the Celery task calls the &#8220;domain_crawl&#8221; function which reuses the &#8220;DomainCrawlerScript&#8221; object over and over to interface with your Scrapy spider. (I am aware that my example is a little redundant but I did do this for a reason in my setup with multiple versions of python [my django webserver is actually using python2.4 and my worker servers use python2.7])<\/p>\n<p>In my example here &#8220;DomainSpider&#8221; is just a modified Scrapy Spider that takes a list of urls in then sets them as the &#8220;start_urls&#8221;.<\/p>\n<p>Hope this helps!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>byoungb Okay here is how I got Scrapy working with my Django project that uses Celery for queuing up what to crawl. The actual workaround came primarily from joehillen&#8217;s code located here http:\/\/snippets.scrapy.org\/snippets\/13\/ First the tasks.py file from celery import task @task() def crawl_domain(domain_pk): from crawl import domain_crawl return domain_crawl(domain_pk) Then the crawl.py file from [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-7655","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts\/7655","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/comments?post=7655"}],"version-history":[{"count":0,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts\/7655\/revisions"}],"wp:attachment":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/media?parent=7655"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/categories?post=7655"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/tags?post=7655"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}