{"id":7921,"date":"2015-11-09T17:41:19","date_gmt":"2015-11-09T17:41:19","guid":{"rendered":"https:\/\/unknownerror.org\/index.php\/2015\/11\/09\/how-to-run-scrapy-from-within-a-python-script-open-source-projects-scrapy-scrapy\/"},"modified":"2015-11-09T17:41:19","modified_gmt":"2015-11-09T17:41:19","slug":"how-to-run-scrapy-from-within-a-python-script-open-source-projects-scrapy-scrapy","status":"publish","type":"post","link":"https:\/\/unknownerror.org\/index.php\/2015\/11\/09\/how-to-run-scrapy-from-within-a-python-script-open-source-projects-scrapy-scrapy\/","title":{"rendered":"How to run Scrapy from within a Python script-open source projects scrapy\/scrapy"},"content":{"rendered":"<p>I&#8217;m new to Scrapy and I&#8217;m looking for a way to run it from a Python script. I found 2 sources that explain this:<\/p>\n<p>http:\/\/tryolabs.com\/Blog\/2011\/09\/27\/calling-scrapy-python-script\/<\/p>\n<p>http:\/\/snipplr.com\/view\/67006\/using-scrapy-from-a-script\/<\/p>\n<p>I can&#8217;t figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code:<\/p>\n<pre><code># This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. \n# \n# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.\n# \n# [Here](http:\/\/groups.google.com\/group\/scrapy-users\/browse_thread\/thread\/f332fc5b749d401a) is the mailing-list discussion for this snippet. \n\n#!\/usr\/bin\/python\nimport os\nos.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports\n\nfrom scrapy import log, signals, project\nfrom scrapy.xlib.pydispatch import dispatcher\nfrom scrapy.conf import settings\nfrom scrapy.crawler import CrawlerProcess\nfrom multiprocessing import Process, Queue\n\nclass CrawlerScript():\n\n    def __init__(self):\n        self.crawler = CrawlerProcess(settings)\n        if not hasattr(project, 'crawler'):\n            self.crawler.install()\n        self.crawler.configure()\n        self.items = []\n        dispatcher.connect(self._item_passed, signals.item_passed)\n\n    def _item_passed(self, item):\n        self.items.append(item)\n\n    def _crawl(self, queue, spider_name):\n        spider = self.crawler.spiders.create(spider_name)\n        if spider:\n            self.crawler.queue.append_spider(spider)\n        self.crawler.start()\n        self.crawler.stop()\n        queue.put(self.items)\n\n    def crawl(self, spider):\n        queue = Queue()\n        p = Process(target=self._crawl, args=(queue, spider,))\n        p.start()\n        p.join()\n        return queue.get(True)\n\n# Usage\nif __name__ == \"__main__\":\n    log.start()\n\n    \"\"\"\n    This example runs spider1 and then spider2 three times. \n    \"\"\"\n    items = list()\n    crawler = CrawlerScript()\n    items.append(crawler.crawl('spider1'))\n    for i in range(3):\n        items.append(crawler.crawl('spider2'))\n    print items\n\n# Snippet imported from snippets.scrapy.org (which no longer works)\n# author: joehillen\n# date  : Oct 24, 2010\n<\/code><\/pre>\n<p>Thank you.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m new to Scrapy and I&#8217;m looking for a way to run it from a Python script. I found 2 sources that explain this: http:\/\/tryolabs.com\/Blog\/2011\/09\/27\/calling-scrapy-python-script\/ http:\/\/snipplr.com\/view\/67006\/using-scrapy-from-a-script\/ I can&#8217;t figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code: # This [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-7921","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts\/7921","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/comments?post=7921"}],"version-history":[{"count":0,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts\/7921\/revisions"}],"wp:attachment":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/media?parent=7921"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/categories?post=7921"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/tags?post=7921"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}