{"id":7619,"date":"2015-09-21T02:23:54","date_gmt":"2015-09-21T02:23:54","guid":{"rendered":"https:\/\/unknownerror.org\/index.php\/2015\/09\/21\/scrapinghub-portia\/"},"modified":"2015-09-21T02:23:54","modified_gmt":"2015-09-21T02:23:54","slug":"scrapinghub-portia","status":"publish","type":"post","link":"https:\/\/unknownerror.org\/index.php\/2015\/09\/21\/scrapinghub-portia\/","title":{"rendered":"scrapinghub\/portia"},"content":{"rendered":"<p>Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages.<\/p>\n<h2>Anatomy of a Portia Project<\/h2>\n<p>A project in Portia generally consists of one or more spiders.<\/p>\n<h3>Spider<\/h3>\n<p>A spider is a crawler for a particular website. The configuration of a spider is split into three sections:<\/p>\n<ul>\n<li><strong>Initialization<\/strong><\/li>\n<li><strong>Crawling<\/strong><\/li>\n<li><strong>Extraction<\/strong><\/li>\n<\/ul>\n<p>The <strong>Initialization<\/strong> section is used to set up the spider when it\u2019s first launched. Here you can define the start URLs and login credentials.<\/p>\n<p>The <strong>Crawling<\/strong> section is used to configure how the spider will behave when it encounters URLs. You can choose how links are followed and whether to respect nofollow. You can visualise the effects of the crawling rules using the <strong>Overlay blocked links<\/strong> option; this will highlight links that will be followed in green, and links that won\u2019t be followed in red.<\/p>\n<p>The <strong>Extraction<\/strong> section lists the templates for this spider.<\/p>\n<h3>Templates<\/h3>\n<p>When the crawler visits a page, the page is matched against each template and templates with more annotations take precedence over those with less. If the page matches a template, data will be extracted using the template\u2019s annotations to yield an item (assuming all required fields are filled). Templates exist within the context of a spider and are made up of annotations which define the elements you wish to extract from a page. Within the template you define the item you want to extract as well as mark any fields that are required for that item.<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/unknownerror.org\/opensource\/scrapinghub\/portia\/docs\/_static\/portia-extractors.png\" \/><\/p>\n<p>You can also add extractors to item fields. Extractors allow you to further refine information extracted from a page using either regexes or a predefined type. For example, let\u2019s say there\u2019s an element on the page that contains a phone number, but it also contains some other text that you don\u2019t need. You could use a regular expression that matches phone numbers and add it as an extractor to the relevant field; upon extracting that field only the phone number would be stored.<\/p>\n<p>You may need to use more than one template even if you\u2019re only extracting a single item type. For example, an e-commerce site may have a different layout for books than it does for audio CDs, so you would need to create a template for each layout. See Tips for Working with Multiple Templates for more info.<\/p>\n<h3>Annotations<\/h3>\n<p><img decoding=\"async\" src=\"http:\/\/unknownerror.org\/opensource\/scrapinghub\/portia\/docs\/_static\/portia-annotation-creation.png\" \/><\/p>\n<p>An annotation defines the location of a piece of data on the web page and how it should be used by the spider. Typically an annotation maps some element on the page to a particular field of an item, but there is also the option to mark the data as being required without storing the data in an item. It\u2019s possible to map attributes of a particular element instead of the content if this is required, for example you can map the <code>href<\/code> attribute of an anchor link rather than the text.<\/p>\n<h3>Items<\/h3>\n<p>An item refers to a single item of data scraped from the target website. A common example of an item would be a product for sale on an e-commerce website. It\u2019s important to differentiate <strong>item<\/strong> and <strong>item definition<\/strong>; in Portia an item definition or item type refers to the schema of an item rather than the item itself. For example, <code>book<\/code> would be an item definition, and a specific book scraped from the website would be an item. An item definition consists of multiple fields, so using the example of a product you might have fields named <code>name<\/code>, <code>price<\/code>, <code>manufacturer<\/code> and so on. Annotations are used to extract data from the page into each of these fields.<\/p>\n<p>If you want to ensure that certain fields are extracted for an item, you can set the <strong>Required<\/strong> flag on a field which will discard an item if the field is missing. Duplicate items are removed by default. In some cases you may have fields where the value can vary despite being the same item, in which case you can mark them as <strong>Vary<\/strong> which will ignore the field when checking for duplicates. It\u2019s important to only use <strong>Vary<\/strong> when necessary, as misuse could easily lead to duplicate items being stored. The <code>url<\/code> field is a good example of where <strong>Vary<\/strong> is useful, as the same item may have multiple URLs. If the <code>url<\/code> field wasn\u2019t marked as <strong>Vary<\/strong> each duplicate item would be seen as unique because its URL would be different.<\/p>\n<h2>Creating a Portia Project<\/h2>\n<p>To create a new project, begin by entering the site\u2019s URL in the navigation bar at the top of the page and clicking <code>Start<\/code>. Portia can be used as a web browser, and you can navigate to pages you want to extract data from and create templates for them. Clicking <code>Start<\/code> should create a new project along with a spider for the website, and you should see the loaded web page:<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/unknownerror.org\/opensource\/scrapinghub\/portia\/docs\/_static\/portia-new-project.png\" \/><\/p>\n<p>The spider can be configured on the right. The start pages are the URLs the spider will visit when beginning a new crawl. Portia can be used as a web browser, and you can navigate to the pages you want to extract data from and create new templates. To define the data you wish to extract from the page, click the <code>Annotate this page<\/code> button, which will create a new template and allow you to annotate the page.<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/unknownerror.org\/opensource\/scrapinghub\/portia\/docs\/_static\/portia-annotation.png\" \/><\/p>\n<p>You will now be able to define annotations by highlighting or clicking elements on the page. When annotating, a context menu will appear allowing you to map an element\u2019s attribute or content to a particular item field. Should you want to add a new item field without having to go into the item editor, you can use the <code>-create new-<\/code> option in the field drop down to create a new field. If you want to mark an element as having to exist on the page without storing its data, you can select <code>-just required-<\/code> instead of a field. It\u2019s important to note when using <code>-just required-<\/code>, only the existence of the element will be checked rather than its content.<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/unknownerror.org\/opensource\/scrapinghub\/portia\/docs\/_static\/portia-item-editor.png\" \/><\/p>\n<p>Once you are finished annotating, you can then mark any fields that are required by going into the item editor under <code>Extracted item<\/code>. As mentioned earlier, if the item appears in several locations and some fields differ despite being the same item, you can also tick <code>Vary<\/code> on any relevant fields to exclude them from being used to detect duplicate items.<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/unknownerror.org\/opensource\/scrapinghub\/portia\/docs\/_static\/portia-extracted-items.png\" \/><\/p>\n<p>You can now confirm that your template works by clicking <code>Continue browsing<\/code>. The page should reload and a pop-up should appear showing you the items extracted from the page. When visiting a page in Portia, the whole extraction process is performed with the spider with the set of currently defined templates. This allows you to check that data will be extracted from the page before running the spider against the whole website.<\/p>\n<p>If you have created a template around one page where the data extracts successfully, but you visit a similar page and no item is extracted, then it\u2019s likely that particular page has a different layout or some fields missing. In this case you would simply click <code>Annotate this page<\/code> again to create a new template for the page, and then annotate it the same way you had done with the other page. See Tips for Working with Multiple Templates for more details on how multiple templates are used within a single spider.<\/p>\n<p>Once you\u2019ve confirmed that your spider works and extracts data properly, your project is now ready to run or deploy.<\/p>\n<h2>Advanced Use of Annotations<\/h2>\n<h3>Multiple Fields<\/h3>\n<p>It\u2019s possible to extract multiple fields using a single annotation if there are several properties you want to extract from an element. For example, if there was an anchor link on the page, you could map the <code>href<\/code> attribute containing the URL to one field, and you could map the text to another. You can view a particular annotation\u2019s settings by either clicking the cog in the annotation pop-up window or by clicking the cog beside the annotation in the <code>Annotations<\/code> section of the template configuration. Within this context there is an <code>Attribute mappings<\/code> section where you can define additional mappings for the selected annotation should you want to map other attributes.<\/p>\n<h3>Variants<\/h3>\n<p>It\u2019s common for there to be a single item with a number of variations e.g. different sizes such as small, medium and large. It\u2019s likely that each variation will have its own annotation for one or more fields and you want to keep each variation\u2019s value. In situations like this you can use variants to make sure each value is stored. Each annotation you define has a variant selected, the default being <code>Base<\/code> referring to the base item. To assign an annotation to a variant, you simply select the variant you want the annotation to use in its options or under the <code>Annotations<\/code> section in the template settings.<\/p>\n<p>Consider the following scenario where variants would be useful:<\/p>\n<p>You are wanting to scrape an e-commerce website that sells beds, and some beds come in multiple sizes e.g. <code>Single<\/code>, <code>Double<\/code>, <code>Queen<\/code>, <code>King<\/code>. The product page for each bed has a table of prices for each size, like so:<\/p>\n<table>\n<tr>\n<td>Single<\/td>\n<td>$300<\/td>\n<\/tr>\n<tr>\n<td>Double<\/td>\n<td>$500<\/td>\n<\/tr>\n<tr>\n<td>Queen<\/td>\n<td>$650<\/td>\n<\/tr>\n<tr>\n<td>King<\/td>\n<td>$800<\/td>\n<\/tr>\n<\/table>\n<p>The rest of the data you want to extract (product name, description etc.) is common across all sizes. In this case, you would annotate the common data to the base item and create the fields <code>size<\/code> and <code>price<\/code>. You would then annotate the <code>Single<\/code> cell as variant 1 of <code>size<\/code>, and the <code>$300<\/code> cell as variant 1 of <code>price<\/code>, followed by annotating <code>Double<\/code> as variant 2 of <code>size<\/code>, <code>$500<\/code> as variant 2 of <code>price<\/code> and so on. It\u2019s worth noting that in this case, it wouldn\u2019t be necessary to create a variant for each row; usually it is enough to annotate only the first and last row of the table as Portia will automatically create variants for rows in between.<\/p>\n<h3>Partial Annotations<\/h3>\n<p>Partial annotations can be used to extract some part of text which exists as part of a common pattern. For example, if an element contained the text <code>Price: $5.00<\/code>, you could highlight the <code>5.00<\/code> part and map it to a field. The <code>Price: $<\/code> part would be matched but removed before extracting the field. In order to create a partial annotation, all you need to do is highlight the text the way you would normally, by clicking and dragging the mouse. The annotation window will pop up and you will be able to map it to a field the same way you would with a normal annotation.<\/p>\n<p>There are some limitations to partial annotations. As mentioned in the previous paragraph, the text must be part of a pattern. For example, let\u2019s say an element contains the following text:<\/p>\n<pre><code>Veris in temporibus sub Aprilis idibus habuit concilium Romarici montium\n<\/code><\/pre>\n<p>One of the pages visited by the crawler contains the following text in the same element:<\/p>\n<pre><code>Cui dono lepidum novum libellum arido modo pumice expolitum?\n<\/code><\/pre>\n<p>If you had annotated <code>Aprilis<\/code> in the template, nothing would have matched because the surrounding text differs from the content being matched against. However, if the following text had instead appeared in the same element:<\/p>\n<pre><code>Veris in temporibus sub Januarii idibus habuit concilium Romarici montium\n<\/code><\/pre>\n<p>The word <code>Januarii<\/code> would have been extracted, because its surrounding text matches the text surrounding the text that was annotated in the template.<\/p>\n<h2>Tips for Working with Multiple Templates<\/h2>\n<p>It\u2019s often necessary to use multiple templates within one spider, even if you\u2019re only extracting one item type. Some pages containing the same item type may have a different layout or fields missing, and you will need to accommodate those pages by creating a template for each layout variation.<\/p>\n<p>The more annotations a template has, the more specific the data being extracted and therefore less chance of a false positive. For this reason, templates with more annotations take precedence over those with less annotations. If a subset of templates contains equal number of annotations per template, then within that subset templates will be tried in the order they were created from first to last. In other words, templates are tried sequentially in order of number of annotations first, and age second.<\/p>\n<p>If you are working with a large number of templates, it may be difficult to ensure the correct template is applied to the right page. It\u2019s best to keep templates as strict as possible to avoid any false matches. It\u2019s useful to take advantage of the <code>-just required-<\/code> option and annotate elements that will always appear on matching pages to reduce the number of false positives.<\/p>\n<p>Consider the following example:<\/p>\n<p>We have an item type with the fields <code>name<\/code>, <code>price<\/code>, <code>description<\/code> and <code>manufacturer<\/code>, where <code>name<\/code> and <code>price<\/code> are required fields. We have create a template with annotations for each of those fields. Upon running the spider, many items are correctly scraped; however, there are a large number of items where the manufacturer field contains the description, and the description field is empty. This has been caused by some pages having a different layout:<\/p>\n<p>Layout A:<\/p>\n<table>\n<tr>\n<td>name<\/td>\n<td>price<\/td>\n<\/tr>\n<tr>\n<td>manufacturer<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<\/tr>\n<tr>\n<td>description<\/td>\n<\/tr>\n<\/table>\n<p>Layout B:<\/p>\n<p>As you can see, the problem lies with the fact that in layout B the description is where manufacturer would be, and with <code>description<\/code> not being a required field it means that the template created for layout A will match layout B. Creating a new template for layout B won\u2019t be enough to fix the problem, as layout A\u2019s template would contain more annotation and be matched against first.<\/p>\n<p>Instead we need to modify layout A\u2019s template, and mark the <code>description<\/code> annotation as <strong>Required<\/strong>. With this added constraint, items displayed with layout B will not be matched against with layout A\u2019s template due to the missing <code>description<\/code> field, so the spider will proceed onto layout B\u2019s template which will extract the data successfully.<\/p>\n<h2>Running Portia<\/h2>\n<h3>Running Portia with Vagrant (Recommended)<\/h3>\n<p>Checkout the repository:<\/p>\n<pre><code>git clone https:\/\/github.com\/scrapinghub\/portia\n<\/code><\/pre>\n<p>You will need both Vagrant and VirtualBox installed.<\/p>\n<p>Run the following in Portia\u2019s directory:<\/p>\n<pre><code>vagrant up\n<\/code><\/pre>\n<p>This will launch a Ubuntu virtual machine, build Portia and start the <code>slyd<\/code> server. You\u2019ll then be able to access Portia at <code>http:\/\/localhost:8000\/static\/index.html<\/code>. You can stop the <code>slyd<\/code> server using <code>vagrant suspend<\/code> or <code>vagrant halt<\/code>. To run <code>portiacrawl<\/code> you will need to SSH into the virtual machine by running <code>vagrant ssh<\/code>.<\/p>\n<h3>Running Portia Locally<\/h3>\n<p>If you would like to run Portia locally you should create an environment with virtualenv:<\/p>\n<pre><code>virtualenv YOUR_ENV_NAME --no-site-packages\nsource YOUR_ENV_NAME\/bin\/activate\ncd ENV_NAME\n<\/code><\/pre>\n<p>Now clone this repository into that env:<\/p>\n<pre><code>git clone https:\/\/github.com\/scrapinghub\/portia.git\n<\/code><\/pre>\n<p>and inside this env install the required packages:<\/p>\n<pre><code>cd portia\npip install -r requirements.txt\npip install -e .\/slybot\n<\/code><\/pre>\n<p>To run Portia start slyd:<\/p>\n<pre><code>cd slyd\ntwistd -n slyd\n<\/code><\/pre>\n<p>Portia should now be running on port 9001 and you can access it at: <code>http:\/\/localhost:9001\/static\/index.html<\/code><\/p>\n<h6>Missing Dependencies on Linux<\/h6>\n<p>When running Portia on Ubuntu or Debian systems you may need to install the following dependencies:<\/p>\n<pre><code>sudo apt-get install libxml2-dev libxslt-dev python-dev zlib1g-dev libffi-dev libssl-dev\n<\/code><\/pre>\n<h3>Running Portia with Docker<\/h3>\n<p>Checkout the repository:<\/p>\n<pre><code>git clone https:\/\/github.com\/scrapinghub\/portia\n<\/code><\/pre>\n<p>If you are on a Linux machine you will need Docker installed or if you are using a Windows or Mac OS X machine you will need boot2docker.<\/p>\n<p>After following the appropriate instructions above the Portia image can be built using the command below:<\/p>\n<pre><code>docker build -t portia .\n<\/code><\/pre>\n<p>Portia can be run using the command below:<\/p>\n<pre><code>docker run -i -t --rm\n-v \/data:\/app\/slyd\/data:rw \\\n-p 9001:9001 \\\n--name portia \\\nportia\n<\/code><\/pre>\n<p>Portia will now be running on port 9001 and you can access it at: <code>http:\/\/localhost:9001\/static\/index.html<\/code> Projects will be stored in the project folder that you mount to docker.<\/p>\n<p>To run <code>portiacrawl<\/code> add <code>\/app\/slybot\/bin\/portiacrawl [SPIDER] [OPTIONS]<\/code> to the command above.<\/p>\n<p>:warning: <code>For Windows the path must be of the form \/<\/code><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages. Anatomy of a Portia Project A project in [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-7619","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts\/7619","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/comments?post=7619"}],"version-history":[{"count":0,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts\/7619\/revisions"}],"wp:attachment":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/media?parent=7619"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/categories?post=7619"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/tags?post=7619"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}