Test Case Title
TC9.16 – Web Scraping to get data from web pages
A user can:
A snap4city user registered on the Snap4City portal.
You have to be an AreaManager.
You can ask the tool by using https://www.snap4city.org/drupal/node/431
Expected successful result
An integrated suite of tools, at least for what we can have so far.
This page reports a general guide to use the Portia web scraping tool. The version taken into consideration by this guide is included into the Docker snap4city-portia image: v2.1. The original version of the tool, created by Scrapinghub, does not differ with regard to the user interface while adopting different backend technologies. Further information on the differences between the two versions can be found on the Github repository page or in the technical manual.
Portia is a webtool for Web Scraping which is based on two phases (i) training and (ii) execution of scripts to automatically crawl information on a given website. The tool provides a web interface to the Scraper engine that pus in execution the script. Through Web Scraper tool it is possible to automate and periodically perform the collection of information which can be grabbed from a rage of different internet web sites. In this guide we are showing how to train a script and how to run the script to collect the results.
The web scraping tool has to be enabled for your account, please read the following page to get how to obtain a Scraping develpment environment access from main menu: https://www.snap4city.org/drupal/node/431
The following screen shows the project selection (collection of one or more spiders). You can create a new project or open an existing one and continue training.
Let's us not to create a new project "Wikinews" and give confirmation to open the screen of the current project: in the central text box we insert the URL of the start page from which we want to start the script. (see following figure)
WARNING: it is recommended to copy the initial URL from the address bar, first opening the page on your browser
Select on the left list the main PROJECT and the SPIDER. Thus the tool is going to open the preview of the main web page.
With the "New Spider" button you can create a new script for web page crawling. The script must be trained: by pressing the "New Sample" button you can instruct the script to collect the interesting elements of the page as follows.
To this end, it is necessary to create a new annotation for each class of elements you intend to collect: suppose you want to collect all the news links in the sidebar to the right of wikinews.en. The click on + creates a new annotation: to associate it to the news link, simply click on it. The link will be highlighted:
In order to associate at the annotation the links of the notices, it is enough to use symbol on a next element. The results will be as follows. Note: in order to get coherent and efficient annotations we suggest you to associate visually similar elements. The annotations are following CSS classes and HTML tags.
It is possible to delete an annotation element by using symbol “-“, while in the presence of multiple annotations on the same page using the tool .
In the right sidebar of the preview page. it is possible to have a preview of the JSON output, while the Extracted Items tab shows an organized summary of the properties of the selected sample. It is also possible to add more annotations to the same page. This strategy is recommended in case you want to retrieve information inserted in semantically different elements.
Tools on the Sidebar
The sidebar on the left side allows you to work on different aspects of the entire project:
- Project: allows to download a .zip with the project files. It also allows Scrapy-compatible export to generate a file compatible with another instance of Portia.
- Start Pages: allows to add other start pages. Attention, the spider specified in the spider section will be used. If the pages are not "compatible" it is recommended to create a dedicated spider.
- Link crawling: allows to specify the tracking policy for links found on the page. You can follow all the links without distinction, limit yourself to those of the current domain (default), do not chase the links or specify a personalized policy through a regular expression.
- Sample Pages contains a list of sample pages presented during the training phase. You can resume and edit the training on that page or delete it rather than specify a new one.
In browsing mode the top bar appears as follows:
1. Status icon indicates that you are in browsing mode
2. Current address bar
3. Link highlighting by pressing it you can view which links will be followed and which are not, according to the policy specified in the left sidebar.
4. Add / remove the current page to the start pages
5. It allows passing at creating or modifying
During the training the top bas appears as follows
1. Status icon indicates that you are in training mode
2. Enable / disable style sheet rendering (CSS)
3. Magic tool,
4. Add / remove the current page to the start pages
5. It allows creating or modifying a sample on the current pag
Script execution can only be done via API call, see technical manual for details.
Execution Web Scraping as MicroService
Once the Web scraping contained is created, the tool can be put in execution from the IOT Application as follow:
The Produced data in JSON can be also ingested in MyKPI or stored in other part of the solution as you prefer.