Python library¶
Installation¶
To install just make:
pip install aduana
It will automatically compile the C library and wrap it using CFFI. Not all parts of the C library are accessible from python, only the necessary ones for the Frontera backends.
Apart from the python module it will also install two scripts:
- aduana-server.py
- aduana-server-cert.py
These scripts are to be used when using the Distributed spider backend.
Using Scrapy/Frontera with Aduana¶
Check the Frontera documentation, for general instructions about setting up Scrapy, Frontera and custom backends. The workflow specific for Aduana is:
Set the backend, either as:
BACKEND = 'aduana.frontera.Backend'
or if you want to make a distributed crawl with multiple spiders as:
BACKEND = 'aduana.frontera.WebBackend'
Set additional options, for example:
PAGE_DB_PATH = 'test-crawl' SCORER = 'HitsScorer' USE_SCORES = True
Run your spider:
scrapy crawl your_spider_here
Single spider backend¶
This backend is the easiest one to run and works by calling directly the wrapped C library. To use it set the backend as:
BACKEND = 'aduana.frontera.Backend'
Additionally, the following setting are also used by this backend
PAGE_DB_PATH
String with the path where you want the PageDB to be stored. Note that Aduana will actually make two directories. One will be the one specified by
PAGE_DB_PATH
and the other will add the suffix_bfs
. This second directory contains the database necessary for the operation of the (best first) scheduler. If this settings is not specified, of it is set to None, the directory will be generated randomly, with suffixfrontera_
and it will be automatically deleted when the spider is closed.SCORER
Strategy to use to compute final page scores. Can be one of the following:
None
'HitsScorer'
'PageRankScorer'
USE_SCORES
Set to
True
if you want that the scorer, in case that it was HITS or PageRank based merges the content scores with link based scores. Default isFalse
.SOFT_CRAWL_LIMIT
When a domain reaches this limit of crawls per second Aduana will try to make requests to other domains. Default is 0.25.
HARD_CRAWL_LIMIT
When a domain reaches this limit of crawls per second Aduana will stop making new requests for this domain. Default is 100.
PAGE_RANK_DAMPING
If the scorer is PageRank then set the damping to this value. Default is 0.85.
Distributed spider backend¶
This backend allows to use several spiders simultaneously, maybe at different computers to improve CPU and network performance. It works by having a central server and several spiders connecting to it through a REST api.
The first thing you need to do is launch the server:
aduana-server.py --help
usage: aduana-server.py [-h] [--seeds [SEEDS]] [settings]
Start Aduana server.
positional arguments:
settings Path to python module containing server settings
optional arguments:
-h, --help show this help message and exit
--seeds [SEEDS] Path to seeds file
Once the server is launched press Ctrl-C to exit.
The server settings are specified in a separate file that is passed as
a positional argument to the aduana-server.py
script. The reason
is that they are settings that will be shared by all spiders that
connect to the server.
The following server settings have the same meaning as the ones in the Single spider backend.
PAGE_DB_PATH
SCORER
USE_SCORES
SOFT_CRAWL_LIMIT
HARD_CRAWL_LIMIT
PAGE_RANK_DAMPING
Additionally the following settings are available:
SEEDS
Path to the seeds file, where each line is a different URL. This setting has no default and is mandatory. It can be specified/overriden with the
--seeds
option when launching the server.DEFAULT_REQS
If the client does not specify the desired number of requests serve this number. Default number is 10.
ADDRESS
Server will listen on this address. Default
'0.0.0.0'
.PORT
Server will listen on this port. Default 8000.
PASSWDS
A dictionary mapping login name to password. If
None
then all connections will be accepted. Notice that it uses BasicAuth which sends login data in plain text. If security is of concern then it is adviced to use this option along withSSL_KEY
andSSL_CERT
. Default value for this setting isNone
.SSL_KEY
Path to SSL keyfile. If this setting is used then
SSL_CERT
must be set too and all communications will be encrypted between server and clients using HTTPS. DefaultNone
.SSL_CERT
Path to SSL certificate. Default
None
.
The Frontera settings to use this backend are:
BACKEND = 'aduana.frontera.WebBackend'
Additionally, the following setting are also used by this backend
SERVER_NAME
Address of the server. Default
'localhost'
SERVER_PORT
Server port number. Default 8000.
SERVER_CERT
Path to server certificate. If this option is set it will try to connecto to the server using HTTPS. Default
None
.
WebBackend REST API¶
There are two messages exchanged between the spiders and the server.
Crawled
When a spider crawls a page it sends a POST message to
/crawled
. The body is a json dictionary with the following fields:- url: The URL of the crawled page, ASCII encoded. This is the only mandatory field.
- score: a floating point number. If omited defaults to zero.
- links: a list links. Each element of the links is a pair made from link URL and link score.
En example message:
{ "url" : "http://scrapinghub.com", "score": 0.5, "links": [["http://scrapinghub.com/professional-services/", 1.0], ["http://scrapinghub.com/platform/", 0.5], ["http://scrapinghub.com/pricing/", 0.8], ["http://scrapinghub.com/clients/", 0.9]] }
Request
When the spider needs to know which pages to crawl next it sends a GET message to
/request
. The query strings accepts an optional parametern
with the maximum number of URLs. If not specified the default value specified in the server settings will be used. The response will be a json encoded list of URLs. Example (pip install httpie
):$ http --auth test:123 --verify=no https://localhost:8000/request n==3 HTTP/1.1 200 OK Date: Tue, 23 Jun 2015 08:40:46 GMT content-length: 120 content-type: application/json [ "http://www.reddit.com/r/MachineLearning/", "http://www.datanami.com/", "http://venturebeat.com/tag/machine-learning/" ]
Running the examples¶
To run the single spider example just go to the example directory, install the requirements and run the crawl:
cd example
pip install -r requirements.txt
scrapy crawl example
To run the distributed spider example we need to dance a little more:
Go to the example directory:
cd example
Generate a server certificate:
aduana-server-cert.py
Launch the server:
aduana-server.py server-config.py
Go to the example directory in another terminal and then:
scrapy crawl -s FRONTERA_SETTINGS=example.frontera.web_settings example