C Library

This section is aimed at developers that want to understand the architecture of the library, in order to extend it.

The library can be compiled and installed independently of the python bindings. To build and install:

cd lib
mkdir debug
cd debug
cmake .. -DCMAKE_BUILD_TYPE=Debug
make && sudo make install

When trying to understand some code I like to start with the data structures that make the inputs and the outputs of the code. CrawledPage is the input of Aduana and the best place to start.

CrawledPage

Data structures

struct CrawledPage

The information that comes with a crawled page.

Public Members

char *url

ASCII, null terminated string for the page URL

PageLinks *links

List of links inside this page

double time

Number of seconds since epoch

float score

A number giving an idea of the page content’s value

char *content_hash

A hash to detect content change since last crawl. Arbitrary byte sequence

size_t content_hash_length

Number of byes of the content_hash

The utility of CrawledPage::url and CrawledPage::links are quite obvious, the others need an explanation:

  • CrawledPage::time: this is used to compute how often a page changes and also it would be useful for a revisiting schedule to know how much time has passed since the page was crawled.
  • CrawledPage::score: one of the objectives of Aduana is to guide the crawl to interesting pages. Since the definition of interesting is application dependent each crawler can give a measure of how interesting they found the page to be. How this number will be exactly used depends on which scorer are we going to use. This field is not mandatory and actually Aduana can be configured to ignore it.
  • CrawledPage::content_hash: in order to detect if a page has changed this hash is compared with the hash previously stored for this same page. If the hash has changed we consider that the page has changed. Notice that the content hash is also application dependent: some applications may consider that the page has changed only if there are new links, others will consider a page has changed if the body text, after stripping HTML tags, has changed, etc... This field can be ignored too, in which case the pages will be considered as unchanging.

C is not known for its powerful and flexible data structures. In order to store a variable number of links per crawled page we implement this resizable array. Each time we run out of allocated memory the size of the reserved memory is doubled.

A (resizable) array of page links.

Initially: n_links = 0 m_links = PAGE_LINKS_MIN_LINKS

Always: 0 <= n_links <= m_links

Public Members

Array of LinkInfo

size_t n_links

Number of items inside link_info

size_t m_links

Maximum number of items that can be stored inside link_info

Initially we reserve this number of links

Allocate at least this amount of memory for link info

Finally, each link not only carries an URL, but also a score. The score gives an idea of how good the (maybe uncrawled) link is, according to the web crawler. Think of the link score as an approximation to CrawledPage::score when we have not crawled the link yet.

struct LinkInfo

The information that comes with a link inside a crawled page.

The link score is used to decide which links should be crawled next. It is application dependent and tipically computed by looking at the link surrounding text.

Public Members

char *url

ASCII, null terminated string for the page URL

float score

An estimated value of the link score

Constructor/Destructor

CrawledPage *crawled_page_new(const char *url)

Create a new CrawledPage

url is a new copy

The following defaults are used for the different fields:

Return
NULL if failure, otherwise a newly allocated CrawledPage

void crawled_page_delete(CrawledPage *cp)

Delete a Crawled Page created with crawled_page_new

Set content hash

int crawled_page_set_hash(CrawledPage *cp, const char *hash, size_t hash_length)

Set content hash

The hash is a new copy

int crawled_page_set_hash128(CrawledPage *cp, char *hash)

Set content hash from a 128bit hash

int crawled_page_set_hash64(CrawledPage *cp, uint64_t hash)

Set content hash from a 64bit hash

int crawled_page_set_hash32(CrawledPage *cp, uint32_t hash)

Set content hash from a 32bit hash

PageInfo

Data structures

This structure contains all we know about a given page, and it’s changed as new CrawledPage arrive.

And here it is:

struct PageInfo

The information we keep about crawled and uncrawled pages

PageInfo are created at the PageDB, that’s why there are no public constructors/destructors available.

Public Members

char *url

A copy of either CrawledPage::url or CrawledPage::links[i]

uint64_t linked_from

The page that first linked this one

double first_crawl

First time this page was crawled

double last_crawl

Last time this page was crawled

size_t n_changes

Number of content changes detected between first and last crawl

size_t n_crawls

Number of times this page has been crawled. Can be zero if it has been observed just as a link

float score

A copy of the same field at the last crawl

size_t content_hash_length

Number of bytes in PageInfo::content_hash

char *content_hash

Byte sequence with the hash of the last crawl

Constructor/Destructor

There is no constructor available for this structure. The reason is that they are automatically created from the info inside CrawledPage when page_db_add() is called.

void page_info_delete(PageInfo *pi)

Destroy PageInfo if not NULL, otherwise does nothing

Functions

int page_info_print(const PageInfo *pi, char *out)

Write printed representation of PageInfo.

This function is intended mainly for debugging and development. The representation is: first_crawl last_crawl n_crawls n_changes url

Each field is separated with an space. The string is null terminated. We use the following format for each field:

  • first_crawl: the standard fixed size (24 bytes) as output by ctime. For example: Mon Jan 1 08:01:59 2015
  • last_crawl: the same as first_crawl
  • n_crawls: To ensure fixed size representation this value is converted to double and represented in exponential notation with two digits. It has therefore always 8 bytes length: 1.21e+01
  • n_changes: The same as n_crawls
  • url: This is the only variable length field. However, it is truncated at 512 bytes length.

Return
size of representation or -1 if error
Parameters
  • pi -

    The PageInfo to be printed

  • out -

    The output buffer, which must be at least 580 bytes long

float page_info_rate(const PageInfo *pi)

Estimate change rate of the given page. If no valid rate can be computed return -1.0, otherwise a valid nonnegative change rate.

PageDB

This is one of the main components of the library. Here we store all the PageInfo and how pages are linked between them.

The first thing to understand is that there are two different ways to refer to a given page, using either the URL hash or the index. Both ways of addressing the page are linked in the hash2idx database.

URL hash

The URL hash is computed using the following function:

uint64_t page_db_hash(const char *url)

Hash function used to convert from URL to hash.

The hash is a 64 bit number where the first 32 bits are a hash of the domain and the last 32 bits are a hash of the full URL. In this way all URLs whith the same domain get grouped together in the database. This has some good consequences:

  1. We can access all pages inside a domain by accessing the first of them in the database and moving sequentially.
  2. When streaming links this improves locality since pages in the same domain tend to have similar links.

When a new CrawledPage arrives we compute the hash of CrawledPage::url and use this as the key inside the hash2info database, to retrieve the associated PageInfo. If no entry is found inside the database a new one is created. We do the same with each one of the links inside CrawledPage::links.

The following two functions are useful to extract the different parts of the hash.

uint32_t page_db_hash_get_domain(uint64_t hash)

Extract the domain hash from the full hash

uint32_t page_db_hash_get_url(uint64_t hash)

Extract the URL hash from the full hash

Index

We could store links between pages using their URL hash, for example, as a list of pairs of the form:

004619df1e9191ff 004619df1eb839e2
004619df1e9191ff 004619df1f1a5477
004619df01e223ae 00115773f1ea355c
...

However the hashing would spoil one interesting property of links: locality. Locality means that pages usually link to pages inside their same domain. For example, here are the first links extracted from the front page of Wikipedia:

https://en.wikipedia.org/wiki/Main_Page#mw-head
https://en.wikipedia.org/wiki/Main_Page#p-search
https://en.wikipedia.org/wiki/Wikipedia
https://en.wikipedia.org/wiki/Free_content
https://en.wikipedia.org/wiki/Encyclopedia
https://en.wikipedia.org/wiki/Wikipedia:Introduction
https://en.wikipedia.org/wiki/Special:Statistics
https://en.wikipedia.org/wiki/English_language

Locality can also happen when there are several links outgoing to the same domain, but a different one of the originating page. For example, from among the 135 links at the front page of Hacker News more than 100 remained on the same domain but there were also the following groups:

http://www.ycombinator.com/
http://www.ycombinator.com/apply/

https://github.com/blog/2024-read-only-deploy-keys
https://github.com/whamtet/Excel-REPL
https://github.com/tadast/switching-to-contracting-uk/blob/master/README.md
https://github.com/HackerNews/API

Instead of storing links using the URL hash we instead assign each page an integer, that starts at zero with the first page and it’s automatically incremented when a new page is added to the database. Links are stored then as lists where the first element is the originating page index and the rest of the elements are the indices of the outoging links. For example, taken from a real crawl:

7 1243 1245 1251 1254 1260 1262 1263
  1264 1267 1269 1271 1274 1275 1276
  1277 1280 1283 1286 1289 1291 1295
  1309 1311 ...

Since we want be able to perform big crawls with billions of pages we use 64 bit integers for the indices, which means they still take as much space as the URL hashes. However, these links are delta-encoded: starting at the second element of the list we substract the previous one:

7 2 6 3 6 2 1 1 3 2 2 3 1 1 1 3 3 3 3 2 4 14 2 ...

Finally we use varint encoding for each integer. As you can see in the above example each link requires just 8 bits, instead of the 64 bits (or 32 bits if somehow we could reuse the domain part of the hash) URL hashing would.

Having indices instead of hashes is also convenient for the PageRank and HITS algorithms. They can store the pages scores using arrays where the position of each page inside those arrays are just their index. Having fast O(1) access time greatly improves the speed of the computation when using billions of pages. Besides, locality also helps access speed, even when working in-memory.

The index for a given page is automatically created when page_db_add() is called.

Data structures

struct PageDB

Page database.

We are really talking about 4 diferent key/value databases:

  • info: contains fixed size information about the whole database. Right now it just contains the number of pages stored.
  • hash2idx: maps URL hash to index. Indices are consecutive identifier for every page. This allows to map pages to elements inside arrays.
  • hash2info: maps URL hash to a PageInfo structure.
  • links: maps URL index to links indices. This allows us to make a fast streaming of all links inside a database.

Public Members

char *path

Path to the database directory

TxnManager *txn_manager

The transaction manager counts the number of read and write transactions active and is capable of safely performing a database resize

DomainTemp *domain_temp

Track the most crawled domains

int persist

If true, do not delete files after deleting object

enum PageDBError

Values:

page_db_error_ok = 0

No error

page_db_error_memory

Error allocating memory

page_db_error_invalid_path

File system error

page_db_error_internal

Unexpected error

page_db_error_no_page

A page was requested but could not be found

Constructor/Destructor

PageDBError page_db_new(PageDB **db, const char *path)

Creates a new database and stores data inside path

Return
0 if success, otherwise the error code
Parameters
  • db -

    In case of page_db_error_memory *db could be NULL. In case of other failures it is nevertheles allocated memory so that the error code and message can be accessed.

  • path -

    Path to directory. In case it doesn’t exist it will created. If it exists and a database is already present operations will resume with the existing database. Note that you must have read, write and execute permissions for the directory.

PageDBError page_db_delete(PageDB *db)

Close database

Close database, delete files if it should not be persisted, and free memory

Add page

PageDBError page_db_add(PageDB *db, const CrawledPage *page, PageInfoList **page_info_list)

Update PageDB with a new crawled page

It performs the following actions:

  • Compute page hash
  • If the page is not already into the database:
    • It generates a new ID and stores it in hash2idx
    • It creates a new PageInfo and stores it in hash2info
  • If already present if updates the PageInfo inside hash2info
  • For each link:
    • Compute hash
    • If already present in the database just retrieves the ID
    • If not present:
      • Generate new ID and store it in hash2idx
      • Creates a new PageInfo and stores it in hash2info
  • Create or overwrite list of Page ID -> Links ID mapping inside links database

Return
0 if success, otherwise the error code
Parameters
  • db -

    The database to update

  • page -

    The information of the crawled page

  • page_info_list -

    If not NULL this function will allocate and populate a new PageInfoList which contains the PageInfo of the updated pages. It is your responsability to call when you no longer need this structure.

Get info from database

PageDBError page_db_get_info(PageDB *db, uint64_t hash, PageInfo **pi)

Retrieve the PageInfo stored inside the database.

Beware that if not found it will signal success but the PageInfo will be NULL

PageDBError page_db_get_idx(PageDB *db, uint64_t hash, uint64_t *idx)

Get index for the given URL

PageDBError page_db_get_scores(PageDB *db, MMapArray **scores)

Build a MMapArray with all the scores

float page_db_get_domain_crawl_rate(PageDB *db, uint32_t domain_hash)

Get crawl rate for the given domain

Database settings

void page_db_set_persist(PageDB *db, int value)

Set persist option for database

PageDBError page_db_set_domain_temp(PageDB *db, size_t n_domains, float window)

Set domain temperature tracking options

Export database

This functions are used by the page_db_dump command line utility.

PageDBError page_db_info_dump(PageDB *db, FILE *output)

Dump database to file in human readable format

Dump database to file in human readable format

PageInfoList

This structure exists just because page_db_add() needs a way of returning which pages had their info created/modified. This information is necessary for schedulers. It’s just a linked list so we are not going to make more comments about it.

Data structures

struct PageInfoList

A linked list of PageInfo (and hash), to be returned by page_db_add

Public Members

uint64_t hash

Hash inside the hash2info database

PageInfo *page_info

Info inside the hash2info database

struct PageInfoList *next

A pointer to the next element, or NULL

Constructor/Destructor

PageInfoList *page_info_list_new(PageInfo *pi, uint64_t hash)

Create a new PageInfoList, with just one element.

Return
A pointer to the first element of the list, or NULL if failure
Parameters
  • pi -

    The PageInfo to add. From this point it is the property of the list, so deleting the list deletes this element.

  • hash -

void page_info_list_delete(PageInfoList *pil)

Deletes the list and all its contents

Functions

PageInfoList *page_info_list_cons(PageInfoList *pil, PageInfo *pi, uint64_t hash)

Add a new element to the head of the list.

Return
A pointer to the first element of the list, or NULL if failure
Parameters
  • pi -

    The PageInfo to add. From this point it is the property of the list, so deleting the list deletes this element.

  • hash -

LinkStream

Maybe the most interesting stream going out of PageDB is the link stream, because it’s the main interface between PageDB and the different scorers like PageRank and HITS. This stream outputs a list of Link, which are just pairs of from and to indices. Right now, because of the way links are stored inside the database the stream groups together all the links with the same from index, however this could change in the future and it’s actually not necessary for the current PageRank or HITS implementations.

The reason for using a link stream is that when billions of pages are crawled the size of the links database can grow to several hundreds of megabytes.

Data structures

struct PageDBLinkStream

Public Members

MDB_cursor *cur

PageDB where links database is stored Cursor to the links database

uint64_t from

Current page

uint64_t *to

A list of links

size_t n_to

Number of links

size_t i_to

Current position inside to

size_t m_to

Allocated memory for to. It must be that n_to <= m_to.

size_t n_diff

Number of out domain links

int only_diff_domain

If true only links that go to a different domain will be streamed

Constructor/Destructor

Delete link stream and free any transaction hold inside the database.

Functions

The signature of these functions use void because they must agree with the following interfaces:

typedef StreamState( LinkStreamNextFunc)(void *state, Link *link)

for

Get next element inside stream.

Return
::link_stream_state_next if success

and

typedef StreamState( LinkStreamResetFunc)(void *state)

for

Rewind stream to the beginning

HashInfoStream

Data structures

This is used by the command line utility page_db_find, which iterates over all the pages and returns which ones have their URL matching some regexp.

struct HashInfoStream

Stream over HashInfo inside PageDB

Public Members

MDB_cursor *cur

Cursor to info database

Constructor/Destructor

PageDBError hashinfo_stream_new(HashInfoStream **st, PageDB *db)

Create a new stream

void hashinfo_stream_delete(HashInfoStream *st)

Free stream

Functions

StreamState hashinfo_stream_next(HashInfoStream *st, uint64_t *hash, PageInfo **pi)

Get next element in stream

HashIdxStream

This is used in two different places. The first one is the command line utility page_db_links which returns which pages link or are linked from other page.

The other more important use case is inside schedulers, which after pages scores are updated, need to iterate over all of them to see which ones have changed enough to be rescheduled.

Data structures

struct HashIdxStream

Stream over hash/index pairs inside PageDB

Public Members

MDB_cursor *cur

Cursor to the hash2idx database

Constructor/Destructor

PageDBError hashidx_stream_new(HashIdxStream **st, PageDB *db)

Create a new stream

void hashidx_stream_delete(HashIdxStream *st)

Free stream

Functions

StreamState hashidx_stream_next(HashIdxStream *st, uint64_t *hash, size_t *idx)

Get next element in stream

DomainTemp

This is used inside PageDB to track how many times the most often domains are crawled. This information will in turn be used by the scheduler, which will try to not serve requests for the most crawled domains.

Ideally, for each domain we would store a (growing) list of timestamps when some page in the domain has been crawled. With this list in hand we could answer questions like How many times the domain has been crawled in the last 60 seconds?. Instead of that we make the following approximation: imagine that we store only how many times the domain has been crawled in the last \(T\) seconds. We don’t know how the crawls have been distributed in that time, it could be that thay are distributed all at the beginning:

or maybe following some strange pattern:

Instead we will assume they are evenly distributed:

Now, if some time \(t\) is elapsed without any more crawled, how many crawls remain in the time window?

The answer is that since there are \(n\) crawls evenly distributed then there are \(n/T\) crawls per second, and then \(n\frac{t}{T}\) have moved out of the time window.

\[n(t_0 + t) - n(t_0) = n(t_0)\frac{t}{T}\]

If \(t \to dt\) then we have the following differential equation:

\[\frac{dn}{dt} = -\frac{1}{T}n\]

The solution of the above equation is obviously:

\[n(t) = n(0)e^{-\frac{t}{T}}\]

And \(n\) would evolve following some similar shape to:

The above figure has a time window of just 2 seconds and there are crawls at instants 1, 2.5, 2.6, 2.7, 4 and 5.

Data structures

struct DomainTemp

Tracks how “hot” are the most crawled domains.

We want to avoid crawling the same domain repeatedly. For this purpose this structure tracks how many times a domain has been crawled in the specified time window. For performance reasons an approximation of the actual number of crawls is maintained. Under certain assumptions it can be shown that if ‘n’ is the number of crawled for a domain it follows the following (cool down) differential equation:

\[ \frac{dn}{dt} = -\frac{1}{T}n \]

where \(T\) is the time window.

Public Members

DomainTempEntry *table

An array of domain/temperature pairs

size_t length

Length of DomainTemp::table

float time

Last time temperatures were updated

float window

Time window to consider in the cooldown

struct DomainTempEntry

Associate a domain hash with a temperature

Public Members

uint32_t hash

Domain hash

float temp

Domain temperature: an estimation of how many times the domain has been crawled in the time window

Constructor/Destructor

DomainTemp *domain_temp_new(size_t length, float window)

Create a new domain temp tracking structure

Return
A pointer to the new struct of NULL if failure
Parameters
  • length -

    Maximum number of domains to track

  • window -

    Time window

void domain_temp_delete(DomainTemp *dh)

Free memory

Functions

void domain_temp_update(DomainTemp *dh, float t)

Updates temp up to current time t

void domain_temp_heat(DomainTemp *dh, uint32_t hash)

Adds another count to domain.

If the domain already in already tracked its counter is incremented. If the domain is not present then we try to initialize it in an empty slot. If not empty slot is available then the domain with fewest crawls is replaced with the new domain if its counter is below 1.

float domain_temp_get(DomainTemp *dh, uint32_t hash)

Gets domain temp

Error handling

Errors are signaled in the following ways:

  • For functions not returning pointers 0 means success and any other value some kind of failure. Usually an enumeration of error codes is defined, otherwise -1 is used as failure code.
  • For functions returning pointers failure is signaled returning a null pointer.
  • If the causes of error are varied enough the structures inside this library have an Error structure, which contains the error code and an error message. The error message usually resembles an stack trace to aid debugging the problem.

Data structures

MAX_ERROR_LENGTH

Maximum length of error message

struct Error

Public Members

pthread_mutex_t mtx

Make operations on errors atomic.

If an error is produced dealing with this mutex it will be silently ignored

int code

Error code, depends on the application but 0 always signals no error

char message[MAX_ERROR_LENGTH+1]

A descriptive message associated with the error code. If no error then it contains “NO ERROR”

Constructor/Destructor

void error_init(Error *error)

Initialize structure.

Error code is set to 0 and message to “NO ERROR”.

void error_destroy(Error *error)

Clean up. Will NOT free error

Error *error_new(void)

Allocate and initialize a new error structure

void error_delete(Error *error)

Destroy and free an error structure

Functions

void error_set(Error *error, int code, const char *msg)

Set error.

If an error is already present then do nothing. If you want to overwrite an already existing error then first call error_clean

void error_clean(Error *error)

Clean error.

Error code is set to 0 and the message to NO ERROR.

void error_add(Error *error, const char *msg)

Add a description message to the existing message and leaves as is the error code

const char *error_message(const Error *error)

Return error message if error, otherwise NULL

int error_code(const Error *error)

Return error code

TxnManager

Data structures

struct TxnManager

Transaction Manager.

LMDB has several restrictions in the operations it allows in multiple threads, but some of these restrictions must be imposed in the application code. In particular:

  1. Some operations require that no transactions in the same process are active, for example mdb_env_set_mapsize
  2. Some operations require that no write transactions are active. For example it is not documented, but it seems to happen that, mdb_env_info crashes if write transactions are active.

This structure tracks the number of read and write transactions active inside the process and allows blocking until all of them are aborted or commited.

Public Members

MDB_env *env

LMDB environment where transactions happen

InvSemaphore txn_counter_read

Counter of read transactions

InvSemaphore txn_counter_write

Counter of write transactions

struct InvSemaphore

Inverse Semaphore.

An inverse semaphore blocks when the count is greater than zero (a regular semaphore blocks when the count is at zero).

enum TxnManagerError

Values:

txn_manager_error_ok = 0

No error

txn_manager_error_internal

Unexpected error

txn_manager_error_memory

Error allocating new memory

txn_manager_error_thread

Error inside pthreads

txn_manager_error_mdb

Error inside LMDB

Constructor/Destructor

TxnManagerError txn_manager_new(TxnManager **tm, MDB_env *env)

Allocate a new TxnManager

Return
0 if success, otherwise error code.
Parameters
  • tm -

    The new transaction manager.

  • env -

    The LMDB environment where transactions will be opened, aborted or commited.

TxnManagerError txn_manager_delete(TxnManager *tm)

Destroy and free manager

Functions

The following functions are wrappers around the corresponding ones in LMDB. They will increment/decrement automatically the read and write transactions counters.

TxnManagerError txn_manager_begin(TxnManager *tm, int flags, MDB_txn **txn)

Begin a new transaction.

Return
0 if success, otherwise error code.
Parameters
  • tm -

  • flags -

    The flags that you pass to LMDB’s mdb_txn_begin. These flags will be checked for MDB_RDONLY to decide which transaction counter to increment. This operation will block if an environment resize is in progress.

  • txn -

    New transaction.

TxnManagerError txn_manager_commit(TxnManager *tm, MDB_txn *txn)

Commit transaction.

The corresponding counter will be decremented

TxnManagerError txn_manager_abort(TxnManager *tm, MDB_txn *txn)

Abort transaction.

The corresponding counter will be decremented

The following function is the main reason for the existence of TxnManager.

TxnManagerError txn_manager_expand(TxnManager *tm)

Check if the environment must be resized. If this is the case then resize it.

This call will block for sure until there are no write transactions active. This call may block until there are no read transactions active, only if a resize is necessary.

If a resize happens then creation of new read and write transactions will be blocked until it finishes.

MDB_MINIMUM_FREE_PAGES

Parameter associated to txn_manager_expand.

The mmap is resized when the remaining free space is less than this amount.

BFScheduler

Data structures

BF_SCHEDULER_DEFAULT_SIZE

Size of the mmap to store the schedule

BF_SCHEDULER_DEFAULT_PERSIST

Default value for BFScheduler::persist

struct BFScheduler

BestFirst scheduler.

As it name implies this scheduler follows a greedy strategy to decide which page is going to crawl next. It mains an ordered list of uncrawled pages. To decide the next page to be crawled this scheduler picks the highest score page and removes it from the top of the list.

The key is then to assign valid scores to the pages. If no scorer is selected this scheduler will use the score provided when the page is crawled. Additionally an alternative scorer can be set up, see for example page_rank_scorer_setup or hits_scorer_setup.

Public Members

PageDB *page_db

Page database

The page database is neither created nor destroyed by the scheduler. The rationale is that the scheduler can be changed while using the same PageDB. The schedule is “attached” to the PageDB.

Scorer *scorer

The scorer use to get page score.

If not set up, the PageInfo.score will be used

TxnManager *txn_manager

The scheduler state is maintained inside am LMDB environment

char *path

Path to the env

It is built by appending _bfs to the PageDB::path

int persist

If true, do not delete files after deleting object

float max_soft_domain_crawl_rate

Maximum crawls per second per domain

float max_hard_domain_crawl_rate

Maximum crawls per second per domain

enum BFSchedulerError

Values:

bf_scheduler_error_ok = 0

No error

bf_scheduler_error_memory

Error allocating memory

bf_scheduler_error_invalid_path

File system error

bf_scheduler_error_internal

Unexpected error

bf_scheduler_error_thread

Error inside the threading library

Constructor/Destructor

BFSchedulerError bf_scheduler_new(BFScheduler **sch, PageDB *db)

Allocate memory and create a new scheduler

Return
0 if success, otherwise the error code
Parameters
  • sch -

    Where to create it. *sch can be NULL in case of memory error

  • db -

    PageDB to attach. Remember it will not be created nor destroyed by the scheduler

void bf_scheduler_delete(BFScheduler *sch)

Delete scheduler.

It may or may not delete associated disk files depending on the BFScheduler::persist flag

Input/Output

BFSchedulerError bf_scheduler_add(BFScheduler *sch, const CrawledPage *page)

Add a new crawled page

It will add the page also to the PageDB.

Return
0 if success, otherwise the error code
Parameters
  • sch -

  • page -

BFSchedulerError bf_scheduler_request(BFScheduler *sch, size_t n_pages, PageRequest **request)

Add a new crawled page

It will add the page also to the PageDB.

Return
0 if success, otherwise the error code
Parameters
  • sch -

  • page -

Update scores

BF_SCHEDULER_UPDATE_BATCH_SIZE

Size of the batch used in updating the schedule.

Updating the schedule involves starting a write transaction. However write transactions coming from multiple threads are serialized. Since adding new pages to the schedule and returning requests also start write transactions it means that the update thread could block this more critical operations. To avoid this we avoid long write transactions and split them in batches.

BF_SCHEDULER_UPDATE_NUM_PAGES

Don’t update scores until this amount of new pages has arrived

BF_SCHEDULER_UPDATE_PER_PAGES

Don’t update scores until this percentage of new pages has arrived

BFSchedulerError bf_scheduler_update_start(BFScheduler *sch)

Start the update thread.

The update thread will run periodically the scorer, in case there is one, to recompute page scores.

BFSchedulerError bf_scheduler_update_stop(BFScheduler *sch)

Stop the update thread

Settings

void bf_scheduler_set_persist(BFScheduler *sch, int value)

Set persist option for scheduler

BF_SCHEDULER_CRAWL_RATE_STEPS

Number of steps to take between soft and hard crawl rate limit

BFSchedulerError bf_scheduler_set_max_domain_crawl_rate(BFScheduler *sch, float max_soft_crawl_rate, float max_hard_crawl_rate)

Set BFScheduler::max_soft_domain_crawl_rate and BFScheduler::max_hard_domain_crawl_rate

Scorer

struct Scorer

Scorers are responsible of computing a measure between 0 and 1 of the relevance of a given page.

In order to be used in different schedulers they must obey the following interface.

Public Members

void *state

Scorer specific state

ScorerUpdateFunc *update

Update scorer

ScorerAddFunc *add

Add new page to scorer

ScorerGetFunc *get

Get a page score

typedef int( ScorerUpdateFunc)(void *state)

Scorer update function interface

typedef int( ScorerAddFunc)(void *state, const PageInfo *page_info, float *score)

Scorer add page function interface

typedef int( ScorerGetFunc)(void *state, size_t idx, float *score_old, float *score_new)

Scorer get page score function

To see concrete implementations have a look at PageRankScorer and HitsScorer.

PageRankScorer

Data structures

PAGE_RANK_SCORER_USE_CONTENT_SCORES

Default value for PageRankScorer::use_content_scores

PAGE_RANK_SCORER_PERSIST

Default value for PageRankScorer::persist

struct PageRankScorer

Public Members

PageRank *page_rank

Implementation of the PageRank algorithm

PageDB *page_db

Database with crawl information

Error *error

Error status

int persist

If true files will not be removed by page_rank_scorer_delete

int use_content_scores

If true use content scores inside PageRank algorithm

enum PageRankScorerError

Values:

page_rank_scorer_error_ok = 0

No error

page_rank_scorer_error_memory

Error allocating memory

page_rank_scorer_error_internal

Unexpected error

page_rank_scorer_error_precision

Could not achieve precision in maximum number of loops

Constructor/Destructor

PageRankScorerError page_rank_scorer_new(PageRankScorer **prs, PageDB *db)

Create new scorer

PageRankScorerError page_rank_scorer_delete(PageRankScorer *prs)

Delete scorer.

Files will be deleted unles PageRankScorer::persist is true

Functions

int page_rank_scorer_add(void *state, const PageInfo *page_info, float *score)

Add new page to scorer.

Function signature complies with Scorer::add

int page_rank_scorer_get(void *state, size_t idx, float *score_old, float *score_new)

Access PageRank scorer as with page_rank_get.

Function signature complies with Scorer::get

int page_rank_scorer_update(void *state)

Update scores.

Function signature complies with Scorer::update

void page_rank_scorer_setup(PageRankScorer *prs, Scorer *scorer)

Given a Scorer fill its fields with the necessary info

Settings

void page_rank_scorer_set_persist(PageRankScorer *prs, int value)

Sets PageRankScorer::persist

void page_rank_scorer_set_use_content_scores(PageRankScorer *prs, int value)

Sets PageRankScorer::use_content_scores

void page_rank_scorer_set_damping(PageRankScorer *prs, float value)

Sets PageRankScorer::page_rank::damping

HitsScorer

Data structures

HITS_SCORER_USE_CONTENT_SCORES

Default value for HitsScorer::use_content_scores

HITS_SCORER_PERSIST

Default value for HitsScorer::persist

struct HitsScorer

Public Members

Hits *hits

Implementation of the HITS algorithm

PageDB *page_db

Database with crawl information

Error *error

Error status

int persist

If true files will not be removed by page_rank_scorer_delete

int use_content_scores

If true use content scores inside PageRank algorithm

enum HitsScorerError

Values:

hits_scorer_error_ok = 0

No error

hits_scorer_error_memory

Error allocating memory

hits_scorer_error_internal

Unexpected error

hits_scorer_error_precision

Could not achieve precision in maximum number of loops

Constructor/Destructor

HitsScorerError hits_scorer_new(HitsScorer **hs, PageDB *db)

Create new scorer

HitsScorerError hits_scorer_delete(HitsScorer *hs)

Delete scorer.

Files will be deleted unles HitsScorer::persist is true

Functions

int hits_scorer_add(void *state, const PageInfo *page_info, float *score)

Add new page to scorer.

Function signature complies with Scorer::add

int hits_scorer_get(void *state, size_t idx, float *score_old, float *score_new)

Access HITS scorer as with hits_get_authority.

Function signature complies with Scorer::get

int hits_scorer_update(void *state)

Update scores.

Function signature complies with Scorer::update

void hits_scorer_setup(HitsScorer *hs, Scorer *scorer)

Given a Scorer fill its fields with the necessary info

Settings

void hits_scorer_set_persist(HitsScorer *hs, int value)

Sets HitsScorer::persist

void hits_scorer_set_use_content_scores(HitsScorer *hs, int value)

Sets HitsScorer::use_content_scores

PageRank

Data structures

PAGE_RANK_DEFAULT_DAMPING

Default PageRank::damping

PAGE_RANK_DEFAULT_MAX_LOOPS

Default PageRank::max_loops

PAGE_RANK_DEFAULT_PRECISION

Default PageRank::precision

PAGE_RANK_DEFAULT_PERSIST

Default PageRank::persist

struct PageRank

Implementation of the PageRank algorithm.

See for example Wikipedia.

Additionally, it allows to merge the pure link based original algorithm with page content scores.

Public Members

MMapArray *out_degree

Number of outgoing links.

If page content scores are used then this array is actually the aggregated scores of all the outgoing links.

MMapArray *value1

PageRank value, old iteration

MMapArray *value2

PageRank value, new iteration

size_t n_pages

Number of pages

char *path_out_degree

Path to the out degree mmap array file

char *path_pr

Path to page rank mmap array file

Error *error

Error status

float damping

Probability of making a random page jump: 1.0 - damping

MMapArray *scores

External computed scores associated with the pages

float total_score

Total score

size_t max_loops

If greater than 0 stop computation even if precision was not achieved

float precision

Stop iteration when the the largest change in any page score is below this threshold

int persist

If true, do not delete files after deleting

enum PageRankError

Values:

page_rank_error_ok = 0

No error

page_rank_error_memory

Error allocating memory

page_rank_error_internal

Unexpected error

page_rank_error_precision

Could not achieve precision in maximum number of loops

Constructor/Destructor

PageRankError page_rank_new(PageRank **pr, const char *path, size_t max_vertices)

Create a new structure.

Return
0 if success, otherwise an error code.
Parameters
  • pr -

    The new structure is returned here. NULL if memory error.

  • path -

    Directory where all files will be stored.

  • max_vertices -

    Initial hint of the number of pages.

PageRankError page_rank_delete(PageRank *pr)

Free memory and close associated resources.

Files will be deleted or not depending on the value of PageRank::persist.

Functions

PageRankError page_rank_set_n_pages(PageRank *pr, size_t n_pages)

Reserve memory for the specified number of pages

PageRankError page_rank_compute(PageRank *pr, void *link_stream_state, LinkStreamNextFunc *link_stream_next, LinkStreamResetFunc *link_stream_reset)

Compute PageRank score for all pages.

The algorithm makes random access of pages scores and sequential access of the links.

Return
0 if success, otherwise an error code.
Parameters

PageRankError page_rank_get(const PageRank *pr, size_t idx, float *score_old, float *score_new)

Get PageRank score associated to a given page.

Return
0 if success, otherwise an error code.
Parameters

void page_rank_set_persist(PageRank *pr, int value)

Set value of PageRank::persist

Hits

Data structures

HITS_DEFAULT_MAX_LOOPS

Default Hits::max_loops

HITS_DEFAULT_PRECISION

Default Hits::precision

HITS_DEFAULT_PERSIST

Default Hits::persist

struct Hits

Implementation of the HITS algorithm.

See for example Wikipedia.

Additionally, it allows to merge the pure link based original algorithm with page content scores. The idea is that the authority scores are distributed back to the hub according to the content score. For example imagine that page A links to B, C and D and the content/authority scores are:

-B: 0.5 / 0.1 -C: 0.1 / 1.0 -D: 0.9 / 0.5

Then the hub score of A would be computed as:

Hub(A) = 0.5*0.1 + 0.1*1.0 + 0.9*0.5

Public Members

MMapArray *h1

Hub score, previous iteration

MMapArray *h2

Hub score, current iteration

MMapArray *a1

Authority score, previous iteration

MMapArray *a2

Authority score, current iteration

char *path_h1

Path to mmap file of Hits::h1

char *path_h2

Path to mmap file of Hits::h2

size_t n_pages

Number of pages

Error *error

Error status

MMapArray *scores

External computed scores associated with the pages

size_t max_loops

If greater than 0 stop computation even if precision was not achieved

float precision

Stop iteration when the the largest change in any page score is below this threshold

int persist

If true, do not delete files after deleting object

enum HitsError

Values:

hits_error_ok = 0

No error

hits_error_memory

Error allocating memory

hits_error_internal

Unexpected error

hits_error_precision

Could not achieve precision in maximum number of loops

Constructor/Destructor

HitsError hits_new(Hits **hits, const char *path, size_t max_vertices)

Create a new structure.

Return
0 if success, otherwise an error code.
Parameters
  • pr -

    The new structure is returned here. NULL if memory error.

  • path -

    Directory where all files will be stored.

  • max_vertices -

    Initial hint of the number of pages.

HitsError hits_delete(Hits *hits)

Free memory and close associated resources.

Files will be deleted or not depending on the value of Hits::persist.

Functions

HitsError hits_set_n_pages(Hits *hits, size_t n_pages)

Reserve memory for the specified number of pages

HitsError hits_compute(Hits *hits, void *link_stream_state, LinkStreamNextFunc *link_stream_next, LinkStreamResetFunc *link_stream_reset)

Compute HITS score for all pages.

The algorithm makes random access of pages scores and sequential access of the links.

Return
0 if success, otherwise an error code.
Parameters

HitsError hits_get_hub(const Hits *pr, size_t idx, float *score_old, float *score_new)

Get hub score associated to a given page.

Return
0 if success, otherwise an error code.
Parameters
  • pr -

  • idx -

    Page index.

  • score_old -

    Score on the previous call to hits_compute.

  • score_new -

    Score on the last call to hits_compute.

HitsError hits_get_authority(const Hits *pr, size_t idx, float *score_old, float *score_new)

Get authority score associated to a given page.

Return
0 if success, otherwise an error code.
Parameters
  • pr -

  • idx -

    Page index.

  • score_old -

    Score on the previous call to hits_compute.

  • score_new -

    Score on the last call to hits_compute.

void hits_set_persist(Hits *hits, int value)

Set value of Hits::persist

MMapArray

Data structures

struct MMapArray

A memory mapped array

Public Members

char *mem

Pointer to data

int fd

File descriptor for data

char *path

Path to data file

size_t n_elements

Number of elements

size_t element_size

Size of each element

int persist

If true, do not delete files after deleting object

enum MMapArrayError

Values:

mmap_array_error_ok = 0

No error

mmap_array_error_memory

Error allocation memory

mmap_array_error_internal

Unexpected error

mmap_array_error_mmap

Error with a mmap operation (creation, unmapping, advise...)

mmap_array_error_file

Error manipulating the file system

mmap_array_error_out_of_bounds

Tried to access array past boundaries

Constructor/Destructor

MMapArrayError mmap_array_new(MMapArray **marr, const char *path, size_t n_elements, size_t element_size)

Create a new MMapArray

Return
0 if success, otherwise the error code (also available in marr if not NULL)
Parameters
  • marr -

    Will be changed to point to the newly allocated structure, or NULL if failure

  • path -

    Path to the associated file. Can be NULL in which case the mapping is made anonymous.

  • n_elements -

    Number of elements (can be changed later with mmap_array_resize)

  • element_size -

    Number of bytes of each element

MMapArrayError mmap_array_delete(MMapArray *marr)

Delete MMapArray

If the structure cannot be deleted, the memory will not be freed

Return
0 if success, otherwise the error code (also available in marr)

Functions

MMapArrayError mmap_array_advise(MMapArray *marr, int flag)

Advise memory use pattern

It accepts any flag that madvise accepts

Return
0 if success, otherwise the error code (also available in marr)

MMapArrayError mmap_array_sync(MMapArray *marr, int flag)

Force memory-disk syncronization

It accepts any flag that msync accepts

Return
0 if success, otherwise the error code (also available in marr)

void *mmap_array_idx(MMapArray *marr, size_t n)

Returns pointer to the array element

Return
In case of failure it will return NULL. The error code is available in marr

MMapArrayError mmap_array_set(MMapArray *marr, size_t n, const void *x)

Set array element value

Return
0 if success, otherwise the error code (also available in marr)

void mmap_array_zero(MMapArray *marr)

Set all elements of array to zero

MMapArrayError mmap_array_resize(MMapArray *marr, size_t n_elements)

Change number of elements

The new memort is initialized to 0

Return
0 if success, otherwise the error code (also available in marr)