Seeks Configuration

From Seeks

Jump to: navigation, search

Seeks has several configurable elements,

  • the proxy,
  • the websearch plugin,
  • the image websearch plugin,
  • the personalization plugin,
  • the query capture plugin,
  • the URI capture plugin.


Contents

Proxy configuration

The proxy is a hack of Privoxy so the configuration file looks similar.

Additional options include:

plugin directory (2.5)

Specifies where Seeks should be looking for compiled plugins and loads them dynamically. Default value: Unset. Forced through command line or automatically set to base dir.

plugindir /usr/local/lib/seeks

plugin activation (2.6)

Specifies the dynamically loaded plugins that are activated. Default: Unset. All loaded plugins are activated. 'httpserv' plugin is not loaded by default.

activated-plugin websearch
activated-plugin blocker
activated-plugin img_websearch
activated-plugin uri-capture
activated-plugin query-capture
activated-plugin cf
#activated-plugin httpserv

automatic disabling of proxy (2.8)

In case the HTTP server plugin is running, the proxy is automatically disabled. Value is 0 or 1. Default: 1, the proxy is automatically disabled when the HTTP server plugin is running.

automatic-proxy-disable 1                                                                                                                

user database file (2.9)

The user database contains user data such as issued queries and clicked URLs for personalization. This option specifies where and in which file the user db should be stored.

Default: Unset. This means the db is located in $HOME/.seeks/seeks_user.db as default location.

user-db-file /path/to/file.db

user database remote address (2.10)

Seeks provides the ability to connect to a remote database to store data and reuse data (e.g. for web search personalization).

Default: Unset. This means Seeks uses a local file as database as default (see 2.9).

user-db-address 127.0.0.1:9000

user database startup check (2.11)

Whether to check the user db at startup for removing old records. On nodes with large amounts of content this makes startup slow, so this option allows one to deactivate the check at startup.

Default: 1. Startup check is activated.

user-db-startup-check 1

user database optimization (2.12)

Whether to optimize the user db at startup and shutdown. Optimization can take a long time on very large databases and thus be detrimental in certain cases. Typically, optimization makes rebooting nodes with large user database a several minutes long operation.

Default: 1. Database optimization at startup.

user-db-optimize 1

user database support for large file (2.13)

Whether to activate the support for large user database (> 2Gb).

Default: 0. No support for large db file.

user-db-large 0

URL pointing to the source code (2.14)

AGPLv3 requires a pointer to the source code to be given to every user of the program. This options sets the URL where the source code of the Seeks node can be read.

Default: http://seeks.git.sourceforge.net/git/gitweb.cgi?p=seeks/seeks;a=tree

url-source-code url/to/source_code

Websearch plugin configuration

The websearch plugin configuration file is

src/plugins/websearch/websearch-config

All the following configuration options and their default values are to be found in this file.

websearch language

The option search-language defines the websearch language of preference. Default: en

search-language fr

Sets the language to French.

  1. automatic detection is based on http headers.
  2. default: auto

number of results per page

Maximum number of websearch results on a single page. Default: 10

search-results-page 10

websearch engine selection

The option search-engine allows the selection of a set of search engines and feeds to get results from. entries of the form

search-engine <engine> <url> <name> <default | nodefault>
  • 'engine' specifies a feed / search engine parser. Every engine can support up to 10 URLs.
  • 'url' specifies a URL to query from. The URL can contain parameters, such as %query for specifying where the user query should be placed.
  • 'name' specifies the name of the feed, to be called from the API.
  • 'default' indicates that the engine and url are default engines when no engine or URLs are specified to the API.

Examples:

search-engine google http://www.google.com/search?q=%query&start=%start&num=%num&hl=%lang&ie=%encoding&oe=%encoding google default
search-engine twitter http://search.twitter.com/search.atom?q=%query&page=%start&rpp=%num twitter nodefault http://identi.ca/api/search.atom?q=%query&page=%start&rpp=%num identica nodefault
search-engine opensearch_rss http://plone.org/search_rss?SearchableText=%query plosone nodefault

Then to access the engines through the API:

&engines=twitter:twitter
&engines=twitter:identica
&engines=twitter

The latter returns both twitter and identica feeds.

websearch cache expiration (seconds)

Minimum number of seconds search results are kept in the system cache, for reuse, update, etc... while not being used. The cache is per query, and is resetted every time an alive query is accessed. Default: 300

query-context-delay 300

enabling thumbnails

The following option enables the insertion of thumbnails from http://www.thumbshots.com, for websearch result URLs. Default: 0

enable-thumbs 1

enables them.

enabling javascript

Enabling javascript on the websearch results pages enables keyboard shortcuts, and will allow all sort of dynamic treatments in the future. Default: 0

enable-js 1

enables javascript.

enabling background content analysis

This option enables the background download of the content pointed to by websearch results. Running this option makes seeks slower and more bandwith demanding than the default behavior. However, the content aware system has more features, such as better aggregation of websearch snippets from multiple search engines, preemptive caching of webpages pointed to by websearch results and accurate automated similarity analysis and clustering of the results. Default: 0

enable-content-analysis 1

activates the analysis of content in real-time.

connection and transfer timeouts

The options below allow to control the connection and transfer timeouts to the search engines, and to other pages (typically for content analysis).

Default: 3

se-connect-timeout 3

connection timeout to search engines, in seconds.

Default: 5

se-transfer-timeout 5

transfer timeout when connecting to a search engine, in seconds.

Default: 1

ct-connect-timeout 1

connection timeout when fetching content for analysis & caching, in seconds.

Default: 3

ct-transfer-timeout 3

transfer timeout when fetching content for analysis & caching, in seconds.

highlighting the most discriminative words

This option is applicable to version 0.2.2-SOLO and above. It enables a more discriminative highlight of words in result snippets. The highlighted words are those that discriminate the most a snippet from all other snippets in the results.

Default: 1

extended-highlight 1

Enables discriminative highlighting.

background proxy setting

Sets a proxy through which to fetch the background URLs Seeks needs, to grab search engine results and content, as required.

Default:

background-proxy-addr your_proxy:your_port

show node's IP on rendered pages

Renders the node IP address in the info bar, or not.

Default: 0

show-node-ip 0

personalization on / off

Personalizes the result ranks based on user data from past searches and proxy usage.

Default: 1

personalized-ranking 1

message in panel

Message to be viewed in a panel next to the search results. Supported are plain text and text with html tags.

Default:

result-message Beware, you are using a remote Seeks node

dynamic UI (JSON-based)

Enables the dynamic ui (JSON-based). Do not use this UI if your browser is console-based and / or does not support javascript. default: 0

dynamic-ui 0

User Interface theme

User Interface theme identifier. 'original' is the historical theme. 'compact' is a button-based theme that leaves more room to results. Custom themes can be designed easily. default: compact

ui-theme compact

Number of recommendations in results

Maximum number of recommended queries in results. Default: 13

num-recommended-queries 13

Websearch patterns

Seeks supports regexp patterns to either regroup or eliminate some results.

In the source repository, pattern files are found in
src/plugins/websearch/patterns

The following files exist

audio  file_doc  forum  pdf  qi_patterns  reject  video
  • Files audio, file_doc, forum, pdf & video are used by Seeks to regroup results automatically per types.
  • File qi_patterns is used by Seeks to intercept queries in proxy mode.
  • File reject is used by Seeks to eliminate some results to queries. This file is empty by default. Adding regexp rules to the reject file allows to control the results per url, e.g. the author reject any result from experts-exchange.com.

Image Websearch Configuration

The configuration file in sources is src/plugins/img_websearch/img-websearch-config

Image websearch engine selection

The option search-engine allows the selection of a set of search engines and feeds to get results from. entries of the form

search-engine <engine> <url> <name> <default | nodefault>
  • 'engine' specifies a feed / search engine parser. Every engine can support up to 10 URLs.
  • 'url' specifies a URL to query from. The URL can contain parameters, such as %query for specifying where the user query should be placed.
  • 'name' specifies the name of the feed, to be called from the API.
  • 'default' indicates that the engine and url are default engines when no engine or URLs are specified to the API.

Examples:

img-search-engine google_img http://www.google.com/images?q=%query&gbv=1&start=%start&hl=%lang&ie=%encoding&oe=%encoding google_img default
img-search-engine flickr http://www.flickr.com/search/?q=%query&page=%start flickr default

enabling image background content analysis

Enables background download of image thumbnails and their analysis for detecting identical and near-identical images. Expected to be both slower and more bandwith demanding than when not activated.

Default: 0

img-content-analysis 1

activates the analysis of images in real-time.

number of results per page

Maximum number of image results on a single page.

Default: 60

img-per-page 60

enabling safe search of images

enables safe search of images (1 for on, 0 for off).

Default: 1

safe-search 1

Collaborative Filter Configuration

The configuration file in sources is src/plugins/cf/cf-config

domain names weight in filter

Weight given to domain names (as opposed to exact URLs). This is a measure of trust / likeliness of an already visited domain to hold good / interesting URL (wrt. content).

Default: 0.3

domain-name-weight 0.3

records cache timeout

Personalization fetches records, possibly from remote databases. Timeout on cached remote records, in seconds.

Default: 600

record-cache-timeout 600

list of peers in the collaborative filtering ring

Static list of peers for collaborative filtering. One of more lines, of the form: cf-peer address port (sn | bsn | tt)

  • 'sn' for HTTP transport to Seeks node (current default),
  • 'bsn' for HTTP batch transport to Seeks node (soon to be default),
  • 'tt' for serving a peer user database with Tokyo Tyrant.

Default:

cf-peer http://www.seeks.fr bsn
cf-peer http://seeks-project.info/search_exp.php bsn
cf-peer http://seeks-project.info/search.php bsn

Time between checks on peers availability

Time interval in seconds between two check on dead peers. Default:

dead-peer-check 300

Number of retries before considering a peer is dead

Number of retries before marking a peer as dead. Default:

dead-peer-retries 3

URL check on posted content

Posted URL check. checks whether a posted URL exists, and tries to retrieve the page and its title. Default:

post-url-check 1

Default user agent used for checking on posted URLs

'user-agent' header for checking URL. when the above option is enabled, and no 'user-agent' header is passed to Seeks, this is the default header to be used. default:

post-ua Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1

Default similarity radius for posted queries =

Posted query similarity radius. Defines the similarity radius of the query to which the posted URL is to be attached. Default:

post-radius 5

Query Capture Configuration

This plugin captures queries and clicked URIs in order to feed the user DB with data for filtering and personalization.

The configuration file in sources is src/plugins/query_capture/query-capture-config

Maximum radius for similar query generation

Maximum radius of the generated halo of queries around every original query. Higher radius means more storage and more data from which to make recommendations and ratings. Recommended values are between 0 and 8.

Default: 5

query-max-radius 5

Result clicks interception mode

Two modes:

  • 'redirect' is default, and captures clicks on search results by

pointing the clicked URL through the Seeks node first, and then redirecting to the URL.

  • 'capture' is detection of clicks from search results by the proxy

itself. It is slightly heavier, but allows to avoid the redirection.

Default: redirect mode-intercept redirect

Protection against abusive redirections

protection against abusive use of the URL redirection scheme. 'redirect' mode-intercept for query capture uses a URL redirection scheme, similar to a proxy. On public nodes this scheme can be abused to hit the nodes with redirection calls for URLs that do not appear among search results. This option activates a minimal protection.

Default: 0 protected-redirection 0


Interval between two sweeps of stored query in user DB

Sets the interval of time in seconds between two sweeps of old query records in the user DB. It is recommended to set sweeps every few days, weeks or months, not every seconds as every sweep requires a traversal of the full DB.

Default: 2592000 # one month.

query-sweep-cycle 2592000

Retention of stored queries, in seconds

Sets the retention of records, in seconds.

Default: 31104000 # one year

uc-retention 31104000

URI capture configuration

The configuration file in sources is src/plugins/uri_capture/uri-capture-config

Interval between two sweeps of stored URIs in user DB

Sets the interval of time in seconds between two sweeps of old URI records in the user DB. It is recommended to set sweeps every few days, weeks or months, not every seconds as every sweep requires a traversal of the full DB.

Default: 2592000 # one month.

uc-sweep-cycle 2592000

Retention of URIs in seconds

Sets the retention of records, in seconds.

Default: 15552000 # six months.

uc-retention 15552000
Personal tools