Seeks Configuration
From Seeks
Seeks has several configurable elements,
- the proxy,
- the websearch plugin,
- the image websearch plugin,
- the personalization plugin,
- the query capture plugin,
- the URI capture plugin.
Proxy configuration
The proxy is a hack of Privoxy so the configuration file looks similar.
Additional options include:
plugin directory (2.5)
Specifies where Seeks should be looking for compiled plugins and loads them dynamically. Default value: Unset. Forced through command line or automatically set to base dir.
plugindir /usr/local/lib/seeks
plugin activation (2.6)
Specifies the dynamically loaded plugins that are activated. Default: Unset. All loaded plugins are activated. 'httpserv' plugin is not loaded by default.
activated-plugin websearch activated-plugin blocker activated-plugin img_websearch activated-plugin uri-capture activated-plugin query-capture activated-plugin cf #activated-plugin httpserv
automatic disabling of proxy (2.8)
In case the HTTP server plugin is running, the proxy is automatically disabled. Value is 0 or 1. Default: 1, the proxy is automatically disabled when the HTTP server plugin is running.
automatic-proxy-disable 1
user database file (2.9)
The user database contains user data such as issued queries and clicked URLs for personalization. This option specifies where and in which file the user db should be stored.
Default: Unset. This means the db is located in $HOME/.seeks/seeks_user.db as default location.
user-db-file /path/to/file.db
user database remote address (2.10)
Seeks provides the ability to connect to a remote database to store data and reuse data (e.g. for web search personalization).
Default: Unset. This means Seeks uses a local file as database as default (see 2.9).
user-db-address 127.0.0.1:9000
user database startup check (2.11)
Whether to check the user db at startup for removing old records. On nodes with large amounts of content this makes startup slow, so this option allows one to deactivate the check at startup.
Default: 1. Startup check is activated.
user-db-startup-check 1
user database optimization (2.12)
Whether to optimize the user db at startup and shutdown. Optimization can take a long time on very large databases and thus be detrimental in certain cases. Typically, optimization makes rebooting nodes with large user database a several minutes long operation.
Default: 1. Database optimization at startup.
user-db-optimize 1
user database support for large file (2.13)
Whether to activate the support for large user database (> 2Gb).
Default: 0. No support for large db file.
user-db-large 0
URL pointing to the source code (2.14)
AGPLv3 requires a pointer to the source code to be given to every user of the program. This options sets the URL where the source code of the Seeks node can be read.
Default: http://seeks.git.sourceforge.net/git/gitweb.cgi?p=seeks/seeks;a=tree
url-source-code url/to/source_code
Websearch plugin configuration
The websearch plugin configuration file is
src/plugins/websearch/websearch-config
All the following configuration options and their default values are to be found in this file.
websearch language
The option search-language defines the websearch language of preference.
Default: en
search-language fr
Sets the language to French.
- automatic detection is based on http headers.
- default: auto
number of results per page
Maximum number of websearch results on a single page. Default: 10
search-results-page 10
websearch engine selection
The option search-engine allows the selection of a set of search engines and feeds to get results from.
entries of the form
search-engine <engine> <url> <name> <default | nodefault>
- 'engine' specifies a feed / search engine parser. Every engine can support up to 10 URLs.
- 'url' specifies a URL to query from. The URL can contain parameters, such as %query for specifying where the user query should be placed.
- 'name' specifies the name of the feed, to be called from the API.
- 'default' indicates that the engine and url are default engines when no engine or URLs are specified to the API.
Examples:
search-engine google http://www.google.com/search?q=%query&start=%start&num=%num&hl=%lang&ie=%encoding&oe=%encoding google default search-engine twitter http://search.twitter.com/search.atom?q=%query&page=%start&rpp=%num twitter nodefault http://identi.ca/api/search.atom?q=%query&page=%start&rpp=%num identica nodefault search-engine opensearch_rss http://plone.org/search_rss?SearchableText=%query plosone nodefault
Then to access the engines through the API:
&engines=twitter:twitter &engines=twitter:identica &engines=twitter
The latter returns both twitter and identica feeds.
websearch cache expiration (seconds)
Minimum number of seconds search results are kept in the system cache, for reuse, update, etc... while not being used. The cache is per query, and is resetted every time an alive query is accessed. Default: 300
query-context-delay 300
enabling thumbnails
The following option enables the insertion of thumbnails from http://www.thumbshots.com, for websearch result URLs. Default: 0
enable-thumbs 1
enables them.
enabling javascript
Enabling javascript on the websearch results pages enables keyboard shortcuts, and will allow all sort of dynamic treatments in the future. Default: 0
enable-js 1
enables javascript.
enabling background content analysis
This option enables the background download of the content pointed to by websearch results. Running this option makes seeks slower and more bandwith demanding than the default behavior. However, the content aware system has more features, such as better aggregation of websearch snippets from multiple search engines, preemptive caching of webpages pointed to by websearch results and accurate automated similarity analysis and clustering of the results. Default: 0
enable-content-analysis 1
activates the analysis of content in real-time.
connection and transfer timeouts
The options below allow to control the connection and transfer timeouts to the search engines, and to other pages (typically for content analysis).
Default: 3
se-connect-timeout 3
connection timeout to search engines, in seconds.
Default: 5
se-transfer-timeout 5
transfer timeout when connecting to a search engine, in seconds.
Default: 1
ct-connect-timeout 1
connection timeout when fetching content for analysis & caching, in seconds.
Default: 3
ct-transfer-timeout 3
transfer timeout when fetching content for analysis & caching, in seconds.
highlighting the most discriminative words
This option is applicable to version 0.2.2-SOLO and above. It enables a more discriminative highlight of words in result snippets. The highlighted words are those that discriminate the most a snippet from all other snippets in the results.
Default: 1
extended-highlight 1
Enables discriminative highlighting.
background proxy setting
Sets a proxy through which to fetch the background URLs Seeks needs, to grab search engine results and content, as required.
Default:
background-proxy-addr your_proxy:your_port
show node's IP on rendered pages
Renders the node IP address in the info bar, or not.
Default: 0
show-node-ip 0
personalization on / off
Personalizes the result ranks based on user data from past searches and proxy usage.
Default: 1
personalized-ranking 1
message in panel
Message to be viewed in a panel next to the search results. Supported are plain text and text with html tags.
Default:
result-message Beware, you are using a remote Seeks node
dynamic UI (JSON-based)
Enables the dynamic ui (JSON-based). Do not use this UI if your browser is console-based and / or does not support javascript. default: 0
dynamic-ui 0
User Interface theme
User Interface theme identifier. 'original' is the historical theme. 'compact' is a button-based theme that leaves more room to results. Custom themes can be designed easily. default: compact
ui-theme compact
Number of recommendations in results
Maximum number of recommended queries in results. Default: 13
num-recommended-queries 13
Websearch patterns
Seeks supports regexp patterns to either regroup or eliminate some results.
In the source repository, pattern files are found insrc/plugins/websearch/patterns
The following files exist
audio file_doc forum pdf qi_patterns reject video
- Files audio, file_doc, forum, pdf & video are used by Seeks to regroup results automatically per types.
- File qi_patterns is used by Seeks to intercept queries in proxy mode.
- File reject is used by Seeks to eliminate some results to queries. This file is empty by default. Adding regexp rules to the reject file allows to control the results per url, e.g. the author reject any result from experts-exchange.com.
Image Websearch Configuration
The configuration file in sources is src/plugins/img_websearch/img-websearch-config
Image websearch engine selection
The option search-engine allows the selection of a set of search engines and feeds to get results from.
entries of the form
search-engine <engine> <url> <name> <default | nodefault>
- 'engine' specifies a feed / search engine parser. Every engine can support up to 10 URLs.
- 'url' specifies a URL to query from. The URL can contain parameters, such as %query for specifying where the user query should be placed.
- 'name' specifies the name of the feed, to be called from the API.
- 'default' indicates that the engine and url are default engines when no engine or URLs are specified to the API.
Examples:
img-search-engine google_img http://www.google.com/images?q=%query&gbv=1&start=%start&hl=%lang&ie=%encoding&oe=%encoding google_img default img-search-engine flickr http://www.flickr.com/search/?q=%query&page=%start flickr default
enabling image background content analysis
Enables background download of image thumbnails and their analysis for detecting identical and near-identical images. Expected to be both slower and more bandwith demanding than when not activated.
Default: 0
img-content-analysis 1
activates the analysis of images in real-time.
number of results per page
Maximum number of image results on a single page.
Default: 60
img-per-page 60
enabling safe search of images
enables safe search of images (1 for on, 0 for off).
Default: 1
safe-search 1
Collaborative Filter Configuration
The configuration file in sources is src/plugins/cf/cf-config
domain names weight in filter
Weight given to domain names (as opposed to exact URLs). This is a measure of trust / likeliness of an already visited domain to hold good / interesting URL (wrt. content).
Default: 0.3
domain-name-weight 0.3
records cache timeout
Personalization fetches records, possibly from remote databases. Timeout on cached remote records, in seconds.
Default: 600
record-cache-timeout 600
list of peers in the collaborative filtering ring
Static list of peers for collaborative filtering. One of more lines, of the form: cf-peer address port (sn | bsn | tt)
- 'sn' for HTTP transport to Seeks node (current default),
- 'bsn' for HTTP batch transport to Seeks node (soon to be default),
- 'tt' for serving a peer user database with Tokyo Tyrant.
Default:
cf-peer http://www.seeks.fr bsn cf-peer http://seeks-project.info/search_exp.php bsn cf-peer http://seeks-project.info/search.php bsn
Time between checks on peers availability
Time interval in seconds between two check on dead peers. Default:
dead-peer-check 300
Number of retries before considering a peer is dead
Number of retries before marking a peer as dead. Default:
dead-peer-retries 3
URL check on posted content
Posted URL check. checks whether a posted URL exists, and tries to retrieve the page and its title. Default:
post-url-check 1
Default user agent used for checking on posted URLs
'user-agent' header for checking URL. when the above option is enabled, and no 'user-agent' header is passed to Seeks, this is the default header to be used. default:
post-ua Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Default similarity radius for posted queries =
Posted query similarity radius. Defines the similarity radius of the query to which the posted URL is to be attached. Default:
post-radius 5
Query Capture Configuration
This plugin captures queries and clicked URIs in order to feed the user DB with data for filtering and personalization.
The configuration file in sources is src/plugins/query_capture/query-capture-config
Maximum radius for similar query generation
Maximum radius of the generated halo of queries around every original query. Higher radius means more storage and more data from which to make recommendations and ratings. Recommended values are between 0 and 8.
Default: 5
query-max-radius 5
Result clicks interception mode
Two modes:
- 'redirect' is default, and captures clicks on search results by
pointing the clicked URL through the Seeks node first, and then redirecting to the URL.
- 'capture' is detection of clicks from search results by the proxy
itself. It is slightly heavier, but allows to avoid the redirection.
Default: redirect mode-intercept redirect
Protection against abusive redirections
protection against abusive use of the URL redirection scheme. 'redirect' mode-intercept for query capture uses a URL redirection scheme, similar to a proxy. On public nodes this scheme can be abused to hit the nodes with redirection calls for URLs that do not appear among search results. This option activates a minimal protection.
Default: 0 protected-redirection 0
Interval between two sweeps of stored query in user DB
Sets the interval of time in seconds between two sweeps of old query records in the user DB. It is recommended to set sweeps every few days, weeks or months, not every seconds as every sweep requires a traversal of the full DB.
Default: 2592000 # one month.
query-sweep-cycle 2592000
Retention of stored queries, in seconds
Sets the retention of records, in seconds.
Default: 31104000 # one year
uc-retention 31104000
URI capture configuration
The configuration file in sources is src/plugins/uri_capture/uri-capture-config
Interval between two sweeps of stored URIs in user DB
Sets the interval of time in seconds between two sweeps of old URI records in the user DB. It is recommended to set sweeps every few days, weeks or months, not every seconds as every sweep requires a traversal of the full DB.
Default: 2592000 # one month.
uc-sweep-cycle 2592000
Retention of URIs in seconds
Sets the retention of records, in seconds.
Default: 15552000 # six months.
uc-retention 15552000
