User:Samuel2252/sandbox

Those are draft only, they are not finish

= Norconex = Norconex is a North American information technology company specializing in Enterprise Search professional services and software development (both commercial and open-source).

The company was founded in 2007 by Pascal Essiembre (president). Norconex headquarters are in Gatineau, Quebec, Canada.

Overview
Norconex positions itself as search technology-independent, working with different commercial search technology and vendors as well as open-source search. Being based in Canada’s national capital region helped Norconex grow by reaching a good presence in the Canadian Federal sector. Norconex customers come from various activity sectors in the commercial sector, including manufacturing, publishing, health, technology, and legal.

Open-source project
Norconex is very active in the Open-source community and is involved with miltiple Open-source project.

In 2013 Norconex released its first contribution to the open-source community with its HTTP Web crawler (named Norconex HTTP Collector). Eversince, they have been hard at work to provide multiple open-source content, including their crawlers (the Norconex HTTP Collector and the Norconex Filesystem Collector), their Importer and multiples Committers compatible with multiples well known organisation  to store document.

Services

 * Enterprise Search Consulting
 * Enterprise Search Design
 * Big Data
 * Taxonomy Editing
 * Data Analysis

Feature

 * Multi-threaded.
 * Supports full and incremental crawls.
 * Supports different hit interval according to different schedules.
 * Can crawls millions on a single server of average capacity.
 * Extract text out of many file formats (HTML, PDF, Word, etc.)
 * Extract metadata associated with documents.
 * Supports pages rendered with JavaScript.
 * Language detection.
 * Many content and metadata manipulation options.
 * OCR support on images and PDFs.
 * Page screenshots.
 * Extract page "featured" image.
 * Translation support.
 * Dynamic title generation.
 * Configurable crawling speed.
 * URL normalization.
 * Detects modified and deleted documents.
 * Supports different frequencies for re-crawling certain pages.
 * Supports various web site authentication schemes.
 * Supports sitemap.xml (including "lastmod" and "changefreq").
 * Supports robot rules.
 * Supports canonical URLs.
 * Can filter documents based on URL, HTTP headers, content, or metadata.
 * Can treat embedded documents as distinct documents.
 * Can split a document into multiple documents.
 * Can store crawled URLs in different database engines.
 * Can re-process or delete URLs no longer linked by other crawled pages.
 * Supports different URL extraction strategies for different content types.
 * Fires more than 20 crawler event types for custom event listeners.
 * Date parsers/formatters to match your source/target repository dates.
 * Can create hierarchical fields.
 * Supports scripting languages for manipulating documents.
 * Reference XML/HTML elements using simple DOM tree navigation.
 * Supports external commands to parse or manipulate documents.
 * Supports crawling with your favorite browser (using WebDriver).
 * Supports "If-Modified-Since" for more efficient crawling.
 * Follow URLs from HTML or any other document format.
 * Can detects and report broken links.
 * Can send crawled content to multiple target repositories at once.
 * Many others.

Architecture
Norconex HTTP Collector was built entirely using Java. A single Collector installation is responsible for launching one or multiple crawler threads, each with their own configuration.

Each step is part of a crawler life-cycle is configurable and overwritable. Developers can provide their own interface implementation for most steps undertaken by the crawler. The default implementations provided cover a vast array of crawling use cases, and are built on stable products such as Apache Tika and Apache Derby. The following figure is a high-level representation of a URL life cycle from the crawler perspective.

The Importer and Committer modules are separate Apache licensed java libraries distributed with the Collector.

The Importer module parses incoming documents from their raw form (HTML, PDF, Word, etc) to a set of extracted metadata and plain text content. In addition, it provides interfaces to manipulate a document metadata, transform its content, or simply filter the documents based on their new format. While the Collector is heavily dependent on the Importer module, the latter can be used on its own, as a general-purpose document parser.

The committer module is responsible for directing the parsed data to a target repository of choice. Developers are able to write custom implementations, allowing the use of Norconex HTTP Collector with any search engines or repositories. Also, multiples committer implementations currently exists, for well known organisation.

Minimum Requirements
While the Norconex HTTP Collector can be configured programmatically it also supports XML configuration files. Apache Velocity is used to parse configuration files. Using Velocity directives permits configuration re-use amongst different Collector installations and variables substitution.

The following code is the minimum XML configuration for the current version 2.x. See the documentation for more complexe configuration.