21 lines
1.0 KiB
Plaintext
21 lines
1.0 KiB
Plaintext
Libtextcat is a library with functions that implement the
|
|
classification technique described in Cavnar & Trenkle, "N-Gram-Based
|
|
Text Categorization". It was primarily developed for language
|
|
guessing, a task on which it is known to perform with near-perfect
|
|
accuracy.
|
|
|
|
The central idea of the Cavnar & Trenkle technique is to calculate a
|
|
"fingerprint" of a document with an unknown category, and compare this
|
|
with the fingerprints of a number of documents of which the categories
|
|
are known. The categories of the closest matches are output as the
|
|
classification. A fingerprint is a list of the most frequent n-grams
|
|
occurring in a document, ordered by frequency. Fingerprints are
|
|
compared with a simple out-of-place metric. See the article for more
|
|
details.
|
|
|
|
Considerable effort went into making this implementation fast and
|
|
efficient. The language guesser processes over 100 documents/second on
|
|
a simple PC, which makes it practical for many uses. It was developed
|
|
for use in our webcrawler and search engine software, in which it it
|
|
handles millions of documents a day.
|