Lucene++ - a full-featured, c++ search engine
API Documentation
This is the Porter stemming algorithm, coded up as thread-safe ANSI C by the author. More...
#include <PorterStemmer.h>
Public Member Functions | |
PorterStemmer () | |
virtual | ~PorterStemmer () |
virtual String | getClassName () |
boost::shared_ptr< PorterStemmer > | shared_from_this () |
bool | stem (CharArray word) |
bool | stem (wchar_t *b, int32_t k) |
In stem(b, k), b is a char pointer, and the string to be stemmed is from b[0] to b[k] inclusive. Possibly b[k+1] == '\0', but it is not important. The stemmer adjusts the characters b[0] ... b[k] and stores the new end-point of the string, k'. Stemming never increases word length, so 0 <= k' <= k. | |
wchar_t * | getResultBuffer () |
int32_t | getResultLength () |
![]() | |
virtual | ~LuceneObject () |
virtual void | initialize () |
Called directly after instantiation to create objects that depend on this object being fully constructed. | |
virtual LuceneObjectPtr | clone (const LuceneObjectPtr &other=LuceneObjectPtr()) |
Return clone of this object. | |
virtual int32_t | hashCode () |
Return hash code for this object. | |
virtual bool | equals (const LuceneObjectPtr &other) |
Return whether two objects are equal. | |
virtual int32_t | compareTo (const LuceneObjectPtr &other) |
Compare two objects. | |
virtual String | toString () |
Returns a string representation of the object. | |
![]() | |
virtual | ~LuceneSync () |
virtual SynchronizePtr | getSync () |
Return this object synchronize lock. | |
virtual LuceneSignalPtr | getSignal () |
Return this object signal. | |
virtual void | lock (int32_t timeout=0) |
Lock this object using an optional timeout. | |
virtual void | unlock () |
Unlock this object. | |
virtual bool | holdsLock () |
Returns true if this object is currently locked by current thread. | |
virtual void | wait (int32_t timeout=0) |
Wait for signal using an optional timeout. | |
virtual void | notifyAll () |
Notify all threads waiting for signal. | |
Static Public Member Functions | |
static String | _getClassName () |
Protected Member Functions | |
bool | cons (int32_t i) |
Returns true if b[i] is a consonant. ('b' means 'z->b', but here and below we drop 'z->' in comments. | |
int32_t | m () |
Measures the number of consonant sequences between 0 and j. If c is a consonant sequence and v a vowel sequence, and <..> indicates arbitrary presence,. | |
bool | vowelinstem () |
Return true if 0,...j contains a vowel. | |
bool | doublec (int32_t j) |
Return true if j,(j-1) contain a double consonant. | |
bool | cvc (int32_t i) |
Return true if i-2,i-1,i has the form consonant - vowel - consonant and also if the second c is not w,x or y. This is used when trying to restore an e at the end of a short word. | |
bool | ends (const wchar_t *s) |
Returns true if 0,...k ends with the string s. | |
void | setto (const wchar_t *s) |
Sets (j+1),...k to the characters in the string s, readjusting k. | |
void | r (const wchar_t *s) |
void | step1ab () |
step1ab() gets rid of plurals and -ed or -ing. eg. | |
void | step1c () |
Turns terminal y to i when there is another vowel in the stem. | |
void | step2 () |
Maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0. | |
void | step3 () |
Deals with -ic-, -full, -ness etc. similar strategy to step2. | |
void | step4 () |
Takes off -ant, -ence etc., in context vcvc<v>. | |
void | step5 () |
Removes a final -e if m() > 1, and changes -ll to -l if m() > 1. | |
![]() | |
LuceneObject () | |
Protected Attributes | |
wchar_t * | b |
int32_t | k |
int32_t | j |
int32_t | i |
bool | dirty |
![]() | |
SynchronizePtr | objectLock |
LuceneSignalPtr | objectSignal |
This is the Porter stemming algorithm, coded up as thread-safe ANSI C by the author.
It may be be regarded as canonical, in that it follows the algorithm presented in Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, no. 3, pp 130-137, only differing from it at the points marked DEPARTURE.
See also http://www.tartarus.org/~martin/PorterStemmer
The algorithm as described in the paper could be exactly replicated by adjusting the points of DEPARTURE, but this is barely necessary, because (a) the points of DEPARTURE are definitely improvements, and (b) no encoding of the Porter stemmer I have seen is anything like as exact as this version, even with the points of DEPARTURE!
Release 2 (the more old-fashioned, non-thread-safe version may be regarded as release 1.)
Lucene::PorterStemmer::PorterStemmer | ( | ) |
|
virtual |
|
inlinestatic |
|
protected |
Returns true if b[i] is a consonant. ('b' means 'z->b', but here and below we drop 'z->' in comments.
|
protected |
Return true if i-2,i-1,i has the form consonant - vowel - consonant and also if the second c is not w,x or y. This is used when trying to restore an e at the end of a short word.
eg. cav(e), lov(e), hop(e), crim(e), but snow, box, tray.
|
protected |
Return true if j,(j-1) contain a double consonant.
|
protected |
Returns true if 0,...k ends with the string s.
|
inlinevirtual |
wchar_t * Lucene::PorterStemmer::getResultBuffer | ( | ) |
int32_t Lucene::PorterStemmer::getResultLength | ( | ) |
|
protected |
Measures the number of consonant sequences between 0 and j. If c is a consonant sequence and v a vowel sequence, and <..> indicates arbitrary presence,.
<v> gives 0
vc<v> gives 1
vcvc<v> gives 2
vcvcvc<v> gives 3 ...
|
protected |
|
protected |
Sets (j+1),...k to the characters in the string s, readjusting k.
|
inline |
bool Lucene::PorterStemmer::stem | ( | CharArray | word | ) |
bool Lucene::PorterStemmer::stem | ( | wchar_t * | b, |
int32_t | k | ||
) |
In stem(b, k), b is a char pointer, and the string to be stemmed is from b[0] to b[k] inclusive. Possibly b[k+1] == '\0', but it is not important. The stemmer adjusts the characters b[0] ... b[k] and stores the new end-point of the string, k'. Stemming never increases word length, so 0 <= k' <= k.
|
protected |
step1ab() gets rid of plurals and -ed or -ing. eg.
caresses -> caress ponies -> poni ties -> ti caress -> caress cats -> cat
feed -> feed agreed -> agree disabled -> disable
matting -> mat mating -> mate meeting -> meet milling -> mill messing -> mess
meetings -> meet
|
protected |
Turns terminal y to i when there is another vowel in the stem.
|
protected |
Maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0.
|
protected |
Deals with -ic-, -full, -ness etc. similar strategy to step2.
|
protected |
Takes off -ant, -ence etc., in context vcvc<v>.
|
protected |
|
protected |
Return true if 0,...j contains a vowel.
|
protected |
|
protected |
|
protected |
|
protected |
|
protected |