Lucene++ - a full-featured, c++ search engine
API Documentation


Loading...
Searching...
No Matches
Public Member Functions | Static Public Member Functions | Protected Member Functions | Protected Attributes
Lucene::PorterStemmer Class Reference

This is the Porter stemming algorithm, coded up as thread-safe ANSI C by the author. More...

#include <PorterStemmer.h>

+ Inheritance diagram for Lucene::PorterStemmer:

Public Member Functions

 PorterStemmer ()
 
virtual ~PorterStemmer ()
 
virtual String getClassName ()
 
boost::shared_ptr< PorterStemmershared_from_this ()
 
bool stem (CharArray word)
 
bool stem (wchar_t *b, int32_t k)
 In stem(b, k), b is a char pointer, and the string to be stemmed is from b[0] to b[k] inclusive. Possibly b[k+1] == '\0', but it is not important. The stemmer adjusts the characters b[0] ... b[k] and stores the new end-point of the string, k'. Stemming never increases word length, so 0 <= k' <= k.
 
wchar_t * getResultBuffer ()
 
int32_t getResultLength ()
 
- Public Member Functions inherited from Lucene::LuceneObject
virtual ~LuceneObject ()
 
virtual void initialize ()
 Called directly after instantiation to create objects that depend on this object being fully constructed.
 
virtual LuceneObjectPtr clone (const LuceneObjectPtr &other=LuceneObjectPtr())
 Return clone of this object.
 
virtual int32_t hashCode ()
 Return hash code for this object.
 
virtual bool equals (const LuceneObjectPtr &other)
 Return whether two objects are equal.
 
virtual int32_t compareTo (const LuceneObjectPtr &other)
 Compare two objects.
 
virtual String toString ()
 Returns a string representation of the object.
 
- Public Member Functions inherited from Lucene::LuceneSync
virtual ~LuceneSync ()
 
virtual SynchronizePtr getSync ()
 Return this object synchronize lock.
 
virtual LuceneSignalPtr getSignal ()
 Return this object signal.
 
virtual void lock (int32_t timeout=0)
 Lock this object using an optional timeout.
 
virtual void unlock ()
 Unlock this object.
 
virtual bool holdsLock ()
 Returns true if this object is currently locked by current thread.
 
virtual void wait (int32_t timeout=0)
 Wait for signal using an optional timeout.
 
virtual void notifyAll ()
 Notify all threads waiting for signal.
 

Static Public Member Functions

static String _getClassName ()
 

Protected Member Functions

bool cons (int32_t i)
 Returns true if b[i] is a consonant. ('b' means 'z->b', but here and below we drop 'z->' in comments.
 
int32_t m ()
 Measures the number of consonant sequences between 0 and j. If c is a consonant sequence and v a vowel sequence, and <..> indicates arbitrary presence,.
 
bool vowelinstem ()
 Return true if 0,...j contains a vowel.
 
bool doublec (int32_t j)
 Return true if j,(j-1) contain a double consonant.
 
bool cvc (int32_t i)
 Return true if i-2,i-1,i has the form consonant - vowel - consonant and also if the second c is not w,x or y. This is used when trying to restore an e at the end of a short word.
 
bool ends (const wchar_t *s)
 Returns true if 0,...k ends with the string s.
 
void setto (const wchar_t *s)
 Sets (j+1),...k to the characters in the string s, readjusting k.
 
void r (const wchar_t *s)
 
void step1ab ()
 step1ab() gets rid of plurals and -ed or -ing. eg.
 
void step1c ()
 Turns terminal y to i when there is another vowel in the stem.
 
void step2 ()
 Maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0.
 
void step3 ()
 Deals with -ic-, -full, -ness etc. similar strategy to step2.
 
void step4 ()
 Takes off -ant, -ence etc., in context vcvc<v>.
 
void step5 ()
 Removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
 
- Protected Member Functions inherited from Lucene::LuceneObject
 LuceneObject ()
 

Protected Attributes

wchar_t * b
 
int32_t k
 
int32_t j
 
int32_t i
 
bool dirty
 
- Protected Attributes inherited from Lucene::LuceneSync
SynchronizePtr objectLock
 
LuceneSignalPtr objectSignal
 

Detailed Description

This is the Porter stemming algorithm, coded up as thread-safe ANSI C by the author.

It may be be regarded as canonical, in that it follows the algorithm presented in Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, no. 3, pp 130-137, only differing from it at the points marked DEPARTURE.

See also http://www.tartarus.org/~martin/PorterStemmer

The algorithm as described in the paper could be exactly replicated by adjusting the points of DEPARTURE, but this is barely necessary, because (a) the points of DEPARTURE are definitely improvements, and (b) no encoding of the Porter stemmer I have seen is anything like as exact as this version, even with the points of DEPARTURE!

Release 2 (the more old-fashioned, non-thread-safe version may be regarded as release 1.)

Constructor & Destructor Documentation

◆ PorterStemmer()

Lucene::PorterStemmer::PorterStemmer ( )

◆ ~PorterStemmer()

virtual Lucene::PorterStemmer::~PorterStemmer ( )
virtual

Member Function Documentation

◆ _getClassName()

static String Lucene::PorterStemmer::_getClassName ( )
inlinestatic

◆ cons()

bool Lucene::PorterStemmer::cons ( int32_t  i)
protected

Returns true if b[i] is a consonant. ('b' means 'z->b', but here and below we drop 'z->' in comments.

◆ cvc()

bool Lucene::PorterStemmer::cvc ( int32_t  i)
protected

Return true if i-2,i-1,i has the form consonant - vowel - consonant and also if the second c is not w,x or y. This is used when trying to restore an e at the end of a short word.

eg. cav(e), lov(e), hop(e), crim(e), but snow, box, tray.

◆ doublec()

bool Lucene::PorterStemmer::doublec ( int32_t  j)
protected

Return true if j,(j-1) contain a double consonant.

◆ ends()

bool Lucene::PorterStemmer::ends ( const wchar_t *  s)
protected

Returns true if 0,...k ends with the string s.

◆ getClassName()

virtual String Lucene::PorterStemmer::getClassName ( )
inlinevirtual

◆ getResultBuffer()

wchar_t * Lucene::PorterStemmer::getResultBuffer ( )

◆ getResultLength()

int32_t Lucene::PorterStemmer::getResultLength ( )

◆ m()

int32_t Lucene::PorterStemmer::m ( )
protected

Measures the number of consonant sequences between 0 and j. If c is a consonant sequence and v a vowel sequence, and <..> indicates arbitrary presence,.

<v> gives 0 vc<v> gives 1 vcvc<v> gives 2 vcvcvc<v> gives 3 ...

◆ r()

void Lucene::PorterStemmer::r ( const wchar_t *  s)
protected

◆ setto()

void Lucene::PorterStemmer::setto ( const wchar_t *  s)
protected

Sets (j+1),...k to the characters in the string s, readjusting k.

◆ shared_from_this()

boost::shared_ptr< PorterStemmer > Lucene::PorterStemmer::shared_from_this ( )
inline

◆ stem() [1/2]

bool Lucene::PorterStemmer::stem ( CharArray  word)

◆ stem() [2/2]

bool Lucene::PorterStemmer::stem ( wchar_t *  b,
int32_t  k 
)

In stem(b, k), b is a char pointer, and the string to be stemmed is from b[0] to b[k] inclusive. Possibly b[k+1] == '\0', but it is not important. The stemmer adjusts the characters b[0] ... b[k] and stores the new end-point of the string, k'. Stemming never increases word length, so 0 <= k' <= k.

◆ step1ab()

void Lucene::PorterStemmer::step1ab ( )
protected

step1ab() gets rid of plurals and -ed or -ing. eg.

caresses -> caress ponies -> poni ties -> ti caress -> caress cats -> cat

feed -> feed agreed -> agree disabled -> disable

matting -> mat mating -> mate meeting -> meet milling -> mill messing -> mess

meetings -> meet

◆ step1c()

void Lucene::PorterStemmer::step1c ( )
protected

Turns terminal y to i when there is another vowel in the stem.

◆ step2()

void Lucene::PorterStemmer::step2 ( )
protected

Maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0.

◆ step3()

void Lucene::PorterStemmer::step3 ( )
protected

Deals with -ic-, -full, -ness etc. similar strategy to step2.

◆ step4()

void Lucene::PorterStemmer::step4 ( )
protected

Takes off -ant, -ence etc., in context vcvc<v>.

◆ step5()

void Lucene::PorterStemmer::step5 ( )
protected

Removes a final -e if m() > 1, and changes -ll to -l if m() > 1.

◆ vowelinstem()

bool Lucene::PorterStemmer::vowelinstem ( )
protected

Return true if 0,...j contains a vowel.

Field Documentation

◆ b

wchar_t* Lucene::PorterStemmer::b
protected

◆ dirty

bool Lucene::PorterStemmer::dirty
protected

◆ i

int32_t Lucene::PorterStemmer::i
protected

◆ j

int32_t Lucene::PorterStemmer::j
protected

◆ k

int32_t Lucene::PorterStemmer::k
protected

The documentation for this class was generated from the following file:

clucene.sourceforge.net