word近義詞功能怎么用(通過word文檔找英文近義詞)
709
2025-04-03
項目原來是基于一個挺老的項目基礎做的迭代二次開發,原來使用的是庖丁分詞,這個分詞庫已經很久沒人更新了,而且和高版本的lucene也有兼容問題,所以處理同義詞問題之前,我把原來使用的padding中文分詞替換成word中文分詞;
同義詞處理邏輯實際代碼主要參考的這兩個帖子:
https://blog.csdn.net/yax405/article/details/43246237
https://blog.csdn.net/winnerspring/article/details/37567739
部分代碼如下:
package?org.apdplat.word.lucene; import?java.io.IOException; import?java.nio.file.Paths; import?java.util.HashMap; import?java.util.Map; import?org.apache.lucene.analysis.Analyzer; import?org.apache.lucene.analysis.Analyzer.TokenStreamComponents; import?org.apache.lucene.analysis.TokenStream; import?org.apache.lucene.analysis.Tokenizer; import?org.apache.lucene.analysis.core.LowerCaseFilterFactory; import?org.apache.lucene.analysis.synonym.SynonymFilterFactory; import?org.apache.lucene.analysis.util.FilesystemResourceLoader; import?org.apache.lucene.util.Version; import?org.apdplat.word.segmentation.Segmentation; import?org.apdplat.word.segmentation.SegmentationAlgorithm; import?org.apdplat.word.segmentation.SegmentationFactory; import?org.slf4j.Logger; import?org.slf4j.LoggerFactory; public?class?SynonymsAnalyzer ??extends?Analyzer { ??private?static?final?Logger?LOGGER?=?LoggerFactory.getLogger(ChineseWordAnalyzer.class); ??private?Segmentation?segmentation?=?null; ?? ??public?SynonymsAnalyzer() ??{ ????this.segmentation?=?SegmentationFactory.getSegmentation(SegmentationAlgorithm.BidirectionalMinimumMatching); ??} ?? ??public?SynonymsAnalyzer(String?segmentationAlgorithm) ??{ ????try ????{ ??????SegmentationAlgorithm?sa?=?SegmentationAlgorithm.valueOf(segmentationAlgorithm); ??????this.segmentation?=?SegmentationFactory.getSegmentation(sa); ????} ????catch?(Exception?e) ????{ ??????this.segmentation?=?SegmentationFactory.getSegmentation(SegmentationAlgorithm.BidirectionalMinimumMatching); ????} ??} ?? ??public?SynonymsAnalyzer(SegmentationAlgorithm?segmentationAlgorithm) ??{ ????this.segmentation?=?SegmentationFactory.getSegmentation(segmentationAlgorithm); ??} ?? ??public?SynonymsAnalyzer(Segmentation?segmentation) ??{ ????this.segmentation?=?segmentation; ??} ?? ??private?static?SynonymFilterFactory?factory?=?null; ?? ??protected?static?SynonymFilterFactory?getSynonymsFactory() ??{ ????if?(factory?==?null) ????{ ??????Map?paramsMap?=?new?HashMap(); ?????? ??????Version?ver?=?Version.LUCENE_5_5_1; ?????? ??????paramsMap.put("luceneMatchVersion",?ver.toString()); ??????paramsMap.put("synonyms",?"./synonyms.txt"); ??????paramsMap.put("expand",?"true"); ??????factory?=?new?SynonymFilterFactory(paramsMap); ??????try ??????{ ????????FilesystemResourceLoader?loader?=?new?FilesystemResourceLoader(Paths.get("D:/RobotK/tomcat/synonyms",?new?String[0])); ????????factory.inform(loader); ??????} ??????catch?(IOException?e) ??????{ ????????e.printStackTrace(); ??????} ????} ????return?factory; ??} ?? ??public?static?void?reloadSynonymFilterFactory() ??{ ????Map?paramsMap?=?new?HashMap(); ???? ????Version?ver?=?Version.LUCENE_5_5_1; ???? ????paramsMap.put("luceneMatchVersion",?ver.toString()); ????paramsMap.put("synonyms",?"./synonyms.txt"); ????paramsMap.put("expand",?"true"); ????factory?=?new?SynonymFilterFactory(paramsMap); ????try ????{ ??????FilesystemResourceLoader?loader?=?new?FilesystemResourceLoader(Paths.get("D:/RobotK/tomcat/synonyms",?new?String[0])); ??????factory.inform(loader); ????} ????catch?(IOException?e) ????{ ??????e.printStackTrace(); ????} ??} ?? ??private?static?LowerCaseFilterFactory?caseFactory?=?null; ?? ??public?static?LowerCaseFilterFactory?getCaseFilterFactory() ??{ ????if?(caseFactory?==?null) ????{ ??????Version?ver?=?Version.LUCENE_5_5_1; ??????Map
其他參考的帖子有:
https://blog.csdn.net/u011066470/article/details/60963439
http://www.hankcs.com/program/java/lucene-synonymfilterfactory.html
https://blog.csdn.net/yax405/article/details/43246237
https://blog.csdn.net/liyantianmin/article/details/59485799
https://github.com/ysc/word/blob/master/src/main/java/org/apdplat/word/lucene/ChineseWordAnalyzer.java
https://blog.csdn.net/winnerspring/article/details/37567739
https://my.oschina.net/apdplat/blog/228619
http://www.hankcs.com/program/java/lucene-synonymfilterfactory.html
在Lucene4.6中通過SynonymFilterFactory實現中文同義詞非常方便,只需幾行代碼和一個同義詞詞典。這個詞典還能在Lucene中實現一定程度的拼寫糾錯,提升搜索體驗。在下面這個例子中我們從磁盤載入一個同義詞詞典,并且對“其實hankcs似好人”這句話進行stream化以供索引,同時還對其中的拼寫錯誤“似->是”做出糾正。
首先是位于./data/synonyms.txt路徑下的同義詞詞典:
我,俺,hankcs似,is,are?=>?是好人,好心人,熱心人
可以看出上面有兩種詞典格式:
通過,分割的可拓展同義詞
比如“我,俺,hankcs”代表著這三個詞是同義詞,并且任何一個詞可以被expand(拓展)為其他三個。如果expand設為false的話,則這三個詞都會被統一替換為第一個詞,也就是“我”。
通過=>收縮的不可拓展同義詞
比如“似,is,are?=>?是”代表這三個詞同義,并且無視expand參數,統一會被替換為“是”
然后是加載代碼
package?com.hankcs.test; import?org.apache.lucene.analysis.TokenStream; import?org.apache.lucene.analysis.core.WhitespaceAnalyzer; import?org.apache.lucene.analysis.synonym.SynonymFilterFactory; import?org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import?org.apache.lucene.analysis.tokenattributes.OffsetAttribute; import?org.apache.lucene.analysis.util.FilesystemResourceLoader; import?org.apache.lucene.util.Version; import?org.apache.uima.annotator.WhitespaceTokenizer;? import?java.io.IOException;import?java.io.StringReader; import?java.util.HashMap;import?java.util.Map;? /**?*?@author?hankcs?*/ public?class?TestSynonyms{???? private?static?void?displayTokens(TokenStream?ts)?throws?IOException????{ ????????CharTermAttribute?termAttr?=?ts.addAttribute(CharTermAttribute.class);???????? ????????OffsetAttribute?offsetAttribute?=?ts.addAttribute(OffsetAttribute.class);???????? ????????ts.reset();???????? ????????while?(ts.incrementToken())????????{?? ??????????????????String?token?=?termAttr.toString();???????????? ??????????????????System.out.print(offsetAttribute.startOffset()?+?"-"?+?offsetAttribute.endOffset()?+?"["?+?token?+?"]?");???????? ????????}???????? ????????System.out.println();???????? ????????ts.end();???????? ????????ts.close();???? }????? public?static?void?main(String[]?args)?throws?Exception????{???????? String?testInput?=?"其實?hankcs?似?好人";???????? Version?ver?=?Version.LUCENE_46;???????? Map
輸出:
0-2[其實] 3-9[我] 3-9[俺] 3-9[hankcs] 10-11[是] 12-14[好人] 12-14[好心人] 12-14[熱心人]
由于 我 俺 hankcs 三個詞是同一個意思,所以它們被視為同一個term,并且它們的偏移相同,都是3->9,這個長度取決于原來的詞 hankcs 的長度。
https://blog.csdn.net/yax405/article/details/43246237
https://blog.csdn.net/u010366796/article/details/44937025
http://www.voidcn.com/article/p-txqtdabn-bbo.html
http://blog.csdn.net/winnerspring/article/details/37521101
http://www.voidcn.com/article/p-pjrzypvg-bbo.html
http://blog.csdn.net/winnerspring/article/details/37567739
http://blog.csdn.net/hu948162999/article/details/41283597
http://www.voidcn.com/article/p-xrordklc-bah.html
https://iamyida.iteye.com/blog/2197355
https://cloud.tencent.com/info/034aa996312ba4928c57ae831d6acedf.html
lucene 同義詞的索引
public?interface?SynonymEngine?{ ????String[]?getSynonyms(String?key); }
public?class?SynonymEngineImpl?implements?SynonymEngine?{ ???? ????private?static?HashMap
public?class?SynonymFilter?extends?TokenFilter?{ ????private?SynonymEngine?engine; ????private?CharTermAttribute?ct; ????private?PositionIncrementAttribute?pt; ????private?Stack
public?class?SynonymAnalyzer?extends?Analyzer?{ ????private?SynonymEngine?engine; ???? ????public?SynonymAnalyzer(SynonymEngine?engine)?{ ????????this.engine?=?engine; ????} ????@Override ????public?TokenStream?tokenStream(String?s,?Reader?reader)?{ ????????//?TODO?Auto-generated?method?stub ????????return?new?SynonymFilter(new?StopFilter(Version.LUCENE_35, ????????????????new?LowerCaseFilter(Version.LUCENE_35, ????????????????????????new?StandardFilter(Version.LUCENE_35, ????????????????????????????????new?StandardTokenizer(Version.LUCENE_35,reader))) ????????????????,StopAnalyzer.ENGLISH_STOP_WORDS_SET),engine); ????} }
public?class?TestSynonym?{ ????private?RAMDirectory?directory; ????@Test ????public?void?init()?{ ????????directory?=?new?RAMDirectory(); ????????SynonymEngine?engine?=?new?SynonymEngineImpl(); ????????IndexWriterConfig?config?=?new?IndexWriterConfig(Version.LUCENE_35,new?SynonymAnalyzer(engine)); ????????String?content?=?"The?quick?brown?fox?jumps?over?the?lazy?dog"; ???????? ????????try?{ ????????????IndexWriter?writer?=?new?IndexWriter(directory,config); ????????????Document?doc?=?new?Document(); ????????????doc.add(new?Field("content",content,Field.Store.YES,Field.Index.ANALYZED)); ????????????writer.addDocument(doc); ????????????writer.close(); ???????????? ????????????IndexReader?reader?=?IndexReader.open(directory); ????????????IndexSearcher?searcher?=?new?IndexSearcher(reader); ????????????TopDocs?docs?=?searcher.search(new?TermQuery(new?Term("content","pooch")),10); ????????????for(ScoreDoc?sd:docs.scoreDocs)?{ ????????????????Document?d?=?searcher.doc(sd.doc); ????????????????System.out.println(d.get("content")); ????????????} ???????????? ????????}?catch?(CorruptIndexException?e)?{ ????????????//?TODO?Auto-generated?catch?block ????????????e.printStackTrace(); ????????}?catch?(LockObtainFailedException?e)?{ ????????????//?TODO?Auto-generated?catch?block ????????????e.printStackTrace(); ????????}?catch?(IOException?e)?{ ????????????//?TODO?Auto-generated?catch?block ????????????e.printStackTrace(); ????????} ????} }
http://www.itzk.com/b/1109/581983.shtml
Lucene的同義詞分析器講解
這個分析器用SynonymFilter過濾器對StandardAnalyzer類進行封裝,當向這個過濾器輸入各個項時,會對這些項進行緩沖,并使用棧存儲這些項的同義詞[code]
public class SynonymFilter extends TokenFilter{
publicstatic final String TOKEN_TYPE_SYNONYM="SYNONYM";
privateStack synonymStack;
privateSynonynEngine engine;
publicSynonymFilter(TokenStream in,SynonymEngine engine){
super(in);
synonymStack=new Stack();//緩存同義詞
this.engine=engine;
}
publicTOken next() throws IOException{
if (synonymStack.size()>0){//如何還有當前詞的同義詞沒有輸出,則輸出
return (Token) synonymStack.pop();
}
Token token=input.next();//讀取新詞
if (token==null) {
return null;
}
addAliasesToStack(token);//存儲新詞的同義詞
returntoken;
}
private voidaddAliasesToStack(Token token) throws IOException{
String[] synonyms=engine.getSynonyms(token.termText());
if (synonyms==mull) return;
for (int i=0;i Token synToken=newToken(synonyms[i],token.startOffset(),token.endOffset(),TOKEN_TYPE_SYNONYM); synToken.setPositionIncrement(0); synonymStack.push(synToken); } }[/code]以下這個接口是關鍵,可以自由實現,目的是返回s的同義詞數組[code]public interface SynonymEngine{ String[] getSynonyms(String s) throws IOException; } [/code]對于這個接口要小心使用,在查詢時不必列出所有的同義詞,如下例[code]Query query=QueryParser.parse("\"foxjumps\"","content",synonymAnalyzer); Hits hits=searcher.search(query);[/code]是會出錯的,找不到任何結果,因為QueryParser不會區別位置增量,所以位置增量為0這一個表明同義的特征無法體現,會將"foxjumps"直接加上同義詞解釋為"fox jumps hops leaps" https://stackoverrun.com/cn/q/4705011 我想弄清楚lucene的分析儀是如何工作的? 我的問題是,lucene如何處理同義詞?這里的情況是: 我們有一個詞和多詞 單:富=酒吧 多的話:富巴= foobar的 對于單個的詞: 是否Lucene的擴大索引記錄或不?我猜如果一個查詢有一個像“foo”這樣的詞,它也會在查詢中添加“bar”。我不知道是否索引或不索引? 對于多話: 是否Lucene的擴大查詢和索引?例如,如果我們有“富吧”,它是否將foobar添加到索引/查詢? 我的第二個問題是:Lucene使用一個標記流并將它們提供給小寫過濾器之類的過濾器。我的問題是lucene如何找到多詞?比如它是如何發現“foo bar”是一個多詞的? SynonymFilter可任選,保持原有的單詞,并添加同義詞到的TokenStream中,通過設置keepOrig?=真(見SynonymMap.Builder.add())。此行為可能會導致PhraseQueries等問題,請參閱SynonymFilter文檔中的第注意事項。 如果您使用相同的Analyzer進行查詢和編制索引,那么寫入索引的查詢和文檔當然都會以同樣的方式處理。?SynonymFilter與keepOrig設置為true是少數幾個Analyzers之一,經常在查詢和索引之間不合時宜地應用,但這完全取決于您的實現。 至于如何實施,source code可供您使用。 來源?分享 創建?24 6月. 13?femtoRgon 0 它是如何處理多個同義詞的?像“紐約”=“紐約” 沃爾瑪=沃爾瑪=沃爾瑪 ,因為它通過令牌執行過濾令牌。我不知道它是如何找到多個單詞的同義詞?–?Mr.Boy?24 6月. 13 0 有沒有你對它的行為感到困惑,或者你想知道實現如何處理令牌流?如果是后者,那就是為什么我提供了鏈接到源代碼的原因。如果前者貪婪地搜索最長匹配,它可以從給定的位置(也就是說,如果你有規則'foo' - >'bar','foo bar' - >'foobar',那么'foo bar'會變成'foobar',而不是'bar bar')。我不相信它支持'wal mart = wal-mart = walmart'這樣的東西(同義詞規則有一個輸入和一個輸出)。如果有什么特別的要問的話,繼續。?–?femtoRgon?24 6月. 13 0 我的問題是它如何處理令牌流?因為我猜同義詞過濾器一個接一個地得到令牌,而且它是無狀態的。例如,如果當前令牌為“新”,它如何檢查下一個令牌以查看它是否是“約克”?
版權聲明:本文內容由網絡用戶投稿,版權歸原作者所有,本站不擁有其著作權,亦不承擔相應法律責任。如果您發現本站中有涉嫌抄襲或描述失實的內容,請聯系我們jiasou666@gmail.com 處理,核實后本網站將在24小時內刪除侵權內容。
版權聲明:本文內容由網絡用戶投稿,版權歸原作者所有,本站不擁有其著作權,亦不承擔相應法律責任。如果您發現本站中有涉嫌抄襲或描述失實的內容,請聯系我們jiasou666@gmail.com 處理,核實后本網站將在24小時內刪除侵權內容。