天堂网在线最新版www资源网,少妇与大狼拘作爱,玩弄人妻少妇500系列

論壇徽章:: 0

電梯直達(dá)

1樓 [收藏(0)] [報(bào)告]

發(fā)表于 2010-06-30 11:59 |只看該作者 |倒序?yàn)g覽

Lucene不是一個完整的全文索引應(yīng)用，而是是一個用Java寫的全文索引引擎工具包，它可以方便的嵌入到各種應(yīng)用中實(shí)現(xiàn)針對應(yīng)用的全文索引/檢索功能。

Lucene的作者：Lucene的貢獻(xiàn)者Doug Cutting是一位資深全文索引/檢索專家，曾經(jīng)是V-Twin搜索引擎(Apple的Copland操作系統(tǒng)的成就之一)的主要開發(fā)者，后在 Excite擔(dān)任高級系統(tǒng)架構(gòu)設(shè)計(jì)師，目前從事于一些INTERNET底層架構(gòu)的研究。他貢獻(xiàn)出的Lucene的目標(biāo)是為各種中小型應(yīng)用程序加入全文檢索功能。

Lucene的發(fā)展歷程：早先發(fā)布在作者自己的www.lucene.com，后來發(fā)布在SourceForge，2001年年底成為APACHE基金會jakarta的一個子項(xiàng)目：http://jakarta.apache.org/lucene/

已經(jīng)有很多Java項(xiàng)目都使用了Lucene作為其后臺的全文索引引擎．

Eclipse:基于Java的開放開發(fā)平臺，幫助部分的全文索引使用了Lucene

對于中文用戶來說，最關(guān)心的問題是其是否支持中文的全文檢索。但通過后面對于Lucene的結(jié)構(gòu)的介紹，你會了解到由于Lucene良好架構(gòu)設(shè)計(jì)，對中文的支持只需對其語言詞法分析接口進(jìn)行擴(kuò)展就能實(shí)現(xiàn)對中文檢索的支持。
下面是一個入門demo ,基于　Lucene 3.0.2測試．
package org.surpass.test;

import java.io.IOException;

import org.apache.lucene.analysis.cn.ChineseAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;

public class OneDemo {
private static String queryKey = "索*";
private static Directory path = new RAMDirectory();

public static void main(String[] args) {
OneDemo fd = new OneDemo ();
// 創(chuàng)建
fd.createLuceneIndex();
System.out.println("-------------------");
// Hits.通過遍歷Hits可獲取返回的結(jié)果的Document，通過Document則可獲取Field中的相關(guān)信息了。
// 測試
TopDocs topDocs = fd.searchLuceneIndex(queryKey);
if (topDocs != null) {
System.out.println("命中:" + topDocs.totalHits);
// 輸出結(jié)果
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (int i = 0; i < scoreDocs.length; i++) {
try {
Searcher search = new IndexSearcher(path);
Document targetDoc = search.doc(scoreDocs.doc);
System.out.println("內(nèi)容：" + targetDoc.toString());
System.out.println(scoreDocs.score);
} catch (Exception e) {
e.printStackTrace();
}
   System.out.println("===========================");
}
}
}

/**
   * 創(chuàng)建索引
   */
public void createLuceneIndex() {
try {
// IndexWriter，通過它建立相應(yīng)的索引表，相當(dāng)于數(shù)據(jù)庫中的table
// IndexWriter(索引路徑, 分詞器, 是否覆蓋已存在)
IndexWriter iwriter = new IndexWriter(path, new ChineseAnalyzer(),
true, IndexWriter.MaxFieldLength.LIMITED);
// Document，有點(diǎn)類似數(shù)據(jù)庫中table的一行記錄
Document doc1 = new Document();
// Field，這個和數(shù)據(jù)庫中的字段類似
// Store {COMPRESS: 壓縮保存。用于長文本或二進(jìn)制數(shù)據(jù),YES ：保存,NO ：不保存}
// Index {NO ：不建索引,TOKENIZED ：分詞，建索引,UN_TOKENIZED ：不分詞，建索引,
// NO_NORMS ：不分詞，建索引。但是Field的值不像通常那樣被保存，而是只取一個byte，這樣節(jié)約存儲空間}
Field field1 = new Field("content", "搜索引擎", Store.YES,
Index.ANALYZED);
doc1.add(field1);
Document doc2 = new Document();
Field field2 = new Field("content", "創(chuàng)建索引", Store.YES,
Index.ANALYZED);
doc2.add(field2);
iwriter.addDocument(doc1);
iwriter.addDocument(doc2);
iwriter.close();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (LockObtainFailedException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

/**
   * 檢索索引
   *
   * @param word
   *          關(guān)鍵字
   * @return
   */
public TopDocs searchLuceneIndex(String word) {
// Query，Lucene提供了幾種經(jīng)常可以用到的
// Query：TermQuery、
// MultiTermQuery、BooleanQuery、WildcardQuery、PhraseQuery、PrefixQuery、
// PhrasePrefixQuery、FuzzyQuery、RangeQuery、SpanQuery，
// Query其實(shí)也就是指對于需要查詢的字段采用什么樣的方式進(jìn)行查詢，
// 如模糊查詢、語義查詢、短語查詢、范圍查詢、組合查詢等，還有就是QueryParser，
// QueryParser可用于創(chuàng)建不同的Query，還有一個MultiFieldQueryParser支持對于多個字段進(jìn)行同一關(guān)鍵字的查詢，
// IndexSearcher概念指的為需要對何目錄下的索引文件進(jìn)行何種方式的分析的查詢，有點(diǎn)象對數(shù)據(jù)庫的哪種索引表進(jìn)行查詢并按一定方式進(jìn)行記錄中字段的分解查詢的概念，
// 通過IndexSearcher以及Query即可查詢出需要的結(jié)果
Query query = new WildcardQuery(new Term("content", word));
Searcher search = null;
try {
search = new IndexSearcher(path);
return search.search(query, 5);
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}

lucene, lucene

文庫|博客

如何通過修改DNS提升網(wǎng)站訪問速度.pdf
Java編程入門官方教程(第7版).pdf
網(wǎng)站開發(fā)常用輔助工具.pdf
新站如何優(yōu)化才能加快收錄.pdf
什么樣的代碼才是好代碼.pdf
使用正則表達(dá)式與lex實(shí)現(xiàn)詞法分析器
C語言的MIPS匯編實(shí)現(xiàn)（四）SWITCH
Requested init /linuxrc failed (error -2).
比較 csv 文件中數(shù)據(jù)差異
LMD ElPack v2019.7新版亮點(diǎn)：Transparent mode全新升級|附下載

surpass_li

版主

論壇徽章:: 0

2樓 [報(bào)告]

發(fā)表于 2010-06-30 13:31 |只看該作者

搜索測試
package org.surpass.test;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.Date;
import java.util.HashSet;
import java.util.Set;

import org.apache.lucene.analysis.cn.ChineseAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

public class Search {
Date startTime, endTime;

/**
   * 索引文件的存放位置,本例是放入內(nèi)存中．
   */
private Directory path = new RAMDirectory();

/**
   * 創(chuàng)建索引
   */
public void createLuceneIndex() {
IndexWriter writer;
try {
writer = new IndexWriter(path, new ChineseAnalyzer(), true,
IndexWriter.MaxFieldLength.LIMITED);
Document docA = new Document();
// 相當(dāng)于數(shù)據(jù)庫中列的概念，因此第一個參數(shù)是列名，第二個參數(shù)是列的值，最后兩個參數(shù)是enum類型的(JDK1.5)，對創(chuàng)建的索引的設(shè)置
// Field.Store 是否覆蓋原來的索引文件，而不是重新建一個
Field fieldA = new Field("content", "搜索引擎19:58:25", Store.YES,
Index.ANALYZED);
// 我們把列(fieldA)加到某一行(docA)中
docA.add(fieldA);
// 英文測試
docA.add(new Field("content", "hello lucene ,I love you",
Store.YES, Index.ANALYZED));
docA.add(new Field("lastModifyTime", "2010個人 19:58:25", Store.YES,
Index.ANALYZED));

Document docB = new Document();
// 相當(dāng)于數(shù)據(jù)庫中列的概念，因此第一個參數(shù)是列名，第二個參數(shù)是列的值，最后兩個參數(shù)是enum類型的(JDK1.5)，對創(chuàng)建的索引的設(shè)置
Field fieldB = new Field("content", "創(chuàng)建索引", Store.YES,
Index.ANALYZED);
// 我們把列(fieldB)加到某一行(docB)中
docB.add(fieldB);
docB.add(new Field("content", "i live in shanghai.i come from cn",
Store.YES, Index.ANALYZED));
docB.add(new Field("lastModifyTime", "2020個人", Store.YES,
Index.ANALYZED));
Document docC = new Document();
Field fieldC = new Field("content", "19:58:25", Store.YES,
Index.ANALYZED);
// 我們把列(fieldC)加到某一行(docC)中
docC.add(fieldC);
docC.add(new Field("content", "this is a test demo", Store.YES,
Index.ANALYZED));
docC.add(new Field("lastModifyTime", "2010", Store.YES,
Index.ANALYZED));

writer.addDocument(docA);
writer.addDocument(docB);

writer.addDocument(docC);

// 如果對海量數(shù)據(jù)進(jìn)行創(chuàng)建索引的時候，需要對索引進(jìn)行優(yōu)化，以便提高速度
writer.optimize();

// 跟數(shù)據(jù)庫類似，打開一個連接，使用完后，要關(guān)閉它
writer.close();

} catch (Exception e) {
e.printStackTrace();
}
}

/**
   * 創(chuàng)建文件索引
   */
public void createIndexByFile() {
IndexWriter writer;
try {
File file = new File("test.txt");
String filePath = file.getAbsolutePath();
System.out.printf("fielPahth:====" + filePath);
System.out
.printf("\n====================================================\n");
String content = file2String(filePath, "UTF-8");
System.out.printf("content:====" + content);
System.out
.printf("\n====================================================\n");
writer = new IndexWriter(path, new ChineseAnalyzer(), true,
IndexWriter.MaxFieldLength.LIMITED);

Document docA = new Document();

Field fieldA = new Field("content", content, Field.Store.YES,
Field.Index.ANALYZED);
docA.add(new Field("path", filePath, Field.Store.YES,
Field.Index.NOT_ANALYZED));
docA.add(fieldA);

writer.addDocument(docA);

// 如果對海量數(shù)據(jù)進(jìn)行創(chuàng)建索引的時候，需要對索引進(jìn)行優(yōu)化，以便提高速度
writer.optimize();

// 跟數(shù)據(jù)庫類似，打開一個連接，使用完后，要關(guān)閉它
writer.close();

} catch (Exception e) {
e.printStackTrace();
}
}

private String file2String(String fileName, String charset)
throws Exception {
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(fileName), charset));
// StringBuilder ,StringBuffer
StringBuilder builder = new StringBuilder();
String line = null;
while ((line = reader.readLine()) != null) {
builder.append(line);
}
return builder.toString();
}

/**
   * 相當(dāng)于sql中where 后面的條件，WildcardQuery不推薦大家使用通配符搜索
   */
private Query wildcardQuery() {
// where username = 'lucene' and password='apache'
// ?代表至少有一個字符在前面
// 搜索"*搜*"，找到一條數(shù)據(jù)；搜索"*索*"，找到兩條數(shù)據(jù)；搜索"*搜索*"，找到0條數(shù)據(jù)；搜索"*索引*"，找到0條數(shù)據(jù)；
Term term = new Term("content", "*索*");
return new WildcardQuery(term);
}

// 基于lucene的分詞 -- TermQuery只能對單個中文進(jìn)行搜索。英文只能對當(dāng)個單詞進(jìn)行搜索
public Query termQuery() {
Term term = new Term("content", "come");
// Term term = new Term("content", "搜");
return new TermQuery(term);
}

/**
   * 智能搜索
   *
   * @return
   */
public Query queryParser() {
QueryParser queryParser = new QueryParser(Version.LUCENE_30,
"content", new ChineseAnalyzer());
try {
return queryParser.parse("搜索  擎");
} catch (Exception e) {
e.printStackTrace();
}
return null;
}

/**
   * '與或'--搜索
   *
   * @return
   */
public Query booleanQuery() {
Term term1 = new Term("content", "索");
Term term2 = new Term("content", "搜");

TermQuery tempQuery1 = new TermQuery(term1);
TermQuery tempQuery2 = new TermQuery(term2);

// 本人覺得他更應(yīng)該叫做JoinQuery
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(tempQuery1, BooleanClause.Occur.MUST);
booleanQuery.add(tempQuery2, BooleanClause.Occur.SHOULD);
return booleanQuery;
}

/**
   * 多關(guān)鍵詞搜索
   *
   * @return
   */
public Query phraseQuery() {
PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.setSlop(1);
phraseQuery.add(new Term("content", "搜"));
phraseQuery.add(new Term("content", "擎"));
return phraseQuery;
}

/**
   * 范圍搜索
   *
   * @return
   */
public Query rangeQuery() {
Set set = new HashSet();
SpanQuery rangeQuery = new SpanTermQuery(new Term("lastModifyTime",
"20100603"));
set.add(new Term("lastModifyTime", "20150808"));
rangeQuery.extractTerms(set);
return rangeQuery;
}

public void search() {
try {
// 相當(dāng)于sql中的 select * from talbeName
IndexSearcher search = new IndexSearcher(path);

startTime = new Date();
// 抽象的查詢對象
Query query = queryParser();
// query = wildcardQuery();
//query = termQuery();
//query = phraseQuery();
//query = booleanQuery();

// 搜索結(jié)果集和JDBC的查詢結(jié)果集完全類似的概念 -- 為什么是這樣的呢？
// lucene在設(shè)計(jì)的時候，就參照了JDBC的很多概念
TopDocs topDocs = search.search(query, 5);
if (topDocs != null) {
System.out.println("命中:" + topDocs.totalHits);
// 輸出結(jié)果
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (int i = 0; i < scoreDocs.length; i++) {
try {
Document targetDoc = search.doc(scoreDocs[i].doc);
System.out.println("內(nèi)容：" + targetDoc.toString());
System.out.println(scoreDocs[i].score);
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("===========================");
}
}

endTime = new Date();

System.out.println("本次搜索用時："
+ (endTime.getTime() - startTime.getTime()) + "毫秒");

} catch (Exception e) {
e.printStackTrace();
}
}

/**
   * @param args
   */
public static void main(String[] args) {
Search search = new Search();
search.createLuceneIndex();
// search.createIndexByFile();

search.search();
}

}

text.txt手動創(chuàng)建，內(nèi)容為索引內(nèi)容．

實(shí)戰(zhàn)分享：從技術(shù)角度談機(jī)器學(xué)習(xí)入門| 【大話IT】RadonDB低門檻向MySQL集群下戰(zhàn)書 | ChinaUnix打賞功能已上線！ | 新一代分布式關(guān)系型數(shù)據(jù)庫RadonDB知多少？

surpass_li

版主

論壇徽章:: 0

3樓 [報(bào)告]

發(fā)表于 2010-06-30 14:51 |只看該作者

Field.Store解析
Store
   YES：保存
   NO：不保存
源碼如下：

/** Store the original field value in the index. This is useful for short texts
   * like a document's title which should be displayed with the results. The
   * value is stored in its original form, i.e. no analyzer is used before it is
   * stored.
   */
YES {
   @Override
   public boolean isStored() { return true; }
},

/** Do not store the field value in the index. */
NO {
   @Override
   public boolean isStored() { return false; }
};

實(shí)戰(zhàn)分享：從技術(shù)角度談機(jī)器學(xué)習(xí)入門| 【大話IT】RadonDB低門檻向MySQL集群下戰(zhàn)書 | ChinaUnix打賞功能已上線！ | 新一代分布式關(guān)系型數(shù)據(jù)庫RadonDB知多少？

surpass_li

版主

論壇徽章:: 0

4樓 [報(bào)告]

發(fā)表于 2010-06-30 14:53 |只看該作者

lucene自帶的分詞方式對中文分詞十分的不友好，基本上可以用慘不忍睹來形容，所以這里推薦使用IKAnalyzer進(jìn)行中文分詞。
IKAnalyzer分詞器是一個非常優(yōu)秀的中文分詞器。
下面是官方文檔上的介紹
采用了特有的“正向迭代最細(xì)粒度切分算法“，具有60萬字/秒的高速處理能力。
采用了多子處理器分析模式，支持：英文字母（IP地址、Email、URL）、數(shù)字（日期，常用中文數(shù)量詞，羅馬數(shù)字，科學(xué)計(jì)數(shù)法），中文詞匯（姓名、地名處理）等分詞處理。
優(yōu)化的詞典存儲，更小的內(nèi)存占用。支持用戶詞典擴(kuò)展定義.
針對Lucene全文檢索優(yōu)化的查詢分析器
IKQueryParser；采用歧義分析算法優(yōu)化查詢關(guān)鍵字的搜索排列組合，能極大的提高Lucene檢索的命中率。
1.IKAnalyzer的部署：將IKAnalyzer3.X.jar部署于項(xiàng)目的lib目錄中；IKAnalyzer.cfg.xml與ext_stopword.dic文件放置在代碼根目錄下即可。

實(shí)戰(zhàn)分享：從技術(shù)角度談機(jī)器學(xué)習(xí)入門| 【大話IT】RadonDB低門檻向MySQL集群下戰(zhàn)書 | ChinaUnix打賞功能已上線！ | 新一代分布式關(guān)系型數(shù)據(jù)庫RadonDB知多少？

surpass_li

版主

論壇徽章:: 0

5樓 [報(bào)告]

發(fā)表于 2010-06-30 14:56 |只看該作者

對索引進(jìn)行查詢并進(jìn)行高亮highlighter處理
部分代碼如下：
//高亮設(shè)置
   Analyzer analyzer = new IKAnalyzer();//設(shè)定分詞器
   SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter("<B>","</B>");//設(shè)定高亮顯示的格式，也就是對高亮顯示的詞組加上前綴后綴
   Highlighter highlighter = new Highlighter(simpleHtmlFormatter,new QueryScorer(query));
   highlighter.setTextFragmenter(new SimpleFragmenter(150));//設(shè)置每次返回的字符數(shù).想必大家在使用搜索引擎的時候也沒有一并把全部數(shù)據(jù)展示出來吧，當(dāng)然這里也是設(shè)定只展示部分?jǐn)?shù)據(jù)
   for(int i=0;i<hits.length;i++){
         Document doc = search.doc(hits[i].doc);
         TokenStream tokenStream = analyzer.tokenStream("",new StringReader(doc.get("content")));
         String str = highlighter.getBestFragment(tokenStream, doc.get("content"));
         System.out.println(str);
   }