基于SpringBoot+ElasticSearch實(shí)現(xiàn)文檔智能化檢索的完整指南

更新時間：2025年08月01日 09:14:37 作者：墨夶

Spring Boot是一個用來快速開發(fā)、運(yùn)行和部署 Spring 應(yīng)用程序的框架,Elasticsearch是一個開源的、分布式的全文搜索,本文給大家介紹了基于SpringBoot+ElasticSearch實(shí)現(xiàn)文檔智能化檢索的完整指南,需要的朋友可以參考下

一、項(xiàng)目背景與技術(shù)選型

在企業(yè)級應(yīng)用中，文檔內(nèi)容的智能化檢索是一個高頻需求。例如：

上傳PDF/Word文檔后自動抽取文本
支持中文分詞和模糊匹配
搜索結(jié)果高亮顯示關(guān)鍵詞

技術(shù)選型

技術(shù)	作用
SpringBoot	快速構(gòu)建微服務(wù)
ElasticSearch	實(shí)現(xiàn)全文檢索與高亮功能
Jieba分詞插件	中文分詞支持
Ingest Attachment Processor Plugin	文檔內(nèi)容抽?。≒DF/Word等）

二、環(huán)境準(zhǔn)備

2.1 Maven依賴配置

<!-- pom.xml -->
<dependencies>
    <!-- SpringBoot基礎(chǔ) -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>

    <!-- Elasticsearch連接 -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-elasticsearch</artifactId>
    </dependency>

    <!-- 文件處理工具 -->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>5.2.3</version>
    </dependency>

    <!-- Jieba分詞插件 -->
    <dependency>
        <groupId>com.nlp</groupId>
        <artifactId>elasticsearch-analysis-jieba</artifactId>
        <version>7.17.0</version>
    </dependency>
</dependencies>

2.2 配置文件

# application.yml
spring:
  data:
    elasticsearch:
      cluster-name: my-cluster
      cluster-nodes: localhost:9200
  elasticsearch:
    rest:
      uris: http://localhost:9200
      username: elastic
      password: your_password

三、核心功能實(shí)現(xiàn)步驟

3.1 安裝ElasticSearch插件

Ingest Attachment Processor Plugin

# 安裝插件（本地ES）
elasticsearch-plugin install ingest-attachment

# 安裝插件（Docker容器內(nèi)）
docker exec -it elasticsearch bin/elasticsearch-plugin install ingest-attachment

注意：確保插件版本與ES版本匹配！重啟ES后生效。

Jieba中文分詞插件

elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-jieba/releases/download/v7.17.0/elasticsearch-analysis-jieba-7.17.0.zip

3.2 創(chuàng)建文檔抽取管道

ElasticSearch的Ingest Pipeline用于自動化處理上傳的文件內(nèi)容。

3.2.1 定義Pipeline

PUT _ingest/pipeline/attachment-extract
{
  "description": "Extract attachment content",
  "processors": [
    {
      "attachment": {
        "field": "content",
        "target_field": "attachment",
        "ignore_missing": true
      }
    },
    {
      "remove": {
        "field": "content"
      }
    }
  ]
}

關(guān)鍵點(diǎn)：

attachment處理器將Base64編碼的文件內(nèi)容解析為文本。
remove處理器刪除原始二進(jìn)制字段，保留提取后的文本。

3.3 定義索引與映射

索引的mapping和settings決定了數(shù)據(jù)存儲格式和分詞規(guī)則。

3.3.1 創(chuàng)建索引

PUT /fileinfo
{
  "mappings": {
    "properties": {
      "id": { "type": "keyword" },
      "fileName": { "type": "text" },
      "fileType": { "type": "keyword" },
      "attachment": {
        "properties": {
          "content": { "type": "text", "analyzer": "jieba" }  // 使用Jieba分詞
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "jieba": {
          "type": "custom",
          "tokenizer": "jieba_tokenizer"
        }
      }
    }
  }
}

注意：attachment.content字段必須使用分詞器，否則全文檢索會失??！

3.4 Java代碼實(shí)現(xiàn)文檔處理

3.4.1 文件上傳接口

@RestController
@RequestMapping("/api/files")
public class FileUploadController {

    @Autowired
    private ElasticsearchRestTemplate elasticsearchRestTemplate;

    @PostMapping("/upload")
    public ResponseEntity<String> uploadFile(@RequestParam("file") MultipartFile file) throws IOException {
        // 1. 文件轉(zhuǎn)Base64
        String base64Content = Base64.getEncoder().encodeToString(file.getBytes());

        // 2. 構(gòu)造文檔對象
        Map<String, Object> document = new HashMap<>();
        document.put("id", UUID.randomUUID().toString());
        document.put("fileName", file.getOriginalFilename());
        document.put("fileType", getFileType(file.getOriginalFilename()));
        document.put("content", base64Content);  // 二進(jìn)制字段

        // 3. 使用Pipeline處理并索引文檔
        IndexRequest request = new IndexRequest("fileinfo")
                .setId(document.get("id").toString())
                .setPipeline("attachment-extract")  // 關(guān)鍵：綁定Pipeline
                .setSource(document);

        elasticsearchRestTemplate.index(request);

        return ResponseEntity.ok("文件已成功索引");
    }

    private String getContentType(MultipartFile file) {
        String originalFilename = file.getOriginalFilename();
        if (originalFilename.endsWith(".pdf")) {
            return "application/pdf";
        } else if (originalFilename.endsWith(".docx")) {
            return "application/vnd.openxmlformats-officedocument.wordprocessingml.document";
        }
        return "application/octet-stream";
    }
}

代碼解析：

Base64.getEncoder() 將文件轉(zhuǎn)為Base64字符串，便于傳輸。
setPipeline("attachment-extract") 調(diào)用預(yù)定義的Pipeline處理內(nèi)容。
elasticsearchRestTemplate.index() 執(zhí)行索引操作。

3.5 全文檢索與高亮分詞

3.5.1 搜索接口

@GetMapping("/search")
public ResponseEntity<Map<String, Object>> searchFiles(@RequestParam String keyword) {
    // 1. 構(gòu)建查詢
    SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
    sourceBuilder.query(QueryBuilders.matchQuery("attachment.content", keyword)
            .analyzer("jieba")  // 使用Jieba分詞
            .fuzziness("AUTO"));

    // 2. 啟用高亮
    HighlightBuilder highlightBuilder = new HighlightBuilder();
    highlightBuilder.field("attachment.content").preTags("<mark>").postTags("</mark>");
    sourceBuilder.highlighter(highlightBuilder);

    // 3. 執(zhí)行搜索
    SearchRequest searchRequest = new SearchRequest("fileinfo");
    searchRequest.source(sourceBuilder);
    SearchResponse response = elasticsearchRestTemplate.search(searchRequest);

    // 4. 提取高亮結(jié)果
    List<Map<String, Object>> results = new ArrayList<>();
    for (SearchHit hit : response.getHits().getHits()) {
        Map<String, Object> source = hit.getSourceAsMap();
        Map<String, HighlightField> highlights = hit.getHighlightFields();
        HighlightField contentHighlight = highlights.get("attachment.content");
        if (contentHighlight != null) {
            source.put("highlight", contentHighlight.fragments()[0].string());
        }
        results.add(source);
    }

    return ResponseEntity.ok(Collections.singletonMap("results", results));
}

關(guān)鍵點(diǎn)：

matchQuery("attachment.content", keyword) 對內(nèi)容字段進(jìn)行分詞匹配。
HighlightBuilder 控制高亮標(biāo)簽（如<mark>）。
搜索結(jié)果中highlight字段包含高亮片段。

四、性能優(yōu)化與注意事項(xiàng)

4.1 緩存策略

ElasticSearch緩存：啟用request_cache減少重復(fù)查詢開銷。
應(yīng)用層緩存：使用Redis緩存高頻搜索結(jié)果。

4.2 分頁與過濾

// 分頁示例
sourceBuilder.from(0).size(10);  // 限制每頁10條
sourceBuilder.sort(SortBuilders.fieldSort("createTime").order(SortOrder.DESC));  // 按時間排序

4.3 安全與容錯

文件類型校驗(yàn)：防止非法文件上傳。
異常處理：捕獲ElasticsearchException并返回友好的錯誤信息。

五、代碼整合

5.1 配置類（ElasticSearch連接）

@Configuration
public class ElasticsearchConfig {

    @Value("${spring.elasticsearch.rest.uris}")
    private String esUri;

    @Bean
    public RestHighLevelClient elasticsearchClient() {
        return new RestHighLevelClient(
                RestClient.builder(new HttpHost(esUri.split(":")[0], Integer.parseInt(esUri.split(":")[1]), "http")));
    }

    @Bean
    public ElasticsearchRestTemplate elasticsearchRestTemplate(RestHighLevelClient client) {
        return new ElasticsearchRestTemplate(client);
    }
}

5.2 高亮結(jié)果返回示例

{
  "results": [
    {
      "id": "123",
      "fileName": "進(jìn)口紅酒.pdf",
      "fileType": "pdf",
      "attachment": {
        "content": "這款紅酒產(chǎn)自法國波爾多地區(qū)，口感醇厚..."
      },
      "highlight": "這款紅酒產(chǎn)自法國波爾多地區(qū)，<mark>口感醇厚</mark>..."
    }
  ]
}

六、從零到一的文檔搜索閉環(huán)

步驟	核心代碼/配置	作用
1. 依賴配置	pom.xml	引入ElasticSearch和分詞插件
2. 管道定義	PUT _ingest/pipeline/attachment-extract	自動抽取文件內(nèi)容
3. 索引映射	PUT /fileinfo	定義字段類型和分詞規(guī)則
4. 文件上傳	FileUploadController.uploadFile()	將文件轉(zhuǎn)為Base64并索引
5. 全文搜索	FileUploadController.searchFiles()	使用Jieba分詞和高亮

七、行動號召：立即動手實(shí)踐！

“文檔檢索不再是難題！現(xiàn)在就搭建你的智能搜索系統(tǒng)！”

嘗試基礎(chǔ)功能：上傳一個PDF并驗(yàn)證內(nèi)容抽取是否成功。
挑戰(zhàn)分詞優(yōu)化：自定義Jieba分詞詞典，提升匹配準(zhǔn)確率。
擴(kuò)展搜索維度：添加按文件類型、時間范圍的過濾功能。

以上就是基于SpringBoot+ElasticSearch實(shí)現(xiàn)文檔智能化檢索的完整指南的詳細(xì)內(nèi)容，更多關(guān)于SpringBoot ElasticSearch文檔檢索的資料請關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章:

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

基于SpringBoot+ElasticSearch實(shí)現(xiàn)文檔智能化檢索的完整指南

目錄

一、項(xiàng)目背景與技術(shù)選型

技術(shù)選型

二、環(huán)境準(zhǔn)備

2.1 Maven依賴配置

2.2 配置文件

三、核心功能實(shí)現(xiàn)步驟

3.1 安裝ElasticSearch插件

3.2 創(chuàng)建文檔抽取管道

3.2.1 定義Pipeline

3.3 定義索引與映射

3.3.1 創(chuàng)建索引

3.4 Java代碼實(shí)現(xiàn)文檔處理

3.4.1 文件上傳接口

3.5 全文檢索與高亮分詞

3.5.1 搜索接口

四、性能優(yōu)化與注意事項(xiàng)

4.1 緩存策略

4.2 分頁與過濾

4.3 安全與容錯

五、代碼整合

5.1 配置類（ElasticSearch連接）

5.2 高亮結(jié)果返回示例

六、從零到一的文檔搜索閉環(huán)

七、行動號召：立即動手實(shí)踐！

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线 免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

基于SpringBoot+ElasticSearch實(shí)現(xiàn)文檔智能化檢索的完整指南

目錄

一、項(xiàng)目背景與技術(shù)選型

技術(shù)選型

二、環(huán)境準(zhǔn)備

2.1 Maven依賴配置

2.2 配置文件

三、核心功能實(shí)現(xiàn)步驟

3.1 安裝ElasticSearch插件

3.2 創(chuàng)建文檔抽取管道

3.2.1 定義Pipeline

3.3 定義索引與映射

3.3.1 創(chuàng)建索引

3.4 Java代碼實(shí)現(xiàn)文檔處理

3.4.1 文件上傳接口

3.5 全文檢索與高亮分詞

3.5.1 搜索接口

四、性能優(yōu)化與注意事項(xiàng)

4.1 緩存策略

4.2 分頁與過濾

4.3 安全與容錯

五、代碼整合

5.1 配置類（ElasticSearch連接）

5.2 高亮結(jié)果返回示例

六、從零到一的文檔搜索閉環(huán)

七、行動號召：立即動手實(shí)踐！

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

一、項(xiàng)目背景與技術(shù)選型

二、環(huán)境準(zhǔn)備

三、核心功能實(shí)現(xiàn)步驟

四、性能優(yōu)化與注意事項(xiàng)

五、代碼整合

七、行動號召：立即動手實(shí)踐！