Java高效地分割文本文件的方法技巧

更新時(shí)間：2025年05月13日 08:29:57 作者：提前退休的java猿

這篇文章介紹了Java中零拷貝技術(shù)的原理和應(yīng)用,通過(guò)比較傳統(tǒng)方法和零拷貝方法的性能,展示了如何使用FileChannel的transferTo方法高效地分割大型文本文件,特別是在保持行完整性的同時(shí),顯著提高了處理速度,需要的朋友可以參考下

前言

之前聽到零拷貝的技術(shù)，都感覺(jué)好高深好遙遠(yuǎn)呀??

都是看什么什么框架用了零拷貝技術(shù)，比如netty就使用零拷貝技術(shù)。

看到一篇文章讓我對(duì)接零拷貝技術(shù)去魅了，原來(lái)我也可以再工作中去使用零拷貝技術(shù)，今天把這篇文章分享給大家

低效常用示例

當(dāng)我們面臨將文本文件分成最大大小塊的時(shí)，我們可能會(huì)嘗試編寫如下代碼:

    private static final long maxFileSizeBytes = 10 * 1024 * 1024; // 默認(rèn)10MB


    public void split(Path inputFile, Path outputDir) throws IOException {
        if (!Files.exists(inputFile)) {
            throw new IOException("輸入文件不存在: " + inputFile);
        }
        if (Files.size(inputFile) == 0) {
            throw new IOException("輸入文件為空: " + inputFile);
        }

        Files.createDirectories(outputDir);

        try (BufferedReader reader = Files.newBufferedReader(inputFile)) {
            int fileIndex = 0;
            long currentSize = 0;
            BufferedWriter writer = null;
            try {
                writer = newWriter(outputDir, fileIndex++);

                String line;
                while ((line = reader.readLine()) != null) {
                byte[] lineBytes = (line + System.lineSeparator()).getBytes();
                if (currentSize + lineBytes.length > maxFileSizeBytes) {
                    if (writer != null) {
                        writer.close();
                    }
                    writer = newWriter(outputDir, fileIndex++);
                    currentSize = 0;
                }
                writer.write(line);
                writer.newLine();
                currentSize += lineBytes.length;
                }
            } finally {
                if (writer != null) {
                    writer.close();
                }
            }
        }
    }

    private BufferedWriter newWriter(Path dir, int index) throws IOException {
        Path filePath = dir.resolve("part_" + index + ".txt");
        return Files.newBufferedWriter(filePath);
    }

效率分析

此代碼在技術(shù)上是可以的，但是將大文件拆分為多個(gè)塊的效率非常低。

它執(zhí)行許多堆分配（行），導(dǎo)致創(chuàng)建和丟棄大量臨時(shí)對(duì)象（字符串、字節(jié)數(shù)組）。
還有一個(gè)不太明顯的問(wèn)題，它將數(shù)據(jù)復(fù)制到多個(gè)緩沖區(qū)，并在用戶和內(nèi)核模式之間執(zhí)行上下文切換。

具體如下：

BufferedReader: BufferedReader 的 BufferedReader 中：

在底層 FileReader 或 InputStreamReader 上調(diào)用 read()
數(shù)據(jù)從內(nèi)核空間→用戶空間緩沖區(qū)復(fù)制。
然后解析為 Java 字符串（堆分配）。

getBytes() : getBytes（） 的

將 String 轉(zhuǎn)換為新的 byte[] 更多的堆分配。

BufferedWriter: BufferedWriter 的 BufferedWriter 中：

從用戶空間獲取 byte/char 數(shù)據(jù)。
調(diào)用 write()這又涉及將用戶空間復(fù)制到內(nèi)核空間。
最終刷新到磁盤。

因此，數(shù)據(jù)在內(nèi)核和用戶空間之間來(lái)回移動(dòng)多次，并產(chǎn)生額外的堆改動(dòng)。除了垃圾收集壓力外，它還具有以下后果：

內(nèi)存帶寬浪費(fèi)在緩沖區(qū)之間進(jìn)行復(fù)制。
磁盤到磁盤傳輸?shù)?CPU 利用率較高。
操作系統(tǒng)本可直接處理批量拷貝（通過(guò)DMA或優(yōu)化I/O），但Java代碼通過(guò)引入用戶空間邏輯攔截了這種高效性。

高效處理方案

那么，我們?nèi)绾伪苊馍鲜鰡?wèn)題呢？

答案是盡可能使用 zero copy，即盡可能避免離開 kernel 空間。這可以通過(guò)使用 FileChannel 方法 long transferTo(long position, long count, WritableByteChannel target) 在 java 中完成。它直接是磁盤到磁盤的傳輸，還會(huì)利用作系統(tǒng)的一些 IO 優(yōu)化。

有問(wèn)題就是所描述的方法對(duì)字節(jié)塊進(jìn)行作，可能會(huì)破壞行的完整性。為了解決這個(gè)問(wèn)題，我們需要一種策略來(lái)確保即使通過(guò)移動(dòng)字節(jié)段處理文件時(shí)，行也保持完整

沒(méi)有上述的問(wèn)題就很容易，只需為每個(gè)塊調(diào)用 transferTo，將position遞增為 position = position + maxFileSize，直到無(wú)法傳輸更多數(shù)據(jù)。

為了保持行的完整性，我們需要確定每個(gè)字節(jié)塊中最后一個(gè)完整行的結(jié)尾。為此，我們首先查找 chunk 的預(yù)期末尾，然后向后掃描以找到前面的換行符。這將為我們提供 chunk 的準(zhǔn)確字節(jié)計(jì)數(shù)，確保包含最后的、不間斷的行。這將是執(zhí)行緩沖區(qū)分配和復(fù)制的代碼的唯一部分，并且由于這些作應(yīng)該最小，因此預(yù)計(jì)性能影響可以忽略不計(jì)。

private static final int LINE_ENDING_SEARCH_WINDOW = 8 * 1024;
?
private long maxSizePerFileInBytes;
private Path outputDirectory;
private Path tempDir;
?
private void split(Path fileToSplit) throws IOException {
    try (RandomAccessFile raf = new RandomAccessFile(fileToSplit.toFile(), "r");
            FileChannel inputChannel = raf.getChannel()) {
?
        long fileSize = raf.length();
        long position = 0;
        int fileCounter = 1;
?
        while (position < fileSize) {
            // Calculate end position (try to get close to max size)
            long targetEndPosition = Math.min(position + maxSizePerFileInBytes, fileSize);
?
            // If we're not at the end of the file, find the last line ending before max size
            long endPosition = targetEndPosition;
            if (endPosition < fileSize) {
                endPosition = findLastLineEndBeforePosition(raf, position, targetEndPosition);
            }
?
            long chunkSize = endPosition - position;
            var outputFilePath = tempDir.resolve("_part" + fileCounter);
            try (FileOutputStream fos = new FileOutputStream(outputFilePath.toFile());
                    FileChannel outputChannel = fos.getChannel()) {
                inputChannel.transferTo(position, chunkSize, outputChannel);
            }
?
            position = endPosition;
            fileCounter++;
        }
?
    }
}
?
private long findLastLineEndBeforePosition(RandomAccessFile raf, long startPosition, long maxPosition)
        throws IOException {
    long originalPosition = raf.getFilePointer();
?
    try {
        int bufferSize = LINE_ENDING_SEARCH_WINDOW;
        long chunkSize = maxPosition - startPosition;
?
        if (chunkSize < bufferSize) {
            bufferSize = (int) chunkSize;
        }
?
        byte[] buffer = new byte[bufferSize];
        long searchPos = maxPosition;
?
        while (searchPos > startPosition) {
            long distanceToStart = searchPos - startPosition;
            int bytesToRead = (int) Math.min(bufferSize, distanceToStart);
?
            long readStartPos = searchPos - bytesToRead;
            raf.seek(readStartPos);
?
            int bytesRead = raf.read(buffer, 0, bytesToRead);
            if (bytesRead <= 0)
                break;
?
            // Search backwards through the buffer for newline
            for (int i = bytesRead - 1; i >= 0; i--) {
                if (buffer[i] == '\n') {
                    return readStartPos + i + 1;
                }
            }
?
            searchPos -= bytesRead;
        }
?
        throw new IllegalArgumentException(
                "File " + fileToSplit + " cannot be split. No newline found within the limits.");
    } finally {
        raf.seek(originalPosition);
    }
}

findLastLineEndBeforePosition 方法具有某些限制。具體來(lái)說(shuō)，它僅適用于類 Unix 系統(tǒng) （\n），非常長(zhǎng)的行可能會(huì)導(dǎo)致大量向后讀取迭代，并且包含超過(guò) maxSizePerFileInBytes 的行的文件無(wú)法拆分。但是，它非常適合拆分訪問(wèn)日志文件等場(chǎng)景，這些場(chǎng)景通常具有短行和大量條目。

性能分析

理論上，我們zero copy拆分文件應(yīng)該【常用方式】更快，現(xiàn)在是時(shí)候衡量它能有多快了。為此，我為這兩個(gè)實(shí)現(xiàn)運(yùn)行了一些基準(zhǔn)測(cè)試，這些是結(jié)果。

Benchmark                                                    Mode  Cnt           Score      Error   Units
FileSplitterBenchmark.splitFile                              avgt   15        1179.429 ±   54.271   ms/op
FileSplitterBenchmark.splitFile:·gc.alloc.rate               avgt   15        1349.613 ±   60.903  MB/sec
FileSplitterBenchmark.splitFile:·gc.alloc.rate.norm          avgt   15  1694927403.481 ± 6060.581    B/op
FileSplitterBenchmark.splitFile:·gc.count                    avgt   15         718.000             counts
FileSplitterBenchmark.splitFile:·gc.time                     avgt   15         317.000                 ms
FileSplitterBenchmark.splitFileZeroCopy                      avgt   15          77.352 ±    1.339   ms/op
FileSplitterBenchmark.splitFileZeroCopy:·gc.alloc.rate       avgt   15          23.759 ±    0.465  MB/sec
FileSplitterBenchmark.splitFileZeroCopy:·gc.alloc.rate.norm  avgt   15     2555608.877 ± 8644.153    B/op
FileSplitterBenchmark.splitFileZeroCopy:·gc.count            avgt   15          10.000             counts
FileSplitterBenchmark.splitFileZeroCopy:·gc.time             avgt   15           5.000                 ms

以下是用于上述結(jié)果的基準(zhǔn)測(cè)試代碼和文件大?。?00+MB）。

int maxSizePerFileInBytes = 1024 * 1024 // 1 MB chunks
?
public void setup() throws Exception {
    inputFile = Paths.get("/tmp/large_input.txt");
    outputDir = Paths.get("/tmp/split_output");
    // Create a large file for benchmarking if it doesn't exist
    if (!Files.exists(inputFile)) {
        try (BufferedWriter writer = Files.newBufferedWriter(inputFile)) {
            for (int i = 0; i < 10_000_000; i++) {
                writer.write("This is line number " + i);
                writer.newLine();
            }
        }
    }
}
?
public void splitFile() throws Exception {
    splitter.split(inputFile, outputDir);
}
?
public void splitFileZeroCopy() throws Exception {
    zeroCopySplitter.split(inputFile);
}

zeroCopy表現(xiàn)出相當(dāng)大的加速，僅用了 77 毫秒，而對(duì)于這種特定情況，【常用方式】需要 1179 毫秒。在處理大量數(shù)據(jù)或許多文件時(shí)，這種性能優(yōu)勢(shì)可能至關(guān)重要。