Node.js Streams for Processing Large Files
Process large files efficiently in Node.js using readable, writable, and transform streams to avoid memory issues and handle data chunk by chunk.
Tags
Node.js Streams for Processing Large Files
TL;DR
Use pipeline() from stream/promises to chain readable, transform, and writable streams together for processing large files. Memory usage stays constant regardless of file size, and error handling is built in.
The Problem
You need to process a 5 GB CSV file — parse rows, transform data, write output. The naive approach crashes:
import { readFileSync } from 'fs';
// This loads 5 GB into memory and crashes with heap overflow
const data = readFileSync('huge-file.csv', 'utf-8');
const rows = data.split('\n').map(parseRow);Even fs.readFile (the async version) has the same problem — it buffers the entire file in memory before returning. For files larger than available RAM, this is a dead end.
The Solution
Streams process data in chunks. The file is read piece by piece, each chunk is transformed, and results are written incrementally. Peak memory usage stays at a few megabytes regardless of file size.
Use pipeline() for clean stream chaining:
import { createReadStream, createWriteStream } from 'fs';
import { pipeline } from 'stream/promises';
import { Transform } from 'stream';
// Custom transform stream that processes CSV rows
const csvTransform = new Transform({
objectMode: true,
transform(chunk: Buffer, encoding, callback) {
const lines = chunk.toString().split('\n');
for (const line of lines) {
if (line.trim()) {
const [name, email, role] = line.split(',');
this.push(JSON.stringify({ name, email, role: role?.trim() }) + '\n');
}
}
callback();
},
});
// Pipeline handles errors and cleanup automatically
await pipeline(
createReadStream('users.csv'),
csvTransform,
createWriteStream('users.jsonl')
);
console.log('Processing complete');Handling backpressure is the key advantage of pipeline() over manual .pipe(). If the writable stream is slower than the readable stream (e.g., writing to a slow disk while reading from a fast SSD), pipeline() automatically pauses the readable stream until the writable stream catches up. Without backpressure handling, memory grows unbounded as unwritten chunks accumulate in buffers.
Composing multiple transforms:
import { createGzip } from 'zlib';
// Read CSV -> transform to JSON -> compress -> write
await pipeline(
createReadStream('data.csv'),
csvTransform,
createGzip(),
createWriteStream('data.jsonl.gz')
);Each stream in the pipeline processes its chunk and passes the result to the next stream. The data flows through the chain without any single step holding the entire file in memory.
Why This Works
Streams leverage Node.js's event loop to process data incrementally. pipeline() connects streams and manages the lifecycle: it propagates errors from any stream in the chain, destroys all streams on failure (preventing resource leaks), and handles backpressure automatically. Compared to manual .pipe() calls, pipeline() is both safer and more concise — .pipe() does not propagate errors or clean up on failure, which leads to subtle resource leaks in production.
FAQ
Why should I use streams instead of readFile?
readFile loads the entire file into memory, which crashes on large files. Streams process data in small chunks, keeping memory usage constant regardless of file size.
What are the four types of Node.js streams?
Readable (data source), Writable (data destination), Transform (modify data in transit), and Duplex (both readable and writable, like network sockets).
How do I chain multiple stream operations?
Use the pipeline function from stream/promises to chain streams together with automatic error handling and cleanup when any stream in the chain fails.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
TypeScript Utility Types You Should Know
Five essential built-in generic utility types in TypeScript that will save you hundreds of lines of code.
Generate Dynamic OG Images in Next.js
Generate dynamic Open Graph images in Next.js using the ImageResponse API with custom fonts, gradients, and data-driven content for social sharing.
GitHub Actions Reusable Workflows: Stop Repeating Yourself
Create reusable GitHub Actions workflows with inputs, secrets, and outputs to eliminate YAML duplication across repositories and teams efficiently.