“I’m handed Word CVs, I have zero API access, zero root privileges, yet I still need the raw content — yesterday.”
Sounds familiar? Below is the exact route I took with nothing but n8n’s built-in nodes and three tiny JavaScript snippets.
.docx
is just a renamed ZIP archive.word/document.xml
.Download → Rename_to_zip → Decompress → Pick_document_xml → Extract XML → Scrape Text
Plain SFTP Download or HTTP GET node; you get a binary under binary.data
.
.docx
as a .zip
n8n’s Compression node only opens what it recognises as a ZIP. Change the metadata first.
// Rename_to_zip (Run Once for Each Item)
const bin = items[0].binary.data;
bin.fileName = bin.fileName.replace(/\.docx$/i, '.zip');
bin.fileExtension = 'zip';
bin.mimeType = 'application/zip';
return items;
Compression → Decompress
data
document.xml
and toss the rest// Pick_document_xml (Run Once for All Items)
const result = [];
for (const file of Object.values(items[0].binary)) {
if (file.fileName === 'document.xml') {
result.push({ binary: { data: file } });
break; // found it, stop looping
}
}
return result; // 1-item array (or [] if not found)
Extract From File
XML
data
// Scrape_Text (Run Once for Each Item)
const xml = items[0].json.data.toString('utf8');
// 1. Grab every <w:p> block (paragraph)
const paraRegex = /<w:p[^>]*?>([\\s\\S]*?)<\\/w:p>/g;
// 2. Inside that, grab every <w:t> (text run)
const wTRegex = /<w:t[^>]*?>(.*?)<\\/w:t>/g;
const paragraphs = [];
let pMatch;
while ((pMatch = paraRegex.exec(xml))) {
const inner = pMatch[1];
const parts = [];
let tMatch;
while ((tMatch = wTRegex.exec(inner))) {
parts.push(tMatch[1]);
}
const txt = parts.join('').replace(/\\s+/g, ' ').trim();
if (txt) paragraphs.push(txt);
}
return [{
json: {
paragraphs, // array you can loop over
text: paragraphs.join('\\n\\n') // single block for GPT, DB, whatever
}
}];
{
"paragraphs": [
"Project management / Modelling / Package approach (UTT)",
"Project management (V-cycle)",
"…"
],
"text": "Project management / Modelling / Package approach (UTT)\n\nProject management (V-cycle)\n\n…"
}
“Can you really strip a Word file down to raw text with no extra libraries?”
Absolutely. Three micro-scripts, six vanilla nodes, done. No more “Could you just pull the text real quick?” nightmares — mission accomplished.