Extracting Clean Text from a .docx in n8n (Without any API)

Date: 12 juin 2025

“I’m handed Word CVs, I have zero API access, zero root privileges, yet I still need the raw content — yesterday.”

Sounds familiar? Below is the exact route I took with nothing but n8n’s built-in nodes and three tiny JavaScript snippets.

Why even bother?

A .docx is just a renamed ZIP archive.
The human-readable part hides in word/document.xml.
My n8n instance is locked down; I can’t install anything or run shell commands.

The workflow at a glance

Download  → Rename_to_zip → Decompress → Pick_document_xml → Extract XML → Scrape Text

1 - Grab the file

Plain SFTP Download or HTTP GET node; you get a binary under binary.data.

2 - Masquerade the `.docx` as a `.zip`

n8n’s Compression node only opens what it recognises as a ZIP. Change the metadata first.

// Rename_to_zip  (Run Once for Each Item)
const bin = items[0].binary.data;

bin.fileName      = bin.fileName.replace(/\.docx$/i, '.zip');
bin.fileExtension = 'zip';
bin.mimeType      = 'application/zip';

return items;

3 - Unzip the archive

Compression → Decompress

Input Binary Field(s): data

4 - Pick `document.xml` and toss the rest

// Pick_document_xml  (Run Once for All Items)
const result = [];

for (const file of Object.values(items[0].binary)) {
  if (file.fileName === 'document.xml') {
    result.push({ binary: { data: file } });
    break;            // found it, stop looping
  }
}

return result;        // 1-item array (or [] if not found)

5 - Turn XML into JSON

Extract From File

File Format: XML
Binary Property: data

6 — Harvest the paragraphs, clean them up

// Scrape_Text  (Run Once for Each Item)
const xml = items[0].json.data.toString('utf8');

// 1. Grab every <w:p> block (paragraph)
const paraRegex = /<w:p[^>]*?>([\\s\\S]*?)<\\/w:p>/g;
// 2. Inside that, grab every <w:t> (text run)
const wTRegex   = /<w:t[^>]*?>(.*?)<\\/w:t>/g;

const paragraphs = [];
let pMatch;
while ((pMatch = paraRegex.exec(xml))) {
  const inner  = pMatch[1];
  const parts  = [];
  let tMatch;
  while ((tMatch = wTRegex.exec(inner))) {
    parts.push(tMatch[1]);
  }
  const txt = parts.join('').replace(/\\s+/g, ' ').trim();
  if (txt) paragraphs.push(txt);
}

return [{
  json: {
    paragraphs,                     // array you can loop over
    text: paragraphs.join('\\n\\n') // single block for GPT, DB, whatever
  }
}];

Typical output

{
  "paragraphs": [
    "Project management / Modelling / Package approach (UTT)",
    "Project management (V-cycle)",
    "…"
  ],
  "text": "Project management / Modelling / Package approach (UTT)\n\nProject management (V-cycle)\n\n…"
}