Extracting Clean Text from a .docx in n8n (Without any API)

Date: 12 juin 2025

“I’m handed Word CVs, I have zero API access, zero root privileges, yet I still need the raw content — yesterday.”

Sounds familiar? Below is the exact route I took with nothing but n8n’s built-in nodes and three tiny JavaScript snippets.

Why even bother?

  • A .docx is just a renamed ZIP archive.
  • The human-readable part hides in word/document.xml.
  • My n8n instance is locked down; I can’t install anything or run shell commands.

The workflow at a glance

Download  → Rename_to_zip → Decompress → Pick_document_xml → Extract XML → Scrape Text

1 - Grab the file

Plain SFTP Download or HTTP GET node; you get a binary under binary.data.

2 - Masquerade the .docx as a .zip

n8n’s Compression node only opens what it recognises as a ZIP. Change the metadata first.

// Rename_to_zip  (Run Once for Each Item)
const bin = items[0].binary.data;

bin.fileName      = bin.fileName.replace(/\.docx$/i, '.zip');
bin.fileExtension = 'zip';
bin.mimeType      = 'application/zip';

return items;

3 - Unzip the archive

Compression → Decompress

  • Input Binary Field(s): data

4 - Pick document.xml and toss the rest

// Pick_document_xml  (Run Once for All Items)
const result = [];

for (const file of Object.values(items[0].binary)) {
  if (file.fileName === 'document.xml') {
    result.push({ binary: { data: file } });
    break;            // found it, stop looping
  }
}

return result;        // 1-item array (or [] if not found)

5 - Turn XML into JSON

Extract From File

  • File Format: XML
  • Binary Property: data

6 — Harvest the paragraphs, clean them up

// Scrape_Text  (Run Once for Each Item)
const xml = items[0].json.data.toString('utf8');

// 1. Grab every <w:p> block (paragraph)
const paraRegex = /<w:p[^>]*?>([\\s\\S]*?)<\\/w:p>/g;
// 2. Inside that, grab every <w:t> (text run)
const wTRegex   = /<w:t[^>]*?>(.*?)<\\/w:t>/g;

const paragraphs = [];
let pMatch;
while ((pMatch = paraRegex.exec(xml))) {
  const inner  = pMatch[1];
  const parts  = [];
  let tMatch;
  while ((tMatch = wTRegex.exec(inner))) {
    parts.push(tMatch[1]);
  }
  const txt = parts.join('').replace(/\\s+/g, ' ').trim();
  if (txt) paragraphs.push(txt);
}

return [{
  json: {
    paragraphs,                     // array you can loop over
    text: paragraphs.join('\\n\\n') // single block for GPT, DB, whatever
  }
}];

Typical output

{
  "paragraphs": [
    "Project management / Modelling / Package approach (UTT)",
    "Project management (V-cycle)",
    "…"
  ],
  "text": "Project management / Modelling / Package approach (UTT)\n\nProject management (V-cycle)\n\n…"
}

Why this works and keeps working

  • Zero external dependencies — perfect when Ops locks everything down.
  • Copy-paste friendly — drop the three code nodes into any workflow.
  • Clean, compact text ready for GPT, Elasticsearch, SQL, you name it.

Final thoughts

“Can you really strip a Word file down to raw text with no extra libraries?”

Absolutely. Three micro-scripts, six vanilla nodes, done. No more “Could you just pull the text real quick?” nightmares — mission accomplished.

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *

Copyright 2019-2024 - Tous droits réservés.