Doxia

Documentation that earns a second life.

In development · Commercial

A five-stage pipeline that takes the documents you already have, cleans and classifies them, and ships them where they need to go. Run it forward and the knowledge base gets sane. Run it in reverse and the same clean output is a corpus primed for retrieval, ingestion, or training. Proprietary, and built to run on-prem where the documents cannot leave the building.

01 / The pipeline

How does sprawl become a clean corpus?

Five stages

01Ingest
Pulls from local and network shares, cloud storage, APIs, Git, and databases. OCR brings scanned PDFs and images into the flow. Three gates run before anything else: a malware scan, prompt-injection detection, and content moderation.
02Analyze
Extracts text and metadata, detects structure and language, expands archives within safety caps, normalizes everything to UTF-8, and deduplicates by content hash so the same document never gets processed twice.
03Classify
Rule-based tagging plus a sensitivity scanner that flags PII, PHI, PCI, credentials, and network details. Optional AI classification (local Ollama or a cloud model) adds context-aware tags and document-type detection.
04Convert
Pandoc and its toolchain turn a dozen input formats into clean Markdown, searchable PDF/A, DOCX, HTML, and more. The redaction module applies one of five strategies to anything the sensitivity scanner flagged.
05Distribute
Delivers to Git (pull-request mode with human review), Wiki.js, Confluence, SharePoint, Paperless-ngx, file systems, or downstream automation like n8n and webhooks.

02 / Two directions

The same clean output, used two ways.

Forward flow

A knowledge base the team trusts.

Point it at the sprawl you actually have. It comes out clean, structured, deduplicated, and readable. The wiki stops being the place documentation goes to die.

Reverse flow

A corpus that is ready for AI.

That same clean output is primed for retrieval, ingestion, or fine-tuning. The months of data prep that usually kill private-LLM and on-prem AI projects are already done, because you did them to get a wiki you could read.

03 / Security and sensitivity

Nothing dangerous gets in. Nothing sensitive leaks out.

Built in, not bolted on

Bulk document processing is a soft target. The files come from everywhere, and the moment you feed them to a model you have inherited whatever was hiding in them. Doxia screens at the door: a malware scan, prompt-injection detection on the text, and content moderation, all before a document reaches the pipeline.

On the way out, the sensitivity scanner flags PII, PHI, PCI, credentials, and network details. The redaction module then applies one of five strategies, block, redact, mask, warn, or encrypt, and keeps both versions so the original is never lost. What crosses the wire is a decision, not an accident. See how we think about AI security.

04 / On-prem by default

Runs inside the walls.

The whole point of cleaning your documents is undone if cleaning them means uploading them to someone else first. Doxia ships as a Docker Compose stack on Linux. Classification can run against a local model so no text leaves the network. The deployment that keeps your data on your hardware is the default, not an enterprise upsell.

This is why Doxia exists in the lineup. The data prep for private and on-prem AI is the part that stalls those projects for months. Doxia does that work where the data already lives.

05 / Coverage

What it reads, writes, and reaches.

Coverage tracks the formats and systems real document estates actually run on. If a team has it, Doxia is built to take it in and put it somewhere useful.

Sources
- Local and network shares (SMB/NFS)
- Cloud storage (S3/MinIO, Drive, OneDrive)
- SharePoint, Confluence, GitHub, GitLab
- Databases and OCR for scanned files
Formats in
- doc, docx, odt, rtf
- pdf, html, epub
- txt, md, csv, json, xml
Formats out
- Markdown (GitHub Flavored)
- Searchable PDF/A
- docx, html, rtf, odt, csv
Targets
- Git (pull-request mode)
- Wiki.js, Confluence, SharePoint
- Paperless-ngx, file systems
- Webhooks, n8n, Slack

06 / Start

Point it at the knowledge base you already have.

Doxia is in active development and runs against real client knowledge bases today. If you have a document estate to clean, or a private-AI project blocked on data prep, tell us what it looks like.

Open a conversation Back to products →

Hexaxia AI · v2 · 2026Doxia / Document transformation, on-premBuilt by operators