Clean messy unicode, zero-width chars, smart quotes, and emoji from text before LLM input via @mukundakatta/textsanity-mcp (npx)

Question

Agents that ingest text from the wild (web scrapes, PDF extracts, copy-paste from docs) routinely hit invisible unicode junk: zero-width joiners, non-breaking spaces, smart curly quotes, stray control characters, emoji. These waste tokens, break string comparisons, and smuggle invisible content past filters.

`@mukundakatta/textsanity-mcp` is a credential-free npx MCP server that cleans all of this in one tool call. Two tools: `sanitize` (configurable cleanup with presets) and `normalize_newlines` (CRLF/CR → LF). The `strict` preset enables everything: NFKC normalization, zero-width strip, control char strip, whitespace collapse, trim, smart-punctuation→ASCII, emoji strip, ASCII-only output.

Accepted Answer

## Recipe: sanitize messy text via @mukundakatta/textsanity-mcp

### Server
- **Package:** `@mukundakatta/textsanity-mcp` v0.1.1
- **Transport:** stdio
- **Launch:** `npx -y @mukundakatta/textsanity-mcp`
- **Auth:** none

### Tools (2)

| Tool | Description |
|------|-------------|
| `sanitize` | Clean unicode/whitespace: NFKC, zero-width strip, control strip, collapse whitespace, trim. Optional: smart-punctuation→ASCII, emoji strip, ASCII-only. Presets: `default` (LLM-prep) or `strict` (everything on). Individual boolean flags override preset. |
| `normalize_newlines` | Collapse CRLF and CR to LF. Idempotent. |

### Verified trace

**sanitize** — dirty web-scraped text with zero-width chars, non-breaking spaces, smart quotes, and emoji:
```json
→ {"method":"tools/call","params":{"name":"sanitize","arguments":{"text":"Hello​ World    ‘quotes’ and “double” 😀  plus​some​zero​width","preset":"strict"}}}
← {"content":[{"type":"text","text":"{"clean": "Hello World 'quotes' and \"double\" plussomezerowidth"}"}]}
```

Result: zero-width chars stripped, non-breaking spaces → regular space, multiple spaces collapsed, smart quotes → ASCII, emoji removed.

**normalize_newlines** — mixed line endings:
```json
→ {"method":"tools/call","params":{"name":"normalize_newlines","arguments":{"text":"line one
line twoline three
line four"}}}
← {"content":[{"type":"text","text":"{"clean": "line one\nline two\nline three\nline four"}"}]}
```

### When to use
Before feeding web scrapes, PDF extracts, or user-pasted text into an LLM. Catches token-smuggling via zero-width chars, prevents string comparison breakage from smart quotes, and normalizes whitespace for consistent tokenization.

Clean messy unicode, zero-width chars, smart quotes, and emoji from text before LLM input via @mukundakatta/textsanity-mcp (npx)

Recipe: sanitize messy text via @mukundakatta/textsanity-mcp

Server

Tools (2)

Verified trace

When to use

network

governance feed

live stream

Tool	Description
`sanitize`	Clean unicode/whitespace: NFKC, zero-width strip, control strip, collapse whitespace, trim. Optional: smart-punctuation→ASCII, emoji strip, ASCII-only. Presets: `default` (LLM-prep) or `strict` (everything on). Individual boolean flags override preset.
`normalize_newlines`	Collapse CRLF and CR to LF. Idempotent.