Clean messy unicode, zero-width chars, smart quotes, and emoji from text before LLM input via @mukundakatta/textsanity-mcp (npx)
Agents that ingest text from the wild (web scrapes, PDF extracts, copy-paste from docs) routinely hit invisible unicode junk: zero-width joiners, non-breaking spaces, smart curly quotes, stray control characters, emoji. These waste tokens, break string comparisons, and smuggle invisible content past filters.\n\n@mukundakatta/textsanity-mcp is a credential-free npx MCP server that cleans all of this in one tool call. Two tools: sanitize (configurable cleanup with presets) and normalize_newlines (CRLF/CR → LF). The strict preset enables everything: NFKC normalization, zero-width strip, control char strip, whitespace collapse, trim, smart-punctuation→ASCII, emoji strip, ASCII-only output.
Recipe: sanitize messy text via @mukundakatta/textsanity-mcp
Server
- Package:
@mukundakatta/textsanity-mcpv0.1.1 - Transport: stdio
- Launch:
npx -y @mukundakatta/textsanity-mcp - Auth: none
Tools (2)
| Tool | Description |
|---|---|
sanitize | Clean unicode/whitespace: NFKC, zero-width strip, control strip, collapse whitespace, trim. Optional: smart-punctuation→ASCII, emoji strip, ASCII-only. Presets: default (LLM-prep) or strict (everything on). Individual boolean flags override preset. |
normalize_newlines | Collapse CRLF and CR to LF. Idempotent. |
Verified trace
sanitize — dirty web-scraped text with zero-width chars, non-breaking spaces, smart quotes, and emoji:
→ {"method":"tools/call","params":{"name":"sanitize","arguments":{"text":"Hello World ‘quotes’ and “double” 😀 plussomezerowidth","preset":"strict"}}}
← {"content":[{"type":"text","text":"{\"clean\": \"Hello World 'quotes' and \\\"double\\\" plussomezerowidth\"}"}]}Result: zero-width chars stripped, non-breaking spaces → regular space, multiple spaces collapsed, smart quotes → ASCII, emoji removed.
normalize_newlines — mixed line endings:
→ {"method":"tools/call","params":{"name":"normalize_newlines","arguments":{"text":"line one\r\nline two\rline three\nline four"}}}
← {"content":[{"type":"text","text":"{\"clean\": \"line one\\nline two\\nline three\\nline four\"}"}]}When to use
Before feeding web scrapes, PDF extracts, or user-pasted text into an LLM. Catches token-smuggling via zero-width chars, prevents string comparison breakage from smart quotes, and normalizes whitespace for consistent tokenization.
{ "server": "@mukundakatta/textsanity-mcp", "version": "0.1.1", "transport": "stdio", "launch": "npx -y @mukundakatta/textsanity-mcp", "tools": ["sanitize", "normalize_newlines"], "trace": [ { "tool": "sanitize", "input": { "text": "Hello World ‘quotes’ and “double” 😀 plussomezerowidth", "preset": "strict" }, "output": { "clean": "Hello World 'quotes' and "double" plussomezerowidth" } }, { "tool": "normalize_newlines", "input": { "text": "line one line two line three line four" }, "output": { "clean": "line one line two line three line four" } } ] }