GLM-OCR Skill

When to Use

Use this skill when the task involves:

OCR for PDF / PPT / PPTX / images
Converting textbooks, lecture slides, or screenshots into Markdown
Verifying whether OCR output is actually complete
Investigating legacy mixed output, failed placeholders, or silent OCR corruption

input/ — source files waiting for OCR
output/ — Markdown output, extracted images, and any _failed_segments/*.failed.json
_cache/ppt_pdf/ — cached PDFs converted from PPT/PPTX
ocr.py — main OCR pipeline
verify_ocr.py — acceptance check
audit_ocr_integrity.py — deep integrity audit

Put source files into input/.
Run python ocr.py.
Run both python verify_ocr.py and python audit_ocr_integrity.py.
Only treat the batch as complete when both checks are clean and the corresponding output directory has no _failed_segments/*.failed.json.

The fallback chain is segment PDF upload -> per-page image OCR -> native PDF text fallback.
If a segment still fails after all fallbacks, the pipeline writes _failed_segments/*.failed.json.
Failed placeholders and failed segment reports mean the OCR result is incomplete. Do not silently pass or move such output downstream.

1301 contentFilter: split the segment first, rerun the blocked page separately, then use a secondary OCR / vision path only for the blocked page if needed.
Legacy segment_*.md: do not delete until ranged .md coverage and content have been compared.
Garbled file names from ZIP/RAR extraction: fix source names first, then keep input/, output/, and downstream library names in sync.
Empty native PDF text fallback: common on scanned books; be ready to switch to image OCR or another vision path.
If a page is filled through a non-GLM route, mark it as AI visual supplementation (non-GLM-OCR output) or equivalent instead of pretending it came from the main OCR pipeline.
For the exact marker format and audit rules, follow OCR_AUDIT_POLICY.md.

PPT/PPTX temporary PDFs belong in _cache/ppt_pdf/, not in output/.
verify_ocr.py is the minimum acceptance gate.
audit_ocr_integrity.py should be used whenever you need confidence that there is no legacy mixed output or silent corruption.