>>
stars: 0
forks: 0
watches: 0
last updated: 2026-03-13 08:11:43
GLM-OCR Skill
When to Use
Use this skill when the task involves:
- OCR for PDF / PPT / PPTX / images
- Converting textbooks, lecture slides, or screenshots into Markdown
- Verifying whether OCR output is actually complete
- Investigating legacy mixed output, failed placeholders, or silent OCR corruption
Project Layout
input/— source files waiting for OCRoutput/— Markdown output, extracted images, and any_failed_segments/*.failed.json_cache/ppt_pdf/— cached PDFs converted from PPT/PPTXocr.py— main OCR pipelineverify_ocr.py— acceptance checkaudit_ocr_integrity.py— deep integrity audit
Required Workflow
- Put source files into
input/. - Run
python ocr.py. - Run both
python verify_ocr.pyandpython audit_ocr_integrity.py. - Only treat the batch as complete when both checks are clean and the corresponding output directory has no
_failed_segments/*.failed.json.
Failure Semantics
- The fallback chain is
segment PDF upload -> per-page image OCR -> native PDF text fallback. - If a segment still fails after all fallbacks, the pipeline writes
_failed_segments/*.failed.json. - Failed placeholders and failed segment reports mean the OCR result is incomplete. Do not silently pass or move such output downstream.
Common Problems and Recommended Handling
1301 contentFilter: split the segment first, rerun the blocked page separately, then use a secondary OCR / vision path only for the blocked page if needed.- Legacy
segment_*.md: do not delete until ranged.mdcoverage and content have been compared. - Garbled file names from ZIP/RAR extraction: fix source names first, then keep
input/,output/, and downstream library names in sync. - Empty native PDF text fallback: common on scanned books; be ready to switch to image OCR or another vision path.
- If a page is filled through a non-GLM route, mark it as
AI visual supplementation (non-GLM-OCR output)or equivalent instead of pretending it came from the main OCR pipeline. - For the exact marker format and audit rules, follow
OCR_AUDIT_POLICY.md.
Notes
- PPT/PPTX temporary PDFs belong in
_cache/ppt_pdf/, not inoutput/. verify_ocr.pyis the minimum acceptance gate.audit_ocr_integrity.pyshould be used whenever you need confidence that there is no legacy mixed output or silent corruption.
