AI Agent 实战进阶

Image-Split + Image-PPT：把 Image 2 生成图变成可编辑 PPT

把 Image 2 生成的页面图、幻灯片图和 PDF 页面图变成可编辑 PPT：先拆出视觉资产和文字区域，再重建可修改文本框、可移动对象和可验收预览。

26 分钟发布于 2026-05-25

你将学到什么

本课结束后，你会知道为什么“Image 2 生成图转 PPT”不能只把整张图贴进幻灯片，应该先用 image-split 把平面图拆成可移动的视觉资产，再用 image-ppt 把文字重建成可编辑文本框。你还会看到两份完整 MD 原文：一份负责拆层，一份负责重建 PPT。重点不是背命令，而是理解这条技术路径为什么更稳：视觉归视觉，文字归文字，样张先验收，整套再批量。

两个 skill 各负责什么

image-split 负责把一张扁平图片拆开，尤其适合 Image 2 生成的幻灯片图、AI 生成页面图、PDF 导出的单页图和普通截图。它会识别页面里的卡片、图标、线条、照片、图表、徽章、页脚这些视觉对象，同时把标题、正文、标签、页码等语义文字单独标出来，交给后续步骤重建。image-ppt 负责把拆好的视觉对象和校正后的文字重新装进 PPTX：视觉对象变成可选中、可移动的图片或形状，文字变成 PowerPoint 里可直接编辑的文本框。

为什么不能直接贴整张图

直接把 Image 2 生成图或页面图放进 PPT 最快，但得到的是一张大图：标题不能改，图标不能挪，卡片不能删除，字号和错别字也修不了。更麻烦的是，后面想统一换品牌色、删某个模块、改一页文案时，只能重新做图。拆层重建的价值，是把“看起来像 PPT 的图”变成“真的能编辑的 PPT”。它不会让每个复杂图表都自动变成矢量，但能让主要标题、标签、正文、页码、卡片和图标进入可维护状态。

适合谁

适合手上有 Image 2 生成页、AI 生成幻灯片图、PDF 页面图、竞品页面图、旧课件图片或普通截图，想把它变成可编辑 PPT 的人。你可能是运营、培训老师、销售、研究助理，也可能只是想让 Agent 帮你把一套图片课件重做成能改字的文件。只要你的目标是“以后还能改”，就应该走这条路线。

不适合谁

不适合期待一键 100% 还原所有像素、同时又要求每个复杂图表都能像 Excel 图表一样编辑的人。照片、复杂插画、显微图、密集图表通常保留为可移动图片对象；真正可编辑的是标题、正文、标签、页码、卡片文字和常规形状。也不建议直接拿未脱敏的合同、病历、学生信息或公司内部材料做样张。

技术路径总览

1先选 1 页样张：不要一上来整套几十页。先用最能代表风格的一页验证拆层和重建质量。
2拆视觉层：用 image-split 区分背景、卡片、徽章、图标、图表、照片、线条和页脚，输出透明视觉资产、位置记录和预览图。
3整理文字层：用 OCR 只做草稿证据，再人工或让 Agent 校正标题、标签、正文、页码，避免把识别错的字直接写进 PPT。
4重建 PPT：用 image-ppt 按视觉资产位置放图形，再叠加透明文本框，保证文字能选中、能修改。
5做 QA：打开预览图检查有没有残留旧文字、白块补丁、图标缺角、文字溢出、页码重复。
6样张通过后再批量：样张没过就做全套，只会把同一个问题复制几十页。

把这条路线画成图，就是先从一张扁平的 Image 2 生成图出发，经由 Image-Split 拆出三类中间产物：可移动视觉资产、区域结构文件和 OCR 证据；再把这些交给 Image-PPT-King 重建可编辑文本层和 PPTX，最后用预览、QA 和差异检查做验收。

技术路线图

Flat slide image

Image Split

Atomic visual assets

region-schema.json

OCR racing evidence

Image-PPT-King

Editable text layer

Editable PPTX

Preview / QA / diff gates

从 Image 2 生成图出发，先拆成视觉资产、区域结构和 OCR 证据，再交给 Image-PPT 重建可编辑文字层和 PPTX，最后用预览和差异检查验收。

开始前准备什么

输入图片：Image 2 生成图、AI 生成幻灯片、PDF 导出的单页 PNG 或截图，尽量用高清版本。
目标说明：写清是“做成可编辑 PPT”，还是只要近似视觉预览。
样张范围：先指定 1 页或 2 页，不要直接全套。
品牌要求：字体、颜色、页脚、页码、Logo 是否需要保留。
验收口径：哪些文字必须可编辑，哪些图表可以保持图片对象。

MD 原文卡片一：Image-Split

下面保留 image-split 的 MD 原文，不做删改。读者可以先看前面的解释理解思路，再点右上角“一键复制”把整段原文交给自己的 Agent 使用。

markdown可截图原文

---
name: image-split
description: Split generated slide/page images into editable-ready transparent visual element layers. Use when Codex needs to extract PPT/Image2/AI-generated slide visuals into PNG assets, separate text from visual components, preserve original slide layout, build a "visual elements" folder, create CopySlides-like region schemas, run OCR-racing evidence, rebuild slide visuals before adding editable text, or avoid OCR/OpenCV fragmentary split failures.
---

# Image Split

## Core Rule

For editable PPT reconstruction, split by atomic visual assets, not broad skeleton layers and not connected-component fragments.

- Use OCR only to locate text masks. Do not let OCR boxes define visual element boundaries.
- For difficult pages, run OCR racing as evidence for text/content/layout only; the production boundary is the region schema and visual anchors, not the first OCR result.
- Rebuild simple geometry such as cards, tabs, badges, arrows, rules, circles, and dashed boxes as clean drawn layers.
- Extract complex visuals such as icons, diagrams, photos, charts, logos, decorative line art, and illustrations from the source image.
- For circular icon UI, split the object into separate assets: deterministic circle/ring shape, inner glyph, and optional separator/accent. Do not crop the entire circular icon as one bitmap unless it is a photo/logo.
- Prefer one reusable visual object per asset: each badge, pill, icon, card outline, arrow, connector, divider, logo, footer, and illustration should be its own named asset when it is meant to be editable/movable/replaced.
- Output cropped transparent PNG assets plus placement metadata when building an editable deck; full-canvas PNG layers are acceptable only for preview compatibility or true background-wide elements.
- Remove semantic text from visual layers unless the user explicitly wants logo text, chart labels, or decorative background text preserved.
- Do not blur alpha for geometric/UI assets. Use antialiasing only from clean vector drawing or high-resolution downsampling.

## Routes

- `atomic-assets route`: production route for editable PPT. Outputs many named, cropped transparent assets with `position` metadata and optional full-canvas preview layers.
- `copyslides-like region route`: production route for difficult Image2 slide pages. First creates a semantic `region-schema.json`, optionally supported by OCR-racing evidence, then rebuilds regular UI geometry as clean drawn assets and extracts complex illustrations/photos as separate image objects. This route must not use a full-page textless/inpaint image as the production visual base.
- `visual-skeleton route`: quick preview route. Uses a few broad full-canvas layers to test layout rhythm, but is not accepted as final editable素材.

## Workflow

1. Inspect the source slide image before scripting. Identify:
   - slide chrome: header lines, footer bars, page number zones
   - geometric components: cards, labels, badges, arrows, dividers
   - extracted image components: icons, illustrations, chart/photo regions, background line art
   - text zones to rebuild later as editable text
2. Choose the route explicitly. Use `atomic-assets route` unless the user only asks for a rough visual skeleton.
3. For dense or previously failed pages, create a CopySlides-like region schema before extracting assets:
   - classify `title`, `body`, `flow_label`, `card`, `tab`, `badge`, `chart`, `microscopy`, `photo`, `image`, `illustration`, `caption`, `footer`, and `page_number` regions
   - record region boxes, anchors, z-order, colors, editability, native-shape requirement, OCR sources, confidence, and intended reconstruction method
   - use the schema as the contract between `image-split` and `image-ppt`
4. For high-risk text-heavy pages, run OCR racing and merge evidence:
   - capture Apple Vision / PaddleOCR / PP-Structure / MinerU results when locally available and approved for the document
   - write `ocr-candidates.json`, `ocr-merged.json`, and `ocr-review-report.md`
   - use agreement between engines for text content; use layout regions plus OCR boxes for rough masks only
   - mark low-confidence or conflicting text as `needs-human-review` instead of silently choosing one engine
5. Create an atomic component map:
   - separate every repeated UI unit that a user may want to move, replace, recolor, or delete
   - keep repeated groups consistently named, such as `card_tl_outline`, `card_tl_tab_bg`, `card_tl_icon`, `connector_tl_line`
   - split text away from visual shapes; label text belongs to `image-ppt`
6. Use full-canvas preview only as QA. Do not confuse a full-canvas preview layer with a production asset.
7. Use OCR text masks only as subtraction masks for extracted image regions. Tighten or override OCR masks when they hit icons, logos, or line art.
8. Draw simple PPT-like shapes instead of trying to crop them from the bitmap.
9. Extract complex regions with foreground, color, or line-art masks; apply morphology lightly so thin strokes and antialiasing survive.
10. For icon glyphs:
   - draw the circle/ring/pill base as vector-like geometry
   - crop only the internal glyph pixels with a tight box
   - use light masks for white glyphs on blue and color-distance masks for blue glyphs on pale backgrounds
11. Generate QA artifacts: `manifest.json`, `assets_contact_sheet.png`, `composite_no_text_preview.png`, `text_mask_for_reference.png`, and `region-schema.json` when using the CopySlides-like route.
12. Run textless-layer OCR or manual text-region review for high-risk pages. If title/body/card text remains in visual layers, fix the split before `image-ppt`.
13. Review the composite preview against the original and the contact sheet for asset quality. Fix missing geometry with drawn assets; fix missing source art with wider regions or a line-art mask.

For the stricter v2 workflow used on difficult editable-PPT pages, read `references/v2-atomic-workflow.md` before authoring the elements file.
For the CopySlides-like route validated on Image2 deck page 16, read `references/copyslides-like-region-workflow.md` before authoring the region schema or visual layers.
For OCR racing, region-schema fields, and merge rules, read `references/region-schema-ocr-racing.md` before accepting text-heavy pages.
For final or high-risk samples, read `references/acceptance-rubric.md` and report pass/warn/fail gates before handing visual layers to `image-ppt`.

## Bundled Script

For production editable-PPT samples, prefer an atomic elements file and cropped positioned assets:

```bash
python scripts/atomic_asset_split.py \
  --image /path/to/slide.png \
  --elements /path/to/elements.json \
  --out /path/to/output-folder
```

The script writes cropped transparent PNG assets, `manifest.json`, `assets_contact_sheet.png`, and `composite_no_text_preview.png`. The manifest is consumable by `image-ppt` because each cropped asset has `position` and `canvas` metadata. For cropped raster assets, use `mask: "nonwhite"` for art on pale backgrounds, `mask: "light"` for pale line art/icons on deep color backgrounds, and `mask: "color"` for single-color glyphs on pale backgrounds.

Use `scripts/component_layer_split.py` for repeatable component-layer extraction from a single source image:

```bash
python scripts/component_layer_split.py \
  --image /path/to/slide.png \
  --out /path/to/output-folder \
  --recipe /path/to/recipe.json \
  --ocr-json /path/to/ocr.json \
  --slide 3
```

The script reads a recipe, writes full-canvas transparent PNG layers, and generates contact-sheet/composite QA images. Read `references/recipe-schema.md` before authoring or patching a recipe.

## Recipe Pattern

Use drawn assets for stable geometry:

```json
{
  "name": "cards_tabs_badges_arrows",
  "type": "draw",
  "items": [
    {"shape": "rounded_rect", "box": [40, 376, 258, 706], "outline": "#0052AA", "width": 3, "radius": 15},
    {"shape": "rounded_rect", "box": [56, 337, 241, 388], "fill": "#0052AA", "radius": 10},
    {"shape": "circle", "center": [151, 300], "radius": 31, "fill": "#0052AA"},
    {"shape": "arrow", "from": [273, 489], "to": [313, 489], "fill": "#0052AA"}
  ]
}
```

Use extracted assets for complex source art:

```json
{
  "name": "card_icons_no_text",
  "type": "extract",
  "mask": "foreground",
  "regions": [[65, 405, 230, 535], [360, 420, 500, 535]],
  "subtract_text": true,
  "close": 3
}
```

## Tool Choices

- Use OpenCV/Pillow for deterministic masks, drawing, contact sheets, and previews.
- Use Apple Vision, PaddleOCR/PP-Structure, or MinerU for OCR/layout evidence. Do not let any one OCR engine define final visual assets by itself.
- Consider SAM/SAM2 for difficult object masks if region extraction is not enough.
- Use inpainting or LaMa only for small background repairs; do not rely on it to reconstruct covered visual components.
- For blue UI geometry, cards, rules, dashed connectors, and badges, prefer vector/native shape generation or clean drawn assets over bitmap extraction.
- Do not use a full-page inpainted/textless base as the production layer for pages with editable labels/cards/title text. It tends to preserve residual glyphs and creates false QA passes.

## Quality Bar

Accept a split only when:

- All layer PNGs have the exact original canvas size and transparency.
- In `atomic-assets route`, all production assets are atomic and named; broad layers are limited to background/footer/line-art/chart/photo regions.
- In `copyslides-like region route`, a `region-schema.json` exists and the production visual layers are traceable to semantic regions.
- The composite preview restores the visual layout without obvious missing cards, icons, lines, or badges.
- Semantic text is absent from visual layers, except intentional logo/decorative/chart text.
- Extracted components are grouped by design meaning, not by arbitrary pixel connectivity.
- UI asset edges are crisp: no Gaussian-blurred alpha, fuzzy rounded corners, rectangular source-crop halos, or background-colored residue.
- Any remaining limitations are documented in `manifest.json` or the final response.

## Acceptance Standards

Run these checks before declaring a split usable.

Use `references/acceptance-rubric.md` for the detailed gate model. The short standard below is only the minimum.

### Required Artifacts

Each processed slide must have:

- `manifest.json` with layer names, source/drawn status, and known limitations.
- For `atomic-assets route`: cropped transparent PNG assets with source-canvas `position` metadata, plus optional full-canvas preview layers.
- For `visual-skeleton route`: full-canvas transparent PNG layers, all matching the source image size.
- For `copyslides-like region route`: `region-schema.json`; for high-risk text-heavy slides, `ocr-candidates.json`, `ocr-merged.json`, and `ocr-review-report.md`.
- `composite_no_text_preview.png` showing all visual layers reassembled at `(0,0)`.
- `assets_contact_sheet.png` for quick layer review.
- `text_mask_for_reference.png` or an equivalent OCR/hand text-zone mask.
- For important slides, a source / composite / diff review image.

### Structural Checks

- Every layer is RGBA or otherwise has a real alpha channel.
- Atomic cropped assets must include precise `position` metadata: source x/y/w/h, source canvas size, and intended slide size if known.
- Full-canvas preview layers must use the original slide canvas size.
- A typical editable page may have 20-80 atomic assets. Do not collapse independent UI objects into one broad layer just to reduce layer count.
- Do not accept dozens of arbitrary connected-component fragments; fragments are only valid if each corresponds to an intentional visual object.
- Layer names must describe design meaning, such as `header_chrome`, `cards_tabs`, `diagram_icons`, `footer_lineart`; names like `component_42` are not sufficient for production.
- Blocking structural failure: an asset named like a group (`blue_filled_ui`, `source_visuals`, `all_icons`, `all_cards`) contains multiple unrelated movable objects in production output.

### Visual Checks

- Compare `composite_no_text_preview.png` with the source page while ignoring intended text-removal zones.
- Outside text zones, large missing cards, blue labels, number badges, icons, lines, footers, campus line art, and diagram/photo regions are blocking failures.
- Any obvious white patch, solid-color repair block, unexpected deep-blue block, broken circle, cropped icon, or missing footer/page chrome is a blocking failure.
- Any fuzzy corner, blurred alpha edge, background halo, rectangular crop boundary, or merged multi-object patch on UI assets is a blocking failure for `atomic-assets route`.
- Circular icons must pass a component split check: the circle/ring base is a drawn asset, the glyph is a tight transparent crop, and the contact sheet does not show a full circular bitmap with background residue.
- Review `assets_contact_sheet.png`: each tile should show one complete, sharp visual object or one documented background-wide element.
- For pixel checks, use thresholds as a guide, not a substitute for review: outside text zones, changed pixels above 25 RGB levels should usually stay below 8-12% for clean geometric pages. Complex chart/photo pages may exceed this but must be manually justified.

### Text Removal Checks

- OCR the composite preview when practical. Main semantic text should be gone from visual layers.
- For high-risk pages, `textless layer OCR` is a blocking gate: if the composite still reads as title/body/card text, the split fails even when non-text diff is good.
- Preserved text must be intentional and documented: logo text, chart/axis text, microscopy labels, or decorative background text.
- If OCR masks damage icons, logos, dashed rules, or fine line art, override them manually rather than accepting a damaged layer.

### Image2 Textless Skeleton Route

When the visual skeleton comes from Image2 or another generative edit instead of deterministic splitting:

- Treat it as a separate route, not as a normal split.
- Compare the skeleton against the source for layout anchors: title baseline, header line, footer band, card/frame boxes, icon centers, and chart/photo bounding boxes.
- If Image2 moved or invented components, do not reuse old text coordinates without remapping.
- A skeleton that is visually polished but changes layout must be marked `layout-remap-required` in `manifest.json` or the final response.

### Stop Conditions

Stop before full-deck work if a sample slide has:

- missing or broken primary visual components,
- visible repair patches,
- unacceptably changed layout anchors,
- semantic text still baked into major visual layers,
- or no reliable way to distinguish intended text removal from damaged visuals.

怎么判断拆层合格

contact sheet 里每个资产都应该像一个“有意义的对象”，而不是随机碎片。
去文字预览图里，主体布局还在，但标题、正文、标签等语义文字已经被拿掉。
圆形图标不要整块带背景裁下来，最好拆成圆形底和内部图标。
卡片、按钮、分割线、箭头这类规则形状，边缘应该干净，不应该有毛边和白色补丁。
复杂图、照片、Logo 可以保留为图片对象，但要能单独移动和替换。

MD 原文卡片二：Image-PPT

下面保留 image-ppt 的 MD 原文，不做删改。它承接 Image-Split 的拆层结果，用来约束 Agent 把视觉资产和文字层重建成可编辑 PPTX。

markdown可截图原文

---
name: image-ppt
description: Convert slide/page images into editable PowerPoint PPTX decks. Use when Codex is given PNG/JPG/Image2/AI-generated slide images and asked to make an editable PPT, reconstruct image-based PPT pages, keep text editable, preserve layout, use visual-element layers or CopySlides-like region schemas, consume OCR-racing evidence, or turn flat slide screenshots into PPTX with selectable text and movable image objects.
---

# Image PPT

## Contract

Build a PPTX where visual components are image/shape layers and semantic text is native editable PowerPoint text.

- Do not put final body text inside a flattened slide screenshot.
- Use `image-split` first when the input is only flat slide images and no clean visual layers exist.
- Preserve slide size, page order, visual rhythm, and page numbering.
- Treat charts, photos, microscopy, logos, and complex diagrams as image objects unless the user explicitly asks to redraw them.
- Rebuild main titles, labels, card text, bullets, captions, and page numbers as editable text boxes.
- Editable text boxes are text-only by default: no fill, no outline, no patch background.
- Export PPTX through Presentations artifact-tool. Do not hand-edit final OOXML as the production path.

## Workflow

1. Inspect the input images and decide scope:
   - one/few-slide sample first when style or OCR quality is uncertain
   - full deck only after the sample is accepted
2. Split visual layers:
   - invoke `image-split` for component-layer transparent PNGs
   - use cropped positioned assets for atomic objects; use full-canvas PNG layers only for true background-wide elements or preview compatibility
   - if `image-split` produced a CopySlides-like `region-schema.json`, treat it as the layout contract for text placement and QA
3. Build or correct text layer data:
   - use OCR only as a coordinate/content draft
   - when `ocr-merged.json` exists, use it as content evidence and conflict report, not as final placement
   - manually correct obvious OCR errors, symbols, Greek letters, dates, names, and headings
   - anchor labels to visual objects when available, such as pill center, badge center, card inner padding, chart frame, or footer page-number slot
   - for CopySlides-like reconstruction, place text from region anchors and a style table rather than raw OCR boxes
   - keep each text object independently editable; do not merge unrelated card labels, page numbers, or bullets
   - keep text boxes transparent; put missing blue labels, badges, cards, and tabs into the visual layer instead of textbox fill
4. Run `scripts/build_ppt_from_layers.mjs` with the layer root and text JSON.
5. Render preview PNGs and layout JSON.
6. Validate:
   - PPTX opens as a zip and has the expected slide count
   - each slide has text in `ppt/slides/slide*.xml`
   - no major text overflow in rendered previews
   - textless-layer OCR and final-render OCR/manual review do not show residual text, duplicate text, or gibberish patches on high-risk pages
   - image layers are selectable/movable objects, not a single baked screenshot unless intentionally used as a background

For high-fidelity editable reconstruction, use the stricter visual-anchor workflow in `references/visual-anchor-text.md` before finalizing `text-layer.json`.
For CopySlides-like region reconstruction validated on difficult Image2 pages, read `references/copyslides-like-reconstruction.md` before building or accepting the PPTX.
When OCR racing was used by `image-split`, carry `ocr-merged.json` and `ocr-review-report.md` into content checks before final text placement.
For final or high-risk samples, read `references/acceptance-rubric.md` and report pass/warn/fail gates. This is mandatory for v3-style position optimization or full-deck rollout.

## Builder Script

Use the bundled script after visual layers and text JSON exist:

```bash
node scripts/build_ppt_from_layers.mjs \
  --layers-root /path/to/visual-layers \
  --text-json /path/to/text-layer.json \
  --out /path/to/editable.pptx \
  --workspace /path/to/workspace \
  --preview-dir /path/to/preview \
  --layout-dir /path/to/layout \
  --slide-size 960x540
```

The layer root should contain page folders with `manifest.json`; each manifest lists cropped positioned transparent PNG assets and/or documented full-canvas background-wide assets. The text JSON schema is documented in `references/text-layer-schema.md`.

By default the builder strips `fill` and `line` from text objects and reports the affected objects in `build-manifest.json`. Use `--fail-on-text-fill` for strict QA. Use `--preserve-text-fill` only for legacy debugging, not for production samples.

For representative slides, run `scripts/visual_text_qa.py` after rendering to produce a masked source/render diff and text-style metrics.

```bash
python scripts/visual_text_qa.py \
  --source /path/to/source-slide.png \
  --render /path/to/preview/slide-01.png \
  --text-json /path/to/text-layer.json \
  --pptx /path/to/output.pptx \
  --build-manifest /path/to/build-manifest.json \
  --asset-manifest /path/to/visual-layers/第05页/manifest.json \
  --out-dir /path/to/qa \
  --label-names s05-ocr-004,s05-ocr-009,s05-ocr-014 \
  --required-text 组方思路,君药,醒窍汤 \
  --warn-non-text-changed-ratio 0.08 \
  --max-non-text-changed-ratio 0.12 \
  --fail-on-gate-fail
```

## Text Layer Rules

- Use corrected content, not raw OCR, for final text.
- Use OCR-racing merge output as evidence; resolve conflicts against the source document, approved slide script, or user review before delivery.
- Use `PingFang SC` or `Microsoft YaHei` for Chinese unless the user specifies otherwise.
- Use separate text boxes for:
  - page title
  - blue tab labels
  - numbered badges
  - each card body
  - each bullet line/group
  - page number
- Use tight boxes for labels and larger wrapped boxes for paragraphs.
- For repeated components, prefer a style table over raw OCR font sizes: same-level labels, card headings, body lines, captions, and page numbers should share font size, weight, color, alignment, and vertical anchoring.
- Repeated labels should be centered from the corresponding visual layer bounds, not from the OCR text box, unless the visual layer is unavailable.
- When exact font style conflicts with editability, prioritize clean editable text and document the visual difference.
- Text boxes must be transparent in all normal routes. Do not use text-box `fill` or `line` to cover an existing skeleton shape, label, badge, card, or tab.
- If a visual shape is missing, fix it in `image-split` as a visual layer or add a separately documented shape repair. Do not combine the repair with the editable text box.

## Acceptance Standards

Run these checks before declaring a PPTX usable.

Use `references/acceptance-rubric.md` for the detailed gate model. The short standard below is only the minimum.

### Route Declaration

State which route was used:

- `component-layer route`: visual layers from `image-split`, with shapes/images rebuilt and text added separately.
- `CopySlides-like region reconstruction route`: semantic region schema from `image-split`, clean rebuilt geometry/atomic image assets, and transparent editable text anchored to regions.
- `Image2 textless skeleton route`: a full-slide textless skeleton image is used as the visual base and editable text is placed on top.
- `hybrid route`: textless skeleton for background plus selected component repairs or native shapes.

Do not mix route assumptions silently. Text coordinates calibrated for one route are not automatically valid for another.

### Structural PPTX Checks

- PPTX opens as a zip without errors.
- Slide count and slide size match the request.
- Page order and page numbers are correct.
- Every slide has semantic text in `ppt/slides/slide*.xml`.
- No empty placeholder text boxes remain.
- No missing media relationships are present.
- The final PPTX is exported through Presentations artifact-tool, unless the user explicitly requests a different production path.

### Editability Checks

- Main titles, subtitles, bullets, card text, tab labels, number badges, captions, and page numbers are native editable text.
- Charts, microscopy, photos, logos, complex medical diagrams, and dense figures may remain image objects, but must be selectable/movable/replacable.
- Do not bake final body text into a flattened slide screenshot.
- Use one text box per semantic unit. Do not merge unrelated bullets, page numbers, labels, and captions into a single box.

### Text Overlay Checks

- Text boxes must fit their target visual region without overflow, clipping, or covering icons/borders.
- Text style must be consistent by hierarchy: same-level labels share size, weight, color, alignment, and typeface.
- Same-level label font size delta should normally be <= 1pt unless the source intentionally differs; report any exception.
- Label text center should normally be within 3-5 px of the target pill/badge/card anchor in 960x540 slide coordinates.
- Text boxes are transparent by default. A text box with `fill` or `line` is a failure if it overlays an already-existing background label, pill, badge, or card.
- Filled text boxes are not accepted in production. Rebuild missing shapes separately, then overlay transparent text.
- QA must report the count of source text objects that had `fill` or `line`; the production build should strip them or fail with `--fail-on-text-fill`.
- When using an Image2 textless skeleton, first verify that the skeleton layout still matches the source. If it moved cards, labels, charts, or frames, regenerate or remap the text layer instead of reusing old coordinates.

### Visual QA Checks

- Render every slide to PNG.
- For sample and high-risk slides, create a source / render / diff comparison.
- For representative slides, mask out editable text boxes before measuring visual diff so the metric focuses on visual-layer fidelity.
- Also review the unmasked final render. Masked diff is not enough because it can hide residual source text, gibberish patches, and editable text overlap inside text regions.
- For presentation-grade acceptance, inspect source/render side-by-side and zoom crops for text zones, not only non-text zones.
- Blocking visual failures include white patches, blue patch blocks, broken/cropped badges, missing icons, double page numbers, text outside boxes, and inconsistent same-level font sizes.
- For full decks, produce a contact sheet and inspect all pages before delivery.

### Content Checks

- Text content must be corrected against the source document or approved slide script, not raw OCR.
- Common OCR substitutions, wrong Greek letters, wrong units, wrong page numbers, and old-topic residue are blocking failures.
- OCR-racing conflicts on scientific values, units, genes/proteins, drug concentrations, dates, or page numbers must be resolved before accepting the slide.
- Page numbers must be editable text and continuous unless the user asks otherwise.

### WPS / PowerPoint Checks

When WPS or PowerPoint is available and the user cares about final presentation fidelity:

- Open the generated PPTX and inspect representative pages.
- Check font fallback, line breaks, text overflow, media display, page numbers, and object selectability.
- If WPS rendering differs from artifact-tool preview, tune the PPTX for WPS and document the tradeoff.

## Output Layout

For substantial jobs, keep artifacts organized:

- `visual-layers/`: page folders from `image-split`
- `text-layer.json`: editable text objects
- `preview/`: rendered slide PNGs
- `layout/`: artifact-tool layout JSON
- final `.pptx`: only the deliverable in the user-facing output folder

## Stop Conditions

Stop and report before full-deck work if:

- the sample slide cannot render close enough to the source image
- OCR text is too inaccurate to correct without source text
- the preview has visible residual text, gibberish patches, or text overlap even when structure checks pass
- OCR-racing or final-render OCR flags unresolved scientific text conflicts
- the user expects every chart/photo pixel to become editable vectors
- final output would require cloud upload or external services not approved by the user

怎么判断 PPT 合格

打开 PPT 后，标题、正文、标签、页码能单独选中并修改。
点击图标、照片、图表、Logo 时，它们是单独对象，不是一整页大截图。
预览图和原图相比，布局节奏、页边距、卡片位置、页脚位置基本一致。
文字没有挤出框、盖住图标、断在奇怪位置，也没有重复一层旧文字。
如果字体无法完全一致，优先保证可读、可编辑，并记录差异。

常见踩坑

踩坑 1：一上来就让 Agent 做整套。页面风格、OCR 质量、字体替换、拆层边界都没验证，整套输出后很难返工。先做样张，样张通过再批量。

踩坑 2：把 OCR 当成最终答案。OCR 只能帮你定位和识别草稿，标题、专有名词、数字、单位、页码都要复核。看起来像中文，不代表内容一定对。

踩坑 3：用白色文本框遮旧字。这种方法预览时可能看着干净，但换背景、移动对象或投影时马上露馅。正确做法是拆层阶段去掉旧语义文字，PPT 阶段用透明文本框重建。

踩坑 4：没有定义哪些必须可编辑。你要提前说清楚：标题、正文、页码必须能改；照片和复杂图表可以保留为图片；图标和卡片最好能单独移动。目标越清楚，Agent 越不容易走偏。

本课小结

这条路线可以概括成一句话：先用 image-split 把 Image 2 生成图拆成干净视觉资产和文字区域，再用 image-ppt 把它们重建成可编辑 PPT。交给 Agent 时，不要只发“帮我转 PPT”，而要同时发样张、Image-Split MD 原文、Image-PPT MD 原文和验收标准。这样得到的不是一张贴在幻灯片里的图，而是一份以后还能改、还能复用的演示稿。