从数据集到分析报告

数据分析的目标不是“分析本身”，而是产出能帮助别人行动的 artifact：给管理层看的图表、给产品团队看的实验读数、给研究者看的模型评估，或指导日常运营的 d

数据分析的目标不是“分析本身”，而是产出能帮助别人行动的 artifact：给管理层看的图表、给产品团队看的实验读数、给研究者看的模型评估，或指导日常运营的 dashboard。

Codex 适合把这个过程做成可复查、可重跑的工作流：清洗数据、连接多源、探索假设、建模、可视化，并把结果打包成报告。

官方页面：https://developers.openai.com/codex/use-cases/datasets-and-reports

适合什么任务

场景	Codex 应该做什么
messy files 需要变成 chart、memo、dashboard 或 report	从 inventory 开始，按可复现流程清洗、合并、分析和输出
需要 cleanup、joins、EDA、reproducible scripts	优先保存 scripts 和 artifacts，不依赖一次性 notebook state
团队需要 reviewable artifacts	生成 Markdown、notebook、`.docx`、PDF、dashboard 或 static report site

使用的能力

能力	用法	链接
`$spreadsheet`	检查 CSV、TSV、Excel 文件，处理 formulas、exports 和快速表格检查	https://developers.openai.com/codex/skills
`$jupyter-notebook`	创建或重构 notebooks，用于 exploratory analysis、experiments 和 reusable walkthroughs	https://github.com/openai/skills/tree/main/skills/.curated/jupyter-notebook
`$doc`	当 layout、tables 或 comments 重要时，生成 stakeholder-ready `.docx` reports	https://github.com/openai/skills/tree/main/skills/.curated/doc
`$pdf`	渲染 PDF 输出，并在分享前检查最终 analysis artifact	https://github.com/openai/skills/tree/main/skills/.curated/pdf

起始提示词

我正在这个 workspace 里做一个 data analysis project。

目标：
- 判断靠近 highway 的 houses 是否有更低的 property valuations。

请从这些步骤开始：
- 阅读 `AGENTS.md`，并解释推荐的 Python environment
- 加载 [dataset path] 下的 dataset(s)
- 描述每个文件包含什么、可能的 join keys，以及明显 data quality issues
- 提出一套可复现 workflow，覆盖 import、tidy、visualization、modeling 和 report output

约束：
- 优先使用 scripts 和 saved artifacts，不依赖一次性 notebook state
- 不要编造 missing values 或 merge keys
- 如果某些 skills 或 worktree splits 能让 workflow 更可复现，请明确建议

输出：
- setup plan
- data inventory
- analysis plan
- first commands or files to create

这个 prompt 先要求 Codex 解释环境、盘点数据和设计 workflow，而不是直接画图。数据分析里，跳过 inventory 和 join strategy 往往是后面结果不可信的根源。

需要	推荐默认值	原因
Analysis stack	pandas + matplotlib 或 seaborn	适合 import、profiling、joins、cleaning 和第一轮 charts
Modeling	statsmodels 或 scikit-learn	先用 interpretable baselines，再决定是否进入复杂 predictive models

定义问题

先选一个具体问题。问题越具体，Codex 越容易判断数据、方法和输出形态。

官方示例问题是：

靠近 highway 的 houses 在 property valuation 上低多少？

假设一份数据包含 property values 或 sale prices，另一份包含 location、parcel 或 highway-proximity 信息。真正的工作不只是跑一个 model，而是让输入可信、记录 joins、压力测试结果，并最终交付别人能使用的 artifact。

设置项目环境

开始新数据项目时，先定义环境和规则：

Environment：Codex 应该知道 canonical Python environment、package manager、folders 和 output conventions。
Skills：notebook cleanup、spreadsheet exports、final report packaging 这类重复流程应该沉淀成 reusable skills。
Worktrees：把不同 hypothesis、merge strategy 或 visualization branch 放进独立 worktree，避免互相污染。

一个小的 AGENTS.md 就能降低很多混乱：

## 数据分析默认规则

- 使用 `uv run` 或项目现有 Python environment。
- source data 放在 `data/raw/`，cleaned data 写入 `data/processed/`。
- exploratory notebooks 放在 `analysis/`，final artifacts 放在 `output/`。
- 永远不要覆盖 raw files。
- 优先使用 scripts 或已提交的 notebooks，不依赖未命名 scratch cells。
- 合并 datasets 前，先报告 candidate keys、null rates 和 join coverage。

如果 repo 还没有定义 Python 环境，先让 Codex 创建可复现 setup 并说明运行方式。对数据分析来说，这一步比直接画图更重要。

先做数据盘点

最快的起点是给出文件路径，让 Codex inspect 数据。第一轮只问 inventory，不要问结论。

让 Codex 回答：

这里有哪些 file formats？
每份 dataset 似乎代表什么？
哪些 columns 可能是 targets、identifiers、dates、locations 或 measures？
明显的数据质量问题在哪里？

Tidy 和 Merge

真实数据分析最容易出错的是合并。primary key 不清楚时，naive merge 可能丢数据，也可能制造重复。

在真正 merge 前，让 Codex 先 profile：

检查 candidate keys 的 uniqueness。
测量 null rates 和 formatting differences。
归一化 casing、whitespace、address formatting 等明确问题。
跑 trial joins 并报告 match rates。
写出 safest merge strategy，再生成 final merged file。

如果需要派生 key，例如 normalized address、parcel identifier 或 location join，让 Codex 先解释 tradeoffs 和 edge cases。

用 Worktrees 做探索

Exploratory data analysis 适合隔离。一个 worktree 试 address cleanup 或 feature engineering，另一个 worktree 做 charts 或 alternate model direction。这样每个 diff 更容易 review，也避免一个长线程混合互斥想法。

Codex app 有内置 worktree 支持。terminal 场景也可以直接用 Git worktrees：

git worktree add ../analysis-highway-eda -b analysis/highway-eda
git worktree add ../analysis-model-comparison -b analysis/highway-modeling

在 highway 示例中，这一步会比较 near highway 与 farther away homes，检查 outliers 和 missing-value patterns，并判断观察到的 effect 是否真实，还是由 neighborhood composition、home size 或其他因素造成。

建模问题

不是每个分析都需要复杂模型。先用 interpretable baseline。

例如 highway 问题，可以先做回归或其他透明模型，估计 highway proximity 和 property value 的关系，同时控制 size、age、location 等因素。

要求 Codex 明确说明：

target variable 和 feature definitions。
controls 选择及原因。
leakage risks 和 exclusions。
split、evaluation 或 uncertainty estimate 的选择。
用自然语言解释结果含义。

第一版模型弱也有价值。它能告诉你问题出在 model、features、join quality，还是问题本身定义不清。

交付结果

分析只有被别人消费才有价值。让 Codex 根据 audience 生成合适 artifact：

给技术协作者的 Markdown memo。
给下游运营的 spreadsheet 或 CSV。
格式和表格重要时，用 $doc 生成 .docx brief。
需要分享最终版时，用 $pdf 渲染 appendix 或 deliverable。
想用 URL 分享时，用 $vercel-deploy 做 lightweight dashboard 或 static report site。

交付物里必须包含 caveats。比如 join quality 不完美、sampling bias、model assumptions fragile，Codex 都应该明确写出来。

可沉淀的 Skills

稳定后，可以把重复步骤做成 repo-local skills，例如：

refresh-data
merge-and-qa
publish-weekly-report

长期看，这比每次把同一段 procedural prompt 贴进线程更可靠。