# 机器学习总结

### &#x20;Q & A

{% hint style="info" %}
**问:** 训练的过程如何选择训练集和测试集?
{% endhint %}

**答:**&#x20;

* 随机划分: 60% 训练 / 20% 验证 / 20% 测试
* K 折交叉验证: 将数据平均分成 K 份, 每次用 K-1 份做训练, 1 份做验证, 重复 K 次

{% hint style="info" %}
**问:** KL散度的定义?
{% endhint %}

**答:**

* $$P(x)$$: 真实分布
* $$Q(x)$$: 近似分布
* $$D\_{KL}(P || Q) = \sum\_{x} P(x)log\frac{P(x)}{Q(x)}$$
* $$D\_{KL}(P || Q) = \int P(x)log\frac{P(x)}{Q(x)} dx$$
* 又称**相对熵,** 是衡量两个概率分布之间差异的一种方法. 它表示: 若我们用分布 $$Q$$ 来近似分布 $$P$$, 每一个样本的“额外代价”是多少?

{% hint style="info" %}
**问:** 如果给你一组数据, 特征比样本多, 怎么去筛选特征?
{% endhint %}

**答:**

* 判断相关性
* 主成分分析 (PCA)
* 基于模型的特征选择: 决策树, 随机森林
* 向前选择 (Forward Selection): 从空特征集开始, 逐步添加一个特征, 选择每次能带来最大性能提升的特征, 直到没有更多的提升
* 向后剔除 (Backward Elimination): 从所有特征开始, 逐步去掉一个特征, 去除后模型的性能下降最小的特征, 直到去除更多特征会导致性能显著下降为止

{% hint style="info" %}
**问:** 怎么描述辛普森悖论?
{% endhint %}

**答:** 当我们将数据分组之后发现某个趋势, 在每个组里都成立, 但合并所有组后, 整体趋势却反过来了


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://yutians-organization.gitbook.io/yun-chou-xue-he-you-hua-dao-lun/ji-qi-xue-xi-zong-jie.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.