本-夏皮罗和龙虾教授的一段访谈

9月1日,本-夏皮罗在自己的博客节目《本-夏皮罗秀》上,采访了著名的“龙虾教授”:乔丹-彼得森。

这期的主题是“如何在自我牺牲中发现美”

在2024年9月1日的《The Ben Shapiro Show》中,Jordan Peterson与Ben Shapiro进行了深入的对话,探讨了自我牺牲之美。Peterson是一位世界知名的临床心理学家、多伦多大学的名誉教授,以及在线教育家,以其关于宗教信仰、叙事神话和人格分析的有影响力的讲座而闻名。在他们的对话中,Peterson分享了他在最新系列《Foundations of the West》中的一些幕后轶事,这个系列中他访问了耶路撒冷、雅典和罗马,以揭示西方文明的古老根源。Peterson还讨论了他的个人宗教旅程,以及他对圣经中风险和牺牲主题的观察。

Peterson在对话中提到了亚伯拉罕和以撒的故事,将其作为牺牲主题的一个例子。他解释说,圣经的叙述挑战了儿童献祭的概念,而是将其呈现为父母必须为孩子做出的牺牲的隐喻。愿意让孩子面对世界,尽管存在固有的危险,是一种自我牺牲,最终导致成长和成熟。这一理念与为了社区和后代而必须做出牺牲的更广泛主题相呼应。

此外,Peterson还讨论了社区和责任的角色,他认为社区本质上是建立在牺牲之上的。他主张,人类经历以关系为特点,这些关系要求个人为了更大的利益做出牺牲。这一概念深深植根于圣经文本中,这些文本始终强调无私和愿意为了秩序而面对混乱的重要性。

Peterson还谈到了这种理解对当代社会的含义。他对日益增长的个人主义趋势和对社区责任的忽视表示担忧。他认为,一个将个人利益置于集体福祉之上的社会,最终会破坏自己的稳定性和道德结构。他建议,圣经的叙述为牺牲的重要性和人类生活的相互联系提供了永恒的智慧。

最后,对话以讨论Peterson在加拿大面临的医疗执照挑战作为高潮,他被政府要求接受社交媒体培训。他表达了对抗的决心,将其视为“反对侵犯言论自由和专业完整性”的道德义务。

这些讨论提供了对自我牺牲、社区责任和个人成长的深刻见解,展示了Peterson对这些主题的深入思考。

大模型vs性象征

我特别感兴趣的一点是:Peterson谈到了大模型对于真实世界的理解(从文本上对世界知识进行压缩),和精神分析中的联想方法(如性象征是更深层次真理的替代品),这两者在某种程度上是相同的(认知领域)。

Peterson说:

“通过阅读弗洛伊德,荣格,埃里克·诺伊曼...如果学术界,如果西方高等教育机构能够采纳诺伊曼...荣格的文学解释方法,我们现在就不会处于文化战争之中了。
他们是对的。我认为我们最终会通过大型语言模型揭示这一点,这已经引起了我的注意,精神分析师通过他们的联想工作,自由联想在梦境解释和叙事解释所做的,与大型语言模型在概念之间统计规律的映射中所做的,从认知领域来看是相同的。
所以最好的思考方式,弗洛伊德认为一个象征,如性象征,是更深层次真理的替代品,这是压抑的结果,对吧?所以压抑的内容会以象征的形式出现。这在很多方面几乎是真的,弗洛伊德所说的许多事情几乎是真的。”

上面这段话很有意思,我不由得猜测,这位龙虾教授难不成被吊销了医疗执照后,打算进军llm行业了?

对于Peterson的看法,我是赞同的。今年8月和同事爬昆明西山,顺便参观了山脚下的美术馆,当时美术馆在展出油画和书法,我靠近观看这张“阳光下的草地”,除了斑驳的颜料色块,是不可能“看到”什么草地、阳光、青山、绿水的。这和我们把人脸图像解析为一串向量数字,没有本质不同,但是当你逐渐拉远距离,会发现灿烂的阳光慢慢浮现,仿佛那时那刻画家眼前的景色重现。

油画就是旧时代的大模型。

*播客地址在此:
From The Ben Shapiro Show: How To Find Beauty In Self-Sacrifice | Jordan Peterson, Sep 1, 2024

力学概论


1986年出版的《力学概论》提出,寻求统一的出发点不是思辨而应是运动现象。

我想,之所以方老师这么说,他肯定认为坐而论道是空洞无用的,只有“躬身格物”才是认识世界的唯一正确的道路。

随着llm的出现,我潜意识里,觉得llm也许真是认识这个世界的钥匙,llm是语言的统计学抽象,而语言恰恰是我们认识世界的所有工具的一个表述,当然数学,通过对这个表述的抽象,甚至可以说,是对抽象的进一步抽象,我们可以在更高维度来分析世界面对的真实问题,也许在更高维度里发现的解决方案,也能反过来映射到真实世界中的问题中去。

前几周在昆明西山,参观美术馆里

neo4j的cto描绘知识图谱的光明前景–一篇GraphRAG的宣言

Neo4J的cto,在火热朝天的大模型背景下,是如何来描绘知识图谱价值的呢?

为什么要使用知识图谱

因为不管fine tune还是rag,都不能提供具备一定置信度的正确答案(Vector-based RAG – in the same way as fine-tuning – increases the probability of a correct answer for many kinds of questions. However neither technique provides the certainty of a correct answer.)

把真正的知识(things not strings)与统计的文本技术结合,就能突破天花板。(bring knowledge about things into the mix of statistically-based text techniques)。

使用知识图谱的好处

  1. There’s a robust body of research proving that it gives you better answers to most if not ALL questions you might ask an LLM using normal vector-only RAG.
  2. That alone will be a huge driver of GraphRAG adoption. In addition to that, you get easier development thanks to data being visible when building your app.
  3. A third major advantage is that graphs can be readily understood and reasoned upon by humans as well as machines.

原文链接
https://neo4j.com/blog/graphrag-manifesto/

华为Ascend NPU适配

Ascend Extension for PyTorch插件

简介

本项目开发了名为torch_npuAscend Extension for PyTorch插件,使昇腾NPU可以适配PyTorch框架,为使用PyTorch框架的开发者提供昇腾AI处理器的超强算力。

昇腾为基于华为昇腾处理器和软件的行业应用及服务提供全栈AI计算基础设施。您可以通过访问昇腾社区,了解关于昇腾的更多信息。

安装

使用二进制文件进行安装

我们为用户提供可以快速安装torch_npu的whl安装包。在安装torch_npu之前,您需要先安装CANN软件。昇腾辅助软件中有更多关于CANN的版本信息。请参考CANN安装指南获取CANN安装包。

  1. 安装PyTorch

通过 pip 安装 PyTorch。

aarch64:

pip3 install torch==2.1.0

x86:

pip3 install torch==2.1.0+cpu  --index-url https://download.pytorch.org/whl/cpu

若使用pip命令安装失败,请使用下载链接或进入PyTorch官方网站进行查询下载对应版本。

架构 Python版本 下载链接
x86 Python3.8 下载链接
x86 Python3.9 下载链接
x86 Python3.10 下载链接
aarch64 Python3.8 下载链接
aarch64 Python3.9 下载链接
aarch64 Python3.10 下载链接
  1. 安装torch_npu依赖

运行以下命令安装依赖。

pip3 install pyyaml
pip3 install setuptools
  1. 安装torch_npu
pip3 install torch-npu==2.1.0.post3

如需要保存安装日志,可在pip3 install命令后面加上参数 --log <PATH>,并对您指定的目录<PATH>做好权限管控。

使用源代码进行安装

某些特殊场景下,用户可能需要自行编译torch_npu。可以根据昇腾辅助软件表PyTorch与Python版本配套表选择合适的分支。推荐使用Docker镜像编译torch_npu,可以通过以下步骤获取(建议只挂载工作路径,并避开系统路径,以降低安全风险), 生成的.whl文件路径为./dist/:

  1. 克隆torch_npu代码仓

    git clone https://gitee.com/ascend/pytorch.git -b v2.1.0-6.0.rc1 --depth 1
    
  2. 构建镜像

    cd pytorch/ci/docker/{arch} # {arch} for X86 or ARM
    docker build -t manylinux-builder:v1 .
    
  3. 进入Docker容器

    docker run -it -v /{code_path}/pytorch:/home/pytorch manylinux-builder:v1 bash
    # {code_path} is the torch_npu source code path
    
  4. 编译torch_npu

    Python 3.8 为例。

    cd /home/pytorch
    bash ci/build.sh --python=3.8
    

卸载

Pytorch框架训练环境的卸载可以参考昇腾官方文档

torch_npu的卸载只需执行命令:

pip3 uninstall torch_npu

如需要保存卸载日志,可在pip3 uninstall命令后面加上参数 --log <PATH>,并对您指定的目录<PATH>做好权限管控。

入门

前提

运行以下命令初始化CANN环境变量。

# Default path, change it if needed.
source /usr/local/Ascend/ascend-toolkit/set_env.sh

快速验证

可以通过以下样例快速体验昇腾NPU

import torch
import torch_npu

x = torch.randn(2, 2).npu()
y = torch.randn(2, 2).npu()
z = x.mm(y)

print(z)

PyTorch与Python版本配套表

PyTorch版本 Python版本
PyTorch1.11.0 Python3.7.x(>=3.7.5), Python3.8.x, Python3.9.x, Python3.10.x
PyTorch2.1.0 Python3.8.x, Python3.9.x, Python3.10.x
PyTorch2.2.0 Python3.8.x, Python3.9.x, Python3.10.x

昇腾辅助软件

PyTorch Extension版本号采用{PyTorch版本}-{昇腾版本}命名规则,前者为PyTorch Extension匹配的PyTorch版本,后者用于匹配CANN版本,详细匹配如下:

CANN版本 支持的PyTorch版本 支持的Extension版本 Gitee分支 AscendHub镜像版本/名称(链接)
CANN 8.0.RC1 2.2.0 2.2.0 v2.2.0-6.0.rc1
2.1.0 2.1.0.post3 v2.1.0-6.0.rc1
1.11.0 1.11.0.post11 v1.11.0-6.0.rc1
CANN 7.0.0 2.1.0 2.1.0 v2.1.0-5.0.0
2.0.1 2.0.1.post1 v2.0.1-5.0.0
1.11.0 1.11.0.post8 v1.11.0-5.0.0
CANN 7.0.RC1 2.1.0 2.1.0.rc1 v2.1.0-5.0.rc3
2.0.1 2.0.1 v2.0.1-5.0.rc3
1.11.0 1.11.0.post4 v1.11.0-5.0.rc3
CANN 6.3.RC3.1 1.11.0 1.11.0.post3 v1.11.0-5.0.rc2.2
CANN 6.3.RC3 1.11.0 1.11.0.post2 v1.11.0-5.0.rc2.1
CANN 6.3.RC2 2.0.1 2.0.1.rc1 v2.0.1-5.0.rc2
1.11.0 1.11.0.post1 v1.11.0-5.0.rc2 23.0.RC1-1.11.0
1.8.1 1.8.1.post2 v1.8.1-5.0.rc2 23.0.RC1-1.8.1
CANN 6.3.RC1 1.11.0 1.11.0 v1.11.0-5.0.rc1
1.8.1 1.8.1.post1 v1.8.1-5.0.rc1
CANN 6.0.1 1.5.0 1.5.0.post8 v1.5.0-3.0.0 22.0.0
1.8.1 1.8.1 v1.8.1-3.0.0 22.0.0-1.8.1
1.11.0 1.11.0.rc2(beta) v1.11.0-3.0.0
CANN 6.0.RC1 1.5.0 1.5.0.post7 v1.5.0-3.0.rc3 22.0.RC3
1.8.1 1.8.1.rc3 v1.8.1-3.0.rc3 22.0.RC3-1.8.1
1.11.0 1.11.0.rc1(beta) v1.11.0-3.0.rc3
CANN 5.1.RC2 1.5.0 1.5.0.post6 v1.5.0-3.0.rc2 22.0.RC2
1.8.1 1.8.1.rc2 v1.8.1-3.0.rc2 22.0.RC2-1.8.1
CANN 5.1.RC1 1.5.0 1.5.0.post5 v1.5.0-3.0.rc1 22.0.RC1
1.8.1 1.8.1.rc1 v1.8.1-3.0.rc1
CANN 5.0.4 1.5.0 1.5.0.post4 2.0.4.tr5 21.0.4
CANN 5.0.3 1.8.1 1.5.0.post3 2.0.3.tr5 21.0.3
CANN 5.0.2 1.5.0 1.5.0.post2 2.0.2.tr5 21.0.2

建议与交流

欢迎大家为社区做贡献。如果有任何疑问或建议,请提交gitee Issues,我们会尽快回复。感谢您的支持。

分支维护策略

AscendPyTorch版本分支的维护阶段如下:

状态 时间 说明
计划 1—3 个月 计划特性
开发 3 个月 开发特性
维护 6-12 个月 合入所有已解决的问题并发布版本,针对不同的PyTorch版本采取不同的维护策略,常规版本和长期支持版本维护周期分别为6个月和12个月
无维护 0—3 个月 合入所有已解决的问题,无专职维护人员,无版本发布
生命周期终止(EOL) N/A 分支不再接受任何修改

PyTorch版本维护策略

PyTorch版本 维护策略 当前状态 发布时间 后续状态 EOL日期
2.2.0 常规版本 维护 2024/04/01 预计2024/10/15起无维护
2.1.0 长期支持 维护 2023/10/15 预计2024/10/15起无维护
2.0.1 常规版本 EOL 2023/7/19 2024/3/14
1.11.0 长期支持 维护 2023/4/19 预计2024/4/19起无维护
1.8.1 长期支持 EOL 2022/4/10 2023/4/10
1.5.0 长期支持 EOL 2021/7/29 2022/7/29

安全声明

Ascend Extension for PyTorch插件 安全声明

参考文档

有关安装指南、模型迁移和训练/推理教程和API列表等更多详细信息,请参考昇腾社区Ascend Extension for PyTorch

文档名称 文档链接
安装指南 参考链接
网络模型迁移和训练 参考链接
算子适配 参考链接
API清单(PyTorch原生接口与自定义接口) 参考链接

许可证

Ascend Extension for PyTorch插件使用BSD许可证。详见LICENSE文件。

在colab里跑书生大模型

1.8B(fp16情况下,3.35G)跑下来没有问题。

下图是20B模型(fp16情况下,37G),把环境搞宕机了,硬盘只有78G,我尝试挂载一块google drive上去是否可行。

计算逻辑

要计算使用 fp16(半精度浮点数)时 InternLM 20B 模型的大小,我们可以按照以下步骤进行:

参数数量:

InternLM 20B 模型包含 20B(20,000,000,000)个参数。
每个参数的大小:

在 fp16 模式下,每个浮点数占用 2 个字节(16 位)。
计算模型大小:

模型大小(字节) = 参数数量 × 每个参数的大小(字节)
模型大小(字节) = 20,000,000,000 × 2
模型大小(字节) = 40,000,000,000 字节
转换为常见单位:

1 GB = 1,073,741,824 字节
模型大小(GB) = 模型大小(字节) / 1,073,741,824
模型大小(GB) ≈ 40,000,000,000 / 1,073,741,824
模型大小(GB) ≈ 37.25 GB
结果
在 fp16 模式下,InternLM 20B 模型的大小约为 37.25 GB。

这个计算基于每个参数占用 2 个字节的假设,并且忽略了模型中的额外开销(例如优化器状态和其他元数据),所以实际的内存占用可能略高。

文本向量化和词频,以及稀疏矩阵和稠密矩阵

文本向量化和词频统计示例

下面的代码把一个文档,转化成稀疏矩阵,然后为了更通俗易懂的处理(同时部分算法也不支持稀疏矩阵),再把它转化为稠密矩阵

from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
documents = [
    "I love programming in Python",
    "Python is a great language",
    "I love coding"
]

# Create an instance of CountVectorizer
vect = CountVectorizer()

# Fit and transform the data
X = vect.fit_transform(documents)

# Convert to dense array
X_dense = X.toarray()

# Get feature names (tokens)
feature_names = vect.get_feature_names_out()

# Print feature names and the dense array for verification
print("Feature names:", feature_names)
print("Dense array:\n", X_dense)

# Sum the counts of each token across all documents
token_counts = X_dense.sum(axis=0)

# Create a dictionary of tokens and their counts
token_count_dict = dict(zip(feature_names, token_counts))

# Print the token counts
for token, count in token_count_dict.items():
    print(f"{token}: {count}")

输出的结果如下

This example will print the feature names, the dense array, and the token counts:

Feature names: ['coding' 'great' 'in' 'is' 'language' 'love' 'programming' 'python']
Dense array:
 [[0 0 0 0 0 1 1 1]
  [0 1 1 1 1 0 0 1]
  [1 0 0 0 0 1 0 0]]
coding: 1
great: 1
in: 1
is: 1
language: 1
love: 2
programming: 1
python: 2

Explanation of the Output

Feature Names:

The feature names are printed in the same order as they appear in the dense array’s columns: [‘coding’ ‘great’ ‘in’ ‘is’ ‘language’ ‘love’ ‘programming’ ‘python’].
Dense Array:

The dense array shows the token counts for each document, with each column corresponding to the respective feature name.
For example, the first column corresponds to ‘coding’, the second column to ‘great’, and so on.
Token Counts:

The token counts dictionary shows the total count of each token across all documents, matching the counts in the dense array.
Verification
To verify the correspondence, look at the dense array and the feature names:

The first column in X_dense corresponds to ‘coding’. In the dense array, the first column has counts [0, 0, 1], meaning ‘coding’ appears once in the third document.
The second column corresponds to ‘great’. The counts are [0, 1, 0], meaning ‘great’ appears once in the second document.
This pattern continues for all feature names and their corresponding columns.
Conclusion
The sequence of the feature names is the same as the columns of the dense array. Each column in the dense array represents the count of a specific token, and the order of these tokens is given by feature_names.

Sparse Matrix vs. Dense Array

When using CountVectorizer to transform text data into a matrix of token counts, the result is a sparse matrix by default. Let’s explore the differences between sparse matrices and dense arrays, and why one might be preferred over the other in certain contexts.

Sparse Matrix

A sparse matrix is a matrix in which most of the elements are zero. Instead of storing every element, sparse matrices store only the non-zero elements and their positions. This can lead to significant memory savings when dealing with large datasets where the number of zeros vastly outnumbers the number of non-zero elements.

Advantages:
Memory Efficiency: Sparse matrices save memory by only storing non-zero elements. This is crucial for large datasets with many features (e.g., in text processing where there are many words but each document only contains a small subset).
Performance: Certain operations can be faster on sparse matrices due to the reduced amount of data.
Disadvantages:
Complexity: Sparse matrices are more complex to manipulate and understand because they don’t store data in a straightforward row-by-row manner.

Dense Array

A dense array, on the other hand, stores all elements explicitly, including the zero elements. This means it takes up more memory but is simpler to understand and manipulate.

Advantages:
Simplicity: Dense arrays are easier to work with because they store data in a straightforward manner, where each element corresponds directly to a position in the matrix.
Compatibility: Some algorithms and libraries work only with dense arrays, not sparse matrices.
Disadvantages:
Memory Usage: Dense arrays can consume a lot of memory if the dataset is large and contains many zero elements.

示例解释

Interpretation
Sparse Matrix:

Efficiently stores data when most elements are zero.
Example representation (only showing non-zero values and their positions):
scss

(0, 4) 1
(0, 7) 1
(0, 8) 1
(0, 5) 1
(0, 6) 1
(1, 4) 1
(1, 2) 1
(1, 1) 1
(1, 3) 1
(2, 5) 1
(2, 0) 1
(2, 6) 1

Dense Array:

Simpler to understand as it stores all elements explicitly.
Example representation:

[[0 0 0 1 1 0 1 1 1]
 [0 0 1 0 1 1 0 0 0]
 [1 0 0 0 0 1 1 0 0]]

Each row corresponds to a document, and each column corresponds to a token. The values represent the count of each token in the respective document.

In summary, sparse matrices are memory-efficient and suitable for large datasets with many zero elements, while dense arrays are straightforward and easier to work with for smaller datasets or when simplicity is desired.

来自Kimi的中文解释

在计算机科学和数学中,稀疏矩阵和稠密矩阵是两种不同类型的矩阵,它们在存储和处理上有所不同,主要取决于矩阵中非零元素的数量。

稠密矩阵(Dense Matrix)

稠密矩阵是指大多数元素都是非零的矩阵。在稠密矩阵中,非零元素的数量接近于矩阵的总元素数量。稠密矩阵通常使用完整的二维数组来表示,每个元素都有一个对应的存储空间。

特点

  • 存储需求较大,因为每个元素都需要存储空间。
  • 矩阵操作(如加法、乘法等)通常需要更多的计算资源。
  • 在数据分析、图像处理等领域中,当数据集中的非零元素较多时,稠密矩阵较为常见。

稀疏矩阵(Sparse Matrix)

稀疏矩阵是指矩阵中大部分元素都是零的矩阵。在稀疏矩阵中,非零元素的数量远小于矩阵的总元素数量。为了节省存储空间和提高计算效率,稀疏矩阵通常不会使用完整的二维数组来存储,而是使用特殊的数据结构来存储非零元素和它们的位置信息。

特点

  • 存储需求较小,因为只有非零元素和它们的位置需要存储。
  • 矩阵操作可以更高效,因为可以忽略大量的零元素。
  • 在许多应用中非常常见,如文本处理(词频矩阵)、社交网络分析、大规模数值模拟等。

稀疏矩阵的存储方式

  • 三元组列表(Triplet List):存储所有非零元素及其行索引和列索引的列表。
  • 压缩存储格式:通常分为两种,行压缩存储(Compressed Sparse Row, CSR)和列压缩存储(Compressed Sparse Column, CSC)。
    • CSR:按行压缩存储,适合行操作。
    • CSC:按列压缩存储,适合列操作。
  • 坐标列表(Coordinate List):类似于三元组列表,但通常不按行或列排序。

在实际应用中,选择使用稀疏矩阵还是稠密矩阵取决于具体问题的需求和数据的特点。稀疏矩阵在处理大规模稀疏数据时具有明显的优势,可以显著减少存储需求和提高计算效率。而稠密矩阵则适用于大多数元素都需要参与计算的情况。

英文是打开世界的钥匙,同志尚需努力

使用colab进行【中文分词模型测试】,编程环境自动给出了train_test_spint函数的提示,我对于这种提示不太清楚,就让chatgpt来解释。

请chatgpt第一次解释

The image you uploaded shows the function signature and description for a function used to split arrays or matrices into random train and test subsets. This function is likely train_test_split from the sklearn.model_selection module in the scikit-learn library. Here’s a breakdown of the parameters and how to use them:

Function Signature
python
Copy code
(*arrays: Any, test_size: Any | None = None, train_size: Any | None = None, random_state: Any | None = None, shuffle: bool = True, stratify: Any | None = None) -> list
Description
Purpose: The function splits arrays or matrices into random train and test subsets.
Usage: It wraps input validation and application to input data into a single call for splitting (and optionally subsampling) data into a one-liner.
Parameters
*arrays: Sequence of indexables with the same length/shape[0]. This is where you pass the data you want to split. It can be features (X) and labels (y).
test_size: The proportion of the dataset to include in the test split. Can be a float (e.g., 0.25 for 25% test data), an integer (number of samples), or None (in which case the value is set to the complement of the train size).
train_size: The proportion of the dataset to include in the train split. Can be a float, an integer, or None (in which case the value is automatically set to the complement of the test size).
random_state: Controls the shuffling applied to the data before splitting. Pass an integer for reproducible output across multiple function calls.
shuffle: Whether or not to shuffle the data before splitting. Default is True.
stratify: If not None, data is split in a stratified fashion, using this as the class labels.

请chatgpt第二次解释

I understand that the concept can be a bit confusing, so let me clarify it further.

The random_state parameter in functions like train_test_split controls the randomness of the operation. It ensures that the random processes involved (such as shuffling the data) are repeatable. By setting random_state to a specific value, you ensure that the split will be the same each time you run the code, as long as the data and the code remain unchanged.

Here’s a step-by-step explanation and demonstration:

  1. What does random_state do?

    • random_state sets the seed for the random number generator. Different seeds will produce different sequences of random numbers, and therefore different shuffles and splits of the data.
  2. Why does it matter?

    • Using the same random_state value allows you to reproduce your results exactly. This is useful for debugging, sharing your results with others, and ensuring consistent results when rerunning experiments.
  3. Different values for random_state:

    • Different random_state values will initialize the random number generator differently, leading to different shuffles of the data and, consequently, different splits.

Demonstration

Let’s demonstrate this with a simple example using train_test_split with different random_state values.

Using random_state=1

from sklearn.model_selection import train_test_split

# Sample data
X = [[1, 2], [3, 4], [5, 6], [7, 8]]
y = [0, 1, 0, 1]

# Split the data with random_state=1
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X, y, test_size=0.25, random_state=1)

print("random_state=1")
print("X_train:", X_train_1)
print("X_test:", X_test_1)
print("y_train:", y_train_1)
print("y_test:", y_test_1)

Using random_state=42

# Split the data with random_state=42
X_train_42, X_test_42, y_train_42, y_test_42 = train_test_split(X, y, test_size=0.25, random_state=42)

print("\nrandom_state=42")
print("X_train:", X_train_42)
print("X_test:", X_test_42)
print("y_train:", y_train_42)
print("y_test:", y_test_42)

Expected Output

For random_state=1

random_state=1
X_train: [[5, 6], [1, 2], [7, 8]]
X_test: [[3, 4]]
y_train: [0, 0, 1]
y_test: [1]

For random_state=42

random_state=42
X_train: [[7, 8], [3, 4], [1, 2]]
X_test: [[5, 6]]
y_train: [1, 1, 0]
y_test: [0]

As you can see, using random_state=1 and random_state=42 results in different splits of the data. This is because the random number generator is initialized differently with each value, resulting in different shuffling of the data before it is split.

Consistency Across Runs

If you run the above code multiple times with the same random_state value, you will get the same split every time. However, changing the random_state value will change the split.

Running Multiple Times for the Same random_state

To verify this, you can run the split multiple times with the same random_state:

for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    print(f"Run {i+1}:")
    print("X_train:", X_train)
    print("X_test:", X_test)
    print("y_train:", y_train)
    print("y_test:", y_test)
    print("-" * 30)

Expected Output for Each Run with random_state=42

Run 1:
X_train: [[7, 8], [3, 4], [1, 2]]
X_test: [[5, 6]]
y_train: [1, 1, 0]
y_test: [0]
------------------------------
Run 2:
X_train: [[7, 8], [3, 4], [1, 2]]
X_test: [[5, 6]]
y_train: [1, 1, 0]
y_test: [0]
------------------------------
Run 3:
X_train: [[7, 8], [3, 4], [1, 2]]
X_test: [[5, 6]]
y_train: [1, 1, 0]
y_test: [0]
------------------------------

This demonstrates that the split is consistent across multiple runs with the same random_state. Changing the random_state value will produce a different but consistent split for that specific value.

很好的软件行业现状描述

原文在此,下面是全文引用

We have used too many levels of abstractions and now the future looks bleak

Published on 2023-10-21. Modified on 2023-11-06.

A big percentage of so-called experts today only know how to configure tools, but they understand nothing about how things work at the deeper level. This is a real challenge and a big problem for the future.

A steering wheel is an abstraction that makes it easier for me to drive my car. Power steering is yet another level of abstraction that further improves the driving experience. Abstractions are nice, they generally improve the quality of life. However, in Denmark we have a proverb that says:

Too little and too much spoils everything.

What good does an abstraction do when it breaks and nobody any longer understand how the technology works under the hood?

Everything in the tech industry is driven with a very hardcore eye for profit and very little interest in anything else. So you need to be able to push out new products or new services as fast as possible. This means more abstraction and more automation, less and less people, and less deeper understanding.

Today programmers and system administrators no longer exist, instead we have DevOps and even DevSecOps, in which the industry is trying very hard to stuff every single task into the job description of a single individual. The tech guys needs to do development (Dev), security (Sec) and operations (Ops), i.e. system administration, but since no single individual can truly master all that, we need to automate as much as possible in order to save money and avoid the complexities of human social interaction between different tech departments. As a result, the modern tech person is only taught about how to use specific tools, he or she then knows very little about the technology under the hood.

It doesn’t help that technology has become increasingly difficult to understand, but more and more of modern life depend heavily upon the tech we’re using. So what is going to happen when the level of understanding in the tech industry reaches such a low point in which the majority of people don’t even know how to fix the tools they are using?

"Manual scene" from the WALL-E movie.

People have become accustomed to the state of abstraction and they think it’s the correct approach and they happily contribute to the mess by adding even more abstraction.

Yes, let’s all go back to coding in assembly!

― Sarcastic comment by arrogant developer

We need abstractions, no doubt about it, but every level of abstraction comes with a heavy price which, ironically enough, eventually can cause a massive loss in profit.

Modern programming scares me in many respects, where they will just build layer after layer after layer that does nothing except translate.

― Ken Thompson

Already now a majority of "security people" know very little about security and only about how to use some kind of pre-made penetration testing tool. The penetration testing tool shows a bunch of green lights in its web GUI board and all is assumed well. Yet, a real security expert with evil intentions has broken the system long ago and keeps selling valuable data on the darknet. Nothing is leaked and nothing is discovered. This can go on for years without anyone finding out because, well, the GUI board says that all is OK.

Some students today apparently don’t even know what files and folders are.

Advice to people studying technology

Never just follow hype or trends.
Be curious. Don’t just learn tools, try to understand how the underlying technology works.
If possible, try at least once to manually do what e.g. a configuration tool does for you.
If possible, try to look at the code for the tool. Even a basic understanding of the code can be very valuable.
Stay curious. Keep learning. Experiment. Dive deeper into the technology that interests you. If possible, set up a homelab and use it as a playground for learning and breaking things.
Question everything. Especially things that don’t make any sense to you. Don’t just assume that someone else knows better – that’s how you quickly turn into a blind follower. Sometimes someone else truly knows better, but don’t just assume that to be the case by default. And be brave! Stand by the truth and your convictions even if that makes you feel like you stand alone.

People blindly following eachother
The point I am making with this post is not that everything needs to be understood from first principles by everyone, or that you shouldn’t use any tools. As I said, we need abstractions. Furthermore, we have people who specialize in different fields such that e.g. the mechanic fixes the truck while the driver drives the truck.

Rather, what I am addressing is the important value of an engineer’s attitude toward technology by people working with technology.

In e.g. software development, too many specialists have been abstracted away and replaced by tools and automation and fewer and fewer people understand anything even one layer directly beneath the layer they are working on.

This is a major problem because we will eventually reach a point in which very few people can fix anything in the layers below. And the fact of the matter is that we have already partly reached that point!

About half a year ago I stumbled upon some front-end web developers who did not know that you can create a website without a deployment tool and that you don’t need any JavaScript at all, even when the website takes payment. I asked a friend of mine about it, who at the time was teaching a programming class in Python, and he said:

Don’t be surprised by this. This is the level today. The industry wants us to mass produce people who know how to "push buttons", not people who understand anything at a deeper level.

I know that there will always be some people who take an interest in the deeper levels, that is not the point. The point is that in software development specifically, we have long reached the point where we have added too many layers of abstraction and too few people understand what they are doing. The industry is shooting itself in the foot.

If e.g. I am a web developer, whether front-end or back-end, or are doing so-called "integration work", and I am producing websites without much coding or any knowledge of TCP/IP, DNS, HTTP, TLS, security, etc., using only pre-made tools or frameworks, then that will make me about as useful as a monkey with a torque wrench when something goes wrong.

If you have any comments or corrections please feel free to email them to me. Also, if you found any of the content on this website useful consider supporting me on Patreon 😉

用kimi和chatgpt来解释决策树

以下是kimi对上面图片的分析
您提供的文件内容似乎是决策树算法生成的一个决策树的文本表示。这个决策树用于分类任务,其中每个节点都是一个决策点,每个叶节点代表一个分类结果。以下是对文件内容的解释:

  1. 根节点:决策树从根节点开始,根节点的决策条件是 sub_grade <= 7.5,表示子等级(可能是贷款风险等级)小于或等于7.5。这个节点的基尼不纯度指数(Gini Index)是0.5,表示在该节点下,数据集是完全随机的,即正负样本各占一半。

  2. 左子树:如果 sub_grade <= 7.5 为真,那么进入左子树。左子树的第一个决策点是 total_rec_late_fee <= 9.0,表示最近记录的逾期费用小于或等于9.0。这个节点的Gini指数是0.436,表示比根节点稍微纯净一些。

  3. 右子树:如果 sub_grade <= 7.5 为假,那么进入右子树。右子树的第一个决策点是 sub_grade <= 15.5,表示子等级小于或等于15.5。这个节点的Gini指数是0.49,表示这个分支的数据集比左子树的稍微不纯净一些。

  4. 进一步的分支:每个子节点都有可能继续分支,直到满足停止条件。例如,左子树的某个分支进一步根据 purpose <= 1.5 进行决策,而右子树的分支可能根据 total_rec_late_fee <= 0.5 进行决策。

  5. 叶节点:最终,每个分支都会到达一个叶节点,叶节点给出了一个分类结果,例如 class = safeclass = not safe。这些叶节点的Gini指数通常较低,表示在这些节点下的数据集分类非常明确。

  6. 样本数量和分类值:每个节点都包含了样本数量(samples)和分类值(value),分类值是一个二维数组,第一个数字代表正类(例如“safe”),第二个数字代表负类(例如“not safe”)。

  7. Gini指数:Gini指数是一个衡量数据集纯度的指标,值越低表示数据集越纯净。在决策树中,我们希望在每个节点上通过特征选择来最大化信息增益,从而降低Gini指数。

这个决策树可能是用于信用评分或风险评估的,其中 sub_grade 可能代表贷款的子等级,total_rec_late_fee 可能代表总的最近记录的逾期费用,purpose 可能代表贷款目的。最终的分类结果 class 可能表示贷款是否被认为是安全的(safe)或不安全的(not safe)。

下面是chatgpt的解释

The image you’ve uploaded depicts a decision tree, which is used to model decisions and their possible consequences, in this case categorizing something as "safe" or "not safe" based on various features. Let’s break down this specific tree step-by-step:

Root Node

  • Feature: total_grade <= 7.5
  • Gini: 0.5
  • Samples: 34881
  • Class Value: [17348, 17533]
  • Decision: This is the initial decision point. If the total_grade is less than or equal to 7.5, we follow the left branch. Otherwise, we follow the right branch. The class distribution nearly equally split between safe and not safe, indicating initial uncertainty (Gini = 0.5 shows maximum impurity).

First Level

Left Branch

  • Feature: grade <= 0.5
  • Gini: 0.436
  • Samples: 10295
  • Class Value: [3303, 6992]
  • Decision: In this branch, we evaluate another feature, grade. If grade is less than or equal to 0.5, we move further left; otherwise, further right. The Gini coefficient has decreased, showing a decrease in impurity from the root.

Right Branch

  • Feature: sub_grade <= 15.5
  • Gini: 0.49
  • Samples: 24586
  • Class Value: [14045, 10541]
  • Decision: On this branch, the decision is based on sub_grade. A lower Gini than the root node, but still high, showing substantial impurity.

Further Branching

Let’s detail one more depth level:

Further Left on the First Left Branch (following grade <= 0.5)

  • Left Sub-branch:
    • Feature: None (Leaf Node)

    • Gini: 0.352

    • Samples: 4621

    • Class Value: [1055, 3566]

    • Class: safe

    • Right Sub-branch:

    • Feature: purpose <= 1.5

    • Gini: 0.371

    • Samples: 346

    • Class Value: [261, 85]

    • Class: not safe

These nodes represent further decisions and classifications based on additional splits.

Observations

  1. Path to Decision: Each path from the root to a leaf node represents a set of conditions that lead to a final classification (either "safe" or "not safe").
  2. Gini Impurity: A Gini index of 0 indicates perfect purity (all samples in a node fall into a single category). Nodes are created by selecting splits that decrease the Gini impurity among child nodes compared to their parent node.
  3. Class Labels: These show the decision at each node, based on the majority of samples’ classification if no further split can achieve better purity or other stopping criteria have been met (like maximum depth or minimum sample size at a node).

This decision tree helps in making predictions by tracing through the tree with the values of the features in the dataset, following branches according to these values until a leaf node is reached, which gives the prediction.

一个决策树项目中的数据处理

因为要使用决策树,没法对非数值进行处理,所以需要把dataframe中的一些数据转换成数值,使用了如下code

from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
d = defaultdict(LabelEncoder)

X_trans = X.apply(lambda x: d[x.name].fit_transform(x))
X_trans.head()

ChatGPT4解释如下

The expression lambda x: d[x.name].fit_transform(x) is a compact way of applying a function to each column of a pandas DataFrame. Let’s dissect this expression further to understand how it works in the context of transforming categorical data into numerical format using LabelEncoder in a Python environment.

The lambda Function

A lambda function in Python is a small anonymous function defined with the keyword lambda. The syntax is:

lambda arguments: expression

In this case, the lambda function has one argument x (representing a column of the DataFrame) and the expression involves applying the fit_transform method from a LabelEncoder instance to x.

How the lambda Function is Applied

  • x: Within the context of the apply() method, x represents each column in the DataFrame one at a time. For instance, if the DataFrame has columns "Color" and "Size", x will be the "Color" column data in one iteration and the "Size" column data in the next.

Usage of d[x.name]

  • d: This is a defaultdict containing instances of LabelEncoder. The use of defaultdict ensures that if a LabelEncoder for a particular column name does not exist, it will automatically create a new LabelEncoder instance without raising an error.

  • x.name: This attribute fetches the name of the column currently being processed (like "Color" or "Size").

  • d[x.name]: This retrieves the LabelEncoder instance for the column named x.name. If it’s the first time this column is being accessed, defaultdict initializes a new LabelEncoder for this key.

The fit_transform Method

  • fit_transform(): This method is part of LabelEncoder and serves a dual purpose:
    • fit: It learns the mapping from the labels in the column to integers. For example, for a column with labels ["Red", "Blue", "Green"], it might learn the mapping Red -> 0, Blue -> 1, Green -> 2.
    • transform: It then transforms the labels into these integers based on the mapping it has learned.

Putting It All Together

When you call X.apply(lambda x: d[x.name].fit_transform(x)), it performs the following steps for each column in DataFrame X:

  • Accesses each column as x.
  • Fetches the LabelEncoder for that column by x.name.
  • Applies fit_transform to the column, converting text labels into integers.
  • Returns a new DataFrame where each text label has been replaced by an integer.

This method is particularly useful when preparing data for machine learning models, which generally require numeric input features rather than text labels. The transformation is done in-place and efficiently for each column, utilizing the flexibility of defaultdict and the functionality of LabelEncoder.