文本向量化和停止词

我在准备文本向量化过程中,发现有几个单词没有作为特征被提取出来,比如说“I”

ChatGPT回答,这应该是向量工具有一套默认的英文停止词,例如i,the等等,如果要确保去除stop words的影响,可以在方法中添加一个参数。
The word "I" is missing from the feature names because CountVectorizer by default removes English stop words, which are common words like "I", "the", "is", etc., that are often filtered out because they do not contain significant meaning in the context of text analysis.

from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
documents = [
    "I love programming in Python",
    "Python is a great language",
    "I love coding"
]

# Create an instance of CountVectorizer without stop words removal
# 注意!!!在这里的参数意味不要使用停止词!!!
vect = CountVectorizer(stop_words=None)

# Fit and transform the data
X = vect.fit_transform(documents)

# Convert to dense array
X_dense = X.toarray()

# Get feature names (tokens)
feature_names = vect.get_feature_names_out()

# Print feature names and the dense array for verification
print("Feature names:", feature_names)
print("Dense array:\n", X_dense)

# Sum the counts of each token across all documents
token_counts = X_dense.sum(axis=0)

# Create a dictionary of tokens and their counts
token_count_dict = dict(zip(feature_names, token_counts))

# Print the token counts
for token, count in token_count_dict.items():
    print(f"{token}: {count}")

下面是新的输出结果

Feature names: ['coding' 'great' 'i' 'in' 'is' 'language' 'love' 'programming' 'python']
Dense array:
 [[0 0 1 1 0 0 1 1 1]
  [0 1 0 1 1 1 0 0 1]
  [1 0 1 0 0 0 1 0 0]]
coding: 1
great: 1
i: 2
in: 1
is: 1
language: 1
love: 2
programming: 1
python: 2

文本向量化和词频,以及稀疏矩阵和稠密矩阵

文本向量化和词频统计示例

下面的代码把一个文档,转化成稀疏矩阵,然后为了更通俗易懂的处理(同时部分算法也不支持稀疏矩阵),再把它转化为稠密矩阵

from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
documents = [
    "I love programming in Python",
    "Python is a great language",
    "I love coding"
]

# Create an instance of CountVectorizer
vect = CountVectorizer()

# Fit and transform the data
X = vect.fit_transform(documents)

# Convert to dense array
X_dense = X.toarray()

# Get feature names (tokens)
feature_names = vect.get_feature_names_out()

# Print feature names and the dense array for verification
print("Feature names:", feature_names)
print("Dense array:\n", X_dense)

# Sum the counts of each token across all documents
token_counts = X_dense.sum(axis=0)

# Create a dictionary of tokens and their counts
token_count_dict = dict(zip(feature_names, token_counts))

# Print the token counts
for token, count in token_count_dict.items():
    print(f"{token}: {count}")

输出的结果如下

This example will print the feature names, the dense array, and the token counts:

Feature names: ['coding' 'great' 'in' 'is' 'language' 'love' 'programming' 'python']
Dense array:
 [[0 0 0 0 0 1 1 1]
  [0 1 1 1 1 0 0 1]
  [1 0 0 0 0 1 0 0]]
coding: 1
great: 1
in: 1
is: 1
language: 1
love: 2
programming: 1
python: 2

Explanation of the Output

Feature Names:

The feature names are printed in the same order as they appear in the dense array’s columns: [‘coding’ ‘great’ ‘in’ ‘is’ ‘language’ ‘love’ ‘programming’ ‘python’].
Dense Array:

The dense array shows the token counts for each document, with each column corresponding to the respective feature name.
For example, the first column corresponds to ‘coding’, the second column to ‘great’, and so on.
Token Counts:

The token counts dictionary shows the total count of each token across all documents, matching the counts in the dense array.
Verification
To verify the correspondence, look at the dense array and the feature names:

The first column in X_dense corresponds to ‘coding’. In the dense array, the first column has counts [0, 0, 1], meaning ‘coding’ appears once in the third document.
The second column corresponds to ‘great’. The counts are [0, 1, 0], meaning ‘great’ appears once in the second document.
This pattern continues for all feature names and their corresponding columns.
Conclusion
The sequence of the feature names is the same as the columns of the dense array. Each column in the dense array represents the count of a specific token, and the order of these tokens is given by feature_names.

Sparse Matrix vs. Dense Array

When using CountVectorizer to transform text data into a matrix of token counts, the result is a sparse matrix by default. Let’s explore the differences between sparse matrices and dense arrays, and why one might be preferred over the other in certain contexts.

Sparse Matrix

A sparse matrix is a matrix in which most of the elements are zero. Instead of storing every element, sparse matrices store only the non-zero elements and their positions. This can lead to significant memory savings when dealing with large datasets where the number of zeros vastly outnumbers the number of non-zero elements.

Advantages:
Memory Efficiency: Sparse matrices save memory by only storing non-zero elements. This is crucial for large datasets with many features (e.g., in text processing where there are many words but each document only contains a small subset).
Performance: Certain operations can be faster on sparse matrices due to the reduced amount of data.
Disadvantages:
Complexity: Sparse matrices are more complex to manipulate and understand because they don’t store data in a straightforward row-by-row manner.

Dense Array

A dense array, on the other hand, stores all elements explicitly, including the zero elements. This means it takes up more memory but is simpler to understand and manipulate.

Advantages:
Simplicity: Dense arrays are easier to work with because they store data in a straightforward manner, where each element corresponds directly to a position in the matrix.
Compatibility: Some algorithms and libraries work only with dense arrays, not sparse matrices.
Disadvantages:
Memory Usage: Dense arrays can consume a lot of memory if the dataset is large and contains many zero elements.

示例解释

Interpretation
Sparse Matrix:

Efficiently stores data when most elements are zero.
Example representation (only showing non-zero values and their positions):
scss

(0, 4) 1
(0, 7) 1
(0, 8) 1
(0, 5) 1
(0, 6) 1
(1, 4) 1
(1, 2) 1
(1, 1) 1
(1, 3) 1
(2, 5) 1
(2, 0) 1
(2, 6) 1

Dense Array:

Simpler to understand as it stores all elements explicitly.
Example representation:

[[0 0 0 1 1 0 1 1 1]
 [0 0 1 0 1 1 0 0 0]
 [1 0 0 0 0 1 1 0 0]]

Each row corresponds to a document, and each column corresponds to a token. The values represent the count of each token in the respective document.

In summary, sparse matrices are memory-efficient and suitable for large datasets with many zero elements, while dense arrays are straightforward and easier to work with for smaller datasets or when simplicity is desired.

来自Kimi的中文解释

在计算机科学和数学中,稀疏矩阵和稠密矩阵是两种不同类型的矩阵,它们在存储和处理上有所不同,主要取决于矩阵中非零元素的数量。

稠密矩阵(Dense Matrix)

稠密矩阵是指大多数元素都是非零的矩阵。在稠密矩阵中,非零元素的数量接近于矩阵的总元素数量。稠密矩阵通常使用完整的二维数组来表示,每个元素都有一个对应的存储空间。

特点

  • 存储需求较大,因为每个元素都需要存储空间。
  • 矩阵操作(如加法、乘法等)通常需要更多的计算资源。
  • 在数据分析、图像处理等领域中,当数据集中的非零元素较多时,稠密矩阵较为常见。

稀疏矩阵(Sparse Matrix)

稀疏矩阵是指矩阵中大部分元素都是零的矩阵。在稀疏矩阵中,非零元素的数量远小于矩阵的总元素数量。为了节省存储空间和提高计算效率,稀疏矩阵通常不会使用完整的二维数组来存储,而是使用特殊的数据结构来存储非零元素和它们的位置信息。

特点

  • 存储需求较小,因为只有非零元素和它们的位置需要存储。
  • 矩阵操作可以更高效,因为可以忽略大量的零元素。
  • 在许多应用中非常常见,如文本处理(词频矩阵)、社交网络分析、大规模数值模拟等。

稀疏矩阵的存储方式

  • 三元组列表(Triplet List):存储所有非零元素及其行索引和列索引的列表。
  • 压缩存储格式:通常分为两种,行压缩存储(Compressed Sparse Row, CSR)和列压缩存储(Compressed Sparse Column, CSC)。
    • CSR:按行压缩存储,适合行操作。
    • CSC:按列压缩存储,适合列操作。
  • 坐标列表(Coordinate List):类似于三元组列表,但通常不按行或列排序。

在实际应用中,选择使用稀疏矩阵还是稠密矩阵取决于具体问题的需求和数据的特点。稀疏矩阵在处理大规模稀疏数据时具有明显的优势,可以显著减少存储需求和提高计算效率。而稠密矩阵则适用于大多数元素都需要参与计算的情况。