文本向量化和停止词

我在准备文本向量化过程中,发现有几个单词没有作为特征被提取出来,比如说“I”

ChatGPT回答,这应该是向量工具有一套默认的英文停止词,例如i,the等等,如果要确保去除stop words的影响,可以在方法中添加一个参数。
The word "I" is missing from the feature names because CountVectorizer by default removes English stop words, which are common words like "I", "the", "is", etc., that are often filtered out because they do not contain significant meaning in the context of text analysis.

from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
documents = [
    "I love programming in Python",
    "Python is a great language",
    "I love coding"
]

# Create an instance of CountVectorizer without stop words removal
# 注意!!!在这里的参数意味不要使用停止词!!!
vect = CountVectorizer(stop_words=None)

# Fit and transform the data
X = vect.fit_transform(documents)

# Convert to dense array
X_dense = X.toarray()

# Get feature names (tokens)
feature_names = vect.get_feature_names_out()

# Print feature names and the dense array for verification
print("Feature names:", feature_names)
print("Dense array:\n", X_dense)

# Sum the counts of each token across all documents
token_counts = X_dense.sum(axis=0)

# Create a dictionary of tokens and their counts
token_count_dict = dict(zip(feature_names, token_counts))

# Print the token counts
for token, count in token_count_dict.items():
    print(f"{token}: {count}")

下面是新的输出结果

Feature names: ['coding' 'great' 'i' 'in' 'is' 'language' 'love' 'programming' 'python']
Dense array:
 [[0 0 1 1 0 0 1 1 1]
  [0 1 0 1 1 1 0 0 1]
  [1 0 1 0 0 0 1 0 0]]
coding: 1
great: 1
i: 2
in: 1
is: 1
language: 1
love: 2
programming: 1
python: 2

Leave a Reply

Your email address will not be published. Required fields are marked *