{"id":777,"date":"2024-05-26T11:43:10","date_gmt":"2024-05-26T03:43:10","guid":{"rendered":"http:\/\/madapapa.com\/wordpress\/?p=777"},"modified":"2024-05-26T22:33:36","modified_gmt":"2024-05-26T14:33:36","slug":"wen-ben-xiang-liang-hua-he-ci-pin-yi-ji-xi-shu-ju","status":"publish","type":"post","link":"http:\/\/madapapa.com\/wordpress\/?p=777","title":{"rendered":"\u6587\u672c\u5411\u91cf\u5316\u548c\u8bcd\u9891\uff0c\u4ee5\u53ca\u7a00\u758f\u77e9\u9635\u548c\u7a20\u5bc6\u77e9\u9635"},"content":{"rendered":"<h2><a id=\"%E6%96%87%E6%9C%AC%E5%90%91%E9%87%8F%E5%8C%96%E5%92%8C%E8%AF%8D%E9%A2%91%E7%BB%9F%E8%AE%A1%E7%A4%BA%E4%BE%8B\" class=\"anchor\" aria-hidden=\"true\"><span class=\"octicon octicon-link\"><\/span><\/a>\u6587\u672c\u5411\u91cf\u5316\u548c\u8bcd\u9891\u7edf\u8ba1\u793a\u4f8b<\/h2>\n<p>\u4e0b\u9762\u7684\u4ee3\u7801\u628a\u4e00\u4e2a\u6587\u6863\uff0c\u8f6c\u5316\u6210\u7a00\u758f\u77e9\u9635\uff0c\u7136\u540e\u4e3a\u4e86\u66f4\u901a\u4fd7\u6613\u61c2\u7684\u5904\u7406\uff08\u540c\u65f6\u90e8\u5206\u7b97\u6cd5\u4e5f\u4e0d\u652f\u6301\u7a00\u758f\u77e9\u9635\uff09\uff0c\u518d\u628a\u5b83\u8f6c\u5316\u4e3a\u7a20\u5bc6\u77e9\u9635<\/p>\n<pre><code class=\"language-plain_text\">from sklearn.feature_extraction.text import CountVectorizer\n\n# Sample text data\ndocuments = [\n    &quot;I love programming in Python&quot;,\n    &quot;Python is a great language&quot;,\n    &quot;I love coding&quot;\n]\n\n# Create an instance of CountVectorizer\nvect = CountVectorizer()\n\n# Fit and transform the data\nX = vect.fit_transform(documents)\n\n# Convert to dense array\nX_dense = X.toarray()\n\n# Get feature names (tokens)\nfeature_names = vect.get_feature_names_out()\n\n# Print feature names and the dense array for verification\nprint(&quot;Feature names:&quot;, feature_names)\nprint(&quot;Dense array:\\n&quot;, X_dense)\n\n# Sum the counts of each token across all documents\ntoken_counts = X_dense.sum(axis=0)\n\n# Create a dictionary of tokens and their counts\ntoken_count_dict = dict(zip(feature_names, token_counts))\n\n# Print the token counts\nfor token, count in token_count_dict.items():\n    print(f&quot;{token}: {count}&quot;)\n\n<\/code><\/pre>\n<h2><a id=\"%E8%BE%93%E5%87%BA%E7%9A%84%E7%BB%93%E6%9E%9C%E5%A6%82%E4%B8%8B\" class=\"anchor\" aria-hidden=\"true\"><span class=\"octicon octicon-link\"><\/span><\/a>\u8f93\u51fa\u7684\u7ed3\u679c\u5982\u4e0b<\/h2>\n<p>This example will print the feature names, the dense array, and the token counts:<\/p>\n<pre><code class=\"language-plain_text\">Feature names: ['coding' 'great' 'in' 'is' 'language' 'love' 'programming' 'python']\nDense array:\n [[0 0 0 0 0 1 1 1]\n  [0 1 1 1 1 0 0 1]\n  [1 0 0 0 0 1 0 0]]\ncoding: 1\ngreat: 1\nin: 1\nis: 1\nlanguage: 1\nlove: 2\nprogramming: 1\npython: 2\n\n<\/code><\/pre>\n<h2><a id=\"explanation-of-the-output\" class=\"anchor\" aria-hidden=\"true\"><span class=\"octicon octicon-link\"><\/span><\/a>Explanation of the Output<\/h2>\n<p>Feature Names:<\/p>\n<p>The feature names are printed in the same order as they appear in the dense array&#8217;s columns: [&#8216;coding&#8217; &#8216;great&#8217; &#8216;in&#8217; &#8216;is&#8217; &#8216;language&#8217; &#8216;love&#8217; &#8216;programming&#8217; &#8216;python&#8217;].<br \/>\nDense Array:<\/p>\n<p>The dense array shows the token counts for each document, with each column corresponding to the respective feature name.<br \/>\nFor example, the first column corresponds to &#8216;coding&#8217;, the second column to &#8216;great&#8217;, and so on.<br \/>\nToken Counts:<\/p>\n<p>The token counts dictionary shows the total count of each token across all documents, matching the counts in the dense array.<br \/>\nVerification<br \/>\nTo verify the correspondence, look at the dense array and the feature names:<\/p>\n<p>The first column in X_dense corresponds to &#8216;coding&#8217;. In the dense array, the first column has counts [0, 0, 1], meaning &#8216;coding&#8217; appears once in the third document.<br \/>\nThe second column corresponds to &#8216;great&#8217;. The counts are [0, 1, 0], meaning &#8216;great&#8217; appears once in the second document.<br \/>\nThis pattern continues for all feature names and their corresponding columns.<br \/>\nConclusion<br \/>\nThe sequence of the feature names is the same as the columns of the dense array. Each column in the dense array represents the count of a specific token, and the order of these tokens is given by feature_names.<\/p>\n<h2><a id=\"sparse-matrix-vs-dense-array\" class=\"anchor\" aria-hidden=\"true\"><span class=\"octicon octicon-link\"><\/span><\/a>Sparse Matrix vs. Dense Array<\/h2>\n<p>When using CountVectorizer to transform text data into a matrix of token counts, the result is a sparse matrix by default. Let&#8217;s explore the differences between sparse matrices and dense arrays, and why one might be preferred over the other in certain contexts.<\/p>\n<h3><a id=\"sparse-matrix\" class=\"anchor\" aria-hidden=\"true\"><span class=\"octicon octicon-link\"><\/span><\/a>Sparse Matrix<\/h3>\n<p>A sparse matrix is a matrix in which most of the elements are zero. Instead of storing every element, sparse matrices store only the non-zero elements and their positions. This can lead to significant memory savings when dealing with large datasets where the number of zeros vastly outnumbers the number of non-zero elements.<\/p>\n<p>Advantages:<br \/>\nMemory Efficiency: Sparse matrices save memory by only storing non-zero elements. This is crucial for large datasets with many features (e.g., in text processing where there are many words but each document only contains a small subset).<br \/>\nPerformance: Certain operations can be faster on sparse matrices due to the reduced amount of data.<br \/>\nDisadvantages:<br \/>\nComplexity: Sparse matrices are more complex to manipulate and understand because they don&#8217;t store data in a straightforward row-by-row manner.<\/p>\n<h3><a id=\"dense-array\" class=\"anchor\" aria-hidden=\"true\"><span class=\"octicon octicon-link\"><\/span><\/a>Dense Array<\/h3>\n<p>A dense array, on the other hand, stores all elements explicitly, including the zero elements. This means it takes up more memory but is simpler to understand and manipulate.<\/p>\n<p>Advantages:<br \/>\nSimplicity: Dense arrays are easier to work with because they store data in a straightforward manner, where each element corresponds directly to a position in the matrix.<br \/>\nCompatibility: Some algorithms and libraries work only with dense arrays, not sparse matrices.<br \/>\nDisadvantages:<br \/>\nMemory Usage: Dense arrays can consume a lot of memory if the dataset is large and contains many zero elements.<\/p>\n<h3><a id=\"%E7%A4%BA%E4%BE%8B%E8%A7%A3%E9%87%8A\" class=\"anchor\" aria-hidden=\"true\"><span class=\"octicon octicon-link\"><\/span><\/a>\u793a\u4f8b\u89e3\u91ca<\/h3>\n<p>Interpretation<br \/>\nSparse Matrix:<\/p>\n<p>Efficiently stores data when most elements are zero.<br \/>\nExample representation (only showing non-zero values and their positions):<br \/>\nscss<\/p>\n<pre><code class=\"language-plain_text\">(0, 4) 1\n(0, 7) 1\n(0, 8) 1\n(0, 5) 1\n(0, 6) 1\n(1, 4) 1\n(1, 2) 1\n(1, 1) 1\n(1, 3) 1\n(2, 5) 1\n(2, 0) 1\n(2, 6) 1\n<\/code><\/pre>\n<p>Dense Array:<\/p>\n<p>Simpler to understand as it stores all elements explicitly.<br \/>\nExample representation:<\/p>\n<pre><code class=\"language-plain_text\">[[0 0 0 1 1 0 1 1 1]\n [0 0 1 0 1 1 0 0 0]\n [1 0 0 0 0 1 1 0 0]]\n<\/code><\/pre>\n<p>Each row corresponds to a document, and each column corresponds to a token. The values represent the count of each token in the respective document.<\/p>\n<p>In summary, sparse matrices are memory-efficient and suitable for large datasets with many zero elements, while dense arrays are straightforward and easier to work with for smaller datasets or when simplicity is desired.<\/p>\n<h2><a id=\"%E6%9D%A5%E8%87%AAkimi%E7%9A%84%E4%B8%AD%E6%96%87%E8%A7%A3%E9%87%8A\" class=\"anchor\" aria-hidden=\"true\"><span class=\"octicon octicon-link\"><\/span><\/a>\u6765\u81eaKimi\u7684\u4e2d\u6587\u89e3\u91ca<\/h2>\n<p>\u5728\u8ba1\u7b97\u673a\u79d1\u5b66\u548c\u6570\u5b66\u4e2d\uff0c\u7a00\u758f\u77e9\u9635\u548c\u7a20\u5bc6\u77e9\u9635\u662f\u4e24\u79cd\u4e0d\u540c\u7c7b\u578b\u7684\u77e9\u9635\uff0c\u5b83\u4eec\u5728\u5b58\u50a8\u548c\u5904\u7406\u4e0a\u6709\u6240\u4e0d\u540c\uff0c\u4e3b\u8981\u53d6\u51b3\u4e8e\u77e9\u9635\u4e2d\u975e\u96f6\u5143\u7d20\u7684\u6570\u91cf\u3002<\/p>\n<h3><a id=\"%E7%A8%A0%E5%AF%86%E7%9F%A9%E9%98%B5%EF%BC%88dense-matrix%EF%BC%89\" class=\"anchor\" aria-hidden=\"true\"><span class=\"octicon octicon-link\"><\/span><\/a>\u7a20\u5bc6\u77e9\u9635\uff08Dense Matrix\uff09<\/h3>\n<p>\u7a20\u5bc6\u77e9\u9635\u662f\u6307\u5927\u591a\u6570\u5143\u7d20\u90fd\u662f\u975e\u96f6\u7684\u77e9\u9635\u3002\u5728\u7a20\u5bc6\u77e9\u9635\u4e2d\uff0c\u975e\u96f6\u5143\u7d20\u7684\u6570\u91cf\u63a5\u8fd1\u4e8e\u77e9\u9635\u7684\u603b\u5143\u7d20\u6570\u91cf\u3002\u7a20\u5bc6\u77e9\u9635\u901a\u5e38\u4f7f\u7528\u5b8c\u6574\u7684\u4e8c\u7ef4\u6570\u7ec4\u6765\u8868\u793a\uff0c\u6bcf\u4e2a\u5143\u7d20\u90fd\u6709\u4e00\u4e2a\u5bf9\u5e94\u7684\u5b58\u50a8\u7a7a\u95f4\u3002<\/p>\n<p><strong>\u7279\u70b9<\/strong>\uff1a<\/p>\n<ul>\n<li>\u5b58\u50a8\u9700\u6c42\u8f83\u5927\uff0c\u56e0\u4e3a\u6bcf\u4e2a\u5143\u7d20\u90fd\u9700\u8981\u5b58\u50a8\u7a7a\u95f4\u3002<\/li>\n<li>\u77e9\u9635\u64cd\u4f5c\uff08\u5982\u52a0\u6cd5\u3001\u4e58\u6cd5\u7b49\uff09\u901a\u5e38\u9700\u8981\u66f4\u591a\u7684\u8ba1\u7b97\u8d44\u6e90\u3002<\/li>\n<li>\u5728\u6570\u636e\u5206\u6790\u3001\u56fe\u50cf\u5904\u7406\u7b49\u9886\u57df\u4e2d\uff0c\u5f53\u6570\u636e\u96c6\u4e2d\u7684\u975e\u96f6\u5143\u7d20\u8f83\u591a\u65f6\uff0c\u7a20\u5bc6\u77e9\u9635\u8f83\u4e3a\u5e38\u89c1\u3002<\/li>\n<\/ul>\n<h3><a id=\"%E7%A8%80%E7%96%8F%E7%9F%A9%E9%98%B5%EF%BC%88sparse-matrix%EF%BC%89\" class=\"anchor\" aria-hidden=\"true\"><span class=\"octicon octicon-link\"><\/span><\/a>\u7a00\u758f\u77e9\u9635\uff08Sparse Matrix\uff09<\/h3>\n<p>\u7a00\u758f\u77e9\u9635\u662f\u6307\u77e9\u9635\u4e2d\u5927\u90e8\u5206\u5143\u7d20\u90fd\u662f\u96f6\u7684\u77e9\u9635\u3002\u5728\u7a00\u758f\u77e9\u9635\u4e2d\uff0c\u975e\u96f6\u5143\u7d20\u7684\u6570\u91cf\u8fdc\u5c0f\u4e8e\u77e9\u9635\u7684\u603b\u5143\u7d20\u6570\u91cf\u3002\u4e3a\u4e86\u8282\u7701\u5b58\u50a8\u7a7a\u95f4\u548c\u63d0\u9ad8\u8ba1\u7b97\u6548\u7387\uff0c\u7a00\u758f\u77e9\u9635\u901a\u5e38\u4e0d\u4f1a\u4f7f\u7528\u5b8c\u6574\u7684\u4e8c\u7ef4\u6570\u7ec4\u6765\u5b58\u50a8\uff0c\u800c\u662f\u4f7f\u7528\u7279\u6b8a\u7684\u6570\u636e\u7ed3\u6784\u6765\u5b58\u50a8\u975e\u96f6\u5143\u7d20\u548c\u5b83\u4eec\u7684\u4f4d\u7f6e\u4fe1\u606f\u3002<\/p>\n<p><strong>\u7279\u70b9<\/strong>\uff1a<\/p>\n<ul>\n<li>\u5b58\u50a8\u9700\u6c42\u8f83\u5c0f\uff0c\u56e0\u4e3a\u53ea\u6709\u975e\u96f6\u5143\u7d20\u548c\u5b83\u4eec\u7684\u4f4d\u7f6e\u9700\u8981\u5b58\u50a8\u3002<\/li>\n<li>\u77e9\u9635\u64cd\u4f5c\u53ef\u4ee5\u66f4\u9ad8\u6548\uff0c\u56e0\u4e3a\u53ef\u4ee5\u5ffd\u7565\u5927\u91cf\u7684\u96f6\u5143\u7d20\u3002<\/li>\n<li>\u5728\u8bb8\u591a\u5e94\u7528\u4e2d\u975e\u5e38\u5e38\u89c1\uff0c\u5982\u6587\u672c\u5904\u7406\uff08\u8bcd\u9891\u77e9\u9635\uff09\u3001\u793e\u4ea4\u7f51\u7edc\u5206\u6790\u3001\u5927\u89c4\u6a21\u6570\u503c\u6a21\u62df\u7b49\u3002<\/li>\n<\/ul>\n<p><strong>\u7a00\u758f\u77e9\u9635\u7684\u5b58\u50a8\u65b9\u5f0f<\/strong>\uff1a<\/p>\n<ul>\n<li><strong>\u4e09\u5143\u7ec4\u5217\u8868\uff08Triplet List\uff09<\/strong>\uff1a\u5b58\u50a8\u6240\u6709\u975e\u96f6\u5143\u7d20\u53ca\u5176\u884c\u7d22\u5f15\u548c\u5217\u7d22\u5f15\u7684\u5217\u8868\u3002<\/li>\n<li><strong>\u538b\u7f29\u5b58\u50a8\u683c\u5f0f<\/strong>\uff1a\u901a\u5e38\u5206\u4e3a\u4e24\u79cd\uff0c\u884c\u538b\u7f29\u5b58\u50a8\uff08Compressed Sparse Row, CSR\uff09\u548c\u5217\u538b\u7f29\u5b58\u50a8\uff08Compressed Sparse Column, CSC\uff09\u3002\n<ul>\n<li><strong>CSR<\/strong>\uff1a\u6309\u884c\u538b\u7f29\u5b58\u50a8\uff0c\u9002\u5408\u884c\u64cd\u4f5c\u3002<\/li>\n<li><strong>CSC<\/strong>\uff1a\u6309\u5217\u538b\u7f29\u5b58\u50a8\uff0c\u9002\u5408\u5217\u64cd\u4f5c\u3002<\/li>\n<\/ul>\n<\/li>\n<li><strong>\u5750\u6807\u5217\u8868\uff08Coordinate List\uff09<\/strong>\uff1a\u7c7b\u4f3c\u4e8e\u4e09\u5143\u7ec4\u5217\u8868\uff0c\u4f46\u901a\u5e38\u4e0d\u6309\u884c\u6216\u5217\u6392\u5e8f\u3002<\/li>\n<\/ul>\n<p>\u5728\u5b9e\u9645\u5e94\u7528\u4e2d\uff0c\u9009\u62e9\u4f7f\u7528\u7a00\u758f\u77e9\u9635\u8fd8\u662f\u7a20\u5bc6\u77e9\u9635\u53d6\u51b3\u4e8e\u5177\u4f53\u95ee\u9898\u7684\u9700\u6c42\u548c\u6570\u636e\u7684\u7279\u70b9\u3002\u7a00\u758f\u77e9\u9635\u5728\u5904\u7406\u5927\u89c4\u6a21\u7a00\u758f\u6570\u636e\u65f6\u5177\u6709\u660e\u663e\u7684\u4f18\u52bf\uff0c\u53ef\u4ee5\u663e\u8457\u51cf\u5c11\u5b58\u50a8\u9700\u6c42\u548c\u63d0\u9ad8\u8ba1\u7b97\u6548\u7387\u3002\u800c\u7a20\u5bc6\u77e9\u9635\u5219\u9002\u7528\u4e8e\u5927\u591a\u6570\u5143\u7d20\u90fd\u9700\u8981\u53c2\u4e0e\u8ba1\u7b97\u7684\u60c5\u51b5\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u6587\u672c\u5411\u91cf\u5316\u548c\u8bcd\u9891\u7edf\u8ba1\u793a\u4f8b \u4e0b\u9762\u7684\u4ee3\u7801\u628a\u4e00\u4e2a\u6587\u6863\uff0c\u8f6c\u5316\u6210\u7a00\u758f\u77e9\u9635\uff0c\u7136\u540e\u4e3a\u4e86\u66f4\u901a\u4fd7\u6613\u61c2\u7684\u5904\u7406\uff08\u540c\u65f6\u90e8\u5206\u7b97\u6cd5\u4e5f\u4e0d\u652f\u6301\u7a00\u758f\u77e9\u9635\uff09\uff0c\u518d\u628a\u5b83\u8f6c\u5316\u4e3a\u7a20\u5bc6\u77e9\u9635 from sklearn.feature_extraction.text import CountVectorizer # Sample text data documents = [ &quot;I love programming in Python&quot;, &quot;Python is a great language&quot;, &quot;I love coding&quot; ] # Create an instance of CountVectorizer vect = CountVectorizer() # Fit and transform the data X = vect.fit_transform(documents) # Convert to dense array X_dense = X.toarray() # Get &hellip; <a href=\"http:\/\/madapapa.com\/wordpress\/?p=777\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">\u6587\u672c\u5411\u91cf\u5316\u548c\u8bcd\u9891\uff0c\u4ee5\u53ca\u7a00\u758f\u77e9\u9635\u548c\u7a20\u5bc6\u77e9\u9635<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[48,47],"tags":[],"class_list":["post-777","post","type-post","status-publish","format-standard","hentry","category-artificial-intelligence","category-datascience"],"_links":{"self":[{"href":"http:\/\/madapapa.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/777","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/madapapa.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/madapapa.com\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/madapapa.com\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/madapapa.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=777"}],"version-history":[{"count":2,"href":"http:\/\/madapapa.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/777\/revisions"}],"predecessor-version":[{"id":781,"href":"http:\/\/madapapa.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/777\/revisions\/781"}],"wp:attachment":[{"href":"http:\/\/madapapa.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=777"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/madapapa.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=777"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/madapapa.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=777"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}