基于sklearn.decomposition.TruncatedSVD的潜在语义分析实践-物联网技术文章-傲云油气装备网

文章目录

1. sklearn.decomposition.TruncatedSVD
2. sklearn.feature_extraction.text.TfidfVectorizer
3. 代码实践
4. 参考文献

《统计学习方法》潜在语义分析（Latent Semantic Analysis，LSA）笔记

1. sklearn.decomposition.TruncatedSVD

sklearn.decomposition.TruncatedSVD 官网介绍

class sklearn.decomposition.TruncatedSVD(n_components=2,
algorithm='randomized', n_iter=5, random_state=None, tol=0.0)

主要参数：

n_components： default = 2，话题数量
algorithm： default = “randomized”，算法选择
n_iter： optional (default 5)，迭代次数
Number of iterations for randomized SVD solver. Not used by ARPACK.

属性：

components_, shape (n_components, n_features)
explained_variance_, shape (n_components,)
The variance of the training samples transformed by a projection to each component.
explained_variance_ratio_, shape (n_components,)
Percentage of variance explained by each of the selected components.
singular_values_, shape (n_components,)
The singular values corresponding to each of the selected components.

2. sklearn.feature_extraction.text.TfidfVectorizer

sklearn.feature_extraction.text.TfidfVectorizer 官网介绍
将原始文档集合转换为TF-IDF矩阵

class sklearn.feature_extraction.text.TfidfVectorizer(input='content',
encoding='utf-8', decode_error='strict', strip_accents=None,
lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', 
stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), 
max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, 
dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True, 
sublinear_tf=False)

参数介绍这个博客写的很清楚。

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.shape)
print(X)

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)
  (0, 8)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 6)	0.38408524091481483
  (0, 2)	0.5802858236844359
  (0, 1)	0.46979138557992045
  (1, 8)	0.281088674033753
  (1, 3)	0.281088674033753
  (1, 6)	0.281088674033753
  (1, 1)	0.6876235979836938
  (1, 5)	0.5386476208856763
  (2, 8)	0.267103787642168
  (2, 3)	0.267103787642168
  (2, 6)	0.267103787642168
  (2, 0)	0.511848512707169
  (2, 7)	0.511848512707169
  (2, 4)	0.511848512707169
  (3, 8)	0.38408524091481483
  (3, 3)	0.38408524091481483
  (3, 6)	0.38408524091481483
  (3, 2)	0.5802858236844359
  (3, 1)	0.46979138557992045

3. 代码实践

# -*- coding:utf-8 -*-
# @Python Version: 3.7
# @Time: 2020/5/1 10:27
# @Author: Michael Ming
# @Website: https://michael.blog.csdn.net/
# @File: 17.LSA.py
# @Reference: https://cloud.tencent.com/developer/article/1530432
import numpy as np
from sklearn.decomposition import TruncatedSVD  # LSA 潜在语义分析
from sklearn.feature_extraction.text import TfidfVectorizer  # 将文本集合转成权值矩阵

# 5个文档
docs = ["Love is patient, love is kind. It does not envy, it does not boast, it is not proud.",
        "It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs.",
        "Love does not delight in evil but rejoices with the truth.",
        "It always protects, always trusts, always hopes, always perseveres.",
        "Love never fails. But where there are prophecies, they will cease; where there are tongues, \
        they will be stilled; where there is knowledge, it will pass away. (1 Corinthians 13:4-8 NIV)"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)  # 转成权重矩阵
print("--------转成权重---------")
print(X)
print("--------获取特征（单词）---------")
words = vectorizer.get_feature_names()
print(words)
print(len(words), "个特征（单词）")  # 52个单词

topics = 4
lsa = TruncatedSVD(n_components=topics)  # 潜在语义分析，设置4个话题
X1 = lsa.fit_transform(X)  # 训练并进行转化
print("--------lsa奇异值---------")
print(lsa.singular_values_)
print("--------5个文本，在4个话题向量空间下的表示---------")
print(X1)  # 5个文本，在4个话题向量空间下的表示

pick_docs = 2  # 每个话题挑出2个最具代表性的文档
topic_docid = [X1[:, t].argsort()[:-(pick_docs + 1):-1] for t in range(topics)]
# argsort,返回排序后的序号
print("--------每个话题挑出2个最具代表性的文档---------")
print(topic_docid)

# print("--------lsa.components_---------")
# print(lsa.components_) # 4话题*52单词,话题向量空间
pick_keywords = 3  # 每个话题挑出3个关键词
topic_keywdid = [lsa.components_[t].argsort()[:-(pick_keywords + 1):-1] for t in range(topics)]
print("--------每个话题挑出3个关键词---------")
print(topic_keywdid)

print("--------打印LSA分析结果---------")
for t in range(topics):
    print("话题 {}".format(t))
    print("\t 关键词：{}".format(", ".join(words[topic_keywdid[t][j]] for j in range(pick_keywords))))
    for i in range(pick_docs):
        print("\t\t 文档{}".format(i))
        print("\t\t", docs[topic_docid[t][i]])

运行结果

--------转成权重---------
  (0, 24)	0.3031801002944161
  (0, 19)	0.4547701504416241
  (0, 32)	0.2263512201359201
  (0, 22)	0.2263512201359201
  (0, 20)	0.3825669873635752
  (0, 12)	0.3031801002944161
  (0, 28)	0.4547701504416241
  (0, 14)	0.2263512201359201
  (0, 6)	0.2263512201359201
  (0, 36)	0.2263512201359201
  (1, 19)	0.28327311337182914
  (1, 20)	0.4765965465346523
  (1, 12)	0.14163655668591457
  (1, 28)	0.42490967005774366
  (1, 11)	0.21148886348790247
  (1, 30)	0.21148886348790247
  (1, 40)	0.21148886348790247
  (1, 39)	0.21148886348790247
  (1, 13)	0.21148886348790247
  (1, 2)	0.21148886348790247
  (1, 21)	0.21148886348790247
  (1, 27)	0.21148886348790247
  (1, 37)	0.21148886348790247
  (1, 29)	0.21148886348790247
  (1, 51)	0.21148886348790247
  :	:
  (3, 46)	0.22185332169737518
  (3, 17)	0.22185332169737518
  (3, 33)	0.22185332169737518
  (4, 24)	0.09483932399667956
  (4, 19)	0.09483932399667956
  (4, 20)	0.0797818291938777
  (4, 7)	0.1142518110942895
  (4, 25)	0.14161217495916
  (4, 16)	0.14161217495916
  (4, 48)	0.42483652487747997
  (4, 43)	0.42483652487747997
  (4, 3)	0.28322434991832
  (4, 34)	0.14161217495916
  (4, 44)	0.28322434991832
  (4, 49)	0.42483652487747997
  (4, 8)	0.14161217495916
  (4, 45)	0.14161217495916
  (4, 5)	0.14161217495916
  (4, 41)	0.14161217495916
  (4, 23)	0.14161217495916
  (4, 31)	0.14161217495916
  (4, 4)	0.14161217495916
  (4, 9)	0.14161217495916
  (4, 0)	0.14161217495916
  (4, 26)	0.14161217495916
--------获取特征（单词）---------
['13', 'always', 'angered', 'are', 'away', 'be', 'boast', 'but', 'cease', 'corinthians', 'delight', 'dishonor', 'does', 'easily', 'envy', 'evil', 'fails', 'hopes', 'in', 'is', 'it', 'keeps', 'kind', 'knowledge', 'love', 'never', 'niv', 'no', 'not', 'of', 'others', 'pass', 'patient', 'perseveres', 'prophecies', 'protects', 'proud', 'record', 'rejoices', 'seeking', 'self', 'stilled', 'the', 'there', 'they', 'tongues', 'trusts', 'truth', 'where', 'will', 'with', 'wrongs']
52 个特征（单词）
--------lsa奇异值---------
[1.29695724 1.00165234 0.98752651 0.94862686]
--------5个文本，在4个话题向量空间下的表示---------
[[ 0.85667347 -0.00334881 -0.11274158 -0.14912237]
 [ 0.80868148  0.09220662 -0.16057627 -0.33804609]
 [ 0.46603522 -0.3005665  -0.06851382  0.82322097]
 [ 0.13423034  0.92315127  0.22573307  0.2806665 ]
 [ 0.24297388 -0.22857306  0.9386499  -0.08314939]]
--------每个话题挑出2个最具代表性的文档---------
[array([0, 1], dtype=int64), array([3, 1], dtype=int64), array([4, 3], dtype=int64), array([2, 3], dtype=int64)]
--------每个话题挑出3个关键词---------
[array([28, 20, 19], dtype=int64), array([ 1, 46, 33], dtype=int64), array([49, 48, 43], dtype=int64), array([10, 42, 18], dtype=int64)]
--------打印LSA分析结果---------
话题 0
	 关键词：not, it, is
		 文档0
		 Love is patient, love is kind. It does not envy, it does not boast, it is not proud.
		 文档1
		 It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs.
话题 1
	 关键词：always, trusts, perseveres
		 文档0
		 It always protects, always trusts, always hopes, always perseveres.
		 文档1
		 It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs.
话题 2
	 关键词：will, where, there
		 文档0
		 Love never fails. But where there are prophecies, they will cease; where there are tongues,         they will be stilled; where there is knowledge, it will pass away. (1 Corinthians 13:4-8 NIV)
		 文档1
		 It always protects, always trusts, always hopes, always perseveres.
话题 3
	 关键词：delight, the, in
		 文档0
		 Love does not delight in evil but rejoices with the truth.
		 文档1
		 It always protects, always trusts, always hopes, always perseveres.

4. 参考文献

主要参考了下面作者的文章，表示感谢！
sklearn: 利用TruncatedSVD做文本主题分析

• 深入理解Python对Json的解析	• 动物识别 python 人工智能实验
• Python时间序列--股票预测（七）	• 老前辈常谈python之鸭子类和多态
• python之有关循环的那些事儿	• React中匹配路由参数的方式

• Esp8266天猫精灵_RGB灯_非点灯平台	• STM32F103 串口1和串口3对发数据配合蓝牙模块
• TMS570学习【1】了解什么是TMS570	• 新闻稿 \| Qt公司收购froglogic公司以巩固市场领
• [Java]SpringBoot2整合mqtt服务器EMQ实现消息订	• 苹果群控投屏同步操作原理及运用的平台APP分享

• Esp8266天猫精灵_RGB灯_非点灯平台	• STM32F103 串口1和串口3对发数据配合蓝牙模块
• TMS570学习【1】了解什么是TMS570	• 新闻稿 \| Qt公司收购froglogic公司以巩固市场领
• [Java]SpringBoot2整合mqtt服务器EMQ实现消息订	• 苹果群控投屏同步操作原理及运用的平台APP分享
• STM32查询式按键输入[直接用寄存器]	• Ubuntu系统 USB设备端口绑定
• 2021-04-14 第四次按键输入实验	• Flutter扫码功能完美实现