【Python机器学习】NLP词中的数学——主题建模

  论坛元老 | 2024-9-3 21:39:12 | 显示全部楼层 | 阅读模式
打印 上一主题 下一主题

主题 1744|帖子 1744|积分 5232

目次
齐普夫定律
相干度排序
工具
其他工具
Okapi BM25


在文档向量中,词计数是有效的,但是纯词计数,即使按照文档长度举行归一化处理,也不能告诉我们太多该词在当前文档相对于语料库中其他文档的重要度信息。假如能弄清晰这些信息,我们就能开始描述语料库中的文档了。假设我们有一个收集了所有风筝数据的语料库,那么险些可以肯定的是“Kite”一词会在每一个文档中出现很多次,但这不能提供任何新信息,对区分文档没有任何帮助。像“construction”这样的词大概不会在整个语料库中广泛出现,但是对于这些词频仍出现的那些文档,我们会对每篇文档的本质有更多了解。为此,我们需要另一个工具。
逆文档频率(IDF),在齐普夫定律下为主题分析打开了一扇新窗户。我们从前面的词项频仍计数器开始,然后对它举行扩展。我们可以通过两种方式对词条计数并对它们装箱处理:对每篇文档举行处理或遍历整个语料库。
下面我们只按文档计数:
  1. from nltk.tokenize import TreebankWordTokenizer
  2. tokenizer=TreebankWordTokenizer()
  3. kite_txt_1="""
  4. A kite is traditionally a tethered heavier-than-air craft with wing surfaces that react against the air to create lift and drag. A kite consists of wings, tethers, and anchors. Kites often have a bridle to guide the face of the kite at the correct angle so the wind can lift it. A kite’s wing also may be so designed so a bridle is not needed; when kiting a sailplane for launch, the tether meets the wing at a single point. A kite may have fixed or moving anchors. Untraditionally in technical kiting, a kite consists of tether-set-coupled wing sets; even in technical kiting, though, a wing in the system is still often called the kite.
  5. The lift that sustains the kite in flight is generated when air flows around the kite’s surface, producing low pressure above and high pressure below the wings. The interaction with the wind also generates horizontal drag along the direction of the wind. The resultant force vector from the lift and drag force components is opposed by the tension of one or more of the lines or tethers to which the kite is attached. The anchor point of the kite line may be static or moving (such as the towing of a kite by a running person, boat, free-falling anchors as in paragliders and fugitive parakites or vehicle).
  6. The same principles of fluid flow apply in liquids and kites are also used under water. A hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite lifting surface is called a kytoon.
  7. Kites have a long and varied history and many different types are flown individually and at festivals worldwide. Kites may be flown for recreation, art or other practical uses. Sport kites can be flown in aerial ballet, sometimes as part of a competition. Power kites are multi-line steerable kites designed to generate large forces which can be used to power activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a new trend snow kiting. Even Man-lifting kites have been made.
  8. """
  9. kite_txt_2="""
  10. Kites were invented in China, where materials ideal for kite building were readily available: silk fabric for sail material; fine, high-tensile-strength silk for flying line; and resilient bamboo for a strong, lightweight framework.
  11. The kite has been claimed as the invention of the 5th-century BC Chinese philosophers Mozi (also Mo Di) and Lu Ban (also Gongshu Ban). By 549 AD paper kites were certainly being flown, as it was recorded that in that year a paper kite was used as a message for a rescue mission. Ancient and medieval Chinese sources describe kites being used for measuring distances, testing the wind, lifting men, signaling, and communication for military operations. The earliest known Chinese kites were flat (not bowed) and often rectangular. Later, tailless kites incorporated a stabilizing bowline. Kites were decorated with mythological motifs and legendary figures; some were fitted with strings and whistles to make musical sounds while flying. From China, kites were introduced to Cambodia, Thailand, India, Japan, Korea and the western world.
  12. After its introduction into India, the kite further evolved into the fighter kite, known as the patang in India, where thousands are flown every year on festivals such as Makar Sankranti.
  13. Kites were known throughout Polynesia, as far as New Zealand, with the assumption being that the knowledge diffused from China along with the people. Anthropomorphic kites made from cloth and wood were used in religious ceremonies to send prayers to the gods. Polynesian kite traditions are used by anthropologists get an idea of early "primitive" Asian traditions that are believed to have at one time existed in Asia.
  14. """
  15. kite_intro=kite_txt_1.lower()
  16. intro_tokens=tokenizer.tokenize(kite_intro)
  17. kite_history=kite_txt_2.lower()
  18. history_tokens=tokenizer.tokenize(kite_history)
  19. intro_total=len(intro_tokens)
  20. print(intro_total)
  21. history_total=len(history_tokens)
  22. print(history_total)
复制代码

如今,由两篇分词后的kite文档,我们计算“kite”在每篇文档中的词项频率。我们将词项频率存储到两个字典中,此中每个字典对应一篇文档。
  1. from collections import Counter
  2. intro_tf={}
  3. history_tf={}
  4. intro_counts=Counter(intro_tokens)
  5. intro_tf['kite']=intro_counts['kite']/intro_total
  6. history_counts=Counter(history_tokens)
  7. history_tf['kite']=history_counts['kite']/history_total
  8. print(intro_tf['kite'],history_tf['kite'])
复制代码

可以看到,“kite”在两篇文档中的词项频率分别是0.0388和0.0202。
进一步挖掘,看下其他词的词项频率数字:
  1. intro_tf['and']=intro_counts['and']/intro_total
  2. history_tf['and']=history_counts['and']/history_total
  3. print(intro_tf['and'],history_tf['and'])
复制代码

可以看到,这两篇文档和“and”的相干度,与它们和“kite”的相干度相差不大。这似乎没什么用。
思量词项逆文档频率的一个好方法是:这个词条在此文档中有多稀缺?假如一个词项在某篇文档中出现很多次,但很少出如今语料库的其他文档中,那么就可以假设它对当前文档非常重要。
词项的IDF仅仅是文档总数与该词项出现的文档数之比。在当前示例中的“and”和“kite”,它们的IDF是雷同的:


  • 文档总数/出现“and”的文档数=2/2=1
  • 文档总数/出现“kite”的文档数=2/2=1
  • 文档总数/出现“China”的文档数=2/1=2
出现了一个不同的效果,下面使用这种稀缺度指标来对词项频率加权:
  1. num_docs_containing_and=0
  2. for doc in [intro_tokens,history_tokens]:
  3.     if 'and' in doc:
  4.         num_docs_containing_and=num_docs_containing_and+1
复制代码
获取“China”在两篇文档中的词项频率值:
  1. intro_tf['China']=intro_counts['China']/intro_total
  2. history_tf['China']=history_counts['China']/history_total
复制代码
末了,计算3个词的IDF。我们就像存储词项频率一样把IDF存储在每篇文档的字典中:
  1. num_docs=2
  2. intro_ifd={}
  3. history_idf={}
  4. intro_ifd['and']=num_docs/num_docs_containing_and
  5. history_idf['and']=num_docs/num_docs_containing_and
  6. intro_ifd['kite']=num_docs/num_docs_containing_and
  7. history_idf['kite']=num_docs/num_docs_containing_and
  8. intro_ifd['china']=num_docs/num_docs_containing_and
  9. history_idf['china']=num_docs/num_docs_containing_and
复制代码
然后对文档intro和文档history有:
  1. intro_tfidf={}
  2. intro_tfidf['and']=intro_tf['and']*intro_ifd['and']
  3. intro_tfidf['kite']=intro_tf['kite']*intro_ifd['kite']
  4. intro_tfidf['china']=intro_tf['china']*intro_ifd['china']
  5. history_tfidf={}
  6. history_tfidf['and']=history_tf['and']*history_idf['and']
  7. history_tfidf['kite']=history_tf['kite']*history_idf['kite']
  8. history_tfidf['china']=history_tf['china']*history_idf['china']
复制代码
齐普夫定律

假设我们拥有一个包罗100万篇文档的语料库,有人搜刮“cat”这个词,在上述100万篇文档中,只有一篇文档包罗“cat”。那么这个词的原始或源生IDF为:
1000000/1=1000000
假设有10篇文章包罗“dog”,那么“dog”的IDF为:
1000000/10=100000
上述两个效果明显不同。齐普夫会说上面的差距太大了,由于这种差距大概会经常出现。齐普夫定律表明,当比较两个词的词频时,即使它们出现的次数类似,更频仍出现的词的词频也将指数级地高于较不频仍出现的词的词频。因此,齐普夫定律建议使用对数log()来对词频(和文档频率)举行标准的缩放处理。这就能够确保像“cat”和“dog”这样的词,即使它们出现的次数类似,在末了的词频计算效果上也不会出现指数级的差异。别的,这种词频的分布将确保TF-IDF分数更加符合匀称分布。因此,我们应该将IDF重新界说为词出如今某篇文档中原始概率的对数。对于词项频率,我们也会举行对数处理。
对数函数的底并不重要,由于我们只想使频率分布匀称,而不是将值限定在特定的数值范围内举行缩放。假如用一个以10为底的对数函数,我们会得到:
search:cat——idf=lg(1000000/1)=6
search:dog——idf=lg(1000000/10)=5
以是如今要根据它们在语言中总体出现的次数,对每一个TF效果举行恰当的加权。
最终,对于语料库D中给定的文档d里的词项t,有:
tf(t,d)=(t在d中出现的次数)/(d的长度)
tf(t,D)=lg(文档数/包罗t的文档数)
tfidf(t,d,D)=tf(t,d)*idf(t,D)
因此,一个词在文档中出现的次数越多,它在文档中的TF(进而TF-IDF)就会越高。与此同时,随着包罗该词的文档数增长,该词的IDF(进而TF-IDF)将降落。如今,我们有了一个计算机可以处理的数字,它将特定词或词条与特定语料库中的特定文档关联起来,然后根据该词在整个语料库中的使用环境,为该词在给定文档中的重要度赋予了一个数值。
在一些环境下,所有的计算可以都在对数空间中举行,这样乘法就变成了加法,处罚就变成了减法:
  1. log_tf=log(term_occurences_in_doc)-log(num_terms_in_doc)
  2. log_idf=log(log(total_num_docs))-log(num_docs_containing_term)
  3. log_tf_idf=log_tf+log_idf
复制代码
TF-IDF这个独立的数字,是简朴搜刮引擎的简陋的基础。线性代数对于全民明白自然语言处理中使用的工具并不是必需的,但是大体上熟悉公式的工作原理可以使它们的使用更加直观。
相干度排序

我们可以很容易地比较两个向量来得到它们的相似度,然而我们已经了解到,仅仅对词计数并不像使用它们的TF-IDF那样具有可描述性。因此,在每个文档向量中,我们用词的TF-IDF更换TF。如今,向量将更全面地反映文档的含义或主题,像是下面的case:
  1. import copy
  2. from nltk.tokenize import TreebankWordTokenizer
  3. tokenizer=TreebankWordTokenizer()
  4. from collections import Counter
  5. from collections import OrderedDict
  6. docs=["""
  7. The faster Harry got to the store, the faster and faster Harry would get home.
  8. """]
  9. docs.append("""
  10. Harry is hairy and faster than Jill
  11. """)
  12. docs.append("""
  13. Jill is not as hairy as Harry
  14. """)
  15. doc_tokens=[]
  16. for doc in docs:
  17.     doc_tokens=doc_tokens+[sorted(tokenizer.tokenize(doc.lower()))]
  18. all_doc_tokens=sum(doc_tokens,[])
  19. lexicon=sorted(set(all_doc_tokens))
  20. zero_vector=OrderedDict((token,0) for token in lexicon)
  21. document_tfidf_vectors=[]
  22. for doc in docs:
  23.     vec=copy.copy(zero_vector)
  24.     tokens=tokenizer.tokenize(doc.lower())
  25.     token_counts=Counter(tokens)
  26.     for key,value in token_counts.items():
  27.         docs_containing_key=0
  28.         for _doc in docs:
  29.             if key in _doc:
  30.                 docs_containing_key=docs_containing_key+1
  31.         tf=value/len(lexicon)
  32.         if docs_containing_key:
  33.             idf=len(docs)/docs_containing_key
  34.         else:
  35.             idf=0
  36.         vec[key]=tf*idf
  37.     document_tfidf_vectors.append(vec)
  38. print(document_tfidf_vectors)
复制代码

在上述设置下,我们就得到了语料库中每篇文档的K维向量表现。在给定的向量空间中,假如两个向量有相似的角度,可以说它们是相似的。
假如两个向量的余弦相似度很高,那么它们就被认为是相似的。因此,假如最小化余弦相似度,就可以找到两个相似的向量:

如今,我们已经有了举行基本TF-IDF搜刮的所有东西,我们可以将搜刮查询本身视为文档,从而得到它的基于TF-IDF的向量表现,末了一步是找到与查询余弦相似度最高的向量的文档,并将这些文档作为搜刮效果返回。
下面是一个简朴的case:
  1. query="How long does it take to get to the store?"
  2. query_vec=copy.copy(zero_vector)
  3. tokens=tokenizer.tokenize(query.lower())
  4. token_counts=Counter(tokens)
  5. for key,value in token_counts.items():
  6.     docs_containing_key=0
  7.     for _doc in docs:
  8.         if key in _doc.lower():
  9.             docs_containing_key=docs_containing_key+1
  10.         if docs_containing_key==0:
  11.             continue
  12.         tf=value/len(tokens)
  13.         idf=len(docs)/docs_containing_key
  14.         query_vec[key]=tf*idf
  15. print(cosine_sim(query_vec,document_tfidf_vectors[0]))
  16. print(cosine_sim(query_vec,document_tfidf_vectors[1]))
  17. print(cosine_sim(query_vec,document_tfidf_vectors[2]))
复制代码

可以看到,对于当前查询,文档0的相干度最高。通过这种方式我们可以在任何语料库中寻找相干的文档,无论是维基百科的文章照旧来自推特的推文。
对于每个查询而言,都必须对所有TF-IDF向量举行“索引扫描”。这是一个复杂度为O(N)的算法。由于使用了倒排索引,大多数搜刮引擎可以在常数时间(O(1))内响应。
关键词搜刮只是NLP流水线中的一个工具,而我们的目标是构建一个谈天机器人,大多数谈天机器人高度依靠搜刮引擎。并且,一些谈天机器人完全依靠搜刮引擎,将它作为天生复兴的唯一算法。我们需要接纳额外的步骤将简朴搜刮索引(TF-IDF)转换为谈天机器人。我们需要将“问题-复兴”对情势的训练数据存储起来。然后,就可以使用TF-IDF搜刮与用户输入的文本最相似的问题。这里我们不返回数据库中最相似的语句,而是返回与该语句关联的复兴。就像任何棘手的计算机科学问题一样,我们的问题可以通过加入一个间接层来办理。
工具

好久以前搜刮就已经自动化处理,有很多相干的实今世码。我们也可以使用scikit-learn包找到快速路径。
下面使用sklearn来构建TF-IDF矩阵。sklearn TF-IDF类是一个包罗.fit()和.transform()方法的模型,这些方法遵循所有机器学习模型的sklearn API:
  1. from sklearn.feature_extraction.text import TfidfVectorizer
  2. docs=["""
  3. The faster Harry got to the store, the faster and faster Harry would get home.
  4. """]
  5. docs.append("""
  6. Harry is hairy and faster than Jill
  7. """)
  8. docs.append("""
  9. Jill is not as hairy as Harry
  10. """)
  11. corpus=docs
  12. vectorizer=TfidfVectorizer(min_df=1)
  13. model=vectorizer.fit_transform(corpus)
  14. print(model.todense().round(2))
复制代码

利用scikit-learn,我们在上面的代码中创建了一个由3行文档组成的矩阵,以及词库中每个词项的逆文档频率。如今有一个表现3个文档的矩阵,词库中每个词项、词条或词的TF-IDF构成矩阵的列。由于分词方式不同,而且去掉了标签符号,以是词库中只有16个词项。对大规模文本而言,这种或其他一些预优化的TF-IDF模型将为我们省去大量工作。
其他工具

TF-IDF矩阵(词项-文档矩阵)一直是信息检索(搜刮)的主流。为了进步搜刮效果相干性,下面是一些可以归一化和平滑词项频率权重的方案:
方案界说
None
TF-IDF
TF-ICF
Okapi BM25
搜刮引擎(信息检索系统)在查询和语料库中的文档之间匹配关键词(词项)。
Okapi BM25

除了计算TF-IDF余弦相似度,还可以对相似度举行归一化和平滑处理。忽略查询文档中词项的重复出现,从而可以有效地将查询向量的词频都简化为1.这里,余弦相似度的点积不是很具TF-IDF向量的模(文档和查询中的词项数)举行归一化,而是由文档长度本身的一个非线性函数举行归一化:
q_idf*dot(q_tf,d_tf)*1.5/(dot(q_tf,d_tf)+0.25+0.75*d_num_words/d_num_words.mean()))
通过选择给用户提供最相干效果的权重方案,我们可以优化流水线。但是,假如所处理的语料库不是太大,可以思量继续往下探索,对词和文档的含义举行更有效和更准确的表现。相比于TF-IDF加权、词干还原和词形归并所盼望到达的目标,语义搜刮会好得多。








免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有账号?立即注册

x
回复

使用道具 举报

0 个回复

正序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

论坛元老
这个人很懒什么都没写!
快速回复 返回顶部 返回列表