LDA和主题模型模型、主题、LDA

2023-09-11 05:59:15 作者:小捧星

我已经研究了LDA和主题模型几weeks.But因为我可怜的数学能力,我不能完全理解其内在algorithms.I已经使用了GibbsLDA实施,投入了大量的文件,并设置栏目号码为100 ,我得到了一个名为final.theta存储每个主题中的每个document.This的话题比例结果是好的文件,我可以使用的主题比例做很多其他的事情。 但是,当我试图在LDA Blei的C语言实现,我只得到了一个名为final.gamma文件,但我不知道如何把这个文件放到题目比例的风格。 谁能帮我。 而且我已经了解到,LDA模型有很多改进的版本(如CTM,HLDA),如果我能找到类似LDA主题模型,我的意思是,当我投入了大量的文件,它可以直接输出文件的题目比例。 非常感谢你!

I have studied LDA and Topic model for several weeks.But due to my poor mathematics ability, i can not fully understand its inner algorithms.I have used the GibbsLDA implementation, input a lot of documents, and set topic number as 100, i got a file named "final.theta" which stores the topic proportion of each topic in each document.This result is good, i can use the topic proportion to do many other things. But when i tried Blei's C language implementation on LDA, i only got a file named final.gamma, but i don't know how to transform this file into topic proportion style. Can anyone help me. And i have learned that LDA model has many improved version(such as CTM,HLDA), if i can find a topic model similar to LDA, i mean when i input a lot of documents, it can directly output the topic proportion in the documents. Thank you very much!

推荐答案

我觉得跟Blei实施的问题是,你正在做的变分推断运行:

I think the problem with the Blei implementation is that you're doing variational inference by running:

$ LDA INF [参数...]

$ lda inf [args...]

当你想要做的题目估计,有:

When you want to be doing topic estimation, with:

$ LDA EST [参数...]

$ lda est [args...]

在此运行时,会出现无论是在当前的目录或由可选的最后一个参数指定的目录中的文件final.beta。然后你运行python脚本topics.py,包括在焦油。这里的自述: http://www.cs.princeton.edu/ 〜blei / LDA-C / readme.txt文件描述了这一切,尤其是部分B和D。

Once this runs, there will be a file "final.beta" in either the current directory or the directory specified by the optional last argument. Then you run the python script "topics.py", included in the tar. The readme here: http://www.cs.princeton.edu/~blei/lda-c/readme.txt describes it all, especially sections B and D.

(如果仍然没有任何意义,让我知道)

(If this still doesn't make sense, let me know)

至于改进,如CTM等:我不知道HLDA什么,但我都用了LDA和CTM在过去,我可以说的是,无论是严格比其他更好 - 这是一个情况下,正在为不同的数据更好。 CTM使得该文件是相关的假设,并使用该假设只要以提高结果是真的。

As far as improvements such as CTM etc: I don't know anything about HLDA, but I have used both LDA and CTM in the past, and I can say that neither is strictly better than the other - it's a case of being better for different data. CTM makes the assumption that documents are correlated, and uses that assumption to improve the results as long as it's true.

希望这有助于!