.net c#在文档分类中将Mallet应用为二进制分类器

Applying Mallet in document classification as binary classifier
2020-11-21
  •  译文(汉语)
  •  原文(英语)

我已经使用Mallet实现了文档分类工具,该工具将文档的每一页分类为某些类别.我也尝试过Weka,但在这方面,Mallet比Weka聪明.我的方法如下:

  1. 将文档页面训练到已知类别
  2. 测试少量样本文档,以了解Mallet是否标识特定类别的页面.在此,Mallet将测试集中的已知类别与之匹配.
  3. 如果测试成功且令人满意,则使用分类器和槌文件在庞大的文档存储库上运行.

该部分已经成功实施.

对于未经培训且与已知类别不同的​​文本文档,应将其作为NO Match返回,Mallet试图从培训集中为Mallet不了解的文档找到匹配项.

例如,我在一个文档中有4页.页面1属于A类,页面3属于B类.页面2和4不属于任何类.如何通过槌将第2页和第4页标记为"不匹配"?

请帮助我实现这一目标.让我知道我是否做错了什么或任何其他可以提供期望输出的工具.

解决过程1

两个简单的想法:

  1. 您可以为所需的置信度值提供一些阈值.例如,槌槌说Page 1属于A类,具有90%的置信度,请接受.如果说Page 2具有60%的置信度属于C类,那是最好的价值,那就拒绝该建议.您可以通过function-getClassificationScores(文档:http : //mallet.cs.umass.edu/api/cc/mallet/classify/MaxEnt.html#getClassificationScores( cc.mallet.types.Instance,double [])

  2. 您可以在python中进行scikit学习.我听说,如果不知道您的页面属于哪个类,它将告诉您NA.

速聊1:
谢谢你的建议.我已经在使用您提到的第一点.我保持了阈值,即60%,低于60%的置信度.需要通过scikit学习工具和算法.

I have implemented a document classification tool using Mallet which classifies each page of a document to certain categories. I have tried Weka too but Mallet is smarter than Weka on this aspect. My approach is as below:

  1. Train pages of a document to known category
  2. Test few sample documents whether Mallet identifies pages of a certain category or not. Here Mallet matches from the test set with Known categories.
  3. if test is successful and satisfactory then run on huge document repository using classifier and mallet file.

This part is already implemented with good success rate.

For Text documents which I have not trained and different from known categories should be returned as NO Match, Mallet is trying to find match from training set for documents which are not known to Mallet.

For example I have 4 pages in a document. Page 1 belongs to class A, page 3 belongs to class B. Pages 2 and 4 do not belong to any classes. How to mark, pages 2 and 4 as 'NON Match' through Mallet?

Please help me to achieve this. Let me know if I am doing anything wrong or any other tool which can give me desired output.

Solutions1

Two quick thoughts:

  1. You can give some threshold for the confidence value you want. For example, mallet is saying that Page 1 belongs to Class A with 90% confidence, accept it. If it is saying that Page 2 belongs to Class C, with 60% confidence, and that is the best value, may be, reject that suggestion. You can get the scores of classification through the function-getClassificationScores (documentation: http://mallet.cs.umass.edu/api/cc/mallet/classify/MaxEnt.html#getClassificationScores(cc.mallet.types.Instance, double[])

  2. You can you scikit-learn in python. I have heard that if it doesn't know which class your page belongs to, it will tell NA.

Talk1:
Thank you for your suggestion. I am already using the first point which you have mentioned. I have kept threshold i.e. 60%, below 60% confidence I am discarding. Need to go through scikit-learn tools and algorithms.
转载于:https://stackoverflow.com/questions/28362098/applying-mallet-in-document-classification-as-binary-classifier

本人是.net程序员,因为英语不行,使用工具翻译,希望对有需要的人有所帮助
如果本文质量不好,还请谅解,毕竟这些操作还是比较费时的,英语较好的可以看原文

留言回复
我们只提供高质量资源,素材,源码,坚持 下了就能用 原则,让客户花了钱觉得值
上班时间 : 周一至周五9:00-17:30 期待您的加入