.net c#根据数据库文件确定英文POS标记

Determine POS tagging in English based on database files
2021-09-14
  •  译文(汉语)
  •  原文(英语)

我有点困惑如何用英语确定词性标记.在这种情况下,我假设英语中的一个单词具有一种类型,例如单词"book"被识别为NOUN,而不是VERB.我想识别基于时态的英语句子.例如,"我寄出的书"被识别为过去式.

描述:

我有许多数据库(* .txt)文件:NounList.txt,verblist.txt,adjectiveList.txt,adverbList.txt,jointList.txt,prepositionList.txt,art​​icleList.txt.并且,如果输入的单词在数据库中可用,我认为这些单词的类型可以得出结论.但是,如何开始在数据库中查找?例如,"我寄了书":如何开始在数据库中搜索每个单词,"I"为名词,"sent"为动词,"the"为文章,"book"为名词?有比在每个数据库中搜索每个单词更好的方法吗?我怀疑每个数据库都有独特的元素.

我在这里附上我的观点.

private List<string> ParseInput(String allInput)
{
    List<string> listSentence = new List<string>();

    char[] delimiter = ".?!;".ToCharArray();
    var sentences = allInput.Split(delimiter, StringSplitOptions.RemoveEmptyEntries).Select(s => s.Trim());

    foreach (var s in sentences)
        listSentence.Add(s);

        return listSentence;
}

private void tenseReviewMenu_Click(object sender, EventArgs e)
    {
        string allInput = rtbInput.Text;

        List<string> listWord = new List<string>();
        List<string> listSentence = new List<string>();

        HashSet<string> nounList = new HashSet<string>(getDBList("nounList.txt"));
        HashSet<string> verbList = new HashSet<string>(getDBList("verbList.txt"));
        HashSet<string> adjectiveList = new HashSet<string>(getDBList("adjectiveList.txt"));
        HashSet<string> adverbList = new HashSet<string>(getDBList("adverbList.txt"));

        char[] separator = new char[] { ' ', '\t', '\n', ',' etc... };         

        listSentence = ParseInput(allInput);

        foreach (string sentence in listSentence)
        {
            foreach (string word in sentence.Split(separator))
                if (word.Trim() != "")
                    listWord.Add(word);               
        }

        string testPOS = "";

        foreach (string word in listWord)
        {
            if (nounList.Contains(word.ToLowerInvariant()))
                testPOS += "noun ";
            else if (verbList.Contains(word.ToLowerInvariant()))
                testPOS += "verb ";
            else if (adjectiveList.Contains(word.ToLowerInvariant()))
                testPOS += "adj ";
            else if (adverbList.Contains(word.ToLowerInvariant()))
                testPOS += "adv ";

        }
        tbTest.Text = testPOS;
    }

POS标记是我在作业中的第二个解释.因此,我使用一种简单的方法来确定基于数据库的POS标记.但是,如果有一种更简单的方法:易于使用,易于理解,易于获得伪代码,易于设计...确定POS标记,请告诉我.

速聊1:
我不明白这个问题.显然,许多单词可以具有超过POS的功能(例如,"book"可以是动词,就像我想要预订旅馆房间一样).请问该如何处理?紧张与这一切有什么关系?"基于时态识别句子"是什么意思?或者,您只是在寻找有关POS标记的介绍(在这种情况下,Stackoverflow不会成为您的理想之选)?
解决过程1

希望下面提供的伪代码对您有所帮助.如果有时间,我也会为您编写一些代码.

可以通过以下步骤解决此问题:

  1. 创建所有英语常用句型的字典.例如,主语+动词是英语模式,所有句子都喜欢I sleep,Dog barked并且Ship will arriveSV模式匹配.您可以在此处找到最常见的英语模式列表.请注意,一段时间以来,您可能需要不断修改此词典以提高程序的准确性.

  2. 尝试使输入句子适合您在上面创建的字典中的一种模式中,例如,如果输入句子为Snakes, unlike elephants, are venomous.,则您的代码必须能够找到与以下模式匹配的字符:Subject,与 AnotherSubject,Verb ObjectS-,与-不同S`-,-VO.要成功执行这一步,你可能需要编写代码的善于分辨结构标记之类的词不同,在这个例子中的句子.

  3. 在模式词典中找到与输入句子匹配的内容后,您可以轻松地为句子中的每个单词分配一个标签.例如,在我们的句子中,该单词Snakes将被标记为主题,就像该单词一样elephants,该单词are将被标记为动词,最后该单词venomous将被标记为宾语.

  4. 为句子中的每个单词分配一个唯一的标记后,您可以在已有的适当文本文件中查找该单词,并确定句子是否有效.

  5. 如果您的句子与任何句子模式都不匹配,那么您有两个选择:

    a)如果这是一个无效的英语句子,请在您的模式词典中添加该无法识别的句子的模式.

    b)或者,将输入的句子作为无效的英语句子丢弃.

使用机器学习技术可以最好地解决诸如您要实现的目标之类的问题,以便系统可以学习任何新模式.因此,您可能希望包括一个培训者系统,该系统将在找到任何与现有模式都不匹配的有效英语句子时将新模式添加到您的模式字典中.我还没有考虑过如何做到这一点,但是现在,您可以手动修改Sentence Pattern词典.

我很高兴听到您对这个伪代码的意见,可以进一步集思广益.

速聊1:
@Pankaj Sharma先生,真是太神奇了..几天前,我决定使用OpenNLP解决该问题.因为通过(手动)使用蛮力解决问题看起来像是业余学生.我不知道该讲座允许我使用还是不使用OpenNLP,但是我想尝试一下..到目前为止,OpenNLP运行良好,但是我在作业中面临新的问题,那就是句子模式.定义POS标记后,我想尝试分析句子模式,例如最常见的时态,现在,过去时等等.
速聊2:
现在,为了检查句子模式,我使用CKY算法(Cocke-Younger-Kasami)对其进行了分析.我必须以Chomsky Normal Form(CNF)设计句子模式.到目前为止,我很难在CNF中进行设计.这是英语中最常见的句子模式,例如:S-> NP VP,NP-> Det N | 名称,PP-> PREP NP,VP-> V | V NP | V NP PP | 聚丙烯
速聊3:
我很高兴听到您的意见,并与您进行讨论.因为我是学生.@Pankaj Sharma爵士先生有什么建议吗?

I'm a little bit confused how to determine part-of-speech tagging in English. In this case, I assume that one word in English has one type, for example word "book" is recognized as NOUN, not as VERB. I want to recognize English sentences based on tenses. For example, "I sent the book" is recognized as past tense.

Description:

I have a number of database (*.txt) files: NounList.txt, verbList.txt, adjectiveList.txt, adverbList.txt, conjunctionList.txt, prepositionList.txt, articleList.txt. And if input words are available in the database, I assume that type of those words can be concluded. But, how to begin lookup in the databases? For example, "I sent the book": how to begin a search in the databases for every word, "I" as Noun, "sent" as verb, "the" as article, "book" as noun? Any better approach than searching every word in every database? I doubt that every databases has unique element.

I enclose my perspective here.

private List<string> ParseInput(String allInput)
{
    List<string> listSentence = new List<string>();

    char[] delimiter = ".?!;".ToCharArray();
    var sentences = allInput.Split(delimiter, StringSplitOptions.RemoveEmptyEntries).Select(s => s.Trim());

    foreach (var s in sentences)
        listSentence.Add(s);

        return listSentence;
}

private void tenseReviewMenu_Click(object sender, EventArgs e)
    {
        string allInput = rtbInput.Text;

        List<string> listWord = new List<string>();
        List<string> listSentence = new List<string>();

        HashSet<string> nounList = new HashSet<string>(getDBList("nounList.txt"));
        HashSet<string> verbList = new HashSet<string>(getDBList("verbList.txt"));
        HashSet<string> adjectiveList = new HashSet<string>(getDBList("adjectiveList.txt"));
        HashSet<string> adverbList = new HashSet<string>(getDBList("adverbList.txt"));

        char[] separator = new char[] { ' ', '\t', '\n', ',' etc... };         

        listSentence = ParseInput(allInput);

        foreach (string sentence in listSentence)
        {
            foreach (string word in sentence.Split(separator))
                if (word.Trim() != "")
                    listWord.Add(word);               
        }

        string testPOS = "";

        foreach (string word in listWord)
        {
            if (nounList.Contains(word.ToLowerInvariant()))
                testPOS += "noun ";
            else if (verbList.Contains(word.ToLowerInvariant()))
                testPOS += "verb ";
            else if (adjectiveList.Contains(word.ToLowerInvariant()))
                testPOS += "adj ";
            else if (adverbList.Contains(word.ToLowerInvariant()))
                testPOS += "adv ";

        }
        tbTest.Text = testPOS;
    }

POS tagging is my secondary explanation in my assignment. So I use a simple approach to determine POS tagging that is based on database. But, if there's a simpler approach: easy to use, easy to understand, easy to get pseudocode, easy to design... to determine POS tagging, please let me know.

Talk1:
I don't understand the question. Clearly, many words can have more than POS (e.g. "book" can be a verb as in I'd like to book a hotel room). Is the issue how to deal with this? And what has tense got to do with it all? What do you mean by "recognize a sentence based on tense"? Or are you just looking for an introduction to POS tagging (in which case Stackoverflow would not be the place to go to)?
Solutions1

I hope the pseudocode I present below proves helpful to you. If I find time, I'd also write some code for you.

This problem can be tackled by following the steps below:

  1. Create a dictionary of all the common sentence patterns in the English language. For example, Subject + Verb is an English pattern and all the sentences like I sleep, Dog barked and Ship will arrive match the S-V pattern. You can find a list of the most common english patterns here. Please note that for some time you may need to keep revising this dictionary to enhance the accuracy of your program.

  2. Try to fit the input sentence in one of the patterns in the dictionary you created above, for example, if the input sentence is Snakes, unlike elephants, are venomous., then your code must be able to find a match with the pattern: Subject, unlike AnotherSubject, Verb Object or S-,unlike-S`-, -V-O. To successfully perform this step, you may need to write code that's good at spotting Structure Markers like the word unlike, in this example sentence.

  3. When you have found a match for your input sentence in your pattern dictionary, you can easily assign a tag to each word in the sentence. For example, in our sentence, the word Snakes would be tagged as a subject, just like the word elephants, the word are would be tagged as a verb and finally the word venomous would be tagged as an object.

  4. Once you have assigned a unique tag to each of the words in your sentence, you can go lookup the word in the appropriate text files that you already have and determine whether or not your sentence is valid.

  5. If your sentence doesn't match any sentence pattern, then you have two options:

    a) Add the pattern of this unrecognized sentence in your pattern dictionary if it is a valid English sentence.

    b) Or, discard the input sentence as an invalid English sentence.

Things like what you're trying to achieve are best solved using machine learning techniques so that the system can learn any new patterns. So, you may want to include a trainer system that would add a new pattern to your pattern dictionary whenever it finds a valid English sentence not matching any of the existing patterns. I haven't thought much about how this can be done, but for now, you may manually revise your Sentence Pattern dictionary.

I'd be glad to hear your opinion about this pseudocode and would be available to brainstorm it further.

Talk1:
It's amazing Sir @Pankaj Sharma.. A few days ago I decided to use OpenNLP to solve that problem. Because solving the problem by using brute force (manually) looks like amateur student. I don't know what the lecture let me use OpenNLP or not, but i wanna try.. So far, OpenNLP run well, but i face new problem in my assignment, it's sentence pattern. After define POS Tagging, i wanna try to analyze the sentence pattern, like the most common tenses, Present, past tense, etc....
Talk2:
Now, for checking sentence pattern, I analyzed it by using CKY Algorithm (Cocke-Younger-Kasami). I have to design sentence pattern in Chomsky Normal Form (CNF). So far, i feel difficult to design it in CNF. Here are the most common sentence pattern in English, for example : S -> NP VP, NP -> Det N | NAME, PP -> PREP NP, VP -> V | V NP | V NP PP | V PP
Talk3:
I'd be glad to hear your opinion, and make discussion with you. Because I'm student. Any suggestion Sir Sir @Pankaj Sharma?
转载于:https://stackoverflow.com/questions/15594626/determine-pos-tagging-in-english-based-on-database-files

本人是.net程序员,因为英语不行,使用工具翻译,希望对有需要的人有所帮助
如果本文质量不好,还请谅解,毕竟这些操作还是比较费时的,英语较好的可以看原文

留言回复
我们只提供高质量资源,素材,源码,坚持 下了就能用 原则,让客户花了钱觉得值
上班时间 : 周一至周五9:00-17:30 期待您的加入