查找短语已在串被使用多次已在、短语

2023-09-11 04:27:56 作者:做最好的自己

这很容易用字典来确定哪些词使用频率最高的,但由于一个文本文件来计算出现的单词在一个文件,我怎么能找到常用的短语,其中一语是一组两个或更多连续字?

It's easy to count occurrences of words in a file by using a Dictionary to identify which words are used the most frequently, but given a text file, how can I find commonly used phrases where a "phrase" is a set of two or more consecutive words?

例如,下面是一些示例文本:

For example, here is some sample text:

除了口头遗嘱,将所有的应以书面形式,也可以是   手写或打印。遗嘱应包含在立遗嘱人的签名   或由其他人在立遗嘱人的自觉presence   而在立遗嘱人的前preSS方向。意志是见证   而认购在自觉presence 立遗嘱人的,由两个或   更能干的证人,谁看到了立遗嘱人认购或听到   立遗嘱人承认在立遗嘱人的签名

Except oral wills, every will shall be in writing, but may be handwritten or typewritten. The will shall contain the testator's signature or by some other person in the testator's conscious presence and at the testator's express direction . The will shall be attested and subscribed in the conscious presence of the testator, by two or more competent witnesses, who saw the testator subscribe, or heard the testator acknowledge the testator's signature.

在本节中,自觉presence 中的指   范围内任何遗嘱人的感官,排除视觉感或   声音是受电话检测到,电子,或其它遥远   沟通。

For purposes of this section, conscious presence means within the range of any of the testator's senses, excluding the sense of sight or sound that is sensed by telephonic, electronic, or other distant communication.

我怎样才能识别短语自觉presence(3次)和立遗嘱人签名(2倍)已经出现不止一次(除了蛮力搜索每一个组中的两个或三个词)?

How can I identify that the phrases "conscious presence" (3 times) and "testator's signature" (2 times) as having appeared more than once (apart from brute force searching for every set of two or three words)?

我会在C#写这个,所以C#code将是巨大的,但我甚至不能确定一个好的算法,所以我会不惜一切,甚至伪$ C满足于任何code $下如何解决这个问题。

I'll be writing this in c#, so c# code would be great, but I can't even identify a good algorithm so I'll settle for any code at all or even pseudo code for how to solve this.

推荐答案

想我有一个快速走在这 - 不知道这是不是蛮力方法,你都试图避免 - 但是:

Thought I'd have a quick go at this - not sure if this isn't the brute force method you were trying to avoid - but :

static void Main(string[] args)
{
    string txt = @"Except oral wills, every will shall be in writing, 
but may be handwritten or typewritten. The will shall contain the testator's 
signature or by some other person in the testator's conscious presence and at the
testator's express direction . The will shall be attested and subscribed in the
conscious presence of the testator, by two or more competent witnesses, who saw the
testator subscribe, or heard the testator acknowledge the testator's signature.

For purposes of this section, conscious presence means within the range of any of the
testator's senses, excluding the sense of sight or sound that is sensed by telephonic,
electronic, or other distant communication.";

    //split string using common seperators - could add more or use regex.
    string[] words = txt.Split(',', '.', ';', ' ', '\n', '\r');

    //trim each tring and get rid of any empty ones
    words = words.Select(t=>t.Trim()).Where(t=>t.Trim()!=string.Empty).ToArray();

    const int MaxPhraseLength = 20;

    Dictionary<string, int> Counts = new Dictionary<string,int>();

    for (int phraseLen = MaxPhraseLength; phraseLen >= 2; phraseLen--)
    {
        for (int i = 0; i < words.Length - 1; i++)
        {
            //get the phrase to match based on phraselen
            string[] phrase = GetPhrase(words, i, phraseLen);
            string sphrase = string.Join(" ", phrase);

            Console.WriteLine("Phrase : {0}", sphrase);

            int index = FindPhraseIndex(words, i+phrase.Length, phrase);

            if (index > -1)
            {
                Console.WriteLine("Phrase : {0} found at {1}", sphrase, index);

                if(!Counts.ContainsKey(sphrase))
                    Counts.Add(sphrase, 1);

                Counts[sphrase]++;
            }
        }
    }

    foreach (var foo in Counts)
    {
        Console.WriteLine("[{0}] - {1}", foo.Key, foo.Value);
    }

    Console.ReadKey();
}

static string[] GetPhrase(string[] words, int startpos, int len)
{
    return words.Skip(startpos).Take(len).ToArray();
}

static int  FindPhraseIndex(string[] words, int startIndex, string[] matchWords)
{
    for (int i = startIndex; i < words.Length; i++)
    {
        int j;

        for(j=0; j<matchWords.Length && (i+j)<words.Length; j++)
            if(matchWords[j]!=words[i+j])
                break;

        if (j == matchWords.Length)
            return startIndex;
    }

    return -1;
}