检索中使用python文件有2个或多个字母的单词总数多个、单词、字母、总数

2023-09-11 05:29:42 作者:一個很呆很呆的呆瓜

我有一个计算排名前10位最频繁的话,10个最不频繁的词与词的一个.txt文件总数的一小Python脚本。根据该分配,一个字被定义为2个字以上。我有10个最常见的10个最不频繁的字印刷精美,但是当我尝试它打印的所有字的总数,包括单字母词(如A来打印文档中的单词总数)。我怎样才能字的总数来计算只有具备2个字以上的话?

I have a small Python script that calculates the top 10 most frequent words, 10 most infrequent words and the total number of words in a .txt document. According to the assignment, a word is defined as 2 letters or more. I have the 10 most frequent and the 10 most infrequent words printing fine, however when I attempt to print the total number of words in the document it prints the total number of all the words, including the single letter words (such as "a"). How can I get the total number of words to calculate ONLY the words that have 2 letters or more?

下面是我的脚本:

from string import *
from collections import defaultdict
from operator import itemgetter
import re

number = 10
words = {}
total_words = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)

"""Define function to count the total number of words"""
def count_words(s):
    unique_words = split(s)
    return len(unique_words)

"""Define words as 2 letters or more -- no single letter words such as "a" """
for word in words:
    if len(word) >= 2:
        counter[word] += 1


"""Open text document, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')

for line in txt_file:
    total_words = total_words + count_words(line)
    for word in line.strip().split():
        word = word.strip(punctuation).lower()
        if words_only.match(word):
            counter[word] += 1


# Most Frequent Words
top_words = sorted(counter.iteritems(),
                    key=lambda(word, count): (-count, word))[:number] 

print "Most Frequent Words: "

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)


# Least Frequent Words:
least_words = sorted(counter.iteritems(),
                    key=lambda (word, count): (count, word))[:number]

print " "
print "Least Frequent Words: "

for word, frequency in least_words:
    print "%s: %d" % (word, frequency)


# Total Unique Words:
print " "
print "Total Number of Words: %s" % total_words

我不使用Python的专家,这是一个Python类我目前服用。我的code和正确的格式的整洁算对我的这个任务,如果可能的话可有人还告诉我,如果这code格式被认为是好的做法?

I am not an expert with Python, this is for a Python class I am currently taking. The neatness of my code and proper formatting count against me in this assignment, if possible can someone also tell me if the format of this code is considered "good practice"?

推荐答案

名单COM prehension方式:

The list comprehension method:

def countWords(s):
    words = s.split()
    return len([word for word in words if len(word)>=2])

的详细方法:

The verbose method:

def countWords(s):
    words = s.split()
    count = 0
    for word in words:
        if len(word) >= 2:
            count += 1
    return count

顺便说一句,使用荣誉 defaultdict ,但我会用的 collections.Counter

As an aside, kudos on using defaultdict, but I would go with collections.Counter:

words = collections.Counter([word for line in open(filepath) for word in line.strip()])
words = dict((k,v) for k,v in words.iteritems if len(k)>=2)
mostFrequent = [w[0] for w in words.most_common(10)]
leastFrequent = [w[0] for w in words.most_common()[-10:]]

希望这有助于

Hope this helps