Skip to content Skip to sidebar Skip to footer

Word Frequency Using Dictionary

My problem is I can't figure out how to display the word count using the dictionary and refer to keys length. For example, consider the following piece of text: 'This is the s

Solution 1:

def freqCounter(infilepath):
    answer = {}
    with open(infilepath) as infile:
        for line in infilepath:
            for word in line.strip().split():
                l = len(word)
                if l not in answer:
                    answer[l] = 0
                answer[l] += 1
    return answer

AlternativelyL

import collections
def freqCounter(infilepath):
    with open(infilepath) as infile:
        return collections.Counter(len(word) for line in infile for word in line.strip().split())

Solution 2:

Use collections.Counter

import collections

sentence = "This is the sample text to get an idea"

Count = collections.Counter([len(a) for a in sentence.split()])

print Count

Solution 3:

To count how many words in a text have given lengths: size -> frequency distribution, you could use a regular expression to extract words:

#!/usr/bin/env python3
import re
from collections import Counter

text = "This is the sample text to get an idea!. "
words = re.findall(r'\w+', text.casefold())
frequencies = Counter(map(len, words)).most_common() 
print("\n".join(["%d word(s) of length %d" % (n, length) 
                 for length, n in frequencies]))

Output

3 word(s) of length 2
3 word(s) of length 4
2 word(s) of length 3
1 word(s) of length 6

Note: It ignores the punctuation such as !. after 'idea' unlike .split()-based solutions automatically.

To read words from a file, you could read lines and extract words from them in the same way as it done for text in the first code example:

from itertools import chain

with open(filename) as file:
    words = chain.from_iterable(re.findall(r'\w+', line.casefold())
                                for line in file)
    # use words here.. (the same as above)
    frequencies = Counter(map(len, words)).most_common()

print("\n".join(["%d word(s) of length %d" % (n, length) 
                 for length, n in frequencies]))

In practice, you could use a list to find the length frequency distribution if you ignore words that are longer than a threshold:

def count_lengths(words, maxlen=100):
    frequencies = [0] * (maxlen + 1)
    for length in map(len, words):
        if length <= maxlen:
            frequencies[length] += 1
    return frequencies

Example

import re

text = "This is the sample text to get an idea!. "
words = re.findall(r'\w+', text.casefold())
frequencies = count_lengths(words)
print("\n".join(["%d word(s) of length %d" % (n, length) 
                 for length, n in enumerate(frequencies) if n > 0]))

Output

3 word(s) of length 2
2 word(s) of length 3
3 word(s) of length 4
1 word(s) of length 6

Post a Comment for "Word Frequency Using Dictionary"