Skip to content Skip to sidebar Skip to footer

Processing Lines Of Text File Between Two Marker Lines

My code processes lines read from a text file (see 'Text Processing Details' at end). I need to amend my code so that it carries out the same task, but only with words in between

Solution 1:

What appears to be one task, "count the words between two marker lines", is actually several. Separate the different tasks and decisions into separate functions and generators, and it will be vastly easier.

Step 1: Separate the file I/O from the word counting. Why should the word-counting code care where the words came from?

Step 2: Separate selecting the lines to process from the file handling and the word counting. Why should the word-counting code be given words it's not supposed to count? This is still far too big a job for one function, so it will be broken down further. (This is the part you're asking about.)

Step 3: Process the text. You've already done that, more or less. (I'll assume your text-processing code ends up in a function called words).

1. Separate file I/O

Reading text from a file is really two steps: first, open and read the file, then strip the newline off each line. These are two jobs.

def stripped_lines(lines):
    for line in lines:
        stripped_line = line.rstrip('\n')
        yield stripped_line

def lines_from_file(fname):
    with open(fname, 'rt', encoding='utf8') as flines:
        for line in stripped_lines(flines):
            yield line

Not a hint of your text processing here. The lines_from_file generator just yield whatever strings were found in the file... after stripping their trailing newline. (Note that a plain strip() would also remove leading and trailing whitespace, which you have to preserve to identify marker lines.)

2. Select only the lines between markers.

This is really more than one step. First, you have to know what is and isn't a marker line. That's just one function.

Then, you have to advance past the first marker (while throwing away any lines encountered), and finally advance to the second marker (while keeping any lines encountered). Anything after that second marker won't even be read, let alone processed.

Python's generators can almost solve the rest of Step 2 for you. The only sticking point is that closing marker... details below.

2a. What is and is not a marker line?

Identifying a marker line is a yes-or-no question, obviously the job of a Boolean function:

def is_marker_line(line, start='***', end='***'):
    '''
    Marker lines start and end with the given strings, which may not
    overlap.  (A line containing just '***' is not a valid marker line.)
    '''
    min_len = len(start) + len(end)
    if len(line) < min_len:
        return False
    return line.startswith(start) and line.endswith(end)

Note that a marker line need not (from my reading of your requirements) contain any text between the start and end markers --- six asterisks ('******') is a valid marker line.

2b. Advance past the first marker line.

This step is now easy: just throw away every line until we find a marker line (and junk it, too). This function doesn't need to worry about the second marker line, or what if there are no marker lines, or anything else.

def advance_past_next_marker(lines):
    '''
    Advances the given iterator through the first encountered marker
    line, if any.
    '''
    for line in lines:
        if is_marker_line(line):
            break

2c. Advance past the second marker line, saving content lines.

A generator could easily yield every line after the "start" marker, but if it discovers there is no "end" marker, there's no way to go back and un-yield those lines. So, now that you've finally encountered lines you (might) actually care about, you'll have to save them all in a list until you know whether they're valid or not.

def lines_before_next_marker(lines):
    '''
    Yields all lines up to but not including the next marker line.  If
    no marker line is found, yields no lines.
    '''
    valid_lines = []
    for line in lines:
        if is_marker_line(line):
            break
        valid_lines.append(line)
    else:
        # `for` loop did not break, meaning there was no marker line.
        valid_lines = []
    for content_line in valid_lines:
        yield content_line

2d. Gluing Step 2 together.

Advance past the first marker, then yield everything until the second marker.

def lines_between_markers(lines):
    '''
    Yields the lines between the first two marker lines.
    '''
    # Must use the iterator --- if it's merely an iterable (like a list
    # of strings), the call to lines_before_next_marker will restart
    # from the beginning.
    it = iter(lines)
    advance_past_next_marker(it)
    for line in lines_before_next_marker(it):
        yield line

Testing functions like this with a bunch of input files is annoying. Testing it with lists of strings is easy, but lists are not generators or iterators, they're iterables. The one extra it = iter(...) line was worth it.

3. Process the selected lines.

Again, I'm assuming your text processing code is safely wrapped up in a function called words. The only change is that, instead of opening a file and reading it to produce a list of lines, you're given the lines:

def words(lines):
    text = '\n'.join(lines).lower().split()
    # Same as before...

...except that words should probably be a generator, too.

Now, calling words is easy:

def words_from_file(fname):
    for word in words(lines_between_markers(lines_from_file(fname))):
        yield word

To get the words_from_filefname, you yield the words found in the lines_between_markers, selected from the lines_from_file... Not quite English, but close.

4. Call words_from_file from your program.

Wherever you already have filename defined --- presumably inside main somewhere --- call words_from_file to get one word at a time:

filename = ...  # However you defined it before.
for word in words_from_file(filename):
    print(word)

Or, if you really need those words in a list:

filename = ...
word_list = list(words_from_file(filename))

Conclusion

That this would have been much harder trying to squeeze it all into one or two functions. It wasn't just one task or decision, but many. The key was breaking it into tiny jobs, each of which was easy to understand and test.

The generators got rid of a lot of boilerplate code. Without generators, almost every function would have required a for loop just to some_list.append(next_item), like in lines_before_next_marker.

If you have Python 3.3+, the yield from ... construct, erases even more boilerplate. Every generator containing a loop like this:

for line in stripped_lines(flines):
    yield line

Could be re-written as:

yield from stripped_lines(flines)

I counted four of them.

For more on the subject of iterables, generators, and functions that use them, see Ned Batchelder's "Loop Like a Native", available as a 30-minute video from PyCon US 2013.

Solution 2:

I recommend using regular expressions.

from re importcompile, findall

exp = compile(r'\*{5}([^\*]+)\*{3}|"([^"]+)"')

infile = open(filename, 'r', encoding="utf-8")

text = infile.read().lower()  # Notice, no .split()
text_exclusive = ' '.join([''.join(block) for block in findall(exp, text)])

# use text_exclusive from this point forward with your code

Solution 3:

You can get only the text between your asterisks with regex:

import re
betweenAstericks = re.search(r"\*{5}.+?\*{3}(.+?)\*{3}", text, re.DOTALL).group(1)

Post a Comment for "Processing Lines Of Text File Between Two Marker Lines"