Processing Lines Of Text File Between Two Marker Lines
Solution 1:
What appears to be one task, "count the words between two marker lines", is actually several. Separate the different tasks and decisions into separate functions and generators, and it will be vastly easier.
Step 1: Separate the file I/O from the word counting. Why should the word-counting code care where the words came from?
Step 2: Separate selecting the lines to process from the file handling and the word counting. Why should the word-counting code be given words it's not supposed to count? This is still far too big a job for one function, so it will be broken down further. (This is the part you're asking about.)
Step 3: Process the text. You've already done that, more or less. (I'll assume your text-processing code ends up in a function called words
).
1. Separate file I/O
Reading text from a file is really two steps: first, open and read the file, then strip the newline off each line. These are two jobs.
def stripped_lines(lines):
for line in lines:
stripped_line = line.rstrip('\n')
yield stripped_line
def lines_from_file(fname):
with open(fname, 'rt', encoding='utf8') as flines:
for line in stripped_lines(flines):
yield line
Not a hint of your text processing here. The lines_from_file
generator just yield whatever strings were found in the file... after stripping their trailing newline. (Note that a plain strip()
would also remove leading and trailing whitespace, which you have to preserve to identify marker lines.)
2. Select only the lines between markers.
This is really more than one step. First, you have to know what is and isn't a marker line. That's just one function.
Then, you have to advance past the first marker (while throwing away any lines encountered), and finally advance to the second marker (while keeping any lines encountered). Anything after that second marker won't even be read, let alone processed.
Python's generators can almost solve the rest of Step 2 for you. The only sticking point is that closing marker... details below.
2a. What is and is not a marker line?
Identifying a marker line is a yes-or-no question, obviously the job of a Boolean function:
def is_marker_line(line, start='***', end='***'):
'''
Marker lines start and end with the given strings, which may not
overlap. (A line containing just '***' is not a valid marker line.)
'''
min_len = len(start) + len(end)
if len(line) < min_len:
return False
return line.startswith(start) and line.endswith(end)
Note that a marker line need not (from my reading of your requirements) contain any text between the start and end markers --- six asterisks ('******'
) is a valid marker line.
2b. Advance past the first marker line.
This step is now easy: just throw away every line until we find a marker line (and junk it, too). This function doesn't need to worry about the second marker line, or what if there are no marker lines, or anything else.
def advance_past_next_marker(lines):
'''
Advances the given iterator through the first encountered marker
line, if any.
'''
for line in lines:
if is_marker_line(line):
break
2c. Advance past the second marker line, saving content lines.
A generator could easily yield every line after the "start" marker, but if it discovers there is no "end" marker, there's no way to go back and un-yield
those lines. So, now that you've finally encountered lines you (might) actually care about, you'll have to save them all in a list until you know whether they're valid or not.
def lines_before_next_marker(lines):
'''
Yields all lines up to but not including the next marker line. If
no marker line is found, yields no lines.
'''
valid_lines = []
for line in lines:
if is_marker_line(line):
break
valid_lines.append(line)
else:
# `for` loop did not break, meaning there was no marker line.
valid_lines = []
for content_line in valid_lines:
yield content_line
2d. Gluing Step 2 together.
Advance past the first marker, then yield everything until the second marker.
def lines_between_markers(lines):
'''
Yields the lines between the first two marker lines.
'''
# Must use the iterator --- if it's merely an iterable (like a list
# of strings), the call to lines_before_next_marker will restart
# from the beginning.
it = iter(lines)
advance_past_next_marker(it)
for line in lines_before_next_marker(it):
yield line
Testing functions like this with a bunch of input files is annoying. Testing it with lists of strings is easy, but lists are not generators or iterators, they're iterables. The one extra it = iter(...)
line was worth it.
3. Process the selected lines.
Again, I'm assuming your text processing code is safely wrapped up in a function called words
. The only change is that, instead of opening a file and reading it to produce a list of lines, you're given the lines:
def words(lines):
text = '\n'.join(lines).lower().split()
# Same as before...
...except that words
should probably be a generator, too.
Now, calling words
is easy:
def words_from_file(fname):
for word in words(lines_between_markers(lines_from_file(fname))):
yield word
To get the words_from_file
fname
, you yield the words
found in the lines_between_markers
, selected from the lines_from_file
... Not quite English, but close.
4. Call words_from_file
from your program.
Wherever you already have filename
defined --- presumably inside main
somewhere --- call words_from_file
to get one word at a time:
filename = ... # However you defined it before.
for word in words_from_file(filename):
print(word)
Or, if you really need those words in a list
:
filename = ...
word_list = list(words_from_file(filename))
Conclusion
That this would have been much harder trying to squeeze it all into one or two functions. It wasn't just one task or decision, but many. The key was breaking it into tiny jobs, each of which was easy to understand and test.
The generators got rid of a lot of boilerplate code. Without generators, almost every function would have required a for
loop just to some_list.append(next_item)
, like in lines_before_next_marker
.
If you have Python 3.3+, the yield from ...
construct, erases even more boilerplate. Every generator containing a loop like this:
for line in stripped_lines(flines):
yield line
Could be re-written as:
yield from stripped_lines(flines)
I counted four of them.
For more on the subject of iterables, generators, and functions that use them, see Ned Batchelder's "Loop Like a Native", available as a 30-minute video from PyCon US 2013.
Solution 2:
I recommend using regular expressions.
from re importcompile, findall
exp = compile(r'\*{5}([^\*]+)\*{3}|"([^"]+)"')
infile = open(filename, 'r', encoding="utf-8")
text = infile.read().lower() # Notice, no .split()
text_exclusive = ' '.join([''.join(block) for block in findall(exp, text)])
# use text_exclusive from this point forward with your code
Solution 3:
You can get only the text between your asterisks with regex:
import re
betweenAstericks = re.search(r"\*{5}.+?\*{3}(.+?)\*{3}", text, re.DOTALL).group(1)
Post a Comment for "Processing Lines Of Text File Between Two Marker Lines"