Pandas Read Csv Ignore Newline

February 28, 2024 Post a Comment

i have a dataset (for compbio people out there, it's a FASTA) that is littered with newlines, that don't act as a delimiter of the data. Is there a way for pandas to ignore newline

Solution 1:

You need to have another sign which will tell pandas when you do actually want to change of tuple.

Here for example I create a file where the new line is encoded by a pipe (|) :

csv = """
col1,col2, col3, col4|
first_col_first_line,2nd_col_first_line,
3rd_col_first_line

de,4rd_col_first_line|
"""withopen("test.csv", "w") as f:
    f.writelines(csv)

Then you read it with the C engine and precise the pipe as the lineterminator :

import pandas as pd
pd.read_csv("test.csv",lineterminator="|", engine="c")

which gives me :

Solution 2:

There is no good way to do this. BioPython alone seems to be sufficient, over a hybrid solution involving iterating through a BioPython object, and inserting into a dataframe

Solution 3:

Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?

Yes, just look at the doc for pd.read_table()

You want to specify a custom line terminator (>) and then handle the newline (\n) appropriately: use the first as a column delimiter with str.split(maxsplit=1), and ignore subsequent newlines with str.replace (until the next terminator):

#---- EXAMPLE DATA ---from io import StringIO
example_file = StringIO(
"""
>ERR899297.10000174 
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
>ERR123456.12345678
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
"""
)
#----------------------#---- EXAMPLE CODE ---import pandas as pd
df = pd.read_table(
    example_file,           # Your file goes here
    engine = 'c',           # C parser must be used to allow custom lineterminator, see doc
    lineterminator = '>',   # New lines begin with ">"
    skiprows =1,            # File begins with line terminator ">", so output skips first line 
    names = ['raw'],        # A single column which we will split into two
    comment = ';'# comment character in FASTA format
)

# The first line break ('\n') separates Column 0 from Column 1
df[['col0','col1']] = pd.DataFrame.from_records(df.raw.apply(lambda s: s.split(maxsplit=1)))

# All subsequent line breaks (which got left in Column 1) should be ignored
df['col1'] = df['col1'].apply(lambda s: s.replace('\n',''))

print(df[['col0','col1']])

# Show that col1 no longer contains line breaksprint('\nExample sequence is:')
print(df['col1'][0])

Returns:

                 col0                                               col1
0  ERR899297.10000174  TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
1  ERR123456.12345678  TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...

Example sequence is:
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGCTATCAAGATCAGCCGATTCT

Solution 4:

After pd.read_csv(), you can use df.split().

import pandas as pd


 data = pd.read_csv("test.csv")
 data.split()

Solution 5:

This should work simply by setting skip_blank_lines=True.

skip_blank_lines : bool, default True
If True, skip over blank lines rather than interpreting as NaN values.

However, I found that I had to set this to False to work with my data that has new lines in it. Very strange, unless I'm misunderstanding.

Docs

Free Interactive Python Tutorial