Pandas Read Csv Ignore Newline
Solution 1:
You need to have another sign which will tell pandas when you do actually want to change of tuple.
Here for example I create a file where the new line is encoded by a pipe (|) :
csv = """
col1,col2, col3, col4|
first_col_first_line,2nd_col_first_line,
3rd_col_first_line
de,4rd_col_first_line|
"""withopen("test.csv", "w") as f:
f.writelines(csv)
Then you read it with the C engine and precise the pipe as the lineterminator :
import pandas as pd
pd.read_csv("test.csv",lineterminator="|", engine="c")
Solution 2:
There is no good way to do this. BioPython alone seems to be sufficient, over a hybrid solution involving iterating through a BioPython object, and inserting into a dataframe
Solution 3:
Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?
Yes, just look at the doc for pd.read_table()
You want to specify a custom line terminator (>
) and then handle the newline (\n
) appropriately: use the first as a column delimiter with str.split(maxsplit=1), and ignore subsequent newlines with str.replace (until the next terminator):
#---- EXAMPLE DATA ---from io import StringIO
example_file = StringIO(
"""
>ERR899297.10000174
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
>ERR123456.12345678
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
"""
)
#----------------------#---- EXAMPLE CODE ---import pandas as pd
df = pd.read_table(
example_file, # Your file goes here
engine = 'c', # C parser must be used to allow custom lineterminator, see doc
lineterminator = '>', # New lines begin with ">"
skiprows =1, # File begins with line terminator ">", so output skips first line
names = ['raw'], # A single column which we will split into two
comment = ';'# comment character in FASTA format
)
# The first line break ('\n') separates Column 0 from Column 1
df[['col0','col1']] = pd.DataFrame.from_records(df.raw.apply(lambda s: s.split(maxsplit=1)))
# All subsequent line breaks (which got left in Column 1) should be ignored
df['col1'] = df['col1'].apply(lambda s: s.replace('\n',''))
print(df[['col0','col1']])
# Show that col1 no longer contains line breaksprint('\nExample sequence is:')
print(df['col1'][0])
Returns:
col0 col1
0 ERR899297.10000174 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
1 ERR123456.12345678 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
Example sequence is:
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGCTATCAAGATCAGCCGATTCT
Solution 4:
After pd.read_csv()
, you can use df.split()
.
import pandas as pd
data = pd.read_csv("test.csv")
data.split()
Solution 5:
This should work simply by setting skip_blank_lines=True
.
skip_blank_lines : bool, default True
If True, skip over blank lines rather than interpreting as NaN values.
However, I found that I had to set this to False
to work with my data that has new lines in it. Very strange, unless I'm misunderstanding.
Post a Comment for "Pandas Read Csv Ignore Newline"