Skip to content Skip to sidebar Skip to footer

How Can I Make An Unique List Cells?

I have a txt file which looks like below including 4 rows as an example and each row strings are separated by a ,. 'India1,India2,myIndia ' 'Where,Here,Here ' 'Here,Where,Ind

Solution 1:

Option 1: .csv, .txt Files

Native Python is unable to read .xls files. If you convert your file(s) to .csv or .txt, you can use the csv module within the Standard Library:

# `csv` module, Standard Library
import csv

filepath = "./test.csv"

withopen(filepath, "r") as f:
    reader = csv.reader(f, delimiter=',')
    header = next(reader)                                  # skip'A', 'B'
    items =set()
    for line in reader:
        line = [word.replace(" ", "") for word in line if word]
        line =filter(str.strip, line)
        items.update(line)

print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']

Option 2: .xls, .xlsx Files

If you want to retain the original .xls format, you have to install a third-party module to handle Excel files.

Install xlrd from the command prompt:

pip install xlrd

In Python:

# `xlrd` module, third-party
import itertools
import xlrd

filepath = "./test.xls"

with xlrd.open_workbook(filepath) as workbook:
    worksheet = workbook.sheet_by_index(0)                 # assumes first sheet
    rows = (worksheet.row_values(i) for i in range(1, worksheet.nrows))
    cells = itertools.chain.from_iterable(rows)
    items = list({val.replace(" ", "") forvalin cells ifval})

print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']

Option 3: DataFrames

You can handle csv and text files with pandas DataFrames. See documentation for other formats.

import pandas as pd
import numpy as np

# Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8
filepath = "./test2.txt"# Determines columns from the first line, so add commas in text file, else may throw an error
df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True)      # remove special chars    
stack = df.stack()
clean_df = pd.Series(stack.unique())
clean_df

DataFrame Output

0     India1
1     India2
2    myIndia
3Where4       Here
5      India
6      uyete
7        AFD
8        TTT
dtype:object

Save as Files

# Save as .txt or .csv without index, optional

# target = "./output.csv"
target = "./output.txt"
clean_df.to_csv(target, index=False)

Note: Results from options 1 & 2 can be converted to unordered, pandas columnar objects too with pd.Series(list(items)).

Finally: As a Script

Save any of the three options above in a function (stack) within a file (named restack.py). Save this script to a directory.

# restack.pyimport pandas as pd
import numpy as np

defstack(filepath, save=False, target="./output.txt"):
    # Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8# Determines columns from the first line, so add commas in text file, else may throw an error
    df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
    df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True)      # remove special chars    
    stack = df.stack()
    clean_df = pd.Series(stack.unique())

    if save:
        clean_df.to_csv(target, index=False)
        print("Your results have been saved to '{}'".format(target))

    return clean_df

if __name__ == "__main__":
    # Set up input prompts
    msg1 = "Enter path to input file e.g. ./test.txt: "
    msg2 = "Save results to a file? y/[n]: "try:
        # Python 2
        fp = raw_input(msg1)
        result = raw_input(msg2)
    except NameError:
        # Python 3
        fp = input(msg1)
        result = input(msg2)

    if result.startswith("y"):
        save = Trueelse:
        save = Falseprint(stack(fp, save=save))

From its working directory, run the script via commandline. Answer the prompts:

> python restack.py 

Enter path to input file e.g. ./test.txt: ./@data/test2.txt
Save results to a file? y/[n]: y
Your results have been saved to'./output.txt'

Your results should print in you console and optionally save to a file output.txt. Adjust any parameters to suit your interests.

Solution 2:

If your stack.txt file looks like this (i.e. it's saved as a .txt file):

"India1,India2,myIndia     ""Where,Here,Here   ""Here,Where,India,uyete""AFD,TTT"

The solution:

from collections import OrderedDict

withopen("stack.txt", "r") as f:
    # read your data in and strip off any new-line characters
    data = [eval(line).strip() for line in f.readlines()]
    # get individual words into a list
    individual_elements = [word for row in data for word in row.split(",")]
    # remove duplicates and preserve order
    uniques = OrderedDict.fromkeys(individual_elements)   
    # convert from OrderedDict object to plain list
    final = [word for word in uniques]

To get your desired columnar output:

print("\n".join(final))

Which yields:

India1
India2
myIndia     
Where
Here   
India
uyete
AFD
TTT

Solution 3:

I won't give you the whole code, but I'll give you some ideas.

First, you need to read all the lines of the file:

lines = open("file.txt").readlines()

Then, extract the data from each line:

lines = [line.split(",") for line in lines]

You can generate combinations with itertools.combinations. For each line, print the combinations of the line's elements.

You can get the unique elements with set if you don't care about the order of the elements. Before using set, you should flatten the list lines first, maybe using itertools.chain.from_iterable.

Solution 4:

Your code for reading the text file line by line is fine. So you still need to

  1. Split each line into "cells"
  2. Remove duplicates

You can split each line into cells using split

line.split(',')

And you want to remove white space, so I would strip each cell:

[value.strip() for elem in line.split(',')]

And you can remove duplicates with set

set(cells)

Lastly, I think it's better to use with (a context manager) when reading files. Putting it all together:

withopen('df.txt', 'r') as f:
    cells = []
    for line in f:
        cells += [value.strip() forvaluein line.split(',')]

cells = list(set(cells))

If you want to be more compact, you can do it in a single list comprehension:

withopen('df.txt', 'r') as f:
    cells = list(set([value.strip() for line in f forvaluein line.split(',']))

Post a Comment for "How Can I Make An Unique List Cells?"