Skip to content Skip to sidebar Skip to footer

Python: Failed In Retrieving The Highest Amount From A Repeated Data With Different Amount In A Certain Year

The csv file that I have contain several repeated supplier_name but with different amt for year 2015-2017. Here goes my codes. df = pd.read_csv('government-procurement-via-gebiz.cs

Solution 1:

The problem is; when you create the dictionary with to_dict it creates the desired first instance of "SANTARLI" as a key, and then as it continues to parse, it finds the second instance of "SANTARLI", which it uses as a key, replacing the first instance's key (overwriting the key and data.)

Dictionary keys must be unique. You need to clean your data of redundant instances first. See below...

import pandas as pd
import re
import operator

#df = pd.read_csv('government-procurement-via-gebiz.csv', parse_dates=['award_date'], infer_datetime_format=True, usecols=['supplier_name', 'award_date', 'awarded_amt'],)# I creatd the df from the data supplied in the questionsdf = pd.DataFrame(data, columns =['award_date', 'supplier_name', 'awarded_amt'])
df['award_date'] = pd.to_datetime(df['award_date'])
print(df)

# Select by date (your original code)df = df[(df['supplier_name'] != 'na') & (df['award_date'].dt.year == 2015)].reset_index(drop=True)

# Sort by column 'awarded_amt'. # This will leave the duplicates like 'SANTARLI', but put the one with the highest #  value in 'awarded_amt' firstdf = df.sort_values('awarded_amt', ascending=True)

# Drop the duplicates. This has a parameter "keep" which defaults to "first"# Thus, it will keep the first instance of 'SANTARLI', #  which will also be the greatest 'awarded_amt'df = df.drop_duplicates(subset=['supplier_name'])

# Now create your dict
d1 = df.set_index('supplier_name').to_dict()['awarded_amt']
print(d1)

OUTPUT:

award_date                                      supplier_name awarded_amt
02015-01-07                    SANTARLI CONSTRUCTION PTE. LTD.  103000000012014-08-04         HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD   60172600022014-02-03                       KAJIMA OVERSEAS ASIA PTE LTD   59580000032015-11-20                            SAMSUNG C&T CORPORATION   55532206342015-11-23                             THE GO-AHEAD GROUP PLC   49773810452015-06-19                GS Engineering & Construction Corp.   42830100062015-09-07                   Master Contract Services Pte Ltd   16300000072015-03-05         Yongnam Engineering & Construction Pte Ltd   15900000082015-12-30NANJING DADI CONSTRUCTION (GROUP) CO., LTD. SI...   152600000
9 2015-05-19                    SANTARLI CONSTRUCTION PTE. LTD.   148800000

{'SANTARLI CONSTRUCTION PTE. LTD.': '1030000000', 'NANJING DADI CONSTRUCTION (GROUP) CO., LTD. SINGAPORE BRANCH': '152600000', 'Yongnam Engineering & Construction Pte Ltd': '159000000', 'Master Contract Services Pte Ltd': '163000000', 'GS Engineering & Construction Corp.': '428301000', 'THE GO-AHEAD GROUP PLC': '497738104', 'SAMSUNG C&T CORPORATION': '555322063'}

EDIT: If you just want the top 5 rows based on "awarded_amt" for each year (I.e. The top 5 "awarded_amt"s regardless of whether those are 5 different companies, or the same companies) then don't do a drop duplicates at all.

Just sort the entire DataFrame by "awarded_amt", take the top 5 (maybe use df.head(5) ), but DON'T use the to_dict() (using the company names as keys) since it won't allow two (or more) of the same company names.

import pandas as pd
import sys

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

data = [["1/7/2015", "SANTARLI CONSTRUCTION PTE. LTD.", 1030000000],
["8/4/2015", "HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD", 601726000], 
["2/3/2015", "KAJIMA OVERSEAS ASIA PTE LTD", 595800000], 
["11/20/2015","SAMSUNG C&T CORPORATION",                         555322063],
["11/23/2015" ,"THE GO-AHEAD GROUP PLC",                          497738104],
["6/19/2015"   ,"GS Engineering & Construction Corp.",             428301000],
["6/25/2015"   ,"TIONG SENG CONTRACTORS (PRIVATE) LIMITED",        277265946],
["5/19/2015"   ,"SANTARLI CONSTRUCTION PTE. LTD."          ,       649800000],
["5/19/2016"   ,"SANTARLI CONSTRUCTION PTE. LTD."          ,       650800000],
["5/19/2016"   ,"SANTARLI CONSTRUCTION PTE. LTD."          ,       651800000],
["11/20/2016","SAMSUNG C&T CORPORATION",                         555322063],
["11/23/2016" ,"THE GO-AHEAD GROUP PLC",                          497738104],
["6/19/2016"   ,"GS Engineering & Construction Corp.",             428301000]
]

df = pd.DataFrame(data, columns = ['award_date', 'supplier_name', 'awarded_amt'])
df['award_date'] = pd.to_datetime(df['award_date'])
# Separate df by years
finaldf = pd.DataFrame()
years = [2015, 2016]
for year in years:
    temp_df = df[(df['supplier_name'] != 'na') & (df['award_date'].dt.year == year)].reset_index(drop=True)
#     Sort by column 'awarded_amt'. #     This will leave the duplicates like 'SANTARLI', but put the one with the highest #      value in 'awarded_amt' first
    temp_df = temp_df.sort_values('awarded_amt', ascending=False)
    print("-----------------------____")
    finaldf = pd.concat([finaldf, temp_df.iloc[:5]]) 
print(finaldf)

OUTPUT:

award_datesupplier_nameawarded_amt02015-01-07             SANTARLICONSTRUCTIONPTE.LTD.103000000072015-05-19             SANTARLICONSTRUCTIONPTE.LTD.64980000012015-08-04  HYUNDAIENGINEERING&CONSTRUCTIONCO.LTD60172600022015-02-03                KAJIMAOVERSEASASIAPTELTD59580000032015-11-20                     SAMSUNGC&TCORPORATION55532206312016-05-19             SANTARLICONSTRUCTIONPTE.LTD.65180000002016-05-19             SANTARLICONSTRUCTIONPTE.LTD.65080000022016-11-20                     SAMSUNGC&TCORPORATION55532206332016-11-23                      THEGO-AHEADGROUPPLC49773810442016-06-19         GSEngineering&ConstructionCorp.428301000

EDIT:

To transform finaldf to a dictionary, I would recommend this. It will create a nested dictionary, similar to JSON. You could also use the Python module JSON for this.

final_dict = {}
for row in finaldf.iterrows():
    award_date    = row[1][0]
    supplier_name = row[1][1]
    awarded_amt   = row[1][2]
    if supplier_name not in final_dict.keys():
        final_dict[supplier_name] = {}
    final_dict[supplier_name][award_date] = awarded_amt

print(final_dict)

OUTPUT:

{
  'SANTARLI CONSTRUCTION PTE. LTD.': {
    Timestamp('2015-01-07 00:00:00'): 1030000000, 
    Timestamp('2015-05-19 00:00:00'): 649800000, 
    Timestamp('2016-05-19 00:00:00'): 650800000
  }, 
  'HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD': {
    Timestamp('2015-08-04 00:00:00'): 601726000
  }, 
  'KAJIMA OVERSEAS ASIA PTE LTD': {
    Timestamp('2015-02-03 00:00:00'): 595800000
  }, 
  'SAMSUNG C&T CORPORATION': {
    Timestamp('2015-11-20 00:00:00'): 555322063, 
    Timestamp('2016-11-20 00:00:00'): 555322063
  }, 
  'THE GO-AHEAD GROUP PLC': {
    Timestamp('2016-11-23 00:00:00'): 497738104
  }, 
  'GS Engineering & Construction Corp.': {
    Timestamp('2016-06-19 00:00:00'): 428301000
    }
}

Post a Comment for "Python: Failed In Retrieving The Highest Amount From A Repeated Data With Different Amount In A Certain Year"