i have 2 dataframes read in via csv. dataframe 1 consists of phone number , additional data. second dataframe contains country codes , country names.
i want take phone number first dataset , compare country codes of second. country codes can between 1 4 digits long. go longest country code shortest. if there match, want assign country name phone number.
input longlist:
phonenumber, add_info 34123425209, info1 92654321762, info2 12018883637, info3 6323450001, info4 496789521134, info5
input country_list:
country;country_code;order_info spain;34;1 pakistan;92;4 usa;1;2 philippines;63;3 germany;49;4 poland;48;1 norway;47;2
output should be:
phonenumber, add_info, country, order_info 34123425209, info1, spain, 1 92654321762, info2, pakistan, 4 12018883637, info3, usa, 2 6323450001, info4, philippines, 3 496789521134, info5, germany, 4
i have solved once this:
#! /usr/bin/python import csv import pandas open ('longlist.csv','r') lookuplist: open ('country_list.csv','r') inputlist: open('outputfile.csv', 'w') outputlist: reader = csv.reader(lookuplist, delimiter=',') reader2 = csv.reader(inputlist, delimiter=';') writer = csv.writer(outputlist, dialect='excel') in reader2: xl in reader: if xl[0].startswith(i[1]): zeile = [xl[0], xl[1], i[0], i[1], i[2]] writer.writerow(zeile) lookuplist.seek(0)
but solve problem, using pandas. got work: - read in csv files - remove duplicates "longlist" - sort list of countries / country code
this is, have working already:
import pandas pd, numpy np longlist = pd.read_csv('path/to/longlist.csv', usecols=[2,3], names=['phonenumber','add_info']) country_list = pd.read_csv('path/to/country_list.csv', sep=';', names=['country','country_code','order_info'], skiprows=[0]) # remove duplicates , make phone number index longlist = longlist.drop_duplicates('phonenumber') longlist = longlist.set_index('phonenumber') # sort country list, high low value , make country code index country_list=country_list.sort_values(by='country_code', ascending=0) country_list=country_list.set_index('country_code') (...) longlist.to_csv('path/to/output.csv')
but way trying same datasets not work. cannot apply startswith (cannot iterate through objects , cannot apply on objects). appreciate help.
i way:
cl = pd.read_csv('country_list.csv', sep=';', dtype={'country_code':str}) ll = pd.read_csv('phones.csv', skipinitialspace=true, dtype={'phonenumber':str}) lookup = cl['country_code'] lookup.index = cl['country_code'] ll['country_code'] = ( ll['phonenumber'] .apply(lambda x: pd.series([lookup.get(x[:4]), lookup.get(x[:3]), lookup.get(x[:2]), lookup.get(x[:1])])) .apply(lambda x: x.get(x.first_valid_index()), axis=1) ) # remove `how='left'` parameter if don't need "unmatched" phone-numbers result = ll.merge(cl, on='country_code', how='left')
output:
in [195]: result out[195]: phonenumber add_info country_code country order_info 0 34123425209 info1 34 spain 1.0 1 92654321762 info2 92 pakistan 4.0 2 12018883637 info3 1 usa 2.0 3 12428883637 info31 1242 bahamas 3.0 4 6323450001 info4 63 philippines 3.0 5 496789521134 info5 49 germany 4.0 6 00000000000 bad none nan nan
explanation:
in [216]: (ll['phonenumber'] .....: .apply(lambda x: pd.series([lookup.get(x[:4]), lookup.get(x[:3]), .....: lookup.get(x[:2]), lookup.get(x[:1])])) .....: ) out[216]: 0 1 2 3 0 none none 34 none 1 none none 92 none 2 none none none 1 3 1242 none none 1 4 none none 63 none 5 none none 49 none 6 none none none none
phones.csv: - i've intentionally added 1 bahamas number (1242...
) , 1 invalid number (00000000000
)
phonenumber, add_info 34123425209, info1 92654321762, info2 12018883637, info3 12428883637, info31 6323450001, info4 496789521134, info5 00000000000, bad