i have following code running normal python code:
def remove_missing_rows(app_list): print("########### missing row removal ###########") missing_rows = [] ''' remove row has missing data in name, id, or description column''' row in app_list: if not row[1]: missing_rows.append(row) continue # continue loop next row. no need check more columns if not row[5]: missing_rows.append(row) continue # continue loop next row. no need check more columns if not row[4]: missing_rows.append(row) print("number of missing entries: " + str(len(missing_rows))) # 967 current method # remove missing_rows original data app_list = [row row in app_list if row not in missing_rows] return app_list
now, after writing smaller sample wish run on large data set. thought useful utilise multiple cores of computer.
i'm struggling implement using multiprocessing module though. e.g. idea have core 1 work through first half of data set, while core 2 work through last half. etc. , in parallel. possible?
this not cpu bound. try code below.
i've used set
fast (hash-based) contains
(you use when invoke if row not in missing_rows
, , it's slow long list).
if csv module you're holding tuples hashable not many changes needed:
def remove_missing_rows(app_list): print("########### missing row removal ###########") filterfunc = lambda row: not all([row[1], row[4], row[5]]) missing_rows = set(filter(filterfunc, app_list)) print("number of missing entries: " + str(len(missing_rows))) # 967 current method # remove missing_rows original data # note: should lot faster set app_list = [row row in app_list if row not in missing_rows] return app_list