We have a web application with multiple modules and we need understand the module usage by user and by role.
We extracted the log in csv format and as the traffic is increasing, our old python script takes more than 30 mins to process a 6GB log data. We have considered multiple solutions, such as importing to a database and using SQL to query, but this will require overhead of a SQL serveoutr and the data importing. Fortunately, we came across this Python library: Pandas.
Pandas is the most popular Python library for data analysis and we can analyze the data in Series (1 dimension) or DataFrame (2 dimensions).
Our data is more than 6GB and a naive script will easily run out of the memory. With Pandas, we can easily split the file in chunks.
#!/usr/bin/python
import pandas as pd
CHUNK_SIZE = 100000
# csv columns: "timestamp","clientIp","sessionId","url","userRole","userUuid"
chunks = pd.read_csv('data.csv', usecols=[0, 2, 3, 4, 5], chunksize=CHUNK_SIZE)
result_df = pd.DataFrame(columns=['sessionId', 'userRole', 'module'])
for df in chunks:
df['module'] = df.apply(lambda row: getModuleFromUrl(row['url']), axis=1)
result_df = pd.concat([result_df, df[['sessionId', 'userRole', 'module']]], axis=0).drop_duplicates()
role_df = result_df.loc[result_df['userRole'] == role_name]
agg_actions = role_df.groupby(by=['module'])['sessionId'].count().reset_index(name='count')
unique_users = role_df['sessionId'].unique()
print("total unique users (%s): %s" %(role_name, len(unique_users)))
func_fmt = lambda x: "{0:.2f}".format(float(x)/len(unique_users))
f = lambda x: map(func_fmt, x)
agg_actions['percentage'] = f(agg_actions['count'].values)
print(agg_actions)
This implementation requires 30+ mins to process the 6GB log file. It’s much improved compared to our old script, but 30+ mins is still quite slow if we need multiple runs. Can we improve this further?
By profiling the script, we found a few areas can be improved.
python -m cProfile --sort cumulative log_pandas.py &> out.log
Vectorization is faster than dataframe.apply
Vectorization applys the function on entire arrays. Remebering that Pandas series in the 1 dimensional array which we can see as the column in an excel sheet.
# this line takes 1.78 seconds
df['module'] = df.apply(lambda row: getModuleFromUrl(row['url']), axis=1)
# apply on series take 0.68 seconds
df['module'] = getModuleSeries(df['url'])
def getModuleSeries(series):
return series.apply(lambda x: getModuleFromUrl(x))
numpty array can be even faster
DataFrame.values attribute return a Numpy representation of the given DataFrame.
# by using numpy array, the execution time is further cut to 0.31 seconds
df['module'] = getModuleNumpy(df['url'].values)
def getModuleNumpy(numpy_arr):
f = lambda x: map(getModuleFromUrl, x)
return f(numpy_arr)
The performance is further improved now and next we move to the second most time consuming line.
result_df = pd.concat([result_df, df[['sessionId', 'userRole', 'module']]], axis=0).drop_duplicates()
This is well explained in this stackoverflow answer: https://stackoverflow.com/questions/36489576/why-does-concatenation-of-dataframes-get-exponentially-slower
Never call DataFrame.append
or pd.concat
inside a for-loop. It leads to quadratic copying.
pd.concat
returns a new DataFrame. Space has to be allocated for the new DataFrame, and data from the old DataFrames have to be copied into the new DataFrame.
The fix is easy, we can append the df to a list in the for-loop and concat once after the loop (pandas.concat can take a list as parameter so that the copying is only once).
result_list = []
for df in chunks:
... ...
result_list.append(df[['sessionId', 'userRole', 'module']])
... ...
result_df = pd.DataFrame(columns=['sessionId', 'userRole', 'module'])
result_df = pd.concat(result_list, axis=0).drop_duplicates()
We re-run the script now, the overall time taken is cut to 147.02790904 seconds.
If you are satisfied with the DataFrame and want to reuse it future, to save the re-run time, you can also save the DataFrame to HDF5 (Hierarchical Data Format) file.
store = pd.HDFStore('access_log.h5')
store.put(KEY_DF, result_df)
# check whether the data in the store
if KEY_DF not in store:
print("DF not exist, regenerating ...")
Conclusion
- load the data in chunks to avoid memory exhaustion
- Performance: DataFrame.apply < Series.apply < applying lambda function on Numpy array
- store dataframe to HDFStore to save future processing time
Other common idioms:
- Avoid using regex to match string, which means the url pattern should avoid some common naming convention, such as /customers/{customerId}/accounts/{accountId}
- list comprehension is faster than traditional for-loop. https://nyu-cds.github.io/python-performance-tips/08-loops/