Archive for the ‘Knowledge Base’ Category

jcmd requires the presence of /tmp/hsperfdata_xxxx, but in Amazon Linux 2, /tmp/hsperfdata_elasticsearch doesn’t exist, so that jcmd and other jvm analytical tools, such as jps, jvmtop won’t be able to see Elasticsearch.

The reason is that in Amazon Linux 2, or RHEL 7, init daemon is replaced by systemd. systemd service has a PrivateTmp setting.

elasticsearch.service set PrivateTmp as true, which means, a systemd-private-xxxx directory will be created in /tmp instead. hsperfdata_elasticsearch is located inside. jcmd doesn’t have access this directory.

To solve this,

set PrivateTmp=false in /usr/lib/systemd/system/elasticsearch.service

We have a web application with multiple modules and we need understand the module usage by user and by role.

We extracted the log in csv format and as the traffic is increasing, our old python script takes more than 30 mins to process a 6GB log data. We have considered multiple solutions, such as importing to a database and using SQL to query, but this will require overhead of a SQL serveoutr and the data importing. Fortunately, we came across this Python library: Pandas.

Pandas is the most popular Python library for data analysis and we can analyze the data in Series (1 dimension) or DataFrame (2 dimensions).

Our data is more than 6GB and a naive script will easily run out of the memory. With Pandas, we can easily split the file in chunks.

#!/usr/bin/python
import pandas as pd

CHUNK_SIZE = 100000
# csv columns: "timestamp","clientIp","sessionId","url","userRole","userUuid"
chunks = pd.read_csv('data.csv', usecols=[0, 2, 3, 4, 5], chunksize=CHUNK_SIZE)

result_df = pd.DataFrame(columns=['sessionId', 'userRole', 'module'])
for df in chunks:
  df['module'] = df.apply(lambda row: getModuleFromUrl(row['url']), axis=1)
  
  result_df = pd.concat([result_df, df[['sessionId', 'userRole', 'module']]], axis=0).drop_duplicates()

role_df = result_df.loc[result_df['userRole'] == role_name]

agg_actions = role_df.groupby(by=['module'])['sessionId'].count().reset_index(name='count')
unique_users = role_df['sessionId'].unique()

print("total unique users (%s): %s" %(role_name, len(unique_users)))

func_fmt = lambda x: "{0:.2f}".format(float(x)/len(unique_users))
f = lambda x: map(func_fmt, x)
agg_actions['percentage'] = f(agg_actions['count'].values)

print(agg_actions)


This implementation requires 30+ mins to process the 6GB log file. It’s much improved compared to our old script, but 30+ mins is still quite slow if we need multiple runs. Can we improve this further?

By profiling the script, we found a few areas can be improved.

python -m cProfile --sort cumulative log_pandas.py &> out.log

Vectorization is faster than dataframe.apply

Vectorization applys the function on entire arrays. Remebering that Pandas series in the 1 dimensional array which we can see as the column in an excel sheet.

# this line takes 1.78 seconds
df['module'] = df.apply(lambda row: getModuleFromUrl(row['url']), axis=1) 

# apply on series take 0.68 seconds
df['module'] = getModuleSeries(df['url'])

def getModuleSeries(series):
    return series.apply(lambda x: getModuleFromUrl(x))

numpty array can be even faster

DataFrame.values attribute return a Numpy representation of the given DataFrame.

# by using numpy array, the execution time is further cut to 0.31 seconds
df['module'] = getModuleNumpy(df['url'].values)

def getModuleNumpy(numpy_arr):
    f = lambda x: map(getModuleFromUrl, x)
    return f(numpy_arr)

The performance is further improved now and next we move to the second most time consuming line.

result_df = pd.concat([result_df, df[['sessionId', 'userRole', 'module']]], axis=0).drop_duplicates()

This is well explained in this stackoverflow answer: https://stackoverflow.com/questions/36489576/why-does-concatenation-of-dataframes-get-exponentially-slower

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

pd.concat returns a new DataFrame. Space has to be allocated for the new DataFrame, and data from the old DataFrames have to be copied into the new DataFrame.

The fix is easy, we can append the df to a list in the for-loop and concat once after the loop (pandas.concat can take a list as parameter so that the copying is only once).

result_list = []
for df in chunks:
  ... ...
  result_list.append(df[['sessionId', 'userRole', 'module']])
  ... ...
result_df = pd.DataFrame(columns=['sessionId', 'userRole', 'module'])
result_df = pd.concat(result_list, axis=0).drop_duplicates()

We re-run the script now, the overall time taken is cut to 147.02790904 seconds.

If you are satisfied with the DataFrame and want to reuse it future, to save the re-run time, you can also save the DataFrame to HDF5 (Hierarchical Data Format) file.

store = pd.HDFStore('access_log.h5')

store.put(KEY_DF, result_df)

# check whether the data in the store
if KEY_DF not in store:
    print("DF not exist, regenerating ...")

Conclusion

  1. load the data in chunks to avoid memory exhaustion
  2. Performance: DataFrame.apply < Series.apply < applying lambda function on Numpy array
  3. store dataframe to HDFStore to save future processing time

Other common idioms:

  1. Avoid using regex to match string, which means the url pattern should avoid some common naming convention, such as /customers/{customerId}/accounts/{accountId}
  2. list comprehension is faster than traditional for-loop. https://nyu-cds.github.io/python-performance-tips/08-loops/

This error happened when I configured RoundCube behind AWS ALB.

A lot of answers suggested that RoundCube behind proxy may have this issue. The fix is to make sure $config[‘ip_check’] = false in your config.ini.php.

This exception is thrown when I tried to launch my Android App with DrawerLayout. The App is following the official guide from Google: https://developer.android.com/training/implementing-navigation/nav-drawer. But it doesn’t work.

After a lot of research, I cannot find any workable solution. 

At the end, i noticed another DrawerLayout from different namespace, i.e., android.support.v4.widget.DrawerLayout.  I have seen this namespace from AndroidX and my App has been using AndroidX. I shouldn’t use the old support library namespace any more.

After following the mapping in https://developer.android.com/jetpack/androidx/migrate, i starts to see different exception, which are for other support libraries. 

AAndroidX maps the original support library API packages into the androidx namespace. I shouldn’t mix them together.

Ubuntu cannot start Terminal

Posted: September 27, 2018 in Linux

When I try to start gnome-terminal through xterm, i see this error: Error calling StartServiceByName for org.gnome.Terminal

To fix this issue: $localectl set-locale LANG=”en_US.UTF-8″

Reboot

Socket connections uses file descriptor and it’s limited.

For Ubuntu, you can ulimit -a to see the current limit. (ulimit is not a binary command, so that no need sudo)

To increase the openfile to a maximum limit of 4096, you can ulimit -n 4096.

More than 4096 must edit the system file /etc/security/limits.conf.

In the current terminal, you will need su to make the settings take effect.

After restarting tomcat, the log will return to normal.

To keep it short, for my case, it’s because we have two WARs deployed to the same Tomcat and both are writing to the same file.

This error has been in Prestashop for quite a whilte and unfortunely, it still exists when I install the latest 1.7.3.

To fix this error, you need unzip the language file manually.

SSH to the server /app/Resources/translations and you will found that there is only one zip file in the language you want to update the translation.

Just unzip the zip file, make sure you unzip all the files to the root directory (no subdirectory)

MySQL cursor is a convenient way to loop through data, but it may exit prematurely sometimes.

Below is one example

drop procedure if exists a_stored_procedure;

DELIMITER $$

CREATE PROCEDURE a_stored_procedure()
BEGIN

  DECLARE done INT DEFAULT FALSE;
  DECLARE user_id, group_id BIGINT;
  DECLARE a_cursor CURSOR FOR select userId from User;

  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
  
  OPEN a_cursor;

  read_loop: LOOP
    FETCH a_cursor INTO user_id;
    IF done THEN
      LEAVE read_loop;
    END IF;

    select groupId into @group_id from UserGroup where userId = user_id limit 1;
    
    -- other statements

    
  END LOOP;

  CLOSE a_cursor;
 
 
END;

In the above stored procedure with cursor, the cursor will terminate after no more users, but actually it will also terminiate when select groupId into @group_id returns no result because it will also trigger Not found handler.

Reference: https://dev.mysql.com/doc/refman/5.6/en/declare-handler.html

To resolve this issue, first list all running jobs

salt-run jobs.active

https://docs.saltstack.com/en/latest/topics/jobs/

Kill the job

# kill all jobs
salt '*' saltutil.kill_all_jobs

# kill the job with id
salt '*' saltutil.kill_job <job id>

https://docs.saltstack.com/en/latest/ref/modules/all/salt.modules.saltutil.html#module-salt.modules.saltutil