[Cluster-devel] [RFC dlm/next 00/15] fs: dlm: performance

Wed Jun 23 15:14:39 UTC 2021

Hi,

this patch series to some more generic handling for some protocols but
mainly it should improve the performance. Here is why:

Current state:
 - Two global ordered workqueues for sending and receiving
 - Each connection has a "sock" mutex
 - The receiving functionality also processes dlm messages

Problem with this behaviour:
 - Each workqueue can only handle one connection in an ordered queuing
   which ends that we can never handle a parallel receive/send or
   processing for "all connections" (not per connection).
 - The sock mutex will block other send/recv handling if the same
   connection is queued in both workqueues
 - The sock mutex is also hold in processing dlm messages to block sending
   dlm messages

New behaviour:
 - Having io_workqueue for send/recv and process workqueue which is not
   ordered globally (but will be ordered per connection)
 - Allow parallel receive and send per connection, mutexes for receive
   and send workers will take care that we have ordered queuing _per_
   connection (same for process work)
 - The process workqueue will process dlm messages ordered in the
   background of io (send/recv) handling. While processing receive
   can still fill the "processqueue" with new possible arrival messages.

Notes:

I would like to get rid of the process workqueue but I didn't find a way
that we can still ordered receive while doing a ordered processing. Also
I am not sure how the unordered workqueue really works, I hope that the
workqueue handling is so clever that it stops queuing workers when it sees
that one worker is blocked at a mutex. Remember that we have these workers
per connection, so if one blocks others are still allowed to execute which
should make some performance improvements that other connections can also
be served. The main thing is I think that we switch from an ordered
workqueue to an unordered workqueue which should parallel the serving of
each connection instead doing each one after one.

Maybe the process workqueue can be changed to a tasklet, but this
requires a lot of more changes in dlm (although ACK messages can be
handled in non-sleepable contextes).

WARNING:

First time ever we let DLM application layer process parallel dlm messages,
BUT processing is per node/connection in an ordered way (which is
required). I tested it and I did saw no problems and think that global/per
lockspace multiple access per nodeid is correct protected for mutual
access.

- Alex

Alexander Aring (15):
  fs: dlm: clear CF_APP_LIMITED on close
  fs: dlm: introduce con_next_wq helper
  fs: dlm: move to static proto ops
  fs: dlm: introduce generic listen
  fs: dlm: auto load sctp module
  fs: dlm: generic connect func
  fs: dlm: fix multiple empty writequeue alloc
  fs: dlm: move receive loop into receive handler
  fs: dlm: introduce io_workqueue
  fs: dlm: introduce reconnect work
  fs: dlm: introduce process workqueue
  fs: dlm: remove send starve
  fs: dlm: move writequeue init to sendcon only
  fs: dlm: flush listen con
  fs: dlm: move srcu into loop call

 fs/dlm/lowcomms.c | 1548 +++++++++++++++++++++++----------------------
 fs/dlm/midcomms.c |   38 +-
 fs/dlm/midcomms.h |    3 +-
 3 files changed, 837 insertions(+), 752 deletions(-)

-- 
2.26.3