[PATCH 0/2] scop GFP_NOFS api

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 0/2] scop GFP_NOFS api

Michal Hocko
Hi,
we have discussed this topic at LSF/MM this year. There was a general
interest in the scope GFP_NOFS allocation context among some FS
developers. For those who are not aware of the discussion or the issue
I am trying to sort out (or at least start in that direction) please
have a look at patch 1 which adds memalloc_nofs_{save,restore} api
which basically copies what we have for the scope GFP_NOIO allocation
context. I haven't converted any of the FS myself because that is way
beyond my area of expertise but I would be happy to help with further
changes on the MM front as well as in some more generic code paths.

Dave had an idea on how to further improve the reclaim context to be
less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
and FS specific cookie set in the FS allocation context and consumed
by the FS reclaim context to allow doing some provably save actions
that would be skipped due to GFP_NOFS normally.  I like this idea and
I believe we can go that direction regardless of the approach taken here.
Many filesystems simply need to cleanup their NOFS usage first before
diving into a more complex changes.

The patch 2 is a debugging aid which warns about explicit allocation
requests from the scope context. This is should help to reduce the
direct usage of the NOFS flags to bare minimum in favor of the scope
API. It is not aimed to be merged upstream. I would hope Andrew took it
into mmotm tree to give it linux-next exposure and allow developers to
do further cleanups.  There is a new kernel command line parameter which
has to be used for the debugging to be enabled.

I think the GFP_NOIO should be seeing the same clean up.

Any feedback is highly appreciated.


------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 1/2] mm: add PF_MEMALLOC_NOFS

Michal Hocko
From: Michal Hocko <[hidden email]>

GFP_NOFS context is used for the following 4 reasons currently
        - to prevent from deadlocks when the lock held by the allocation
          context would be needed during the memory reclaim
        - to prevent from stack overflows during the reclaim because
          the allocation is performed from a deep context already
        - to prevent lockups when the allocation context depends on
          other reclaimers to make a forward progress indirectly
        - just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
API to control the scope. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO.

Xfs has already had a similar functionality as PF_FSTRANS so let's just
give it a more generic name and make it usable for others as well and
move the GFP_NOFS context tracking to the page allocator. Xfs has its
own accessor functions but let's keep them for now to reduce this patch
as minimum.

This patch shouldn't introduce any functional changes. Xfs code paths
preserve their semantic. kmem_flags_convert() doesn't need to evaluate
the flag anymore because it is the page allocator to care about the
flag. memalloc_noio_flags is renamed to current_gfp_context because it
now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <[hidden email]>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       |  8 ++++++++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 mm/page_alloc.c           |  8 +++++---
 mm/vmscan.c               |  4 ++--
 9 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
  * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
  * the filesystem here and potentially deadlocking.
  */
- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
  noio_flag = memalloc_noio_save();
 
  lflags = kmem_flags_convert(flags);
  ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
  memalloc_noio_restore(noio_flag);
 
  return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
  lflags = GFP_ATOMIC | __GFP_NOWARN;
  } else {
  lflags = GFP_KERNEL | __GFP_NOWARN;
- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if (flags & KM_NOFS)
  lflags &= ~__GFP_FS;
  }
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
  struct xfs_btree_split_args *args = container_of(work,
  struct xfs_btree_split_args, work);
  unsigned long pflags;
- unsigned long new_pflags = PF_FSTRANS;
+ unsigned long new_pflags = PF_MEMALLOC_NOFS;
 
  /*
  * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
  * We hand off the transaction to the completion thread now, so
  * clear the flag here.
  */
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
  return 0;
 }
 
@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
  * thus we need to mark ourselves as being in a transaction manually.
  * Similarly for freeze protection.
  */
- current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
  __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
  /* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
  * Given that we do not allow direct reclaim to call us, we should
  * never be called while in a filesystem transaction.
  */
- if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+ if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
  goto redirty;
 
  /* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
  bool rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
  /* Mark this thread as being in a transaction */
- current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
  /*
  * Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
  if (blocks > 0) {
  error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
  if (error != 0) {
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
  return -ENOSPC;
  }
  tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
  tp->t_blk_res = 0;
  }
 
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
  return error;
 }
@@ -921,7 +921,7 @@ __xfs_trans_commit(
 
  xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
  xfs_trans_free(tp);
 
  /*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
  if (commit_lsn == -1 && !error)
  error = -EIO;
  }
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
  xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
  xfs_trans_free(tp);
 
@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
  xfs_log_done(mp, tp->t_ticket, NULL, false);
 
  /* mark this thread as no longer being in a transaction */
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
  xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
  xfs_trans_free(tp);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC 0x00004000 /* used async_schedule*(), used by module init */
 #define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */
 #define PF_FROZEN 0x00010000 /* frozen for system suspend */
-#define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
-#define PF_KSWAPD 0x00040000 /* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */
+#define PF_KSWAPD 0x00020000 /* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
 #define PF_KTHREAD 0x00200000 /* I am a kernel thread */
 #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */
@@ -2140,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+ /*
+ * NOIO implies both NOIO and NOFS and it is a weaker context
+ * so always make sure it makes precendence
+ */
  if (unlikely(current->flags & PF_MEMALLOC_NOIO))
  flags &= ~(__GFP_IO | __GFP_FS);
+ else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+ flags &= ~__GFP_FS;
  return flags;
 }
 
@@ -2162,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
  current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+ unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+ current->flags |= PF_MEMALLOC_NOFS;
+ return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+ current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0 /* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
  goto out;
 
  /*
- * Runtime PM, block IO and its error handling path can deadlock
- * because I/O on the device might not complete.
+ * Apply scoped allocation constrains. This is mainly about
+ * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+ * allocation requests from a particular context which has
+ * been marked by memalloc_no{fs,io}_{save,restore}
  */
- alloc_mask = memalloc_noio_flags(gfp_mask);
+ alloc_mask = current_gfp_context(gfp_mask);
  ac.spread_dirty_pages = false;
 
  page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
  unsigned long nr_reclaimed;
  struct scan_control sc = {
  .nr_to_reclaim = SWAP_CLUSTER_MAX,
- .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+ .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
  .order = order,
  .nodemask = nodemask,
  .priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
  struct reclaim_state reclaim_state;
  struct scan_control sc = {
  .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
- .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+ .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
  .order = order,
  .priority = ZONE_RECLAIM_PRIORITY,
  .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
--
2.8.0.rc3


------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context

Michal Hocko
In reply to this post by Michal Hocko
From: Michal Hocko <[hidden email]>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <[hidden email]>
---
 mm/page_alloc.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 86bb5d6ddd7d..085d00280496 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3750,6 +3750,61 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+ debug_scope_gfp = true;
+ return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+static inline void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+ gfp_t restrict_mask;
+
+ if (likely(!debug_scope_gfp))
+ return;
+
+ /* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+ if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+ return;
+
+ if (current->flags & PF_MEMALLOC_NOIO)
+ restrict_mask = __GFP_IO;
+ else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+ restrict_mask = __GFP_FS;
+ else
+ return;
+
+ if ((gfp_mask & restrict_mask) != restrict_mask) {
+ /*
+ * If you see this this warning then the code does:
+ * memalloc_no{fs,io}_save()
+ * ...
+ *    foo()
+ *      alloc_page(GFP_NO{FS,IO})
+ * ...
+ * memalloc_no{fs,io}_restore()
+ *
+ * allocation which is unnecessary because the scope gfp
+ * context will do that for all allocation requests already.
+ * If foo() is called from multiple contexts then make sure other
+ * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+ * scope protection into particular paths or change the gfp mask
+ * to GFP_KERNEL.
+ */
+ pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context\n", gfp_mask, &gfp_mask,
+ (current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS");
+ dump_stack();
+ }
+}
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3796,6 +3851,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
  ac.nodemask);
 
  /* First allocation attempt */
+ debug_scope_gfp_context(gfp_mask);
  page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
  if (likely(page))
  goto out;
--
2.8.0.rc3


------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context

Dave Chinner
On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:

> From: Michal Hocko <[hidden email]>
>
> THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
>
> It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
>
> Let's help this process and add a debugging tool to catch when an
> explicit allocation request for GFP_NO{FS,IO} is done from the scope
> context. The printed stacktrace should help to identify the caller
> and evaluate whether it can be changed to use a wider context or whether
> it is called from another potentially dangerous context which needs
> a scope protection as well.

You're going to get a large number of these from XFS. There are call
paths in XFs that get called both inside and outside transaction
context, and many of them are marked with GFP_NOFS to prevent issues
that have cropped up in the past.

Often these are to silence lockdep warnings (e.g. commit b17cb36
("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
lockdep gets very unhappy about the same functions being called with
different reclaim contexts. e.g.  directory block mapping might
occur from readdir (no transaction context) or within transactions
(create/unlink). hence paths like this are tagged with GFP_NOFS to
stop lockdep emitting false positive warnings....

Removing the GFP_NOFS flags in situations like this is simply going
to restart the flood of false positive lockdep warnings we've
silenced over the years, so perhaps lockdep needs to be made smarter
as well...

Cheers,

Dave.

--
Dave Chinner
[hidden email]

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS

Dave Chinner
In reply to this post by Michal Hocko
On Tue, Apr 26, 2016 at 01:56:11PM +0200, Michal Hocko wrote:

> From: Michal Hocko <[hidden email]>
>
> GFP_NOFS context is used for the following 4 reasons currently
> - to prevent from deadlocks when the lock held by the allocation
>  context would be needed during the memory reclaim
> - to prevent from stack overflows during the reclaim because
>  the allocation is performed from a deep context already
> - to prevent lockups when the allocation context depends on
>  other reclaimers to make a forward progress indirectly
> - just in case because this would be safe from the fs POV

- silencing lockdep false positives

> Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> API to control the scope. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO.
>
> Xfs has already had a similar functionality as PF_FSTRANS so let's just
> give it a more generic name and make it usable for others as well and
> move the GFP_NOFS context tracking to the page allocator. Xfs has its
> own accessor functions but let's keep them for now to reduce this patch
> as minimum.

Can you split this into two patches? The first simply does this:

#define PF_MEMALLOC_NOFS PF_FSTRANS

and changes only the XFS code to use PF_MEMALLOC_NOFS.

The second patch can then do the rest of the mm API changes that we
don't actually care about in XFS at all.  That way I can carry all
the XFS changes in the XFS tree and not have to worry about when
this stuff gets merged or conflicts with the rest of the work that
is being done to the mm/ code and whatever tree that eventually
lands in...

Cheers,

Dave.
--
Dave Chinner
[hidden email]

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS

Michal Hocko
On Wed 27-04-16 09:07:02, Dave Chinner wrote:

> On Tue, Apr 26, 2016 at 01:56:11PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <[hidden email]>
> >
> > GFP_NOFS context is used for the following 4 reasons currently
> > - to prevent from deadlocks when the lock held by the allocation
> >  context would be needed during the memory reclaim
> > - to prevent from stack overflows during the reclaim because
> >  the allocation is performed from a deep context already
> > - to prevent lockups when the allocation context depends on
> >  other reclaimers to make a forward progress indirectly
> > - just in case because this would be safe from the fs POV
>
> - silencing lockdep false positives
>
> > Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> > API to control the scope. This is basically copying
> > memalloc_noio_{save,restore} API we have for other restricted allocation
> > context GFP_NOIO.
> >
> > Xfs has already had a similar functionality as PF_FSTRANS so let's just
> > give it a more generic name and make it usable for others as well and
> > move the GFP_NOFS context tracking to the page allocator. Xfs has its
> > own accessor functions but let's keep them for now to reduce this patch
> > as minimum.
>
> Can you split this into two patches? The first simply does this:
>
> #define PF_MEMALLOC_NOFS PF_FSTRANS
>
> and changes only the XFS code to use PF_MEMALLOC_NOFS.
>
> The second patch can then do the rest of the mm API changes that we
> don't actually care about in XFS at all.  That way I can carry all
> the XFS changes in the XFS tree and not have to worry about when
> this stuff gets merged or conflicts with the rest of the work that
> is being done to the mm/ code and whatever tree that eventually
> lands in...

Sure I will do that

--
Michal Hocko
SUSE Labs

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context

Michal Hocko
In reply to this post by Dave Chinner
On Wed 27-04-16 08:58:45, Dave Chinner wrote:

> On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <[hidden email]>
> >
> > THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> >
> > It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> > prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> >
> > Let's help this process and add a debugging tool to catch when an
> > explicit allocation request for GFP_NO{FS,IO} is done from the scope
> > context. The printed stacktrace should help to identify the caller
> > and evaluate whether it can be changed to use a wider context or whether
> > it is called from another potentially dangerous context which needs
> > a scope protection as well.
>
> You're going to get a large number of these from XFS. There are call
> paths in XFs that get called both inside and outside transaction
> context, and many of them are marked with GFP_NOFS to prevent issues
> that have cropped up in the past.
>
> Often these are to silence lockdep warnings (e.g. commit b17cb36
> ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> lockdep gets very unhappy about the same functions being called with
> different reclaim contexts. e.g.  directory block mapping might
> occur from readdir (no transaction context) or within transactions
> (create/unlink). hence paths like this are tagged with GFP_NOFS to
> stop lockdep emitting false positive warnings....

I would much rather see lockdep being fixed than abusing GFP_NOFS to
workaround its limitations. GFP_NOFS has a real consequences to the
memory reclaim. I will go and check the commit you mentioned and try
to understand why that is a problem. From what you described above
I would like to get rid of exactly this kind of usage which is not
really needed for the recursion protection.

Thanks!
--
Michal Hocko
SUSE Labs

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

Michal Hocko
In reply to this post by Michal Hocko
From: Michal Hocko <[hidden email]>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more genric name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <[hidden email]>
---
Hi,
as suggested by Dave, I have split up [1] into two parts. The first one
addes a new PF flag which is just an alias to the existing PF_FSTRANS
and does all the renaming and the second one to introduce the generic
API which only changes the bare minimum in the xfs proper.

 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
  * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
  * the filesystem here and potentially deadlocking.
  */
- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
  noio_flag = memalloc_noio_save();
 
  lflags = kmem_flags_convert(flags);
  ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
  memalloc_noio_restore(noio_flag);
 
  return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..0d83f332e5c2 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
  lflags = GFP_ATOMIC | __GFP_NOWARN;
  } else {
  lflags = GFP_KERNEL | __GFP_NOWARN;
- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
  lflags &= ~__GFP_FS;
  }
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
  struct xfs_btree_split_args *args = container_of(work,
  struct xfs_btree_split_args, work);
  unsigned long pflags;
- unsigned long new_pflags = PF_FSTRANS;
+ unsigned long new_pflags = PF_MEMALLOC_NOFS;
 
  /*
  * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
  * We hand off the transaction to the completion thread now, so
  * clear the flag here.
  */
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
  return 0;
 }
 
@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
  * thus we need to mark ourselves as being in a transaction manually.
  * Similarly for freeze protection.
  */
- current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
  __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
  /* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
  * Given that we do not allow direct reclaim to call us, we should
  * never be called while in a filesystem transaction.
  */
- if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+ if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
  goto redirty;
 
  /* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
  bool rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
  /* Mark this thread as being in a transaction */
- current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
  /*
  * Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
  if (blocks > 0) {
  error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
  if (error != 0) {
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
  return -ENOSPC;
  }
  tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
  tp->t_blk_res = 0;
  }
 
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
  return error;
 }
@@ -921,7 +921,7 @@ __xfs_trans_commit(
 
  xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
  xfs_trans_free(tp);
 
  /*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
  if (commit_lsn == -1 && !error)
  error = -EIO;
  }
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
  xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
  xfs_trans_free(tp);
 
@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
  xfs_log_done(mp, tp->t_ticket, NULL, false);
 
  /* mark this thread as no longer being in a transaction */
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
  xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
  xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..820db8f98bfc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2115,6 +2115,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS /* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
--
2.8.0.rc3


------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API

Michal Hocko
From: Michal Hocko <[hidden email]>

GFP_NOFS context is used for the following 4 reasons currently
        - to prevent from deadlocks when the lock held by the allocation
          context would be needed during the memory reclaim
        - to prevent from stack overflows during the reclaim because
          the allocation is performed from a deep context already
        - to prevent lockups when the allocation context depends on
          other reclaimers to make a forward progress indirectly
        - just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <[hidden email]>
---
 fs/xfs/kmem.h         |  2 +-
 include/linux/gfp.h   |  8 ++++++++
 include/linux/sched.h | 34 ++++++++++++++++++++++++++--------
 mm/page_alloc.c       |  8 +++++---
 mm/vmscan.c           |  4 ++--
 5 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 0d83f332e5c2..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
  lflags = GFP_ATOMIC | __GFP_NOWARN;
  } else {
  lflags = GFP_KERNEL | __GFP_NOWARN;
- if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+ if (flags & KM_NOFS)
  lflags &= ~__GFP_FS;
  }
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 820db8f98bfc..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC 0x00004000 /* used async_schedule*(), used by module init */
 #define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */
 #define PF_FROZEN 0x00010000 /* frozen for system suspend */
-#define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
-#define PF_KSWAPD 0x00040000 /* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */
+#define PF_KSWAPD 0x00020000 /* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
 #define PF_KTHREAD 0x00200000 /* I am a kernel thread */
 #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */
@@ -2115,8 +2115,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS /* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2142,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+ /*
+ * NOIO implies both NOIO and NOFS and it is a weaker context
+ * so always make sure it makes precendence
+ */
  if (unlikely(current->flags & PF_MEMALLOC_NOIO))
  flags &= ~(__GFP_IO | __GFP_FS);
+ else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+ flags &= ~__GFP_FS;
  return flags;
 }
 
@@ -2164,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
  current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+ unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+ current->flags |= PF_MEMALLOC_NOFS;
+ return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+ current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0 /* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
  goto out;
 
  /*
- * Runtime PM, block IO and its error handling path can deadlock
- * because I/O on the device might not complete.
+ * Apply scoped allocation constrains. This is mainly about
+ * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+ * allocation requests from a particular context which has
+ * been marked by memalloc_no{fs,io}_{save,restore}
  */
- alloc_mask = memalloc_noio_flags(gfp_mask);
+ alloc_mask = current_gfp_context(gfp_mask);
  ac.spread_dirty_pages = false;
 
  page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
  unsigned long nr_reclaimed;
  struct scan_control sc = {
  .nr_to_reclaim = SWAP_CLUSTER_MAX,
- .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+ .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
  .order = order,
  .nodemask = nodemask,
  .priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
  struct reclaim_state reclaim_state;
  struct scan_control sc = {
  .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
- .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+ .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
  .order = order,
  .priority = ZONE_RECLAIM_PRIORITY,
  .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
--
2.8.0.rc3


------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API

Michal Hocko
On Wed 27-04-16 13:54:35, Michal Hocko wrote:
> From: Michal Hocko <[hidden email]>
>

Ups missed Dave's note about:

> GFP_NOFS context is used for the following 4 reasons currently
> - to prevent from deadlocks when the lock held by the allocation
>  context would be needed during the memory reclaim
> - to prevent from stack overflows during the reclaim because
>  the allocation is performed from a deep context already
> - to prevent lockups when the allocation context depends on
>  other reclaimers to make a forward progress indirectly
> - just in case because this would be safe from the fs POV

        - silence lockdep false positives
--
Michal Hocko
SUSE Labs

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

Andreas Dilger
In reply to this post by Michal Hocko
On Apr 27, 2016, at 5:54 AM, Michal Hocko <[hidden email]> wrote:

>
> From: Michal Hocko <[hidden email]>
>
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more genric name PF_MEMALLOC_NOFS which is in line with an exiting
> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
>
> This patch doesn't introduce any functional change.
>
> Signed-off-by: Michal Hocko <[hidden email]>
> ---
> Hi,
> as suggested by Dave, I have split up [1] into two parts. The first one
> addes a new PF flag which is just an alias to the existing PF_FSTRANS
> and does all the renaming and the second one to introduce the generic
> API which only changes the bare minimum in the xfs proper.
>
> fs/xfs/kmem.c             |  4 ++--
> fs/xfs/kmem.h             |  2 +-
> fs/xfs/libxfs/xfs_btree.c |  2 +-
> fs/xfs/xfs_aops.c         |  6 +++---
> fs/xfs/xfs_trans.c        | 12 ++++++------
> include/linux/sched.h     |  2 ++
> 6 files changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 686ba6fb20dd..73f6ab59c664 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> * the filesystem here and potentially deadlocking.
> */
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> noio_flag = memalloc_noio_save();
>
> lflags = kmem_flags_convert(flags);
> ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> memalloc_noio_restore(noio_flag);
Not really the fault of this patch, but it brings this nasty bit of code into
the light.  Is all of this machinery still needed given that __vmalloc() can
accept GFP flags?  If yes, wouldn't it be better to fix __vmalloc() to honor
the GFP flags instead of working around it in the filesystem code?

Cheers, Andreas

> return ptr;
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index d1c66e465ca5..0d83f332e5c2 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
> lflags = GFP_ATOMIC | __GFP_NOWARN;
> } else {
> lflags = GFP_KERNEL | __GFP_NOWARN;
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> lflags &= ~__GFP_FS;
> }
>
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index a0eb18ce3ad3..326566f4a131 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
> struct xfs_btree_split_args *args = container_of(work,
> struct xfs_btree_split_args, work);
> unsigned long pflags;
> - unsigned long new_pflags = PF_FSTRANS;
> + unsigned long new_pflags = PF_MEMALLOC_NOFS;
>
> /*
> * we are in a transaction context here, but may also be doing work
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d12dfcfd0cc8..6d816ff0b763 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
> * We hand off the transaction to the completion thread now, so
> * clear the flag here.
> */
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> return 0;
> }
>
> @@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
> * thus we need to mark ourselves as being in a transaction manually.
> * Similarly for freeze protection.
> */
> - current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
>
> /* we abort the update if there was an IO error */
> @@ -979,7 +979,7 @@ xfs_vm_writepage(
> * Given that we do not allow direct reclaim to call us, we should
> * never be called while in a filesystem transaction.
> */
> - if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> goto redirty;
>
> /* Is this page beyond the end of the file? */
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 748b16aff45a..1d247366c733 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -176,7 +176,7 @@ xfs_trans_reserve(
> bool rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
>
> /* Mark this thread as being in a transaction */
> - current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>
> /*
> * Attempt to reserve the needed disk blocks by decrementing
> @@ -186,7 +186,7 @@ xfs_trans_reserve(
> if (blocks > 0) {
> error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
> if (error != 0) {
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> return -ENOSPC;
> }
> tp->t_blk_res += blocks;
> @@ -263,7 +263,7 @@ xfs_trans_reserve(
> tp->t_blk_res = 0;
> }
>
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>
> return error;
> }
> @@ -921,7 +921,7 @@ __xfs_trans_commit(
>
> xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
>
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> xfs_trans_free(tp);
>
> /*
> @@ -951,7 +951,7 @@ __xfs_trans_commit(
> if (commit_lsn == -1 && !error)
> error = -EIO;
> }
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
> xfs_trans_free(tp);
>
> @@ -1005,7 +1005,7 @@ xfs_trans_cancel(
> xfs_log_done(mp, tp->t_ticket, NULL, false);
>
> /* mark this thread as no longer being in a transaction */
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>
> xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
> xfs_trans_free(tp);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index acfc32b30704..820db8f98bfc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2115,6 +2115,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
> #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
> #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
>
> +#define PF_MEMALLOC_NOFS PF_FSTRANS /* Transition to a more generic GFP_NOFS scope semantic */
> +
> /*
>  * Only the _current_ task can read/write to tsk->flags, but other
>  * tasks can access tsk->flags in readonly mode for example
> --
> 2.8.0.rc3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [hidden email]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cheers, Andreas






------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev

signature.asc (850 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

Michal Hocko
On Wed 27-04-16 11:41:51, Andreas Dilger wrote:
> On Apr 27, 2016, at 5:54 AM, Michal Hocko <[hidden email]> wrote:
[...]

> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> > * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> > * the filesystem here and potentially deadlocking.
> > */
> > - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> > + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> > noio_flag = memalloc_noio_save();
> >
> > lflags = kmem_flags_convert(flags);
> > ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
> >
> > - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> > + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> > memalloc_noio_restore(noio_flag);
>
> Not really the fault of this patch, but it brings this nasty bit of code into
> the light.  Is all of this machinery still needed given that __vmalloc() can
> accept GFP flags?  If yes, wouldn't it be better to fix __vmalloc() to honor
> the GFP flags instead of working around it in the filesystem code?

This is not that easy. __vmalloc can accept gfp flags but it doesn't
honor __GFP_IO 100%. IIRC some paths like page table allocations are
hardcoded GFP_KERNEL. Besides that I would like to have GFP_NOIO used
via memalloc_noio_{save,restore} API as well for the similar reasons as
GFP_NOFS - it is just easier to explain scope than particular code paths
which might be shared.
--
Michal Hocko
SUSE Labs

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API

Michal Hocko
In reply to this post by Michal Hocko
Hi Dave,

On Wed 27-04-16 13:54:35, Michal Hocko wrote:
[...]

> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 0d83f332e5c2..b35688a54c9a 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>   lflags = GFP_ATOMIC | __GFP_NOWARN;
>   } else {
>   lflags = GFP_KERNEL | __GFP_NOWARN;
> - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> + if (flags & KM_NOFS)
>   lflags &= ~__GFP_FS;
>   }
>  

I was trying to reproduce the false positives you mentioned in other
email by reverting b17cb364dbbb ("xfs: fix missing KM_NOFS tags to keep
lockdep happy"). I could really hit the one below but then it turned out
that it is this hunk above which causes the lockdep warning. Now I am
trying to understand how is this possible as ~__GFP_FS happens at the
page allocation level so we should never leak the __GFP_FS context when
the PF flag is set.

One possible explanation would be that some code path hands over to
a kworker which then loses the PF flag but it would get a gfp_mask which
would have the FS flag cleared with the original code.
xfs_btree_split_worker is using PF_MEMALLOC_NOFS directly so it should
be OK and xfs_reclaim_worker resp. xfs_eofblocks_worker use struct
xfs_mount as a context which doesn't contain gfp_mask. The stack trace
also doesn't indicate any kworker involvement. Do you have an idea what
else might get wrong or what am I missing?

---
[   53.990821] =================================
[   53.991672] [ INFO: inconsistent lock state ]
[   53.992419] 4.5.0-nofs5-00004-g35ad69a8eb83-dirty #902 Not tainted
[   53.993458] ---------------------------------
[   53.993480] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
[   53.993480] kswapd0/467 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   53.993480]  (&xfs_nondir_ilock_class){+++++-}, at: [<ffffffffa0066897>] xfs_ilock+0x18a/0x205 [xfs]
[   53.993480] {RECLAIM_FS-ON-W} state was registered at:
[   53.993480]   [<ffffffff810945e3>] mark_held_locks+0x5e/0x74
[   53.993480]   [<ffffffff8109722c>] lockdep_trace_alloc+0xb2/0xb5
[   53.993480]   [<ffffffff81174e56>] kmem_cache_alloc+0x36/0x2b0
[   53.993480]   [<ffffffffa0073422>] kmem_zone_alloc+0x65/0xc1 [xfs]
[   54.003925]   [<ffffffffa007a701>] xfs_buf_item_init+0x40/0x147 [xfs]
[   54.003925]   [<ffffffffa0084451>] _xfs_trans_bjoin+0x23/0x53 [xfs]
[   54.003925]   [<ffffffffa0084ea5>] xfs_trans_read_buf_map+0x2e9/0x5b3 [xfs]
[   54.003925]   [<ffffffffa001427c>] xfs_read_agf+0x141/0x1d4 [xfs]
[   54.003925]   [<ffffffffa0014413>] xfs_alloc_read_agf+0x104/0x223 [xfs]
[   54.003925]   [<ffffffffa001482f>] xfs_alloc_pagf_init+0x1a/0x3a [xfs]
[   54.003925]   [<ffffffffa001e1f2>] xfs_bmap_longest_free_extent+0x4c/0x9c [xfs]
[   54.003925]   [<ffffffffa0026225>] xfs_bmap_btalloc_nullfb+0x7a/0xc6 [xfs]
[   54.003925]   [<ffffffffa00289fc>] xfs_bmap_btalloc+0x21a/0x59d [xfs]
[   54.003925]   [<ffffffffa0028d8d>] xfs_bmap_alloc+0xe/0x10 [xfs]
[   54.003925]   [<ffffffffa0029612>] xfs_bmapi_write+0x401/0x80f [xfs]
[   54.003925]   [<ffffffffa0064209>] xfs_iomap_write_allocate+0x1bf/0x2b3 [xfs]
[   54.003925]   [<ffffffffa004be66>] xfs_map_blocks+0x141/0x3c9 [xfs]
[   54.003925]   [<ffffffffa004d8f1>] xfs_vm_writepage+0x3f6/0x612 [xfs]
[   54.003925]   [<ffffffff8112e6b4>] __writepage+0x16/0x34
[   54.003925]   [<ffffffff8112ecdb>] write_cache_pages+0x35d/0x4b4
[   54.003925]   [<ffffffff8112ee82>] generic_writepages+0x50/0x6f
[   54.003925]   [<ffffffffa004bbc2>] xfs_vm_writepages+0x44/0x4c [xfs]
[   54.003925]   [<ffffffff81130b7c>] do_writepages+0x23/0x2c
[   54.003925]   [<ffffffff81124395>] __filemap_fdatawrite_range+0x84/0x8b
[   54.003925]   [<ffffffff8112445f>] filemap_write_and_wait_range+0x2d/0x5b
[   54.003925]   [<ffffffffa0059512>] xfs_file_fsync+0x113/0x29a [xfs]
[   54.003925]   [<ffffffff811bd269>] vfs_fsync_range+0x8c/0x9e
[   54.003925]   [<ffffffff811bd297>] vfs_fsync+0x1c/0x1e
[   54.003925]   [<ffffffff811bd2ca>] do_fsync+0x31/0x4a
[   54.003925]   [<ffffffff811bd50c>] SyS_fsync+0x10/0x14
[   54.003925]   [<ffffffff81618197>] entry_SYSCALL_64_fastpath+0x12/0x6b
[   54.003925] irq event stamp: 2998549
[   54.003925] hardirqs last  enabled at (2998549): [<ffffffff81617997>] _raw_spin_unlock_irq+0x2c/0x4a
[   54.003925] hardirqs last disabled at (2998548): [<ffffffff81617805>] _raw_spin_lock_irq+0x13/0x47
[   54.003925] softirqs last  enabled at (2994888): [<ffffffff8161afbf>] __do_softirq+0x38f/0x4d5
[   54.003925] softirqs last disabled at (2994867): [<ffffffff810542ea>] irq_exit+0x6f/0xd1
[   54.003925]
[   54.003925] other info that might help us debug this:
[   54.003925]  Possible unsafe locking scenario:
[   54.003925]
[   54.003925]        CPU0
[   54.003925]        ----
[   54.003925]   lock(&xfs_nondir_ilock_class);
[   54.003925]   <Interrupt>
[   54.003925]     lock(&xfs_nondir_ilock_class);
[   54.003925]
[   54.003925]  *** DEADLOCK ***
[   54.003925]
[   54.003925] 2 locks held by kswapd0/467:
[   54.003925]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff81136350>] shrink_slab+0x7a/0x518
[   54.003925]  #1:  (&type->s_umount_key#25){.+.+..}, at: [<ffffffff81192b74>] trylock_super+0x1b/0x4b
[   54.003925]
[   54.003925] stack backtrace:
[   54.003925] CPU: 0 PID: 467 Comm: kswapd0 Not tainted 4.5.0-nofs5-00004-g35ad69a8eb83-dirty #902
[   54.003925] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[   54.003925]  0000000000000000 ffff8800065678b8 ffffffff81308ded ffffffff825caa70
[   54.003925]  ffff880007328000 ffff8800065678f0 ffffffff81121489 0000000000000009
[   54.003925]  ffff8800073288a0 ffff880007328000 ffffffff810936a7 0000000000000009
[   54.003925] Call Trace:
[   54.003925]  [<ffffffff81308ded>] dump_stack+0x67/0x90
[   54.003925]  [<ffffffff81121489>] print_usage_bug.part.24+0x259/0x268
[   54.003925]  [<ffffffff810936a7>] ? print_shortest_lock_dependencies+0x180/0x180
[   54.003925]  [<ffffffff8109439f>] mark_lock+0x381/0x567
[   54.003925]  [<ffffffff810953ff>] __lock_acquire+0x9f7/0x190c
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffff81093f2c>] ? check_irq_usage+0x99/0xaa
[   54.003925]  [<ffffffff81092c77>] ? add_lock_to_list.isra.8.constprop.25+0x82/0x8d
[   54.003925]  [<ffffffff810961b1>] ? __lock_acquire+0x17a9/0x190c
[   54.003925]  [<ffffffff81096ae2>] lock_acquire+0x139/0x1e1
[   54.003925]  [<ffffffff81096ae2>] ? lock_acquire+0x139/0x1e1
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0050df2>] ? xfs_free_eofblocks+0x84/0x1ce [xfs]
[   54.003925]  [<ffffffff81090e34>] down_read_nested+0x29/0x3e
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0066897>] xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0050df2>] xfs_free_eofblocks+0x84/0x1ce [xfs]
[   54.003925]  [<ffffffff81094765>] ? trace_hardirqs_on_caller+0x16c/0x188
[   54.003925]  [<ffffffffa0069ef7>] xfs_inactive+0x55/0xc6 [xfs]
[   54.003925]  [<ffffffffa006ef1b>] xfs_fs_evict_inode+0x14b/0x1bd [xfs]
[   54.003925]  [<ffffffff811a7db9>] evict+0xb0/0x165
[   54.003925]  [<ffffffff811a7eaa>] dispose_list+0x3c/0x4a
[   54.003925]  [<ffffffff811a9241>] prune_icache_sb+0x4a/0x55
[   54.003925]  [<ffffffff81192cd3>] super_cache_scan+0x12f/0x179
[   54.003925]  [<ffffffff811365a1>] shrink_slab+0x2cb/0x518
[   54.003925]  [<ffffffff81139b97>] shrink_zone+0x175/0x263
[   54.003925]  [<ffffffff8113b175>] kswapd+0x7dc/0x948
[   54.003925]  [<ffffffff8113a999>] ? mem_cgroup_shrink_node_zone+0x305/0x305
[   54.003925]  [<ffffffff8106c911>] kthread+0xed/0xf5
[   54.003925]  [<ffffffff81617997>] ? _raw_spin_unlock_irq+0x2c/0x4a
[   54.003925]  [<ffffffff8106c824>] ? kthread_create_on_node+0x1bd/0x1bd
[   54.003925]  [<ffffffff816184ef>] ret_from_fork+0x3f/0x70
[   54.003925]  [<ffffffff8106c824>] ? kthread_create_on_node+0x1bd/0x1bd
--
Michal Hocko
SUSE Labs

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API

Michal Hocko
On Wed 27-04-16 22:09:27, Michal Hocko wrote:
[...]
> [   53.993480]   [<ffffffff810945e3>] mark_held_locks+0x5e/0x74
> [   53.993480]   [<ffffffff8109722c>] lockdep_trace_alloc+0xb2/0xb5
> [   53.993480]   [<ffffffff81174e56>] kmem_cache_alloc+0x36/0x2b0

Scratch that. I got it. It is the lockdep annotation which I got wrong
with my patch. I thought this was done much later in the slow path.
My head is burnt so I will get back to it tomorrow. The patch 1.1 should
be OK to go for XFS though because it doesn't really introduce anything
new.

Sorry about the noise!
--
Michal Hocko
SUSE Labs

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API

Michal Hocko
In reply to this post by Michal Hocko
OK, so the lockdep splats I was seeing [1] were much easier to fix than
I originally thought. So the following should be folded into the
original patch. I will send the full patch later on.

[1] http://lkml.kernel.org/r/20160427200927.GC22544@...
---
>From 1968c0a8ace4090a9deca8f4c1a206ee546e595a Mon Sep 17 00:00:00 2001
From: Michal Hocko <[hidden email]>
Date: Wed, 27 Apr 2016 22:32:57 +0200
Subject: [PATCH] fold me "mm: introduce memalloc_nofs_{save,restore} API"

Lockdep infrastructure is hooked into early hot paths of the allocator
so __lockdep_trace_alloc has to check for PF_MEMALLOC_NOFS directly and
do not rely on current_gfp_context

Signed-off-by: Michal Hocko <[hidden email]>

---
 kernel/locking/lockdep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 716547fdb873..f60124d0871c 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2750,7 +2750,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
  return;
 
  /* We're only interested __GFP_FS allocations for now */
- if (!(gfp_mask & __GFP_FS))
+ if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
  return;
 
  /*
--
2.8.0.rc3

--
Michal Hocko
SUSE Labs

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context

Dave Chinner
In reply to this post by Michal Hocko
On Wed, Apr 27, 2016 at 10:03:11AM +0200, Michal Hocko wrote:

> On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> > On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <[hidden email]>
> > >
> > > THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> > >
> > > It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> > > prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> > >
> > > Let's help this process and add a debugging tool to catch when an
> > > explicit allocation request for GFP_NO{FS,IO} is done from the scope
> > > context. The printed stacktrace should help to identify the caller
> > > and evaluate whether it can be changed to use a wider context or whether
> > > it is called from another potentially dangerous context which needs
> > > a scope protection as well.
> >
> > You're going to get a large number of these from XFS. There are call
> > paths in XFs that get called both inside and outside transaction
> > context, and many of them are marked with GFP_NOFS to prevent issues
> > that have cropped up in the past.
> >
> > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > lockdep gets very unhappy about the same functions being called with
> > different reclaim contexts. e.g.  directory block mapping might
> > occur from readdir (no transaction context) or within transactions
> > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > stop lockdep emitting false positive warnings....
>
> I would much rather see lockdep being fixed than abusing GFP_NOFS to
> workaround its limitations. GFP_NOFS has a real consequences to the
> memory reclaim. I will go and check the commit you mentioned and try
> to understand why that is a problem. From what you described above
> I would like to get rid of exactly this kind of usage which is not
> really needed for the recursion protection.

The problem is that every time we come across this, the answer is
"use lockdep annotations". Our lockdep annotations are freakin'
complex because of this, and more often than not lockdep false
positives occur due to bugs in the annotations. e.g. see
fs/xfs/xfs_inode.h for all the inode locking annotations we have to
use and the hoops we have to jump through because we are limited to
8 subclasses and we have to be able to annotate nested inode locks
5 deep in places (RENAME_WHITEOUT, thanks).

At one point, we had to reset lockdep classes for inodes in reclaim
so that they didn't throw lockdep false positives the moment an
inode was locked in a memory reclaim context. We had to change
locking to remove that problem (commit 4f59af7 ("xfs: remove iolock
lock classes"). Then there were all the problems with reclaim
triggering lockdep warnings on directory inodes - we had to add a
separate directory inode class for them, and even then we still need
GFP_NOFS in places to minimise reclaim noise (as per the above
commit).

Put simply: we've had to resort to designing locking and allocation
strategies around the limitations of lockdep annotations, as opposed
to what is actually possible or even optimal. i.e. when the choice
is a 2 minute fix to add GFP_NOFS in cases like this, versus another
week long effort to rewrite the inode annotations (again) like this
one last year:

commit 0952c8183c1575a78dc416b5e168987ff98728bb
Author: Dave Chinner <[hidden email]>
Date:   Wed Aug 19 10:32:49 2015 +1000

    xfs: clean up inode lockdep annotations
   
    Lockdep annotations are a maintenance nightmare. Locking has to be
    modified to suit the limitations of the annotations, and we're
    always having to fix the annotations because they are unable to
    express the complexity of locking heirarchies correctly.
.....

It's a no-brainer to see why GFP_NOFS will be added to the
allocation in question. I've been saying for years that I consider
lockdep harmful - if you want to get rid of GFP_NOFS, then you're
going to need to sort out the lockdep reclaim annotation mess at the
same time...

Cheers,

Dave.
--
Dave Chinner
[hidden email]

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 0/2] scop GFP_NOFS api

NeilBrown
In reply to this post by Michal Hocko
On Tue, Apr 26 2016, Michal Hocko wrote:

> Hi,
> we have discussed this topic at LSF/MM this year. There was a general
> interest in the scope GFP_NOFS allocation context among some FS
> developers. For those who are not aware of the discussion or the issue
> I am trying to sort out (or at least start in that direction) please
> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> which basically copies what we have for the scope GFP_NOIO allocation
> context. I haven't converted any of the FS myself because that is way
> beyond my area of expertise but I would be happy to help with further
> changes on the MM front as well as in some more generic code paths.
>
> Dave had an idea on how to further improve the reclaim context to be
> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> and FS specific cookie set in the FS allocation context and consumed
> by the FS reclaim context to allow doing some provably save actions
> that would be skipped due to GFP_NOFS normally.  I like this idea and
> I believe we can go that direction regardless of the approach taken here.
> Many filesystems simply need to cleanup their NOFS usage first before
> diving into a more complex changes.>
This strikes me as over-engineering to work around an unnecessarily
burdensome interface.... but without details it is hard to be certain.

Exactly what things happen in "FS reclaim context" which may, or may
not, be safe depending on the specific FS allocation context?  Do they
need to happen at all?

My research suggests that for most filesystems the only thing that
happens in reclaim context that is at all troublesome is the final
'evict()' on an inode.  This needs to flush out dirty pages and sync the
inode to storage.  Some time ago we moved most dirty-page writeout out
of the reclaim context and into kswapd.  I think this was an excellent
advance in simplicity.
If we could similarly move evict() into kswapd (and I believe we can)
then most file systems would do nothing in reclaim context that
interferes with allocation context.

The exceptions include:
 - nfs and any filesystem using fscache can block for up to 1 second
   in ->releasepage().  They used to block waiting for some IO, but that
   caused deadlocks and wasn't really needed.  I left the timeout because
   it seemed likely that some throttling would help.  I suspect that a
   careful analysis will show that there is sufficient throttling
   elsewhere.

 - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
   for IO so it can free some quotainfo things.  If it could be changed
   to just schedule the IO without waiting for it then I think this
   would be safe to be called in any FS allocation context.  It already
   uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
   if the lock is held.

I think you/we would end up with a much simpler system if instead of
focussing on the places where GFP_NOFS is used, we focus on places where
__GFP_FS is tested, and try to remove them.  If we get rid of enough of
them the remainder could just use __GFP_IO.

> The patch 2 is a debugging aid which warns about explicit allocation
> requests from the scope context. This is should help to reduce the
> direct usage of the NOFS flags to bare minimum in favor of the scope
> API. It is not aimed to be merged upstream. I would hope Andrew took it
> into mmotm tree to give it linux-next exposure and allow developers to
> do further cleanups.  There is a new kernel command line parameter which
> has to be used for the debugging to be enabled.
>
> I think the GFP_NOIO should be seeing the same clean up.

I think you are suggesting that use of GFP_NOIO should (largely) be
deprecated in favour of memalloc_noio_save().  I think I agree.
Could we go a step further and deprecate GFP_ATOMIC in favour of some
in_atomic() test?  Maybe that is going too far.

Thanks,
NeilBrown

>
> Any feedback is highly appreciated.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to [hidden email]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev

signature.asc (834 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

Steven Whitehouse
Hi,

On 29/04/16 06:35, NeilBrown wrote:

> On Tue, Apr 26 2016, Michal Hocko wrote:
>
>> Hi,
>> we have discussed this topic at LSF/MM this year. There was a general
>> interest in the scope GFP_NOFS allocation context among some FS
>> developers. For those who are not aware of the discussion or the issue
>> I am trying to sort out (or at least start in that direction) please
>> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> which basically copies what we have for the scope GFP_NOIO allocation
>> context. I haven't converted any of the FS myself because that is way
>> beyond my area of expertise but I would be happy to help with further
>> changes on the MM front as well as in some more generic code paths.
>>
>> Dave had an idea on how to further improve the reclaim context to be
>> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> and FS specific cookie set in the FS allocation context and consumed
>> by the FS reclaim context to allow doing some provably save actions
>> that would be skipped due to GFP_NOFS normally.  I like this idea and
>> I believe we can go that direction regardless of the approach taken here.
>> Many filesystems simply need to cleanup their NOFS usage first before
>> diving into a more complex changes.>
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?
>
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
evict() is an issue, but moving it into kswapd would be a potential
problem for GFS2. We already have a memory allocation issue when
recovery is taking place and memory is short. The code path is as follows:

  1. Inode is scheduled for eviction (which requires deallocation)
  2. The glock is required in order to perform the deallocation, which
implies getting a DLM lock
  3. Another node in the cluster fails, so needs recovery
  4. When the DLM lock is requested, it gets blocked until recovery is
complete (for the failed node)
  5. Recovery is performed using a userland fencing utility
  6. Fencing requires memory and then blocks on the eviction
  7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on
DLM lock, DLM lock waiting on fencing)

It doesn't happen often, but we've been looking at the best place to
break that cycle, and one of the things we've been wondering is whether
we could avoid deallocation evictions from memory related contexts, or
at least make it async somehow.

The issue is that it is not possible to know in advance whether an
eviction will result in mearly writing things back to disk (because the
inode is being dropped from cache, but still resides on disk) which is
easy to do, or whether it requires a full deallocation (n_link==0) which
may require significant resources and time,

Steve.


------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 0/2] scop GFP_NOFS api

Michal Hocko
In reply to this post by NeilBrown
On Fri 29-04-16 15:35:42, NeilBrown wrote:

> On Tue, Apr 26 2016, Michal Hocko wrote:
>
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally.  I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
>
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?

Let me quote Dave Chinner from one of the emails discussed at LSFMM
mailing list:
: IMO, making GFP_NOFS "better" cannot be done with context-less flags
: being passed through reclaim. If we want to prevent the recursive
: self-deadlock case in an optimal manner, then we need to be able to
: pass state down to reclaim so that page writeback and the shrinkers
: can determine if they are likely to deadlock.
:
: IOWs, I think we should stop thinking of GFP_NOFS as a *global*
: directive to avoid recursion under any circumstance and instead
: start thinking about it as a mechanism to avoid recursion in
: specific reclaim contexts.
:
: Something as simple as adding an opaque cookie (e.g. can hold a
: superblock or inode pointer) to check against in writeback and
: subsystem shrinkers would result in the vast majority of GFP_NOFS
: contexts being able to reclaim from everything but the one context
: that we *might* deadlock against.
:
: e.g, if we then also check the PF_FSTRANS flag in XFS, we'll
: still be able to reclaim clean inodes, buffers and write back
: dirty pages that don't require transactions to complete under "don't
: recurse" situations because we know it's transactions that we could
: deadlock on in the direct reclaim context.
:
: Note that this information could be added to the writeback_control
: for page writeback, and it could be passed directly to shrinkers
: in the shrink_control structures. The allocation paths might be a
: little harder, but I suspect using the task struct for passing this
: information into direct reclaim might be the easiest approach...

> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
>
> The exceptions include:
>  - nfs and any filesystem using fscache can block for up to 1 second
>    in ->releasepage().  They used to block waiting for some IO, but that
>    caused deadlocks and wasn't really needed.  I left the timeout because
>    it seemed likely that some throttling would help.  I suspect that a
>    careful analysis will show that there is sufficient throttling
>    elsewhere.
>
>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>    for IO so it can free some quotainfo things.  If it could be changed
>    to just schedule the IO without waiting for it then I think this
>    would be safe to be called in any FS allocation context.  It already
>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>    if the lock is held.
>
> I think you/we would end up with a much simpler system if instead of
> focussing on the places where GFP_NOFS is used, we focus on places where
> __GFP_FS is tested, and try to remove them.

One think I have learned is that shrinkers can be really complex and
getting rid of GFP_NOFS will be really hard so I would really like to
start the easiest way possible and remove the direct usage and replace
it by scope one which would at least _explain_ why it is needed. I think
this is a reasonable _first_ step and a large step ahead because we have
a good chance to get rid of a large number of those which were used
"just because I wasn't sure and this should be safe, right?". I wouldn't
be surprised if we end up with a very small number of both scope and
direct usage in the end.

I would also like to revisit generic inode/dentry shrinker and see
whether it could be more __GFP_FS friendly. As you say many FS might
even not depend on some FS internal locks so pushing GFP_FS check down
the layers might make a lot of sense and allow to clean some [id]cache
even for __GFP_FS context.

> If we get rid of enough of them the remainder could just use __GFP_IO.
>
> > The patch 2 is a debugging aid which warns about explicit allocation
> > requests from the scope context. This is should help to reduce the
> > direct usage of the NOFS flags to bare minimum in favor of the scope
> > API. It is not aimed to be merged upstream. I would hope Andrew took it
> > into mmotm tree to give it linux-next exposure and allow developers to
> > do further cleanups.  There is a new kernel command line parameter which
> > has to be used for the debugging to be enabled.
> >
> > I think the GFP_NOIO should be seeing the same clean up.
>
> I think you are suggesting that use of GFP_NOIO should (largely) be
> deprecated in favour of memalloc_noio_save().  I think I agree.

Yes that was the idea.

> Could we go a step further and deprecate GFP_ATOMIC in favour of some
> in_atomic() test?  Maybe that is going too far.

I am not really sure we need that and some GFP_NOWAIT usage is deliberate
to perform an optimistic allocation with another fallback (e.g. higher order
for performance reasons with single page fallback). So I think that nowait
is a slightly different thing.

Thanks!
--
Michal Hocko
SUSE Labs

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 0/2] scop GFP_NOFS api

Dave Chinner
In reply to this post by NeilBrown
On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:

> On Tue, Apr 26 2016, Michal Hocko wrote:
>
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally.  I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
>
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?
>
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.

No, we didn't move dirty page writeout to kswapd - we moved it back
to the background writeback threads where it can be done
efficiently.  kswapd should almost never do single page writeback
because of how inefficient it is from an IO perspective, even though
it can. i.e. if we are doing any significant amount of dirty page
writeback from memory reclaim (direct, kswapd or otherwise) then
we've screwed something up.

> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.

When lots of GFP_NOFS allocation is being done, this already
happens. The shrinkers that can't run due to context accumulate the
work on the shrinker structure, and when the shrinker can next run
(e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
reclaim contexts.

IOWs, we already move shrinker work from direct reclaim to kswapd
when appropriate.

> The exceptions include:
>  - nfs and any filesystem using fscache can block for up to 1 second
>    in ->releasepage().  They used to block waiting for some IO, but that
>    caused deadlocks and wasn't really needed.  I left the timeout because
>    it seemed likely that some throttling would help.  I suspect that a
>    careful analysis will show that there is sufficient throttling
>    elsewhere.
>
>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>    for IO so it can free some quotainfo things.

No it's not. evict() can block on IO - waiting for data or inode
writeback to complete, or even for filesystems to run transactions
on the inode. Hence the superblock shrinker can and does block in
inode cache reclaim.

Indeed, blocking the superblock shrinker in reclaim is a key part of
balancing inode cache pressure in XFS. If the shrinker starts
hitting dirty inodes, it blocks on cleaning them, thereby slowing
the rate of allocation to that which inodes can be cleaned and
reclaimed. There are also background threads that walk ahead freeing
clean inodes, but we have to throttle direct reclaim in this manner
otherwise the allocation pressure vastly outweighs the ability to
reclaim inodes. if we don't balance this, inode allocation triggers
the OOM killer because reclaim keeps reporting "no progress being
made" because dirty inodes are skipped. BY blocking on such inodes,
the shrinker makes progress (slowly) and reclaim sees that memory is
being freed and so continues without invoking the OOM killer...

>    If it could be changed
>    to just schedule the IO without waiting for it then I think this
>    would be safe to be called in any FS allocation context.  It already
>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>    if the lock is held.

We could, but then we have the same problem as the inode cache -
there's no indication of progress going back to the memory reclaim
subsystem, nor is reclaim able to throttle memory allocation back to
the rate at which reclaim is making progress.

There's feedback loops all throughout the XFS reclaim code - it's
designed specifically that way - I made changes to the shrinker
infrastructure years ago to enable this. It's no different to the
dirty page throttling that was done at roughly the same time -
that's also one big feedback loop controlled by the rate at which
pages can be cleaned. Indeed, it was designed was based on the same
premise as all the XFS shrinker code: in steady state conditions
we can't allocate a resource faster than we can reclaim it, so we
need to make reclaim as efficient at possible...

> I think you/we would end up with a much simpler system if instead of
> focussing on the places where GFP_NOFS is used, we focus on places where
> __GFP_FS is tested, and try to remove them.  If we get rid of enough of
> them the remainder could just use __GFP_IO.

The problem with this is that a single kswapd thread can't keep up
with all of the allocation pressure that occurs. e.g. a 20-core
intel CPU with local memory will be seen as a single node and so
will have a single kswapd thread to do reclaim. There's a massive
imbalance between maximum reclaim rate and maximum allocation rate
in situations like this. If we want memory reclaim to run faster,
we to be able to do more work *now*, not defer it to a context with
limited execution resources.

i.e. IMO deferring more work to a single reclaim thread per node is
going to limit memory reclaim scalability and performance, not
improve it.

Cheers,

Dave.
--
Dave Chinner
[hidden email]

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Linux-NTFS-Dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/linux-ntfs-dev
12
Loading...