Return-Path: <owner-linux-mm@kvack.org>
From: Wei Yang <richard.weiyang@gmail.com>
To: akpm@linux-foundation.org
Cc: linux-mm@kvack.org,
	Wei Yang <richard.weiyang@gmail.com>,
	Mike Rapoport <rppt@kernel.org>,
	David Hildenbrand <david@redhat.com>
Subject: [PATCH] mm: use zonelist_zone() to get zone
Date: Sat,  6 Jul 2024 01:50:44 +0000
Message-Id: <20240706015044.27789-1-richard.weiyang@gmail.com>
Sender: owner-linux-mm@kvack.org
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>
Xref: photonic.trudheim.com org.kvack.linux-mm:202795
Newsgroups: org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

Instead of accessing zoneref->zone directly, use zonelist_zone() like
other places for consistency.

No functional change.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
CC: Mike Rapoport (IBM) <rppt@kernel.org>
CC: David Hildenbrand <david@redhat.com>
---
 include/linux/mmzone.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cb7f265c2b96..51bce636373f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1690,7 +1690,7 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 			zone = zonelist_zone(z))
 
 #define for_next_zone_zonelist_nodemask(zone, z, highidx, nodemask) \
-	for (zone = z->zone;	\
+	for (zone = zonelist_zone(z);	\
 		zone;							\
 		z = next_zones_zonelist(++z, highidx, nodemask),	\
 			zone = zonelist_zone(z))
@@ -1726,7 +1726,7 @@ static inline bool movable_only_nodes(nodemask_t *nodes)
 	nid = first_node(*nodes);
 	zonelist = &NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK];
 	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
-	return (!z->zone) ? true : false;
+	return (!zonelist_zone(z)) ? true : false;
 }
 
 
-- 
2.34.1


.

From: Takero Funaki <flintglass@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosryahmed@google.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	Jonathan Corbet <corbet@lwn.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Takero Funaki <flintglass@gmail.com>,
	linux-mm@kvack.org,
	linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
Date: Sat,  6 Jul 2024 02:25:16 +0000
Message-ID: <20240706022523.1104080-1-flintglass@gmail.com>
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1269596 org.kvack.linux-mm:202796
Newsgroups: org.kernel.vger.linux-kernel,org.kernel.vger.linux-doc,org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail


This series addresses some issues and introduces a minor improvement in
the zswap global shrinker.

This version addresses issues discovered during the review of v1:
https://lore.kernel.org/linux-mm/20240608155316.451600-1-flintglass@gmail.com/
and includes three additional patches to fix another issue uncovered by
applying v1.

Changes in v2:
- Added 3 patches to reduce resource contention.
mm: zswap: fix global shrinker memcg iteration:
- Change the loop style (Yosry, Nhat, Shakeel)
- To not skip online memcg, save zswap_last_shrink to detect cursor change by cleaner.  (Yosry, Nhat, Shakeel)
mm: zswap: fix global shrinker error handling logic:
- Change error code for no-writeback memcg. (Yosry)
- Use nr_scanned to check if lru is empty. (Yosry)

Changes in v1:
mm: zswap: fix global shrinker memcg iteration:
- Drop and reacquire spinlock before skipping a memcg.
- Add some comment to clarify the locking mechanism.
mm: zswap: proactive shrinking before pool size limit is hit:
- Remove unneeded check before scheduling work.
- Change shrink start threshold to accept_thr_percent + 1%.


Patches
===============================

1. Fix the memcg iteration logic that abort iteration on offline memcgs.
2. Fix the error path that aborts on expected error codes.
3. Add proactive shrinking at accept threshold + 1%.
4. Drop the global shrinker workqueue WQ_MEM_RECLAIM flag to not block
   pageout.
5. Store incompressible pages as-is to accept all pages.
6. Interrupt the shrinker to avoid blocking page-in/out.

Patches 1 to 3 should be applied in this order to avoid potential loops
caused by the first issue. Patch 3 and later can be applied
independently, but the first two issues must be resolved to ensure the
shrinker can evict pages. Patches 4 to 6 address resource contention
issues in the current shrinker uncovered by applying patch 3.

With this series applied, the shrinker will continue to evict pages
until the accept_threshold_percent proactively, as documented in patch
3. As a side effect of changes in the hysteresis logic, zswap will no
longer reject pages under the max pool limit.

The series is rebased on mainline 6.10-rc6.

Proactive shrinking (Patch 1 to 3)
===============================

Visible issue to resolve
-------------------------------
The visible issue with the current global shrinker is that pageout/in
operations from active processes are slow when zswap is near its max
pool size. This is particularly significant on small memory systems
where total swap usage exceeds what zswap can store. This results in old
pages occupying most of the zswap pool space, with recent pages using
the swap disk directly.

Root cause of the issue
-------------------------------
This issue is caused by zswap maintaining the pool size near 100%. Since
the shrinker fails to shrink the pool to accept_threshold_percent, zswap
rejects incoming pages more frequently than it should. The rejected
pages are directly written to disk while zswap protects old pages from
eviction, leading to slow pageout/in performance for recent pages.

Changes to resolve the issue
-------------------------------
If the pool size were shrunk proactively, rejection by pool limit hits
would be less likely. New incoming pages could be accepted as the pool
gains some space in advance, while older pages are written back in the
background. zswap would then be filled with recent pages, as expected
in the LRU logic.

Patches 1 and 2 ensure the shrinker reduces the pool to the
accept_threshold_percent. Without patch 2, shrink_worker() stops
evicting pages on the 16th encounter with a memcg that has no stored
pages (e.g., tiny services managed under systemd).
Patch 3 makes zswap_store() trigger the shrinker before reaching the max
pool size.

With this series, zswap will prepare some space to reduce the
probability of problematic pool_limit_hit situations, thus reducing slow
reclaim and the page priority inversion against LRU.

Visible changes
-------------------------------
Once proactive shrinking reduces the pool size, pageouts complete
instantly as long as the space prepared by proactive shrinking can store
the reclaimed pages. If an admin sees a large pool_limit_hit, lowering
the accept_threshold_percent will improve active process performance.
The optimal point depends on the tradeoff between active process
pageout/in performance and background services' pagein performance.

Benchmark
-------------------------------
The benchmark below observes active process performance under memory
pressure, coexisting with background services occupying the zswap pool.

The tests were run on a vm with 2 vCPU, 1GB RAM + 4GB swap.  20% max
pool and 50% accept threshould (about 100MB proactive shrink),
zsmalloc/lz4.  This test prefer active process than background process
to demonstrate my intended workload.

It runs `apt -qq update` 10 times under a 60MB memcg memory.max
constraint to emulate an active process running under pressure. The
zswap pool is filled prior to the test by allocating a large memory
(~1.5GB) to emulate background inactive services.

The counters below are from vmstat and zswap debugfs stats. The times
are average seconds using /usr/bin/time -v.

pre-patch, 6.10-rc4
|                    | Start       | End         | Delta      |
|--------------------|-------------|-------------|------------|
| pool_limit_hit     | 845         | 845         | 0          |
| pool_total_size    | 201539584   | 201539584   | 0          |
| stored_pages       | 63138       | 63138       | 0          |
| written_back_pages | 12          | 12          | 0          |
| pswpin             | 387         | 32412       | 32025      |
| pswpout            | 153078      | 197138      | 44060      |
| zswpin             | 0           | 0           | 0          |
| zswpout            | 63150       | 63150       | 0          |
| zswpwb             | 12          | 12          | 0          |

| Time              |              |
|-------------------|--------------|
| elapsed           | 8.473        |
| user              | 1.881        |
| system            | 0.814        |

post-patch, 6.10-rc4 with patch 1 to 5
|                    | Start       | End         | Delta      |
|--------------------|-------------|-------------|------------|
| pool_limit_hit     | 81861       | 81861       | 0          |
| pool_total_size    | 75001856    | 87117824    | 12115968   |
| reject_reclaim_fail| 0           | 32          | 32         |
| same_filled_pages  | 135         | 135         | 0          |
| stored_pages       | 23919       | 27549       | 3630       |
| written_back_pages | 97665       | 106994      | 10329      |
| pswpin             | 4981        | 8223        | 3242       |
| pswpout            | 179554      | 188883      | 9329       |
| zswpin             | 293863      | 590014      | 296151     |
| zswpout            | 455338      | 764882      | 309544     |
| zswpwb             | 97665       | 106994      | 10329      |

| Time              |              |
|-------------------|--------------|
| elapsed           | 4.525        |
| user              | 1.839        |
| system            | 1.467        |

Although the pool_limit_hit is not increased in both cases,
zswap_store() rejected pages before this patch. Note that, before this
series, zswap_store() did not increment pool_limit_hit on rejection by
limit hit hysteresis (only the first few hits were counted).

From the pre-patch result, the existing zswap global shrinker cannot
write back effectively and locks the old pages in the pool. The
pswpin/out indicates the active process uses the swap device directly.

From the post-patch result, zswpin/out/wb are increased as expected,
indicating the active process uses zswap and the old pages of the
background services are evicted from the pool. pswpin/out are
significantly reduced from pre-patch results.


System responsiveness issue (patch 4 to 6)
===============================
After applying patches 1 to 3, I encountered severe responsiveness
degradation while zswap global shrinker is running under heavy memory
pressure.

Visible issue to resolve
-------------------------------
The visible issue happens with patches 1 to 3 applied when a large
amount of memory allocation happens and zswap cannot store the incoming
pages.
While global shrinker is writing back pages, system stops responding as
if under heavy memory thrashing.

This issue is less likely to happen without patches 1 to 3 or zswap is
disabled. I believe this is because the global shrinker could not write
back a meaningful amount of pages, as described in patch 2.

Root cause and changes to resolve the issue
-------------------------------
It seems that zswap shrinker blocking IO for memory reclaim and faults
is the root cause of this responsiveness issue. I introduced three
patches to reduce possible blocking in the following problematic
situations:

1. Contention on workqueue thread pools by shrink_worker() using
WQ_MEM_RECLAIM unnecessarily.

Although the shrinker runs simultaneously with memory reclaim, shrinking
is not required to reclaim memory since zswap_store() can reject pages
without interfering with memory reclaim progress. shrink_worker() should
not use WQ_MEM_RECLAIM and should be delayed when another work in
WQ_MEM_RECLAIM is reclaiming memory. The existing code requires
allocating memory inside shrink_worker(), potentially blocking other
latency-sensitive reclaim work.

2. Contention on swap IO.

Since zswap_writeback_entry() performs write IO in 4KB pages, it
consumes a lot of IOPS, increasing the IO latency of swapout/in. We
should not perform IO for background shrinking while zswap_store() is
rejecting pages or zswap_load() is failing to find stored pages. This
series implements two mitigation logics to reduce the IO contention:

2-a. Do not reject pages in zswap_store().
This is mostly achieved by patch 3. With patch 3, zswap can prepare
space proactively and accept pages while the global shrinker is running.

To avoid rejection further, patch 5 (store incompressible pages) is
added. This reduces rejection by storing incompressible pages. When
zsmalloc is used, we can accept incompressible pages with small memory
overhead. It is a minor optimization, but I think it is worth
implementing. This does not improve performance on current zbud but does
not incur a performance penalty.

2-b. Interrupt writeback while pagein/out.
Once zswap runs out of prepared space, we cannot accept incoming pages,
incurring direct writes to the swap disk. At this moment, the shrinker
is proactively evicting pages, leading to IO contention with memory
reclaim.

Performing low-priority IO is straightforward but requires
reimplementing a low-priority version of __swap_writepage(). Instead, in
patch 6, I implemented a heuristic, delaying the next zswap writeback
based on the elapsed time since zswap_store() rejected a page.

When zswap_store() hits the max pool size and rejects pages,
swap_writepage() immediately performs the writeback to disk. The time
jiffies is saved to tell shrink_worker() to sleep up to
ZSWAP_GLOBAL_SHRINK_DELAY msec.

The same logic applied to zswap_load(). When zswap cannot find a page in
the stored pool, pagein requires read IO from the swap device. The
global shrinker should be interrupted here.

This patch proposes a constant delay of 500 milliseconds, aligning with
the mq-deadline target latency.

Visible change
-------------------------------
With patches 4 to 6, the global shrinker pauses the writeback while
pagein/out operations are using the swap device. This change reduces
resource contention and makes memory reclaim/faults complete faster,
thereby reducing system responsiveness degradation. 

Intended scenario for memory reclaim:
1. zswap pool < accept_threshold as the initial state. This is achieved
   by patch 3, proactive shrinking.
2. Active processes start allocating pages. Pageout is buffered by zswap
   without IO.
3. zswap reaches shrink_start_threshold. zswap continues to buffer
   incoming pages and starts writeback immediately in the background.
4. zswap reaches max pool size. zswap interrupts the global shrinker and
   starts rejecting pages. Write IO for the rejected page will consume
   all IO resources.
5. Active processes stop allocating pages. After the delay, the shrinker
   resumes writeback until the accept threshold.

Benchmark
-------------------------------
To demonstrate that the shrinker writeback is not interfering with
pagein/out operations, I measured the elapsed time of allocating 2GB of
3/4 compressible data by a Python script, averaged over 10 runs:

|                      | elapsed| user  | sys   |
|----------------------|--------|-------|-------|
| With patches 1 to 3  | 13.10  | 0.183 | 2.049 |
| With all patches     | 11.17  | 0.116 | 1.490 |
| zswap off (baseline) | 11.81  | 0.149 | 1.381 |

Although this test cannot distinguish responsiveness issues caused by
zswap writeback from normal memory thrashing between plain pagein/out,
the difference from the baseline indicates that the patches reduced
performance degradation on pageout caused by zswap writeback.

The tests were run on kernel 6.10-rc5 on a VM with 1GB RAM (idling Azure
VM with persistent block swap device), 2 vCPUs, zsmalloc/lz4, 25% max
pool, and 50% accept threshold.

---


Takero Funaki (6):
  mm: zswap: fix global shrinker memcg iteration
  mm: zswap: fix global shrinker error handling logic
  mm: zswap: proactive shrinking before pool size limit is hit
  mm: zswap: make writeback run in the background
  mm: zswap: store incompressible page as-is
  mm: zswap: interrupt shrinker writeback while pagein/out IO

 Documentation/admin-guide/mm/zswap.rst |  17 +-
 mm/zswap.c                             | 264 ++++++++++++++++++++-----
 2 files changed, 219 insertions(+), 62 deletions(-)

-- 
2.43.0

.

Return-Path: <owner-linux-mm@kvack.org>
Date: Sat, 06 Jul 2024 12:34:28 +0800
From: kernel test robot <lkp@intel.com>
To: Mike Rapoport <rppt@kernel.org>
Cc: linux-mm@kvack.org
Subject: [rppt-memblock:for-next] BUILD SUCCESS
 9364a7e40d54e6858479f0a96e1a04aa1204be16
Message-ID: <202407061226.CBlJPbm7-lkp@intel.com>
Sender: owner-linux-mm@kvack.org
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>
Xref: photonic.trudheim.com org.kvack.linux-mm:202803
Newsgroups: org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

tree/branch: git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock.git for-next
branch HEAD: 9364a7e40d54e6858479f0a96e1a04aa1204be16  memblock tests: fix implicit declaration of function 'numa_valid_node'

elapsed time: 1190m

configs tested: 220
configs skipped: 5

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha                             allnoconfig   gcc-13.2.0
alpha                            allyesconfig   gcc-13.2.0
alpha                               defconfig   gcc-13.2.0
arc                              allmodconfig   gcc-13.2.0
arc                               allnoconfig   gcc-13.2.0
arc                              allyesconfig   gcc-13.2.0
arc                                 defconfig   gcc-13.2.0
arc                   randconfig-001-20240706   gcc-13.2.0
arc                   randconfig-002-20240706   gcc-13.2.0
arm                              allmodconfig   gcc-13.2.0
arm                               allnoconfig   clang-19
arm                               allnoconfig   gcc-13.2.0
arm                              allyesconfig   gcc-13.2.0
arm                                 defconfig   gcc-13.2.0
arm                   randconfig-001-20240706   gcc-13.2.0
arm                   randconfig-002-20240706   gcc-13.2.0
arm                   randconfig-003-20240706   gcc-13.2.0
arm                   randconfig-004-20240706   gcc-13.2.0
arm64                            allmodconfig   clang-19
arm64                            allmodconfig   gcc-13.2.0
arm64                             allnoconfig   gcc-13.2.0
arm64                               defconfig   gcc-13.2.0
arm64                 randconfig-001-20240706   gcc-13.2.0
arm64                 randconfig-002-20240706   gcc-13.2.0
arm64                 randconfig-003-20240706   clang-16
arm64                 randconfig-003-20240706   gcc-13.2.0
arm64                 randconfig-004-20240706   clang-19
arm64                 randconfig-004-20240706   gcc-13.2.0
csky                              allnoconfig   gcc-13.2.0
csky                                defconfig   gcc-13.2.0
csky                  randconfig-001-20240706   gcc-13.2.0
csky                  randconfig-002-20240706   gcc-13.2.0
hexagon                          allmodconfig   clang-19
hexagon                           allnoconfig   clang-19
hexagon                          allyesconfig   clang-19
hexagon               randconfig-001-20240706   clang-19
hexagon               randconfig-002-20240706   clang-15
i386                             allmodconfig   clang-18
i386                             allmodconfig   gcc-13
i386                              allnoconfig   clang-18
i386                              allnoconfig   gcc-13
i386                             allyesconfig   clang-18
i386                             allyesconfig   gcc-13
i386         buildonly-randconfig-001-20240705   gcc-13
i386         buildonly-randconfig-001-20240706   clang-18
i386         buildonly-randconfig-002-20240705   gcc-9
i386         buildonly-randconfig-002-20240706   clang-18
i386         buildonly-randconfig-003-20240705   gcc-11
i386         buildonly-randconfig-003-20240706   clang-18
i386         buildonly-randconfig-004-20240705   clang-18
i386         buildonly-randconfig-004-20240706   clang-18
i386         buildonly-randconfig-005-20240705   clang-18
i386         buildonly-randconfig-005-20240706   clang-18
i386         buildonly-randconfig-006-20240705   clang-18
i386         buildonly-randconfig-006-20240706   clang-18
i386                                defconfig   clang-18
i386                  randconfig-001-20240705   gcc-13
i386                  randconfig-001-20240706   clang-18
i386                  randconfig-002-20240705   clang-18
i386                  randconfig-002-20240706   clang-18
i386                  randconfig-003-20240705   gcc-11
i386                  randconfig-003-20240706   clang-18
i386                  randconfig-004-20240705   gcc-13
i386                  randconfig-004-20240706   clang-18
i386                  randconfig-005-20240705   clang-18
i386                  randconfig-005-20240706   clang-18
i386                  randconfig-006-20240705   clang-18
i386                  randconfig-006-20240706   clang-18
i386                  randconfig-011-20240705   gcc-13
i386                  randconfig-011-20240706   clang-18
i386                  randconfig-012-20240705   gcc-13
i386                  randconfig-012-20240706   clang-18
i386                  randconfig-013-20240705   clang-18
i386                  randconfig-013-20240706   clang-18
i386                  randconfig-014-20240705   gcc-8
i386                  randconfig-014-20240706   clang-18
i386                  randconfig-015-20240705   gcc-10
i386                  randconfig-015-20240706   clang-18
i386                  randconfig-016-20240705   clang-18
i386                  randconfig-016-20240706   clang-18
loongarch                        allmodconfig   gcc-13.2.0
loongarch                         allnoconfig   gcc-13.2.0
loongarch                           defconfig   gcc-13.2.0
loongarch             randconfig-001-20240706   gcc-13.2.0
loongarch             randconfig-002-20240706   gcc-13.2.0
m68k                             allmodconfig   gcc-13.2.0
m68k                              allnoconfig   gcc-13.2.0
m68k                             allyesconfig   gcc-13.2.0
m68k                          amiga_defconfig   gcc-13.2.0
m68k                         apollo_defconfig   gcc-13.2.0
m68k                                defconfig   gcc-13.2.0
m68k                       m5475evb_defconfig   gcc-13.2.0
microblaze                       allmodconfig   gcc-13.2.0
microblaze                        allnoconfig   gcc-13.2.0
microblaze                       allyesconfig   gcc-13.2.0
microblaze                          defconfig   gcc-13.2.0
mips                              allnoconfig   gcc-13.2.0
mips                           ci20_defconfig   gcc-13.2.0
mips                     loongson1b_defconfig   gcc-13.2.0
mips                          malta_defconfig   gcc-13.2.0
mips                       rbtx49xx_defconfig   gcc-13.2.0
nios2                             allnoconfig   gcc-13.2.0
nios2                               defconfig   gcc-13.2.0
nios2                 randconfig-001-20240706   gcc-13.2.0
nios2                 randconfig-002-20240706   gcc-13.2.0
openrisc                          allnoconfig   gcc-13.2.0
openrisc                         allyesconfig   gcc-13.2.0
openrisc                            defconfig   gcc-13.2.0
parisc                           allmodconfig   gcc-13.2.0
parisc                            allnoconfig   gcc-13.2.0
parisc                           allyesconfig   gcc-13.2.0
parisc                              defconfig   gcc-13.2.0
parisc                randconfig-001-20240706   gcc-13.2.0
parisc                randconfig-002-20240706   gcc-13.2.0
parisc64                            defconfig   gcc-13.2.0
powerpc                     akebono_defconfig   gcc-13.2.0
powerpc                          allmodconfig   gcc-13.2.0
powerpc                           allnoconfig   gcc-13.2.0
powerpc                          allyesconfig   clang-19
powerpc                          allyesconfig   gcc-13.2.0
powerpc                  mpc866_ads_defconfig   gcc-13.2.0
powerpc                  mpc885_ads_defconfig   gcc-13.2.0
powerpc               randconfig-001-20240706   gcc-13.2.0
powerpc               randconfig-002-20240706   gcc-13.2.0
powerpc               randconfig-003-20240706   clang-19
powerpc                     sequoia_defconfig   gcc-13.2.0
powerpc64             randconfig-001-20240706   gcc-13.2.0
powerpc64             randconfig-002-20240706   gcc-13.2.0
powerpc64             randconfig-003-20240706   clang-19
powerpc64             randconfig-003-20240706   gcc-13.2.0
riscv                            allmodconfig   clang-19
riscv                            allmodconfig   gcc-13.2.0
riscv                             allnoconfig   gcc-13.2.0
riscv                            allyesconfig   clang-19
riscv                            allyesconfig   gcc-13.2.0
riscv                               defconfig   gcc-13.2.0
riscv                 randconfig-001-20240706   gcc-13.2.0
riscv                 randconfig-002-20240706   gcc-13.2.0
s390                             alldefconfig   gcc-13.2.0
s390                             allmodconfig   clang-19
s390                              allnoconfig   clang-19
s390                              allnoconfig   gcc-13.2.0
s390                             allyesconfig   clang-19
s390                             allyesconfig   gcc-13.2.0
s390                                defconfig   gcc-13.2.0
s390                  randconfig-001-20240706   gcc-13.2.0
s390                  randconfig-002-20240706   clang-19
s390                  randconfig-002-20240706   gcc-13.2.0
sh                               allmodconfig   gcc-13.2.0
sh                                allnoconfig   gcc-13.2.0
sh                               allyesconfig   gcc-13.2.0
sh                                  defconfig   gcc-13.2.0
sh                ecovec24-romimage_defconfig   gcc-13.2.0
sh                          r7785rp_defconfig   gcc-13.2.0
sh                    randconfig-001-20240706   gcc-13.2.0
sh                    randconfig-002-20240706   gcc-13.2.0
sh                           se7722_defconfig   gcc-13.2.0
sparc                            allmodconfig   gcc-13.2.0
sparc64                             defconfig   gcc-13.2.0
sparc64               randconfig-001-20240706   gcc-13.2.0
sparc64               randconfig-002-20240706   gcc-13.2.0
um                               allmodconfig   clang-19
um                               allmodconfig   gcc-13.2.0
um                                allnoconfig   clang-17
um                                allnoconfig   gcc-13.2.0
um                               allyesconfig   gcc-13
um                               allyesconfig   gcc-13.2.0
um                                  defconfig   gcc-13.2.0
um                             i386_defconfig   gcc-13.2.0
um                    randconfig-001-20240706   clang-19
um                    randconfig-001-20240706   gcc-13.2.0
um                    randconfig-002-20240706   gcc-13
um                    randconfig-002-20240706   gcc-13.2.0
um                           x86_64_defconfig   gcc-13.2.0
x86_64                            allnoconfig   clang-18
x86_64                           allyesconfig   clang-18
x86_64       buildonly-randconfig-001-20240706   clang-18
x86_64       buildonly-randconfig-002-20240706   clang-18
x86_64       buildonly-randconfig-003-20240706   clang-18
x86_64       buildonly-randconfig-004-20240706   clang-18
x86_64       buildonly-randconfig-005-20240706   clang-18
x86_64       buildonly-randconfig-005-20240706   gcc-13
x86_64       buildonly-randconfig-006-20240706   clang-18
x86_64                              defconfig   clang-18
x86_64                              defconfig   gcc-13
x86_64                randconfig-001-20240706   clang-18
x86_64                randconfig-001-20240706   gcc-9
x86_64                randconfig-002-20240706   clang-18
x86_64                randconfig-003-20240706   clang-18
x86_64                randconfig-004-20240706   clang-18
x86_64                randconfig-005-20240706   clang-18
x86_64                randconfig-006-20240706   clang-18
x86_64                randconfig-011-20240706   clang-18
x86_64                randconfig-011-20240706   gcc-12
x86_64                randconfig-012-20240706   clang-18
x86_64                randconfig-012-20240706   gcc-12
x86_64                randconfig-013-20240706   clang-18
x86_64                randconfig-014-20240706   clang-18
x86_64                randconfig-014-20240706   gcc-13
x86_64                randconfig-015-20240706   clang-18
x86_64                randconfig-015-20240706   gcc-13
x86_64                randconfig-016-20240706   clang-18
x86_64                randconfig-016-20240706   gcc-13
x86_64                randconfig-071-20240706   clang-18
x86_64                randconfig-071-20240706   gcc-12
x86_64                randconfig-072-20240706   clang-18
x86_64                randconfig-072-20240706   gcc-13
x86_64                randconfig-073-20240706   clang-18
x86_64                randconfig-073-20240706   gcc-12
x86_64                randconfig-074-20240706   clang-18
x86_64                randconfig-074-20240706   gcc-13
x86_64                randconfig-075-20240706   clang-18
x86_64                randconfig-076-20240706   clang-18
x86_64                randconfig-076-20240706   gcc-13
x86_64                          rhel-8.3-rust   clang-18
xtensa                            allnoconfig   gcc-13.2.0
xtensa                       common_defconfig   gcc-13.2.0
xtensa                generic_kc705_defconfig   gcc-13.2.0
xtensa                randconfig-001-20240706   gcc-13.2.0
xtensa                randconfig-002-20240706   gcc-13.2.0

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

.

Return-Path: <owner-linux-mm@kvack.org>
From: Wei Yang <richard.weiyang@gmail.com>
To: david@redhat.com,
	osalvador@suse.de,
	akpm@linux-foundation.org
Cc: linux-mm@kvack.org,
	Wei Yang <richard.weiyang@gmail.com>
Subject: [PATCH] mm/page_alloc: put __free_pages_core() in __meminit section
Date: Sat,  6 Jul 2024 06:16:15 +0000
Message-Id: <20240706061615.30322-1-richard.weiyang@gmail.com>
Sender: owner-linux-mm@kvack.org
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>
Xref: photonic.trudheim.com org.kvack.linux-mm:202805
Newsgroups: org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

Function __free_pages_core() is only used in bootmem init and hot-add
memory init path. Let's put it in __meminit section.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
---
 mm/memory_hotplug.c | 3 ++-
 mm/page_alloc.c     | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index fae1da27f0d7..66267c26ca1b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -628,7 +628,8 @@ int restore_online_page_callback(online_page_callback_t callback)
 }
 EXPORT_SYMBOL_GPL(restore_online_page_callback);
 
-void generic_online_page(struct page *page, unsigned int order)
+/* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
+void __ref generic_online_page(struct page *page, unsigned int order)
 {
 	__free_pages_core(page, order, MEMINIT_HOTPLUG);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f3d83def9be..116ee33fd1ce 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1219,7 +1219,7 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	__count_vm_events(PGFREE, 1 << order);
 }
 
-void __free_pages_core(struct page *page, unsigned int order,
+void __meminit __free_pages_core(struct page *page, unsigned int order,
 		enum meminit_context context)
 {
 	unsigned int nr_pages = 1 << order;
-- 
2.34.1


.

Return-Path: <owner-linux-mm@kvack.org>
Message-ID: <603269e5-f3d6-4c42-a2af-6287f5e0ceca@gmx.com>
Date: Sat, 6 Jul 2024 17:40:55 +0930
MIME-Version: 1.0
Content-Language: en-US
To: Linux Memory Management List <linux-mm@kvack.org>,
 "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
Subject: Memory cgroup and special hidden inodes
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Sender: owner-linux-mm@kvack.org
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>
Xref: photonic.trudheim.com org.kvack.linux-mm:202806
Newsgroups: org.kvack.linux-mm,org.kernel.vger.linux-btrfs
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

Hi,

I'm wondering how memory cgroup handles the special inodes, which are
not exposed to end users and are only utilized by the filesystem itself.

Btrfs has at least 3 usages of such hidden special inodes:

- Btree (metadata) inode
   All metadata are mapped in the page cache of btree inode.
   This is the most critical usage, as btrfs has tons of metadata and
   if mem cgroup is limiting the memory usage of btree inode page cache,
   a lot of thing can go wrong pretty easily.

   And a lot of operation on this inode has no obvious task bound to.
   E.g. a lot of delayed operations happens in workqueue context, I'm not
   sure which cgroup those memory would be charged on.

- Data reloc inode
   This is only utilized by relocation, and is only transient.

- Free space cache inode(s)
   They are already being phased out, and the new free space cache tree
   is fully rely on the btree inode.

So my question is, do these special inodes need to be managed by cgroup
in the first place?

Thanks,
Qu

.

Return-Path: <owner-linux-mm@kvack.org>
Date: Sun, 7 Jul 2024 01:16:15 +0800
From: kernel test robot <lkp@intel.com>
To: Jagadeesh Kona <quic_jkona@quicinc.com>
Cc: oe-kbuild-all@lists.linux.dev,
	Linux Memory Management List <linux-mm@kvack.org>,
	Bjorn Andersson <andersson@kernel.org>,
	Vladimir Zapolskiy <vladimir.zapolskiy@linaro.org>,
	Konrad Dybcio <konrad.dybcio@linaro.org>,
	Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Subject: [linux-next:master 7769/10451]
 arch/arm64/boot/dts/qcom/sm8650-hdk.dtb: clock-controller@aaf0000:
 'required-opps' is a required property
Message-ID: <202407070147.C9c3oTqS-lkp@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: owner-linux-mm@kvack.org
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>
Xref: photonic.trudheim.com org.kvack.linux-mm:202811
Newsgroups: org.kvack.linux-mm,dev.linux.lists.oe-kbuild-all
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
head:   0b58e108042b0ed28a71cd7edf5175999955b233
commit: 0bdb730e63f6628b0f8deb3f11991b1d10f9bca5 [7769/10451] arm64: dts: qcom: sm8650: Add video and camera clock controllers
config: arm64-randconfig-051-20240627 (https://download.01.org/0day-ci/archive/20240707/202407070147.C9c3oTqS-lkp@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project ad79a14c9e5ec4a369eed4adf567c22cc029863f)
dtschema version: 2024.6.dev3+g650bf2d
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240707/202407070147.C9c3oTqS-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407070147.C9c3oTqS-lkp@intel.com/

dtcheck warnings: (new ones prefixed by >>)
   arch/arm64/boot/dts/qcom/sm8650.dtsi:3459.27-3534.6: Warning (avoid_unnecessary_addr_size): /soc@0/display-subsystem@ae00000/dsi@ae94000: unnecessary #address-cells/#size-cells without "ranges" or child "reg" property
   arch/arm64/boot/dts/qcom/sm8650.dtsi:3556.27-3612.6: Warning (avoid_unnecessary_addr_size): /soc@0/display-subsystem@ae00000/dsi@ae96000: unnecessary #address-cells/#size-cells without "ranges" or child "reg" property
   arch/arm64/boot/dts/qcom/sm8650-hdk.dts:841.9-852.5: Warning (graph_child_address): /soc@0/geniqup@ac0000/i2c@a8c000/typec-mux@e/ports: graph node has single child node 'port@0', #address-cells/#size-cells are not necessary
   arch/arm64/boot/dts/qcom/sm8650-hdk.dtb: phy@1c0e000: clock-output-names: ['pcie1_pipe_clk'] is too short
   	from schema $id: http://devicetree.org/schemas/phy/qcom,sc8280xp-qmp-pcie-phy.yaml#
>> arch/arm64/boot/dts/qcom/sm8650-hdk.dtb: clock-controller@aaf0000: 'required-opps' is a required property
   	from schema $id: http://devicetree.org/schemas/clock/qcom,sm8450-videocc.yaml#
>> arch/arm64/boot/dts/qcom/sm8650-hdk.dtb: clock-controller@ade0000: 'required-opps' is a required property
   	from schema $id: http://devicetree.org/schemas/clock/qcom,sm8450-camcc.yaml#
   arch/arm64/boot/dts/qcom/sm8650-hdk.dtb: /wcn7850-pmu: failed to match any schema with compatible: ['qcom,wcn7850-pmu']
--
   arch/arm64/boot/dts/qcom/sm8650.dtsi:3556.27-3612.6: Warning (avoid_unnecessary_addr_size): /soc@0/display-subsystem@ae00000/dsi@ae96000: unnecessary #address-cells/#size-cells without "ranges" or child "reg" property
   arch/arm64/boot/dts/qcom/sm8650-mtp.dtb: phy@1c0e000: clock-output-names: ['pcie1_pipe_clk'] is too short
   	from schema $id: http://devicetree.org/schemas/phy/qcom,sc8280xp-qmp-pcie-phy.yaml#
>> arch/arm64/boot/dts/qcom/sm8650-mtp.dtb: clock-controller@aaf0000: 'required-opps' is a required property
   	from schema $id: http://devicetree.org/schemas/clock/qcom,sm8450-videocc.yaml#
>> arch/arm64/boot/dts/qcom/sm8650-mtp.dtb: clock-controller@ade0000: 'required-opps' is a required property
   	from schema $id: http://devicetree.org/schemas/clock/qcom,sm8450-camcc.yaml#
--
   arch/arm64/boot/dts/qcom/sm8650.dtsi:3556.27-3612.6: Warning (avoid_unnecessary_addr_size): /soc@0/display-subsystem@ae00000/dsi@ae96000: unnecessary #address-cells/#size-cells without "ranges" or child "reg" property
   arch/arm64/boot/dts/qcom/sm8650-qrd.dtb: phy@1c0e000: clock-output-names: ['pcie1_pipe_clk'] is too short
   	from schema $id: http://devicetree.org/schemas/phy/qcom,sc8280xp-qmp-pcie-phy.yaml#
>> arch/arm64/boot/dts/qcom/sm8650-qrd.dtb: clock-controller@aaf0000: 'required-opps' is a required property
   	from schema $id: http://devicetree.org/schemas/clock/qcom,sm8450-videocc.yaml#
>> arch/arm64/boot/dts/qcom/sm8650-qrd.dtb: clock-controller@ade0000: 'required-opps' is a required property
   	from schema $id: http://devicetree.org/schemas/clock/qcom,sm8450-camcc.yaml#
   arch/arm64/boot/dts/qcom/sm8650-qrd.dtb: /wcn7850-pmu: failed to match any schema with compatible: ['qcom,wcn7850-pmu']

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

.

Return-Path: <owner-linux-mm@kvack.org>
Date: Sat, 6 Jul 2024 13:55:11 -0700 (PDT)
From: David Rientjes <rientjes@google.com>
To: Michal Hocko <mhocko@kernel.org>, 
    Andrew Morton <akpm@linux-foundation.org>, Mel Gorman <mgorman@suse.de>, 
    Balbir Singh <bsingharora@gmail.com>, 
    Peter Zijlstra <peterz@infradead.org>
cc: linux-mm@kvack.org
Subject: Tools for explaining memory mappings/usage/pressure
Message-ID: <29c27dab-a590-5df2-c840-279bf9dff090@google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Sender: owner-linux-mm@kvack.org
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>
Xref: photonic.trudheim.com org.kvack.linux-mm:202812
Newsgroups: org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

Hi all,

I'm trying to crowdsource information on open source tools that can be 
used directly by customers to explain memory mappings, usage, pressure, 
etc.

We encounter both internal and external users that are looking for this 
insight and it often requires significant engineering time to collect data 
to make any conclusions.

A recent example is an external customer that recently upgraded their 
userspace and started to run into memcg constrained memory pressure that 
wasn't previously observed.  After handing off a hacky script to run in 
the background, it was immediately obvious that the source of the direct 
reclaim was all of the MADV_FREE memory that was sitting around.  
Converting to MADV_DONTNEED solved their issue.

A month ago, a different external customer was concerned about increased 
memory access latency in their guest on some instances although there 
were no issues observable on the host.  After handing off a hacky script 
to run in the background, it was immediately obvious that memory 
fragmentation was resulting in a large disparity in the number of 
hugepages that were available on some instances.

Rather than hacky scripts that collect things like vmstat, memory.stat, 
buddyinfo, etc, at regular intervals, it would be preferable to hand off 
something more complete.  Idea is an open source tool that can be run in 
the background to collect metrics for the system, NUMA nodes, and memcg 
hierarchies, as well as potentially from subsystems in the kernel like 
delay accounting.  IOW, I want to be able to say "install ${tool} and send 
over the log file."

Are thre any open source tools that do a good job of this today that I can 
latch onto?  If not, sounds like I'll be writing one from scratch.  Let me 
know if there's interest in this as well.

Thanks!

.

