Return-Path: <owner-linux-mm@kvack.org>
X-Delivered-To: int-list-linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
X-FDA: 82264113624.27.82C6AE1
From: Wei Yang <richard.weiyang@gmail.com>
To: rppt@kernel.org
Cc: linux-mm@kvack.org,
	Wei Yang <richard.weiyang@gmail.com>
Subject: [PATCH] memblock tests: fix implicit declaration of function 'numa_valid_node'
Date: Mon, 24 Jun 2024 01:54:32 +0000
Message-Id: <20240624015432.31134-1-richard.weiyang@gmail.com>
X-Stat-Signature: tf3d8c519dcjipqzs9rm5mr8ok4o5dg7
X-Rspam-User: 
X-HE-Tag: 1719194090-127607
X-HE-Meta: U2FsdGVkX18q/0iUgTj7gNKMfhjlBbtYCxSIcxkTr5sP469cCjThfLrAHFAT9Ny+vmoqZJ57F5+IjjPl2tnRFXOUBpUGC3G9JYbec5FANbkRn7N3VukiuPSZ1OrhayJZxEf+Bnhji8RADBkCg+3/ej6HPutDj1eHDau4HqC94P9LXj99O2wvyu2qO2qy81xXYShkCzWRYPF8gnkRKmk6dySv+PFZr5Od19EU+zgAXKQLr8EPNfeNRBG4GIJMKj16EIkcDZInNjaEQIARknwYwPAnpXA2yhjvY6Mhu/FVPru474/vp3XpjyufUd4TMfpCQapZYbnzN87ie14gSY3V6cNZWaVp5sNfFB9IEWCYi7WD8fKkdYuA+5+19OoxUJi8gS0Jg+Lu0d0Haclqhzmp+xtpBSQdZb+Nfk7/FJf0oCVt7K2jfh8NbV0Ln0eGRyI/8aiAzn1j22xM84S62zF/+nL8NjYlcP1NjXYl/jzzur8mEB5jwHMXR42O5b1bjgXh1XqEWJJrLl0szdk+PDVEa2fLyeKoVJSONF/uy+xisUWOMi1ODRZcV4YHW8RnQLDOfS5t+cMVmq4Y6QY74sEDX+CKH1my7Nnnm1eB7iDa8FnjZPd/ny0l1DlUSrlQW0KFhcUj0Pl0Gm8L3LdLmr1ci+cPaxM2xdqVboUSUN3urIDgMmaBaXUedTvUwP4QVwxD3weArCLFykPb/CjuJFNC/6SbZnxtafS68u+62NxoZQhEyzFmHUDp7QJ47eZdrK9kFkdN8wr1lUfuqg2UdzIZCgG4C39JDQHP+qTsGOpJLavZzC3obcZPNiXJfERCvR3KxsZd3DUa83o6I5RFcxu2kt8C4eK4s4I96Z/YDec+edkLRliaEiFs9D4hZOIn9FbtzryEBtwBWjY7HE+zrqoH+GaBczCU4MqcBQRBnJzLFRsrN3Pw3vCA1ErAApSTUUR6S15uVl7Ec6PckCanErp
 a1OLkeAH
 BZp8yk0NvZJkaZ15sdDY6WmFdXgPuj9zWZoWaiSyA1GWnEr1QyOr75vXs/yuG6u7h/D1w34vo7C1S354WbODjcL5NdMXxhocYmuWSoH4CHNuOnVw9uQCBsE57K07PHgzN6k08J4U3B2Hyl4AXx+h9uebxbPys7mD/YIJC7iHgRSxtvkfoMJw4jHW+ADT++l9qfn0b6aABsLV9vIMIqC4wMxbx1n29yi0cLOaIOlYv/8u6wg4FttbwT2dEBuzpxjIVvUD3n4tl+XfIomksY0VXNn283lJCWBkh+jh3kFb5DHBKQUaNNDh+OdA6QaYv7nWh30c3hBPLdrXYusxs1OsOxGJloOmfn/JD/nbsteaiGD3uBx+SZ1OMSqmKEpdcye+Gcy6P8EcpFqk62/wmbnyng86Vmux/XlF2KdnZCJvi/3rwjSkjCY0UuPnyOEZNyZzGv9FsHPVBt2RA5qWAN1R5gMjRpc/aFjOPZV6F
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>
Xref: photonic.trudheim.com org.kvack.linux-mm:201044
Newsgroups: org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

commit 8043832e2a12 ("memblock: use numa_valid_node() helper to check
for invalid node ID") introduce a new helper numa_valid_node(), which is
not defined in memblock tests.

Let's add it in the corresponding header file.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
CC: Mike Rapoport (IBM) <rppt@kernel.org>
---
 tools/include/linux/numa.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/tools/include/linux/numa.h b/tools/include/linux/numa.h
index 110b0e5d0fb0..c8b9369335e0 100644
--- a/tools/include/linux/numa.h
+++ b/tools/include/linux/numa.h
@@ -13,4 +13,9 @@
 
 #define	NUMA_NO_NODE	(-1)
 
+static inline bool numa_valid_node(int nid)
+{
+	return nid >= 0 && nid < MAX_NUMNODES;
+}
+
 #endif /* _LINUX_NUMA_H */
-- 
2.34.1


.

X-CID-INFO: VERSION:1.1.38,REQID:53c491e2-b8da-4ec8-862d-4f50d5d39a69,IP:15,UR
	L:0,TC:0,Content:0,EDM:25,RT:0,SF:5,FILE:0,BULK:0,RULE:Release_Ham,ACTION:
	release,TS:45
X-User: lizhenneng@kylinos.cn
From: Zhenneng Li <lizhenneng@kylinos.cn>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH] migrate_pages: modify max number of pages to migrate in batch
Date: Mon, 24 Jun 2024 12:41:40 +0800
Message-Id: <20240624044140.117196-1-lizhenneng@kylinos.cn>
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1256368 org.kvack.linux-mm:201050
Newsgroups: org.kernel.vger.linux-kernel,org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

We restrict the number of pages to be migrated to no more than
HPAGE_PMD_NR or NR_MAX_BATCHED_MIGRATION, but in fact, the
number of pages to be migrated may reach 2*HPAGE_PMD_NR-1 or 2
*NR_MAX_BATCHED_MIGRATION-1, it's not in inconsistent with the context.

Please refer to the patch: 42012e0436d4(migrate_pages: restrict number
of pages to migrate in batch)

Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
---
 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 781979567f64..7a4b37aac9e8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1961,7 +1961,7 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
 			break;
 	}
 	if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
-		list_cut_before(&folios, from, &folio2->lru);
+		list_cut_before(&folios, from, &folio->lru);
 	else
 		list_splice_init(from, &folios);
 	if (mode == MIGRATE_ASYNC)
-- 
2.25.1

.

X-CID-INFO: VERSION:1.1.38,REQID:513e5f83-4bc0-4c29-9adf-5881a04f0a2b,IP:15,UR
	L:0,TC:0,Content:0,EDM:-25,RT:0,SF:-1,FILE:0,BULK:0,RULE:EDM_GE969F26,ACTI
	ON:release,TS:-11
X-User: lizhenneng@kylinos.cn
From: Zhenneng Li <lizhenneng@kylinos.cn>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Zhenneng Li <lizhenneng@kylinos.cn>
Subject: [PATCH] migrate_pages: modify max number of pages to migrate in batch
Date: Mon, 24 Jun 2024 12:51:20 +0800
Message-Id: <20240624045120.121261-1-lizhenneng@kylinos.cn>
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1256381 org.kvack.linux-mm:201051
Newsgroups: org.kernel.vger.linux-kernel,org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

We restrict the number of pages to be migrated to no more than
HPAGE_PMD_NR or NR_MAX_BATCHED_MIGRATION, but in fact, the
number of pages to be migrated may reach 2*HPAGE_PMD_NR-1 or 2
*NR_MAX_BATCHED_MIGRATION-1, it's not in inconsistent with the context.

Please refer to the patch: 42012e0436d4(migrate_pages: restrict number
of pages to migrate in batch)

Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
---
 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 781979567f64..7a4b37aac9e8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1961,7 +1961,7 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
 			break;
 	}
 	if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
-		list_cut_before(&folios, from, &folio2->lru);
+		list_cut_before(&folios, from, &folio->lru);
 	else
 		list_splice_init(from, &folios);
 	if (mode == MIGRATE_ASYNC)
-- 
2.25.1

.

Return-Path: <owner-linux-mm@kvack.org>
X-Delivered-To: int-list-linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
X-FDA: 82264896294.11.0FA7217
From: Vivek Kasireddy <vivek.kasireddy@intel.com>
To: dri-devel@lists.freedesktop.org,
	linux-mm@kvack.org
Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>,
	David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Oscar Salvador <osalvador@suse.de>,
	Daniel Vetter <daniel.vetter@ffwll.ch>,
	Hugh Dickins <hughd@google.com>,
	Peter Xu <peterx@redhat.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	Dongwon Kim <dongwon.kim@intel.com>,
	Junxiao Chang <junxiao.chang@intel.com>
Subject: [PATCH v16 0/9] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios
Date: Sun, 23 Jun 2024 23:36:08 -0700
Message-ID: <20240624063952.1572359-1-vivek.kasireddy@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Rspam-User: 
X-Stat-Signature: kcexm893sc9ptytga7e88o486iqd7da5
X-HE-Tag: 1719212725-120362
X-HE-Meta: U2FsdGVkX18E7ckK7hLmkwwZ1h33vhOwUiXF/x2TpDQFtLAhzTmcTmB9GYandDNsuh1uyqWDk699ZDx/vHnN6cEg5h3040+Hz0yr6JtmAlSRSR8jRAZ1NPXNFn+RJd0uVQCOy24q/nqeVHqn2y+o9fRX68m6zcJfNw0I730tiBgbO33KGTq2+PC+xqnelfe5gOMKltbcWnqf4S5tTRpFNtgbgwSmJj3xFuH6Wzc+FFa9yg0ZyS8lX7Ma46kQW1ryj/8CXwFU8SUTzWOxBGixam6LjGkqm7nPBtE4Sf/UsyWHG0BJMzO3FvnB7UQpbvm3DLRnsW+gkUf6DGlDOepfIBQK2MjXskMVwNUTTG7c+btD8U+fr/Fqcv7qLpo3eEhIOJxl2/r5TPk2N8MXdWuGkVaqpeE3dlheKTTQl+kFYot2L59PdTzO1M+THNL1hQnAmYQkHNj1vyMoQjif3cpvwdRyWKTonXeQD9/2qYIa9WvWM7cHXiGI8RRnTl9u7ZvFWNBCBxvZNsHr/6K9s4daP9/lOQUpVU8NuFRA3MSx+cQXWP2XcgzfXiMeAjzsDL/VPMqEhLLVwlBNnexRaeepf3VEgH86A0n3x5V3Zp9ctrvgCU83X2kp6DwegSO46ij6quxvwllZnZSA0gfyJxj+g6VdMeT1hc0u6AY5Hj+V92t++GKT5zCWF8IKh+Ex7J2S7KTOqBgx/qOK15VCpzPwRZrdpxK0gU/jsVp34Bmh+OCEPV80y1lwMlMon/zUY4rc+Ny+9DtMO0xUc590XsPrQZ79m41oYRIbf0zC2yV4KLYXWSrHQjaFTGXgONMpK1NM+amwhPgSjNE8Mbn07N6iIk8VGdeA55erhILaen3lSK+sgHICZ0Nh2kKwJqSGrOI693A7dttev9rYf6C85ZKFFJatWY5Ej6V47fd+FcJx9FMggOI07eqDCwlqirca7EkMbNOGKSCQPZIp+Cgx/Kt
 EBs/aRM4
 PZUrZ7qoU8bURTTQL50MA9qSJfdJQ9AFl3JuXGAWfqO8Ho1QNDqtIDPihYbGcyXmUNoWI+ARyI11ojEq/Tiq7ototcOItef8SLYP7o66t/74zzaOAgtf9OEanDkASSVKMDjz+lRYYN1d6ds559s92XqmDebXZP1n6/myKX2qTUybGfsEjUSthxf2ZJSAkEoCpkipcBqB7/k25FAM+m5EzNCbobIamGnrhagQ1qCdyzi3f5cIWbNBUHfmuHslDrQ0unucjY4gnWoYhXQYh8QfkDYb7LllHe80EnXKyrq5FEwhoWwHbjKDWG0O4Do3ROnjZMF4OzyrSlUE84tIAsJsKaQjAzMpVU3dg+ijwAsQAVfpaaOhVC3uBwklJ+xnT3b4B7fV/OXPK2K1hT5ORecvKMoH9lfCn8HLwSeh44hGVh8eoyJQyTgifzLCYXlxbKRARfQgb
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>
Xref: photonic.trudheim.com org.kvack.linux-mm:201054
Newsgroups: org.kvack.linux-mm,org.freedesktop.lists.dri-devel
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

Currently, some drivers (e.g, Udmabuf) that want to longterm-pin
the pages/folios associated with a memfd, do so by simply taking a
reference on them. This is not desirable because the pages/folios
may reside in Movable zone or CMA block.

Therefore, having drivers use memfd_pin_folios() API ensures that
the folios are appropriately pinned via FOLL_PIN for longterm DMA.

This patchset also introduces a few helpers and converts the Udmabuf
driver to use folios and memfd_pin_folios() API to longterm-pin
the folios for DMA. Two new Udmabuf selftests are also included to
test the driver and the new API.

---

Patchset overview:

Patch 1-2:    GUP helpers to migrate and unpin one or more folios
Patch 3:      Introduce memfd_pin_folios() API
Patch 4-6:    Udmabuf driver bug fixes for Qemu + hugetlb=on, blob=true case
Patch 7-9:    Convert Udmabuf to use memfd_pin_folios() and add selftests

This series is tested using the following methods:
- Run the subtests added in the last patch
- Run Qemu (master) with the following options and a few additional
  patches to Spice:
  qemu-system-x86_64 -m 4096m....
  -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
  -spice port=3001,gl=on,disable-ticketing=on,preferred-codec=gstreamer:h264
  -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
  -machine memory-backend=mem1
- Run source ./run_vmtests.sh -t gup_test -a to check GUP regressions

Changelog:

v15 -> v16:
- Instead of passing GFP_USER while allocating a hugetlb folio, use
  htlb_alloc_mask(h) & ~(__GFP_HIGHMEM | __GFP_MOVABLE) as gfp mask
  to discourage new users from passing GFP_xxx flags. Also add comments
  to explain this situation (Oscar)
- Replace NUMA_NO_NODE with numa_node_id() while allocating the htlb
  folio to discourage new users from passing NUMA_NO_NODE

v14 -> v15:
- Add an error check start < 0 in memfd_pin_folios()
- Return an error in udmabuf driver if memfd_pin_folios() returns 0
  These two checks fix the following issue identified by syzbot:
  https://syzkaller.appspot.com/bug?extid=40c7dad27267f61839d4
- Set memfd = NULL before dmabuf export to ensure that memfd is
  not closed twice. This fixes the following syzbot issue:
  https://syzkaller.appspot.com/bug?extid=b2cfdac9ae5278d4b621

v13 -> v14:
- Drop the redundant comments before check_and_migrate_movable_pages()
  and refer to check_and_migrate_movable_folios() comments (David)
- Use appropriate ksft_* functions for printing and KSFT_* codes for
  exit() in udmabuf selftest (Shuah)
- Add Mike Kravetz's suggested-by tag in udmabuf selftest patch (Shuah)
- Collect Ack and Rb tags from David

v12 -> v13: (suggestions from David)
- Drop the sanity checks in unpin_folio()/unpin_folios() due to
  unavailability of per folio anon-exclusive flag
- Export unpin_folio()/unpin_folios() using EXPORT_SYMBOL_GPL
  instead of EXPORT_SYMBOL
- Have check_and_migrate_movable_pages() just call
  check_and_migrate_movable_folios() instead of calling other helpers
- Slightly improve the comments and commit messages

v11 -> v12:
- Rebased and tested on mm-unstable

v10 -> v11:
- Remove the version string from the patch subject (Andrew)
- Move the changelog from the patches into the cover letter
- Rearrange the patchset to have GUP patches at the beginning

v9 -> v10:
- Introduce and use unpin_folio(), unpin_folios() and
  check_and_migrate_movable_folios() helpers
- Use a list to track the folios that need to be unpinned in udmabuf

v8 -> v9: (suggestions from Matthew)
- Drop the extern while declaring memfd_alloc_folio()
- Fix memfd_alloc_folio() declaration to have it return struct folio *
  instead of struct page * when CONFIG_MEMFD_CREATE is not defined
- Use folio_pfn() on the folio instead of page_to_pfn() on head page
  in udmabuf
- Don't split the arguments to shmem_read_folio() on multiple lines
  in udmabuf

v7 -> v8: (suggestions from David)
- Have caller pass [start, end], max_folios instead of start, nr_pages
- Replace offsets array with just offset into the first page
- Add comments explaning the need for next_idx
- Pin (and return) the folio (via FOLL_PIN) only once

v6 -> v7:
- Rename this API to memfd_pin_folios() and make it return folios
  and offsets instead of pages (David)
- Don't continue processing the folios in the batch returned by
  filemap_get_folios_contig() if they do not have correct next_idx
- Add the R-b tag from Christoph

v5 -> v6: (suggestions from Christoph)
- Rename this API to memfd_pin_user_pages() to make it clear that it
  is intended for memfds
- Move the memfd page allocation helper from gup.c to memfd.c
- Fix indentation errors in memfd_pin_user_pages()
- For contiguous ranges of folios, use a helper such as
  filemap_get_folios_contig() to lookup the page cache in batches
- Split the processing of hugetlb or shmem pages into helpers to
  simplify the code in udmabuf_create()

v4 -> v5: (suggestions from David)
- For hugetlb case, ensure that we only obtain head pages from the
  mapping by using __filemap_get_folio() instead of find_get_page_flags()
- Handle -EEXIST when two or more potential users try to simultaneously
  add a huge page to the mapping by forcing them to retry on failure

v3 -> v4:
- Remove the local variable "page" and instead use 3 return statements
  in alloc_file_page() (David)
- Add the R-b tag from David

v2 -> v3: (suggestions from David)
- Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE
  (Build error reported by kernel test robot <lkp@intel.com>)
- Don't forget memalloc_pin_restore() on non-migration related errors
- Improve the readability of the cleanup code associated with
  non-migration related errors
- Augment the comments by describing FOLL_LONGTERM like behavior
- Include the R-b tag from Jason

v1 -> v2:
- Drop gup_flags and improve comments and commit message (David)
- Allocate a page if we cannot find in page cache for the hugetlbfs
  case as well (David)
- Don't unpin pages if there is a migration related failure (David)
- Drop the unnecessary nr_pages <= 0 check (Jason)
- Have the caller of the API pass in file * instead of fd (Jason)

Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>

Arnd Bergmann (1):
  udmabuf: add CONFIG_MMU dependency

Vivek Kasireddy (8):
  mm/gup: Introduce unpin_folio/unpin_folios helpers
  mm/gup: Introduce check_and_migrate_movable_folios()
  mm/gup: Introduce memfd_pin_folios() for pinning memfd folios
  udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap
  udmabuf: Add back support for mapping hugetlb pages
  udmabuf: Convert udmabuf driver to use folios
  udmabuf: Pin the pages using memfd_pin_folios() API
  selftests/udmabuf: Add tests to verify data after page migration

 drivers/dma-buf/Kconfig                       |   1 +
 drivers/dma-buf/udmabuf.c                     | 232 +++++++++----
 include/linux/memfd.h                         |   5 +
 include/linux/mm.h                            |   5 +
 mm/gup.c                                      | 308 +++++++++++++++---
 mm/memfd.c                                    |  45 +++
 .../selftests/drivers/dma-buf/udmabuf.c       | 214 ++++++++++--
 7 files changed, 673 insertions(+), 137 deletions(-)

-- 
2.45.1


.

From: Mateusz Guzik <mjguzik@gmail.com>
To: cfijalkovich@google.com
Cc: brauner@kernel.org,
	viro@zeniv.linux.org.uk,
	jack@suse.cz,
	linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org,
	Mateusz Guzik <mjguzik@gmail.com>
Subject: [RFC PATCH] vfs: wrap CONFIG_READ_ONLY_THP_FOR_FS-related code with an ifdef
Date: Mon, 24 Jun 2024 09:41:34 +0200
Message-ID: <20240624074135.486845-1-mjguzik@gmail.com>
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1256504 org.kvack.linux-mm:201066
Newsgroups: org.kernel.vger.linux-kernel,org.kernel.vger.linux-fsdevel,org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

On kernels compiled without this option (which is currently the default
state) filemap_nr_thps expands to 0.

do_dentry_open has a big chunk dependent on it, most of which gets
optimized away, except for a branch and a full fence:

if (f->f_mode & FMODE_WRITE) {
[snip]
        smp_mb();
        if (filemap_nr_thps(inode->i_mapping)) {
[snip]
	}
}

While the branch is pretty minor the fence really does not need to be
there.

This is a bare-minimum patch which takes care of it until someone(tm)
cleans this up. Notably it does not conditionally compile other spots
which issue the matching fence.

I did not bother benchmarking it, not issuing a spurious full fence in
the fast path does not warrant justification from perf standpoint.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

I am not particularly familiar with any of this, the smp_mb in the open
for write path was sticking out like a sore thumb on code read so I
figured there may be One Weird Trick to whack it.

If the stock code is correct as is, then the ifdef as above is fine.

The ifdefed chunk is big enough that it should probably be its own
routine. I don't want to bikeshed so I did not go for it.

For a moment I considered adding filemap_nr_thps_mb which would expand
to 0 or issue the fence + do the read, but then I figured a routine
claiming to post a fence and only conditionally do it is misleading at
best.

As per the commit message fences in collapse_file remain compiled in.
It is unclear to me if the code following them is doing anything useful
on kernels !CONFIG_READ_ONLY_THP_FOR_FS.

All that said, if there is cosmetic touch ups you want done here, I can
do them.

However, a nice full patch would take care of all of the above and I
have neither the information needed to do it nor the interest to get it,
so should someone insinst on a full version I'm going to suggest they
write it themselves. I repeat this is merely a damage control until
someone sorts thigs out.

 fs/open.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/open.c b/fs/open.c
index 28f2fcbebb1b..654c300b3c33 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -980,6 +980,7 @@ static int do_dentry_open(struct file *f,
 	if ((f->f_flags & O_DIRECT) && !(f->f_mode & FMODE_CAN_ODIRECT))
 		return -EINVAL;
 
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
 	/*
 	 * XXX: Huge page cache doesn't support writing yet. Drop all page
 	 * cache for this file before processing writes.
@@ -1007,6 +1008,7 @@ static int do_dentry_open(struct file *f,
 			filemap_invalidate_unlock(inode->i_mapping);
 		}
 	}
+#endif
 
 	return 0;
 
-- 
2.43.0

.

Return-Path: <owner-linux-mm@kvack.org>
Date: Mon, 24 Jun 2024 16:44:01 +0800
From: kernel test robot <oliver.sang@intel.com>
To: "Paul E. McKenney" <paulmck@kernel.org>
CC: <oe-lkp@lists.linux.dev>, <lkp@intel.com>, Linux Memory Management List
	<linux-mm@kvack.org>, Ankur Arora <ankur.a.arora@oracle.com>, Peter Zijlstra
	<peterz@infradead.org>, <rcu@vger.kernel.org>, <oliver.sang@intel.com>
Subject: [linux-next:master] [rcutorture]  7a1fcbb52e:
 WARNING:at_kernel/rcu/rcutorture.c:#rcu_torture_stats_print[rcutorture]
Message-ID: <202406241652.e44865a0-lkp@intel.com>
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
MIME-Version: 1.0
Sender: owner-linux-mm@kvack.org
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>
Xref: photonic.trudheim.com org.kvack.linux-mm:201075
Newsgroups: org.kvack.linux-mm,dev.linux.lists.oe-lkp,org.kernel.vger.rcu
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail



Hello,

the config for this test is a randconfig. by this commit, the config has below
diff between parent and this commit.

--- /pkg/linux/i386-randconfig-r022-20221003/gcc-13/d65635ebba92ee02e8284acfabbaf6b59ec0b5b6/.config    2024-06-23 23:24:45.203220229 +0800
+++ /pkg/linux/i386-randconfig-r022-20221003/gcc-13/7a1fcbb52e611c49331bd66dd2da1efa4c0afef7/.config    2024-06-23 23:59:34.976722738 +0800
@@ -133,7 +133,7 @@ CONFIG_IRQ_TIME_ACCOUNTING=y
 CONFIG_TREE_RCU=y
 CONFIG_PREEMPT_RCU=y
 CONFIG_RCU_EXPERT=y
-CONFIG_TREE_SRCU=y
+CONFIG_TINY_SRCU=y
 CONFIG_TASKS_RCU_GENERIC=y
 CONFIG_FORCE_TASKS_RCU=y
 CONFIG_NEED_TASKS_RCU=y


we don't have enough knowledge how this diff impacts rcutorture test results,
just FYI what we observed in our tests.


kernel test robot noticed "WARNING:at_kernel/rcu/rcutorture.c:#rcu_torture_stats_print[rcutorture]" on:

commit: 7a1fcbb52e611c49331bd66dd2da1efa4c0afef7 ("rcutorture: Add SRCU-V scenario for preemptible Tiny SRCU")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

[test failed on linux-next/master f76698bd9a8ca01d3581236082d786e9a6b72bb7]

in testcase: rcutorture
version: 
with following parameters:

	runtime: 300s
	test: default
	torture_type: busted_srcud



compiler: gcc-13
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+-------------------------------------------------------------------------+------------+------------+
|                                                                         | d65635ebba | 7a1fcbb52e |
+-------------------------------------------------------------------------+------------+------------+
| WARNING:suspicious_RCU_usage                                            | 6          | 6          |
| kernel/rcu/rcutorture.c:#suspicious_rcu_dereference_check()usage        | 6          | 6          |
| WARNING:at_kernel/rcu/rcutorture.c:#rcu_torture_stats_print[rcutorture] | 0          | 6          |
| EIP:rcu_torture_stats_print                                             | 0          | 6          |
+-------------------------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202406241652.e44865a0-lkp@intel.com



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240624/202406241652.e44865a0-lkp@intel.com


[   96.952168][  T436] busted_srcud-torture: !!! 
[   96.952253][  T436] ------------[ cut here ]------------
[   96.954867][  T436] WARNING: CPU: 0 PID: 436 at kernel/rcu/rcutorture.c:2268 rcu_torture_stats_print+0x253/0x496 [rcutorture]
[   96.956537][  T436] Modules linked in: rcutorture torture ppdev i6300esb rapl ipmi_devintf parport_pc bochs drm_vram_helper parport drm_ttm_helper ttm drm_kms_helper i2c_piix4 tiny_power_button ata_generic button fuse drm drm_panel_orientation_quirks i2c_core configfs
[   96.959483][  T436] CPU: 0 PID: 436 Comm: rcu_torture_sta Not tainted 6.10.0-rc1-00067-g7a1fcbb52e61 #1
[   96.960380][  T436] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   96.961355][  T436] EIP: rcu_torture_stats_print+0x253/0x496 [rcutorture]
[   96.962139][  T436] Code: 85 c0 74 02 0f 0b 83 3d e4 17 59 ef 00 74 02 0f 0b 83 3d e0 17 59 ef 00 74 02 0f 0b 83 3d dc 17 59 ef 00 74 02 0f 0b 4b 7e 02 <0f> 0b 68 31 76 59 ef 31 db e8 40 11 54 de 5f 83 fb 0b 72 0c 89 da
[   96.963988][  T436] EAX: 00000000 EBX: 00000003 ECX: ef597627 EDX: ef59762c
[   96.964651][  T436] ESI: ef591994 EDI: 0000453c EBP: c6567f6c ESP: c6567eec
[   96.965187][  T436] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010206
[   96.965660][  T436] CR0: 80050033 CR2: b6b5f000 CR3: 04f71000 CR4: 00040690
[   96.966172][  T436] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[   96.966624][  T436] DR6: fffe0ff0 DR7: 00000400
[   96.966924][  T436] Call Trace:
[   96.967138][  T436]  ? show_regs+0x45/0x4b
[   96.967415][  T436]  ? rcu_torture_stats_print+0x253/0x496 [rcutorture]
[   96.967849][  T436]  ? __warn+0x7c/0x113
[   96.968117][  T436]  ? report_bug+0xb3/0x111
[   96.968406][  T436]  ? rcu_torture_stats_print+0x253/0x496 [rcutorture]
[   96.968880][  T436]  ? exc_overflow+0x37/0x37
[   96.969174][  T436]  ? handle_bug+0x2b/0x47
[   96.969452][  T436]  ? exc_invalid_op+0x17/0x53
[   96.969755][  T436]  ? handle_exception+0x100/0x100
[   96.970141][  T436]  ? exc_overflow+0x37/0x37
[   96.970434][  T436]  ? rcu_torture_stats_print+0x253/0x496 [rcutorture]
[   96.970871][  T436]  ? exc_overflow+0x37/0x37
[   96.971162][  T436]  ? rcu_torture_stats_print+0x253/0x496 [rcutorture]
[   96.971599][  T436]  ? __timer_delete_sync+0x82/0x92
[   96.971938][  T436]  rcu_torture_stats+0x3d/0x5e [rcutorture]
[   96.972322][  T436]  kthread+0xdc/0xe1
[   96.972575][  T436]  ? rcu_torture_stats_print+0x496/0x496 [rcutorture]
[   96.973010][  T436]  ? kthread_park+0x62/0x62
[   96.973302][  T436]  ret_from_fork+0x1c/0x2f
[   96.973589][  T436]  ? kthread_park+0x62/0x62
[   96.973879][  T436]  ret_from_fork_asm+0x12/0x18
[   96.974230][  T436]  entry_INT80_32+0xef/0xef
[   96.974519][  T436] irq event stamp: 535
[   96.974777][  T436] hardirqs last  enabled at (543): [<cda8b4e9>] console_unlock+0x80/0xd4
[   96.975183][  T436] hardirqs last disabled at (550): [<cda8b4cc>] console_unlock+0x63/0xd4
[   96.975588][  T436] softirqs last  enabled at (104): [<cda4d6de>] handle_softirqs+0x2cf/0x2f9
[   96.976007][  T436] softirqs last disabled at (97): [<ce313fce>] __do_softirq+0xa/0xc
[   96.976392][  T436] ---[ end trace 0000000000000000 ]---
[   96.999461][  T436] Reader Pipe:  1738294 6779 1992 57 1 0 0 0 0 0 0
[   96.999803][  T436] busted_srcud-torture: Reader Batch:  1747123 0 0 0 0 0 0 0 0 0 0



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


.

Date: Mon, 24 Jun 2024 16:49:04 +0800
From: kernel test robot <oliver.sang@intel.com>
To: Usama Arif <usamaarif642@gmail.com>
CC: <oe-lkp@lists.linux.dev>, <lkp@intel.com>, Linux Memory Management List
	<linux-mm@kvack.org>, Andrew Morton <akpm@linux-foundation.org>, "Chengming
 Zhou" <chengming.zhou@linux.dev>, Yosry Ahmed <yosryahmed@google.com>, "Nhat
 Pham" <nphamcs@gmail.com>, David Hildenbrand <david@redhat.com>, "Huang,
 Ying" <ying.huang@intel.com>, Hugh Dickins <hughd@google.com>, Johannes
 Weiner <hannes@cmpxchg.org>, Matthew Wilcox <willy@infradead.org>, Shakeel
 Butt <shakeel.butt@linux.dev>, Andi Kleen <ak@linux.intel.com>,
	<linux-kernel@vger.kernel.org>, <oliver.sang@intel.com>
Subject: [linux-next:master] [mm]  0fa2857d23:
 WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
Message-ID: <202406241651.963e3e78-oliver.sang@intel.com>
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1256593 org.kvack.linux-mm:201076
Newsgroups: org.kernel.vger.linux-kernel,dev.linux.lists.oe-lkp,org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail



Hello,

kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:

commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

[test failed on linux-next/master f76698bd9a8ca01d3581236082d786e9a6b72bb7]

in testcase: vm-scalability
version: vm-scalability-x86_64-6f4ef16-0_20240303
with following parameters:

	runtime: 300
	thp_enabled: always
	thp_defrag: always
	nr_task: 32
	nr_ssd: 1
	priority: 1
	test: swap-w-rand-mt
	cpufreq_governor: performance



compiler: gcc-13
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202406241651.963e3e78-oliver.sang@intel.com


[   34.776816][ T2413] ------------[ cut here ]------------
[ 34.782497][ T2413] WARNING: CPU: 11 PID: 2413 at mm/page_alloc.c:4685 __alloc_pages_noprof (mm/page_alloc.c:4685 (discriminator 11)) 
[   34.792245][ T2413] Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c sd_mod t10_pi intel_rapl_msr intel_rapl_common crc64_rocksoft_generic crc64_rocksoft x86_pkg_temp_thermal crc64 intel_powerclamp sg coretemp binfmt_misc kvm_intel ipmi_ssif kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 ahci ast libahci rapl drm_shmem_helper intel_cstate mei_me intel_th_gth ioatdma acpi_power_meter i2c_i801 intel_th_pci libata intel_uncore drm_kms_helper ipmi_si acpi_ipmi dax_hmem mei i2c_smbus intel_th intel_pch_thermal dca wmi ipmi_devintf ipmi_msghandler acpi_pad joydev drm fuse loop dm_mod ip_tables
[   34.849370][ T2413] CPU: 11 PID: 2413 Comm: swapon Not tainted 6.10.0-rc4-00263-g0fa2857d23aa #1
[ 34.858458][ T2413] RIP: 0010:__alloc_pages_noprof (mm/page_alloc.c:4685 (discriminator 11)) 
[ 34.864602][ T2413] Code: 00 00 00 48 89 54 24 08 e9 83 fe ff ff 83 fd 0a 0f 86 f6 fd ff ff 80 3d 8a f4 d6 01 00 0f 85 7f fe ff ff c6 05 7d f4 d6 01 01 <0f> 0b e9 71 fe ff ff f7 c1 00 00 80 00 75 61 f7 c1 00 00 08 00 74
All code
========
   0:	00 00                	add    %al,(%rax)
   2:	00 48 89             	add    %cl,-0x77(%rax)
   5:	54                   	push   %rsp
   6:	24 08                	and    $0x8,%al
   8:	e9 83 fe ff ff       	jmpq   0xfffffffffffffe90
   d:	83 fd 0a             	cmp    $0xa,%ebp
  10:	0f 86 f6 fd ff ff    	jbe    0xfffffffffffffe0c
  16:	80 3d 8a f4 d6 01 00 	cmpb   $0x0,0x1d6f48a(%rip)        # 0x1d6f4a7
  1d:	0f 85 7f fe ff ff    	jne    0xfffffffffffffea2
  23:	c6 05 7d f4 d6 01 01 	movb   $0x1,0x1d6f47d(%rip)        # 0x1d6f4a7
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	e9 71 fe ff ff       	jmpq   0xfffffffffffffea2
  31:	f7 c1 00 00 80 00    	test   $0x800000,%ecx
  37:	75 61                	jne    0x9a
  39:	f7 c1 00 00 08 00    	test   $0x80000,%ecx
  3f:	74                   	.byte 0x74

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	e9 71 fe ff ff       	jmpq   0xfffffffffffffe78
   7:	f7 c1 00 00 80 00    	test   $0x800000,%ecx
   d:	75 61                	jne    0x70
   f:	f7 c1 00 00 08 00    	test   $0x80000,%ecx
  15:	74                   	.byte 0x74
[   34.884371][ T2413] RSP: 0018:ffa000000ce8fda8 EFLAGS: 00010246
[   34.890619][ T2413] RAX: 0000000000000000 RBX: 0000000000040dc0 RCX: 0000000000000000
[   34.898766][ T2413] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000040dc0
[   34.906910][ T2413] RBP: 000000000000000b R08: ffa000000ce8fd44 R09: ff11000104e13bc0
[   34.915074][ T2413] R10: ffa000000ce8feb0 R11: ffa0000023201000 R12: 0000000000000000
[   34.923264][ T2413] R13: 0000000000000001 R14: 0000000000000dc0 R15: 0000000003200000
[   34.931414][ T2413] FS:  00007f8ac1a03840(0000) GS:ff1100103e780000(0000) knlGS:0000000000000000
[   34.940527][ T2413] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   34.947348][ T2413] CR2: 000056306078b000 CR3: 00000001307f4001 CR4: 0000000000771ef0
[   34.955505][ T2413] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   34.963661][ T2413] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   34.971815][ T2413] PKRU: 55555554
[   34.975551][ T2413] Call Trace:
[   34.979030][ T2413]  <TASK>
[ 34.982179][ T2413] ? __warn (kernel/panic.c:693) 
[ 34.986427][ T2413] ? __alloc_pages_noprof (mm/page_alloc.c:4685 (discriminator 11)) 
[ 34.991965][ T2413] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
[ 34.996643][ T2413] ? handle_bug (arch/x86/kernel/traps.c:239) 
[ 35.001163][ T2413] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1)) 
[ 35.006011][ T2413] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621) 
[ 35.011233][ T2413] ? __alloc_pages_noprof (mm/page_alloc.c:4685 (discriminator 11)) 
[ 35.016765][ T2413] __kmalloc_large_node (mm/slub.c:4069) 
[ 35.022043][ T2413] __kmalloc_noprof (arch/x86/include/asm/bitops.h:417 include/asm-generic/getorder.h:46 mm/slub.c:4113 mm/slub.c:4136) 
[ 35.027066][ T2413] ? __do_sys_swapon (mm/swapfile.c:3173) 
[ 35.032196][ T2413] ? __do_sys_swapon (mm/swapfile.c:3173) 
[ 35.037290][ T2413] ? __do_sys_swapon (mm/swapfile.c:3167) 
[ 35.042379][ T2413] __do_sys_swapon (mm/swapfile.c:3173) 
[ 35.047300][ T2413] do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1)) 
[ 35.051955][ T2413] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
[   35.058002][ T2413] RIP: 0033:0x7f8ac1bcef97
[ 35.062571][ T2413] Code: 73 01 c3 48 8b 0d 69 2e 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a7 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 2e 0d 00 f7 d8 64 89 01 48
All code
========
   0:	73 01                	jae    0x3
   2:	c3                   	retq   
   3:	48 8b 0d 69 2e 0d 00 	mov    0xd2e69(%rip),%rcx        # 0xd2e73
   a:	f7 d8                	neg    %eax
   c:	64 89 01             	mov    %eax,%fs:(%rcx)
   f:	48 83 c8 ff          	or     $0xffffffffffffffff,%rax
  13:	c3                   	retq   
  14:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  1b:	00 00 00 
  1e:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  23:	b8 a7 00 00 00       	mov    $0xa7,%eax
  28:	0f 05                	syscall 
  2a:*	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax		<-- trapping instruction
  30:	73 01                	jae    0x33
  32:	c3                   	retq   
  33:	48 8b 0d 39 2e 0d 00 	mov    0xd2e39(%rip),%rcx        # 0xd2e73
  3a:	f7 d8                	neg    %eax
  3c:	64 89 01             	mov    %eax,%fs:(%rcx)
  3f:	48                   	rex.W

Code starting with the faulting instruction
===========================================
   0:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
   6:	73 01                	jae    0x9
   8:	c3                   	retq   
   9:	48 8b 0d 39 2e 0d 00 	mov    0xd2e39(%rip),%rcx        # 0xd2e49
  10:	f7 d8                	neg    %eax
  12:	64 89 01             	mov    %eax,%fs:(%rcx)
  15:	48                   	rex.W
[   35.063745][ T1492] is_virt=false
[   35.082007][ T2413] RSP: 002b:00007fffa761ac08 EFLAGS: 00000246 ORIG_RAX: 00000000000000a7
[   35.082010][ T2413] RAX: ffffffffffffffda RBX: 000056306077c190 RCX: 00007f8ac1bcef97
[   35.082010][ T2413] RDX: 0000000000008001 RSI: 0000000000008001 RDI: 000056306077c190
[   35.082011][ T2413] RBP: 0000000000008001 R08: 0000000000000ff6 R09: 0000000000001000
[   35.082012][ T2413] R10: 4e45505355533253 R11: 0000000000000246 R12: 00007fffa761ae3c
[   35.082012][ T2413] R13: 0000000000000001 R14: 0000003200000000 R15: 000056306077cfe0
[   35.082014][ T2413]  </TASK>
[   35.082015][ T2413] ---[ end trace 0000000000000000 ]---



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240624/202406241651.963e3e78-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

.

From: Mateusz Guzik <mjguzik@gmail.com>
To: akpm@linux-foundation.org
Cc: brauner@kernel.org,
	viro@zeniv.linux.org.uk,
	jack@suse.cz,
	linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org,
	Mateusz Guzik <mjguzik@gmail.com>
Subject: [PATCH v2] vfs: remove redundant smp_mb for thp handling in do_dentry_open
Date: Mon, 24 Jun 2024 10:54:02 +0200
Message-ID: <20240624085402.493630-1-mjguzik@gmail.com>
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1256598 org.kvack.linux-mm:201077
Newsgroups: org.kernel.vger.linux-kernel,org.kernel.vger.linux-fsdevel,org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

opening for write performs:

if (f->f_mode & FMODE_WRITE) {
[snip]
        smp_mb();
        if (filemap_nr_thps(inode->i_mapping)) {
[snip]
        }
}

filemap_nr_thps on kernels built without CONFIG_READ_ONLY_THP_FOR
expands to 0, allowing the compiler to eliminate the entire thing, with
exception of the fence (and the branch leading there).

So happens required synchronisation between i_writecount and nr_thps
changes is already provided by the full fence coming from
get_write_access -> atomic_inc_unless_negative, thus the smp_mb instance
above can be removed regardless of CONFIG_READ_ONLY_THP_FOR.

While I updated commentary in places claiming to match the now-removed
fence, I did not try to patch them to act on the compile option.

I did not bother benchmarking it, not issuing a spurious full fence in
the fast path does not warrant justification from perf standpoint.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

v2:
- just whack the fence instead of ifdefing
- change To recipient, the person who committed the original change is
  no longer active

 fs/open.c       |  9 ++++-----
 mm/khugepaged.c | 10 +++++-----
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 28f2fcbebb1b..64976b6dc75f 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -986,12 +986,11 @@ static int do_dentry_open(struct file *f,
 	 */
 	if (f->f_mode & FMODE_WRITE) {
 		/*
-		 * Paired with smp_mb() in collapse_file() to ensure nr_thps
-		 * is up to date and the update to i_writecount by
-		 * get_write_access() is visible. Ensures subsequent insertion
-		 * of THPs into the page cache will fail.
+		 * Depends on full fence from get_write_access() to synchronize
+		 * against collapse_file() regarding i_writecount and nr_thps
+		 * updates. Ensures subsequent insertion of THPs into the page
+		 * cache will fail.
 		 */
-		smp_mb();
 		if (filemap_nr_thps(inode->i_mapping)) {
 			struct address_space *mapping = inode->i_mapping;
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 409f67a817f1..2e017585f813 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1997,9 +1997,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	if (!is_shmem) {
 		filemap_nr_thps_inc(mapping);
 		/*
-		 * Paired with smp_mb() in do_dentry_open() to ensure
-		 * i_writecount is up to date and the update to nr_thps is
-		 * visible. Ensures the page cache will be truncated if the
+		 * Paired with the fence in do_dentry_open() -> get_write_access()
+		 * to ensure i_writecount is up to date and the update to nr_thps
+		 * is visible. Ensures the page cache will be truncated if the
 		 * file is opened writable.
 		 */
 		smp_mb();
@@ -2187,8 +2187,8 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	if (!is_shmem && result == SCAN_COPY_MC) {
 		filemap_nr_thps_dec(mapping);
 		/*
-		 * Paired with smp_mb() in do_dentry_open() to
-		 * ensure the update to nr_thps is visible.
+		 * Paired with the fence in do_dentry_open() -> get_write_access()
+		 * to ensure the update to nr_thps is visible.
 		 */
 		smp_mb();
 	}
-- 
2.43.0

.

Organization: Red Hat UK Ltd. Registered Address: Red Hat UK Ltd, Amberley
	Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United
	Kingdom.
	Registered in England and Wales under Company Registration No. 3798903
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
    Matthew Wilcox <willy@infradead.org>
cc: dhowells@redhat.com, Jeff Layton <jlayton@kernel.org>,
    netfs@lists.linux.dev, v9fs@lists.linux.dev,
    linux-afs@lists.infradead.org, linux-cifs@vger.kernel.org,
    linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
    linux-kernel@vger.kernel.org
Subject: [PATCH] netfs: Fix netfs_page_mkwrite() to check folio->mapping is valid
X-Mailing-List: linux-cifs@vger.kernel.org
List-Id: <linux-cifs.vger.kernel.org>
List-Subscribe: <mailto:linux-cifs+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-cifs+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 24 Jun 2024 12:23:01 +0100
Message-ID: <614257.1719228181@warthog.procyon.org.uk>
Xref: photonic.trudheim.com org.kernel.vger.linux-cifs:43202 org.kernel.vger.linux-kernel:1256749 org.kvack.linux-mm:201089
Newsgroups: org.kernel.vger.linux-cifs,dev.linux.lists.netfs,dev.linux.lists.v9fs,org.kernel.vger.linux-fsdevel,org.kernel.vger.linux-kernel,org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

Fix netfs_page_mkwrite() to check that folio->mapping is valid once it has
taken the folio lock (as filemap_page_mkwrite() does).  Without this,
generic/247 occasionally oopses with something like the following:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page

    RIP: 0010:trace_event_raw_event_netfs_folio+0x61/0xc0
    ...
    Call Trace:
     <TASK>
     ? __die_body+0x1a/0x60
     ? page_fault_oops+0x6e/0xa0
     ? exc_page_fault+0xc2/0xe0
     ? asm_exc_page_fault+0x22/0x30
     ? trace_event_raw_event_netfs_folio+0x61/0xc0
     trace_netfs_folio+0x39/0x40
     netfs_page_mkwrite+0x14c/0x1d0
     do_page_mkwrite+0x50/0x90
     do_pte_missing+0x184/0x200
     __handle_mm_fault+0x42d/0x500
     handle_mm_fault+0x121/0x1f0
     do_user_addr_fault+0x23e/0x3c0
     exc_page_fault+0xc2/0xe0
     asm_exc_page_fault+0x22/0x30

This is due to the invalidate_inode_pages2_range() issued at the end of th=
e
DIO write interfering with the mmap'd writes.

Fixes: 102a7e2c598c ("netfs: Allow buffered shared-writeable mmap through =
netfs_page_mkwrite()")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/buffered_write.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
index c36643c97cb5..6a6387b3aaff 100644
--- a/fs/netfs/buffered_write.c
+++ b/fs/netfs/buffered_write.c
@@ -497,6 +497,7 @@ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, st=
ruct netfs_group *netfs_gr
 	struct netfs_group *group;
 	struct folio *folio =3D page_folio(vmf->page);
 	struct file *file =3D vmf->vma->vm_file;
+	struct address_space *mapping =3D file->f_mapping;
 	struct inode *inode =3D file_inode(file);
 	struct netfs_inode *ictx =3D netfs_inode(inode);
 	vm_fault_t ret =3D VM_FAULT_RETRY;
@@ -508,6 +509,10 @@ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, s=
truct netfs_group *netfs_gr
 =

 	if (folio_lock_killable(folio) < 0)
 		goto out;
+	if (folio->mapping !=3D mapping) {
+		ret =3D VM_FAULT_NOPAGE | VM_FAULT_LOCKED;
+		goto out;
+	}
 =

 	if (folio_wait_writeback_killable(folio)) {
 		ret =3D VM_FAULT_LOCKED;
@@ -523,7 +528,7 @@ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, st=
ruct netfs_group *netfs_gr
 	group =3D netfs_folio_group(folio);
 	if (group !=3D netfs_group && group !=3D NETFS_FOLIO_COPY_TO_CACHE) {
 		folio_unlock(folio);
-		err =3D filemap_fdatawait_range(inode->i_mapping,
+		err =3D filemap_fdatawait_range(mapping,
 					      folio_pos(folio),
 					      folio_pos(folio) + folio_size(folio));
 		switch (err) {

.

Organization: Red Hat UK Ltd. Registered Address: Red Hat UK Ltd, Amberley
	Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United
	Kingdom.
	Registered in England and Wales under Company Registration No. 3798903
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
    Matthew Wilcox <willy@infradead.org>
cc: dhowells@redhat.com, Jeff Layton <jlayton@kernel.org>,
    netfs@lists.linux.dev, v9fs@lists.linux.dev,
    linux-afs@lists.infradead.org, linux-cifs@vger.kernel.org,
    linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
    linux-kernel@vger.kernel.org
Subject: [PATCH] netfs: Fix netfs_page_mkwrite() to flush conflicting data, not wait
X-Mailing-List: linux-cifs@vger.kernel.org
List-Id: <linux-cifs.vger.kernel.org>
List-Subscribe: <mailto:linux-cifs+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-cifs+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 24 Jun 2024 12:24:03 +0100
Message-ID: <614300.1719228243@warthog.procyon.org.uk>
Xref: photonic.trudheim.com org.kernel.vger.linux-cifs:43203 org.kernel.vger.linux-kernel:1256750 org.kvack.linux-mm:201090
Newsgroups: org.kernel.vger.linux-cifs,dev.linux.lists.netfs,dev.linux.lists.v9fs,org.kernel.vger.linux-fsdevel,org.kernel.vger.linux-kernel,org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

Fix netfs_page_mkwrite() to use filemap_fdatawrite_range(), not
filemap_fdatawait_range() to flush conflicting data.

Fixes: 102a7e2c598c ("netfs: Allow buffered shared-writeable mmap through =
netfs_page_mkwrite()")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/buffered_write.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
index 05745bcc54c6..9cbbeeee6170 100644
--- a/fs/netfs/buffered_write.c
+++ b/fs/netfs/buffered_write.c
@@ -554,9 +554,9 @@ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, st=
ruct netfs_group *netfs_gr
 	group =3D netfs_folio_group(folio);
 	if (group !=3D netfs_group && group !=3D NETFS_FOLIO_COPY_TO_CACHE) {
 		folio_unlock(folio);
-		err =3D filemap_fdatawait_range(mapping,
-					      folio_pos(folio),
-					      folio_pos(folio) + folio_size(folio));
+		err =3D filemap_fdatawrite_range(mapping,
+					       folio_pos(folio),
+					       folio_pos(folio) + folio_size(folio));
 		switch (err) {
 		case 0:
 			ret =3D VM_FAULT_RETRY;

.

Subject: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd
 across NUMA nodes
From: Jesper Dangaard Brouer <hawk@kernel.org>
To: tj@kernel.org, cgroups@vger.kernel.org, yosryahmed@google.com,
 shakeel.butt@linux.dev
Cc: Jesper Dangaard Brouer <hawk@kernel.org>, hannes@cmpxchg.org,
 lizefan.x@bytedance.com, longman@redhat.com, kernel-team@cloudflare.com,
 linux-mm@kvack.org, linux-kernel@vger.kernel.org
Date: Mon, 24 Jun 2024 13:55:32 +0200
Message-ID: <171923011608.1500238.3591002573732683639.stgit@firesoul>
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1256807 org.kvack.linux-mm:201095
Newsgroups: org.kernel.vger.linux-kernel,org.kernel.vger.cgroups,org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

Avoid lock contention on the global cgroup rstat lock caused by kswapd
starting on all NUMA nodes simultaneously. At Cloudflare, we observed
massive issues due to kswapd and the specific mem_cgroup_flush_stats()
call inlined in shrink_node, which takes the rstat lock.

On our 12 NUMA node machines, each with a kswapd kthread per NUMA node,
we noted severe lock contention on the rstat lock. This contention
causes 12 CPUs to waste cycles spinning every time kswapd runs.
Fleet-wide stats (/proc/N/schedstat) for kthreads revealed that we are
burning an average of 20,000 CPU cores fleet-wide on kswapd, primarily
due to spinning on the rstat lock.

To help reviewer follow code: When the Per-CPU-Pages (PCP) freelist is
empty, __alloc_pages_slowpath calls wake_all_kswapds(), causing all
kswapdN threads to wake up simultaneously. The kswapd thread invokes
shrink_node (via balance_pgdat) triggering the cgroup rstat flush
operation as part of its work. This results in kernel self-induced rstat
lock contention by waking up all kswapd threads simultaneously.
Leveraging this detail: balance_pgdat() have NULL value in
target_mem_cgroup, this cause mem_cgroup_flush_stats() to do flush with
root_mem_cgroup.

To resolve the kswapd issue, we generalized the "stats_flush_ongoing"
concept to apply to all users of cgroup rstat, not just memcg. This
concept was originally reverted in commit 7d7ef0a4686a ("mm: memcg:
restore subtree stats flushing"). If there is an ongoing rstat flush,
limited to the root cgroup, the flush is skipped. This is effective as
kswapd operates on the root tree, sufficiently mitigating the thundering
herd problem.

This lowers contention on the global rstat lock, although limited to the
root cgroup. Flushing cgroup subtree's can still lead to lock contention.

Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing").
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/
RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/

 include/linux/cgroup.h |    5 +++++
 kernel/cgroup/rstat.c  |   25 +++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2150ca60394b..ad41cca5c3b6 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -499,6 +499,11 @@ static inline struct cgroup *cgroup_parent(struct cgroup *cgrp)
 	return NULL;
 }
 
+static inline bool cgroup_is_root(struct cgroup *cgrp)
+{
+	return cgroup_parent(cgrp) == NULL;
+}
+
 /**
  * cgroup_is_descendant - test ancestry
  * @cgrp: the cgroup to be tested
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index fb8b49437573..2591840b6dc1 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -11,6 +11,7 @@
 
 static DEFINE_SPINLOCK(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
+static atomic_t root_rstat_flush_ongoing = ATOMIC_INIT(0);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
@@ -350,8 +351,25 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 {
 	might_sleep();
 
+	/*
+	 * This avoids thundering herd problem on global rstat lock. When an
+	 * ongoing flush of the entire tree is in progress, then skip flush.
+	 */
+	if (atomic_read(&root_rstat_flush_ongoing))
+		return;
+
+	/* Grab right to be ongoing flusher, return if loosing race */
+	if (cgroup_is_root(cgrp) &&
+	    atomic_xchg(&root_rstat_flush_ongoing, 1))
+		return;
+
 	__cgroup_rstat_lock(cgrp, -1);
+
 	cgroup_rstat_flush_locked(cgrp);
+
+	if (cgroup_is_root(cgrp))
+		atomic_set(&root_rstat_flush_ongoing, 0);
+
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
@@ -362,13 +380,20 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
  * Flush stats in @cgrp's subtree and prevent further flushes.  Must be
  * paired with cgroup_rstat_flush_release().
  *
+ * Current invariant, not called with root cgrp.
+ *
  * This function may block.
  */
 void cgroup_rstat_flush_hold(struct cgroup *cgrp)
 	__acquires(&cgroup_rstat_lock)
 {
 	might_sleep();
+
 	__cgroup_rstat_lock(cgrp, -1);
+
+	if (atomic_read(&root_rstat_flush_ongoing))
+		return;
+
 	cgroup_rstat_flush_locked(cgrp);
 }
 


.

From: Usama Arif <usamaarif642@gmail.com>
To: akpm@linux-foundation.org
Cc: hannes@cmpxchg.org,
	shakeel.butt@linux.dev,
	david@redhat.com,
	ying.huang@intel.com,
	hughd@google.com,
	willy@infradead.org,
	yosryahmed@google.com,
	nphamcs@gmail.com,
	chengming.zhou@linux.dev,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	kernel-team@meta.com,
	Usama Arif <usamaarif642@gmail.com>
Subject: [PATCH v6 0/2] mm: store zero pages to be swapped out in a bitmap
Date: Mon, 24 Jun 2024 15:01:27 +0100
Message-ID: <20240624140427.1334871-1-usamaarif642@gmail.com>
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1256951 org.kvack.linux-mm:201106
Newsgroups: org.kernel.vger.linux-kernel,org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

As shown in the patchseries that introduced the zswap same-filled
optimization [1], 10-20% of the pages stored in zswap are same-filled.
This is also observed across Meta's server fleet.
By using VM counters in swap_writepage (not included in this
patchseries) it was found that less than 1% of the same-filled
pages to be swapped out are non-zero pages.

For conventional swap setup (without zswap), rather than reading/writing
these pages to flash resulting in increased I/O and flash wear, a bitmap
can be used to mark these pages as zero at write time, and the pages can
be filled at read time if the bit corresponding to the page is set.

When using zswap with swap, this also means that a zswap_entry does not
need to be allocated for zero filled pages resulting in memory savings
which would offset the memory used for the bitmap.

A similar attempt was made earlier in [2] where zswap would only track
zero-filled pages instead of same-filled.
This patchseries adds zero-filled pages optimization to swap
(hence it can be used even if zswap is disabled) and removes the
same-filled code from zswap (as only 1% of the same-filled pages are
non-zero), simplifying code.

This patchseries is based on mm-unstable.

[1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
[2] https://lore.kernel.org/lkml/20240325235018.2028408-1-yosryahmed@google.com/

---
v5 -> v6 (kernel test robot <oliver.sang@intel.com>):
- change bitmap_zalloc/free to kvzalloc/free as a very large swap
  file will result in the allocation order to exceed MAX_PAGE_ORDER
  retulting in bitmap_zalloc to fail.

v4 -> v5 (Yosry):
- Correct comment about using clear_bit instead of bitmp_clear.
- Remove clearing the zeromap from swap_cluster_schedule_discard
  and swap_do_scheduled_discard.

v3 -> v4:
- remove folio_start/end_writeback when folio is zero filled at
  swap_writepage (Matthew)
- check if a large folio is partially in zeromap and return without
  folio_mark_uptodate so that an IO error is emitted, rather than
  checking zswap/disk (Yosry)
- clear zeromap in swap_free_cluster (Nhat)

v2 -> v3:
- Going back to the v1 version of the implementation (David and Shakeel)
- convert unatomic bitmap_set/clear to atomic set/clear_bit (Johannes)
- use clear_highpage instead of folio_page_zero_fill (Yosry)

v1 -> v2:
- instead of using a bitmap in swap, clear pte for zero pages and let
  do_pte_missing handle this page at page fault. (Yosry and Matthew)
- Check end of page first when checking if folio is zero filled as
  it could lead to better performance. (Yosry)

Usama Arif (2):
  mm: store zero pages to be swapped out in a bitmap
  mm: remove code to handle same filled pages

 include/linux/swap.h |   1 +
 mm/page_io.c         | 113 ++++++++++++++++++++++++++++++++++++++++++-
 mm/swapfile.c        |  15 ++++++
 mm/zswap.c           |  86 +++-----------------------------
 4 files changed, 136 insertions(+), 79 deletions(-)

-- 
2.43.0

.

Return-Path: <owner-linux-mm@kvack.org>
From: Christophe Leroy <christophe.leroy@csgroup.eu>
To: Andrew Morton <akpm@linux-foundation.org>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Peter Xu <peterx@redhat.com>,
	Oscar Salvador <osalvador@suse.de>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	linuxppc-dev@lists.ozlabs.org
Subject: [PATCH v6 00/23] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)
Date: Mon, 24 Jun 2024 16:45:26 +0200
Message-ID: <cover.1719240269.git.christophe.leroy@csgroup.eu>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>
Xref: photonic.trudheim.com org.kvack.linux-mm:201133
Newsgroups: org.kvack.linux-mm,org.ozlabs.lists.linuxppc-dev
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

This series should have reached maturity for linux-next.

Also see https://github.com/linuxppc/issues/issues/483

Unlike most architectures, powerpc 8xx HW requires a two-level
pagetable topology for all page sizes. So a leaf PMD-contig approach
is not feasible as such.

Possible sizes on 8xx are 4k, 16k, 512k and 8M.

First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
must point to a single entry level-2 page table. Until now that was
done using hugepd. This series changes it to use standard page tables
where the entry is replicated 1024 times on each of the two pagetables
refered by the two associated PMD entries for that 8M page.

For e500 and book3s/64 there are less constraints because it is not
tied to the HW assisted tablewalk like on 8xx, so it is easier to use
leaf PMDs (and PUDs).

On e500 the supported page sizes are 4M, 16M, 64M, 256M and 1G. All at
PMD level on e500/32 (mpc85xx) and mix of PMD and PUD for e500/64. We
encode page size with 4 available bits in PTE entries. On e300/32 PGD
entries size is increases to 64 bits in order to allow leaf-PMD entries
because PTE are 64 bits on e500.

On book3s/64 only the hash-4k mode is concerned. It supports 16M pages
as cont-PMD and 16G pages as cont-PUD. In other modes (radix-4k, radix-6k
and hash-64k) the sizes match with PMD and PUD sizes so that's just leaf
entries. The hash processing make things a bit more complex. To ease
things, __hash_page_huge() is modified to bail out when DIRTY or ACCESSED
bits are missing, leaving it to mm core to fix it.

Global changes in v6:
- Unsquashed preliminary series from Michael so that everything gets merged together through mm
- In patch 3, removed the modification of pte-40x.h, because 40x is going away completely in another series. This has no impact.
- Added a WARN_ON_ONCE() in patch 21 as commented by Oscar.

Global changes in v5:
- Now use PAGE SIZE field in e500's PTE to store TSIZE instead of using U0-U3
- On e500/64, use highest bit to discriminate leaf entries because PUD entries are not garantied to be 4k aligned so PAGE SIZE field is not garantied to be 0 on a non-leaf entry.

Global changes in v4:
- Fixed a few issues reported privately by robots
- Rebased on top of v6.10-rc1

Global changes in v3:
- Removed patches 1 and 2
- Squashed patch 11 into patch 5
- Replaced patches 12 and 13 with a series from Michael
- Reordered patches a bit to have more general patches up front

For more details on changes, see in each patch.

Christophe Leroy (17):
  mm: Define __pte_leaf_size() to also take a PMD entry
  mm: Provide mm_struct and address to huge_ptep_get()
  powerpc/mm: Remove _PAGE_PSIZE
  powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries
  powerpc/mm: Allow hugepages without hugepd
  powerpc/8xx: Fix size given to set_huge_pte_at()
  powerpc/8xx: Rework support for 8M pages using contiguous PTE entries
  powerpc/8xx: Simplify struct mmu_psize_def
  powerpc/e500: Remove enc and ind fields from struct mmu_psize_def
  powerpc/e500: Switch to 64 bits PGD on 85xx (32 bits)
  powerpc/e500: Encode hugepage size in PTE bits
  powerpc/e500: Don't pre-check write access on data TLB error
  powerpc/e500: Free r10 for FIND_PTE
  powerpc/e500: Use contiguous PMD instead of hugepd
  powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
  powerpc/mm: Remove hugepd leftovers
  mm: Remove CONFIG_ARCH_HAS_HUGEPD

Michael Ellerman (6):
  powerpc/64e: Remove unused IBM HTW code
  powerpc/64e: Split out nohash Book3E 64-bit code
  powerpc/64e: Drop E500 ifdefs in 64-bit code
  powerpc/64e: Drop MMU_FTR_TYPE_FSL_E checks in 64-bit code
  powerpc/64e: Consolidate TLB miss handler patching
  powerpc/64e: Drop unused TLB miss handlers

 arch/arm/include/asm/hugetlb-3level.h         |   4 +-
 arch/arm64/include/asm/hugetlb.h              |   2 +-
 arch/arm64/mm/hugetlbpage.c                   |   2 +-
 arch/powerpc/Kconfig                          |   1 -
 arch/powerpc/include/asm/book3s/32/pgalloc.h  |   2 -
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  15 -
 arch/powerpc/include/asm/book3s/64/hash.h     |  40 +-
 arch/powerpc/include/asm/book3s/64/hugetlb.h  |  38 --
 .../include/asm/book3s/64/pgtable-4k.h        |  47 --
 .../include/asm/book3s/64/pgtable-64k.h       |  20 -
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  22 +-
 arch/powerpc/include/asm/hugetlb.h            |  15 +-
 .../include/asm/nohash/32/hugetlb-8xx.h       |  38 +-
 arch/powerpc/include/asm/nohash/32/mmu-8xx.h  |   9 +-
 arch/powerpc/include/asm/nohash/32/pte-44x.h  |   3 -
 arch/powerpc/include/asm/nohash/32/pte-85xx.h |   3 -
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  |  58 ++-
 .../powerpc/include/asm/nohash/hugetlb-e500.h |  39 +-
 arch/powerpc/include/asm/nohash/mmu-e500.h    |   6 +-
 arch/powerpc/include/asm/nohash/pgalloc.h     |   2 -
 arch/powerpc/include/asm/nohash/pgtable.h     |  46 +-
 arch/powerpc/include/asm/nohash/pte-e500.h    |  63 ++-
 arch/powerpc/include/asm/page.h               |  32 --
 arch/powerpc/include/asm/pgtable-be-types.h   |  10 -
 arch/powerpc/include/asm/pgtable-types.h      |  13 +-
 arch/powerpc/include/asm/pgtable.h            |   3 +
 arch/powerpc/kernel/exceptions-64e.S          |   4 +-
 arch/powerpc/kernel/head_85xx.S               |  70 +--
 arch/powerpc/kernel/head_8xx.S                |  10 +-
 arch/powerpc/kernel/setup_64.c                |   6 +-
 arch/powerpc/mm/book3s64/hash_utils.c         |  11 +-
 arch/powerpc/mm/book3s64/hugetlbpage.c        |  10 +
 arch/powerpc/mm/book3s64/pgtable.c            |  12 -
 arch/powerpc/mm/hugetlbpage.c                 | 455 +-----------------
 arch/powerpc/mm/init-common.c                 |   8 +-
 arch/powerpc/mm/kasan/8xx.c                   |  21 +-
 arch/powerpc/mm/nohash/8xx.c                  |  43 +-
 arch/powerpc/mm/nohash/Makefile               |   2 +-
 arch/powerpc/mm/nohash/book3e_pgtable.c       |   4 +-
 arch/powerpc/mm/nohash/tlb.c                  | 407 +---------------
 arch/powerpc/mm/nohash/tlb_64e.c              | 314 ++++++++++++
 arch/powerpc/mm/nohash/tlb_low_64e.S          | 428 +---------------
 arch/powerpc/mm/pgtable.c                     |  94 ++--
 arch/powerpc/mm/pgtable_32.c                  |   2 +-
 arch/riscv/include/asm/hugetlb.h              |   2 +-
 arch/riscv/mm/hugetlbpage.c                   |   2 +-
 arch/s390/include/asm/hugetlb.h               |   4 +-
 arch/s390/mm/hugetlbpage.c                    |   4 +-
 fs/hugetlbfs/inode.c                          |   2 +-
 fs/proc/task_mmu.c                            |  10 +-
 fs/userfaultfd.c                              |   2 +-
 include/asm-generic/hugetlb.h                 |   2 +-
 include/linux/hugetlb.h                       |   6 -
 include/linux/pgtable.h                       |   3 +
 include/linux/swapops.h                       |   4 +-
 kernel/events/core.c                          |   2 +-
 mm/Kconfig                                    |  10 -
 mm/damon/vaddr.c                              |   6 +-
 mm/gup.c                                      | 183 +------
 mm/hmm.c                                      |   2 +-
 mm/hugetlb.c                                  |  44 +-
 mm/memory-failure.c                           |   2 +-
 mm/mempolicy.c                                |   2 +-
 mm/migrate.c                                  |   4 +-
 mm/mincore.c                                  |   2 +-
 mm/pagewalk.c                                 |  57 +--
 mm/userfaultfd.c                              |   2 +-
 67 files changed, 751 insertions(+), 2040 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/book3s/64/pgtable-4k.h
 create mode 100644 arch/powerpc/mm/nohash/tlb_64e.c

-- 
2.44.0


.

Return-Path: <owner-linux-mm@kvack.org>
Date: Mon, 24 Jun 2024 16:33:44 +0000
Mime-Version: 1.0
Message-ID: <20240624163348.1751454-1-jiaqiyan@google.com>
Subject: [PATCH v5 0/4] Userspace controls soft-offline pages
From: Jiaqi Yan <jiaqiyan@google.com>
To: nao.horiguchi@gmail.com, linmiaohe@huawei.com, jane.chu@oracle.com, 
	ioworker0@gmail.com
Cc: muchun.song@linux.dev, akpm@linux-foundation.org, shuah@kernel.org, 
	corbet@lwn.net, osalvador@suse.de, rientjes@google.com, duenwen@google.com, 
	fvdl@google.com, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, 
	linux-doc@vger.kernel.org, Jiaqi Yan <jiaqiyan@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: owner-linux-mm@kvack.org
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>
Xref: photonic.trudheim.com org.kvack.linux-mm:201145
Newsgroups: org.kvack.linux-mm,org.kernel.vger.linux-doc,org.kernel.vger.linux-kselftest
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

Correctable memory errors are very common on servers with large
amount of memory, and are corrected by ECC, but with two
pain points to users:
1. Correction usually happens on the fly and adds latency overhead
2. Not-fully-proved theory states excessive correctable memory
   errors can develop into uncorrectable memory error.

Soft offline is kernel's additional solution for memory pages
having (excessive) corrected memory errors. Impacted page is migrated
to healthy page if it is in use, then the original page is discarded
for any future use.

The actual policy on whether (and when) to soft offline should be
maintained by userspace, especially in case of an 1G HugeTLB page.
Soft-offline dissolves the HugeTLB page, either in-use or free, into
chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage.
If userspace has not acknowledged such behavior, it may be surprised
when later mmap hugepages MAP_FAILED due to lack of hugepages.
In case of a transparent hugepage, it will be split into 4K pages
as well; userspace will stop enjoying the transparent performance.

In addition, discarding the entire 1G HugeTLB page only because of
corrected memory errors sounds very costly and kernel better not
doing under the hood. But today there are at least 2 such cases:
1. GHES driver sees both GHES_SEV_CORRECTED and
   CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER.
2. RAS Correctable Errors Collector counts correctable errors per
   PFN and when the counter for a PFN reaches threshold
In both cases, userspace has no control of the soft offline performed
by kernel's memory failure recovery.

This patch series give userspace the control of softofflining any page:
kernel only soft offlines raw page / transparent hugepage / HugeTLB
hugepage if userspace has agreed to. The interface to userspace is a
new sysctl called enable_soft_offline under /proc/sys/vm. By default
enable_soft_line is 1 to preserve existing behavior in kernel.

Changelog

v4 => v5:
* incorportate feedbacks from Muhammad Usama Anjum
  <usama.anjum@collabora.com>
* refactor selftest to use what available in kselftest.h.
* update a comment in soft_offline_page.

v3 => v4:
* incorporate feedbacks from Miaohe Lin <linmiaohe@huawei.com>,
  Andrew Morton <akpm@linux-foundation.org>, and
  Oscar Salvador <osalvador@suse.de>.
* insert a refactor commit to unify soft offline's logs to follow
  "Soft offline: 0x${pfn}: ${message}" format.
* some rewords in document: fail => will not perform.
* v4 is still based on commit 83a7eefedc9b ("Linux 6.10-rc3"),
  akpm/mm-stable.

v2 => v3:
* incorporate feedbacks from Miaohe Lin <linmiaohe@huawei.com>,
  Lance Yang <ioworker0@gmail.com>, Oscar Salvador <osalvador@suse.de>,
  and David Rientjes <rientjes@google.com>.
* release potential refcount if enable_soft_offline is 0.
* soft_offline_page() returns EOPNOTSUPP if enable_soft_offline is 0.
* refactor hugetlb-soft-offline.c, for example, introduce
  test_soft_offline_common to reduce repeated code.
* rewrite enable_soft_offline's documentation, adds more details about
  the cost of soft-offline for transparent and hugetlb hugepages, and
  components that are impacted when enable_soft_offline becomes 0.
* fix typos in commit messages.
* v3 is still based on commit 83a7eefedc9b ("Linux 6.10-rc3").

v1 => v2:
* incorporate feedbacks from both Miaohe Lin <linmiaohe@huawei.com> and
  Jane Chu <jane.chu@oracle.com>.
* make the switch to control all pages, instead of HugeTLB specific.
* change the API from
  /sys/kernel/mm/hugepages/hugepages-${size}kB/softoffline_corrected_errors
  to /proc/sys/vm/enable_soft_offline.
* minor update to test code.
* update documentation of the user control API.
* v2 is based on commit 83a7eefedc9b ("Linux 6.10-rc3").

Jiaqi Yan (4):
  mm/memory-failure: refactor log format in soft offline code
  mm/memory-failure: userspace controls soft-offlining pages
  selftest/mm: test enable_soft_offline behaviors
  docs: mm: add enable_soft_offline sysctl

 Documentation/admin-guide/sysctl/vm.rst       |  32 +++
 mm/memory-failure.c                           |  38 ++-
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   1 +
 .../selftests/mm/hugetlb-soft-offline.c       | 227 ++++++++++++++++++
 tools/testing/selftests/mm/run_vmtests.sh     |   4 +
 6 files changed, 295 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/mm/hugetlb-soft-offline.c

-- 
2.45.2.741.gdbec12cfda-goog


.

Message-ID: <6c3fbc2d-85d9-4502-b43c-0950ccdd6f7e@redhat.com>
Date: Mon, 24 Jun 2024 13:32:34 -0400
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Language: en-US
To: Michal Hocko <mhocko@kernel.org>,
 Roman Gushchin <roman.gushchin@linux.dev>,
 Shakeel Butt <shakeel.butt@linux.dev>, Muchun Song <muchun.song@linux.dev>,
 Andrew Morton <akpm@linux-foundation.org>,
 Johannes Weiner <hannes@cmpxchg.org>, Chris Down <chris@chrisdown.name>,
 Yu Zhao <yuzhao@google.com>, Axel Rasmussen <axelrasmussen@google.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
 Linux Memory Management List <linux-mm@kvack.org>,
 Rafael Aquini <aquini@redhat.com>,
 "cgroups@vger.kernel.org" <cgroups@vger.kernel.org>
From: Waiman Long <longman@redhat.com>
Subject: MGLRU OOM problem
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1257200 org.kvack.linux-mm:201157
Newsgroups: org.kernel.vger.linux-kernel,org.kernel.vger.cgroups,org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

Hi,

We are hitting an OOM issue with our OpenShift middleware which is
based on Kubernetes. Currently, it only sets memory.max when setting
a memory limit.  OOM kills are rather frequently encountered when we
try to write a large data file that exceeds memory.max to a NFS mount
filesystem. I have bisected the problem down to commit 14aa8b2d5c2e
("mm/mglru: don't sync disk for each aging cycle").

The following command can be used to cause an OOM kill when running in a
memory cgroup with a memory.max limit of 600M on a NFS mount filesystem.

  # dd if=/dev/urandom of=/disk/2G.bin bs=32K count=65536 
status=progress iflag=fullblock

In my case, I can cause an OOM when I ran the reproducer the 2nd time in 
a test system.

In the first successful run, the reported data rate was:

   2147483648 bytes (2.1 GB, 2.0 GiB) copied, 57.5474 s, 37.3 MB/s

After reverting commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each
aging cycle"), OOM can no longer be reproduced and the new data rate was:

   2147483648 bytes (2.1 GB, 2.0 GiB) copied, 25.694 s, 83.6 MB/s

If I disabled MGLRU (echo 0 > /sys/kernel/mm/lru_gen/enabled), the data
rate was:

   2147483648 bytes (2.1 GB, 2.0 GiB) copied, 21.184 s, 101 MB/s

I know that the purpose of commit 14aa8b2d5c2e to prevent premature
aging of SSDs. However I would like to find a way to wake up the flusher
whenever the cgroup is under memory pressure and have a lot of dirty
pages, but I don't have a solid clue yet.

I am aware that there was a previous discussion about this commit in
[1], so I would like to engage the same community to see if there can
be a proper solution to this problem.

[1] https://lore.kernel.org/lkml/ZcWOh9u3uqZjNFMa@chrisdown.name/

Cheers,
Longman

.

From: SeongJae Park <sj@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: SeongJae Park <sj@kernel.org>,
	damon@lists.linux.dev,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	stable@vger.kernel.org
Subject: [PATCH] mm/damon/core: merge regions aggressively when max_nr_regions is unmet
Date: Mon, 24 Jun 2024 10:58:14 -0700
Message-Id: <20240624175814.89611-1-sj@kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1257247 org.kvack.linux-mm:201160
Newsgroups: org.kernel.vger.linux-kernel,dev.linux.lists.damon,org.kernel.vger.stable,org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

DAMON keeps the number of regions under max_nr_regions by skipping
regions split operations when doing so can make the number higher than
the limit.  It works well for preventing violation of the limit.  But,
if somehow the violation happens, it cannot recovery well depending on
the situation.  In detail, if the real number of regions having
different access pattern is higher than the limit, the mechanism cannot
reduce the number below the limit.  In such a case, the system could
suffer from high monitoring overhead of DAMON.

The violation can actually happen.  For an example, the user could
reduce max_nr_regions while DAMON is running, to be lower than the
current number of regions.  Fix the problem by repeating the merge
operations with increasing aggressiveness in kdamond_merge_regions() for
the case, until the limit is met.

Fixes: b9a6ac4e4ede ("mm/damon: adaptively adjust regions")
Cc: <stable@vger.kernel.org> # 5.15.x
Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/core.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/mm/damon/core.c b/mm/damon/core.c
index f69250b68bcc..e6598c44b53c 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1694,14 +1694,30 @@ static void damon_merge_regions_of(struct damon_target *t, unsigned int thres,
  * access frequencies are similar.  This is for minimizing the monitoring
  * overhead under the dynamically changeable access pattern.  If a merge was
  * unnecessarily made, later 'kdamond_split_regions()' will revert it.
+ *
+ * The total number of regions could be temporarily higher than the
+ * user-defined limit, max_nr_regions for some cases.  For an example, the user
+ * updates max_nr_regions to a number that lower than the current number of
+ * regions while DAMON is running.  Depending on the access pattern, it could
+ * take indefinitve time to reduce the number below the limit.  For such a
+ * case, repeat merging until the limit is met while increasing @threshold and
+ * @sz_limit.
  */
 static void kdamond_merge_regions(struct damon_ctx *c, unsigned int threshold,
 				  unsigned long sz_limit)
 {
 	struct damon_target *t;
+	unsigned int nr_regions;
 
-	damon_for_each_target(t, c)
-		damon_merge_regions_of(t, threshold, sz_limit);
+	do {
+		nr_regions = 0;
+		damon_for_each_target(t, c) {
+			damon_merge_regions_of(t, threshold, sz_limit);
+			nr_regions += damon_nr_regions(t);
+		}
+		threshold = max(1, threshold * 2);
+		sz_limit = max(1, sz_limit * 2);
+	} while (nr_regions > c->attrs.max_nr_regions);
 }
 
 /*
-- 
2.39.2

.

Return-Path: <owner-linux-mm@kvack.org>
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Waiman Long <longman@redhat.com>,
	Shakeel Butt <shakeelb@google.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	Chengming Zhou <zhouchengming@bytedance.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Muchun Song <muchun.song@linux.dev>,
	Chris Li <chrisl@kernel.org>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@intel.com>,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH 0/7] Split list_lru lock into per-cgroup scope
Date: Tue, 25 Jun 2024 01:53:06 +0800
Message-ID: <20240624175313.47329-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>
Xref: photonic.trudheim.com org.kvack.linux-mm:201161
Newsgroups: org.kvack.linux-mm
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

From: Kairui Song <kasong@tencent.com>

Currently, every list_lru has a per-node lock that protects adding,
deletion, isolation, and reparenting of all list_lru_one instances
belonging to this list_lru on this node. This lock contention is heavy
when multiple cgroups modify the same list_lru.

This can be alleviated by splitting the lock into per-cgroup scope.

To achieve this, this series reworks and optimizes the reparenting
process step by step, making it possible to have a stable list_lru_one,
and making it possible to pin the list_lru_one. Then split the lock
into per-cgroup scope.

The result is reduced LOC and better performance: I see a ~25%
improvement for multi-cgroup SWAP over ZRAM and a ~10% improvement for
multi-cgroup inode / dentry workload, as tested in PATCH 6/7:

memhog SWAP test (shadow nodes):
Before:
real    0m20.328s user    0m4.315s sys     10m23.639s
real    0m20.440s user    0m4.142s sys     10m34.756s
real    0m20.381s user    0m4.164s sys     10m29.035s

After:
real    0m15.156s user    0m4.590s sys     7m34.361s
real    0m15.161s user    0m4.776s sys     7m35.086s
real    0m15.429s user    0m4.734s sys     7m42.919s

File read test (inode / dentry):
Before:
real    0m26.939s user    0m36.322s sys     6m30.248s
real    0m15.111s user    0m33.749s sys     5m4.991s
real    0m16.796s user    0m33.438s sys     5m22.865s
real    0m15.256s user    0m34.060s sys     4m56.870s
real    0m14.826s user    0m33.531s sys     4m55.907s
real    0m15.664s user    0m35.619s sys     6m3.638s
real    0m15.746s user    0m34.066s sys     4m56.519s

After:
real    0m22.166s user    0m35.155s sys     6m21.045s
real    0m13.753s user    0m34.554s sys     4m40.982s
real    0m13.815s user    0m34.693s sys     4m39.605s
real    0m13.495s user    0m34.372s sys     4m40.776s
real    0m13.895s user    0m34.005s sys     4m39.061s
real    0m13.629s user    0m33.476s sys     4m43.626s
real    0m14.001s user    0m33.463s sys     4m41.261s

PATCH 1/7: Fixes a long-existing bug, so shadow nodes will be accounted
    to the right cgroup and put into the right list_lru.
PATCH 2/7 - 4/7: Clean up
PATCH 6/7: Reworks and optimizes reparenting process, avoids touching
    kmemcg_id on reparenting as first step.
PATCH 7/7: Makes it possible to pin the list_lru_one and prevent racing
    with reparenting, and splits the lock.

Kairui Song (7):
  mm/swap, workingset: make anon workingset nodes memcg aware
  mm/list_lru: don't pass unnecessary key parameters
  mm/list_lru: don't export list_lru_add
  mm/list_lru: code clean up for reparenting
  mm/list_lru: simplify reparenting and initial allocation
  mm/list_lru: split the lock to per-cgroup scope
  mm/list_lru: Simplify the list_lru walk callback function

 drivers/android/binder_alloc.c |   6 +-
 drivers/android/binder_alloc.h |   2 +-
 fs/dcache.c                    |   4 +-
 fs/gfs2/quota.c                |   2 +-
 fs/inode.c                     |   5 +-
 fs/nfs/nfs42xattr.c            |   4 +-
 fs/nfsd/filecache.c            |   5 +-
 fs/xfs/xfs_buf.c               |   2 -
 fs/xfs/xfs_qm.c                |   6 +-
 include/linux/list_lru.h       |  26 ++-
 mm/list_lru.c                  | 387 +++++++++++++++++----------------
 mm/memcontrol.c                |  10 +-
 mm/swap_state.c                |   3 +-
 mm/workingset.c                |  20 +-
 mm/zswap.c                     |  12 +-
 15 files changed, 246 insertions(+), 248 deletions(-)

-- 
2.45.2


.

Return-Path: <owner-linux-mm@kvack.org>
Date: Tue, 25 Jun 2024 04:54:56 +0800
From: kernel test robot <lkp@intel.com>
To: Christian Hewitt <christianshewitt@gmail.com>
Cc: oe-kbuild-all@lists.linux.dev,
	Linux Memory Management List <linux-mm@kvack.org>,
	Neil Armstrong <neil.armstrong@linaro.org>
Subject: [linux-next:master 7720/8016]
 arch/arm64/boot/dts/amlogic/meson-gxl-s905x-vero4k.dtb: sound: Unevaluated
 properties are not allowed ('assigned-clock-parents',
 'assigned-clock-rates', 'assigned-clocks' were unexpected)
Message-ID: <202406250406.PfkX0bDF-lkp@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: owner-linux-mm@kvack.org
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>
Xref: photonic.trudheim.com org.kvack.linux-mm:201188
Newsgroups: org.kvack.linux-mm,dev.linux.lists.oe-kbuild-all
Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
head:   62c97045b8f720c2eac807a5f38e26c9ed512371
commit: 5feff053b08ce5d2167b9f44bcea3b466b5a81a0 [7720/8016] arm64: dts: meson: add support for OSMC Vero 4K
config: arm64-randconfig-051-20240625 (https://download.01.org/0day-ci/archive/20240625/202406250406.PfkX0bDF-lkp@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project ad79a14c9e5ec4a369eed4adf567c22cc029863f)
dtschema version: 2024.6.dev1+g833054f
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240625/202406250406.PfkX0bDF-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202406250406.PfkX0bDF-lkp@intel.com/

dtcheck warnings: (new ones prefixed by >>)
   arch/arm64/boot/dts/amlogic/meson-gxl.dtsi:146.28-300.4: Warning (unit_address_vs_reg): /soc/bus@c8100000/pinctrl@14: node has a unit name, but no reg or ranges property
   arch/arm64/boot/dts/amlogic/meson-gxl.dtsi:363.31-774.4: Warning (unit_address_vs_reg): /soc/bus@c8834000/pinctrl@4b0: node has a unit name, but no reg or ranges property
   arch/arm64/boot/dts/amlogic/meson-gxl.dtsi:146.28-300.4: Warning (simple_bus_reg): /soc/bus@c8100000/pinctrl@14: missing or empty reg/ranges property
   arch/arm64/boot/dts/amlogic/meson-gxl.dtsi:363.31-774.4: Warning (simple_bus_reg): /soc/bus@c8834000/pinctrl@4b0: missing or empty reg/ranges property
>> arch/arm64/boot/dts/amlogic/meson-gxl-s905x-vero4k.dtb: sound: Unevaluated properties are not allowed ('assigned-clock-parents', 'assigned-clock-rates', 'assigned-clocks' were unexpected)
   	from schema $id: http://devicetree.org/schemas/sound/amlogic,gx-sound-card.yaml#
>> arch/arm64/boot/dts/amlogic/meson-gxl-s905x-vero4k.dtb: sound: 'anyOf' conditional failed, one must be fixed:
   	'clocks' is a required property
   	'#clock-cells' is a required property
   	from schema $id: http://devicetree.org/schemas/clock/clock.yaml#

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

.

