From: Yang Shi To: peterx@redhat.com, yangge1116@126.com, david@redhat.com, akpm@linux-foundation.org Cc: yang@os.amperecomputing.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: [v2 PATCH] mm: gup: do not call try_grab_folio() in slow path Date: Thu, 27 Jun 2024 15:14:13 -0700 Message-ID: <20240627221413.671680-1-yang@os.amperecomputing.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1261494 org.kvack.linux-mm:201709 Newsgroups: org.kernel.vger.linux-kernel,org.kernel.vger.stable,org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail The try_grab_folio() is supposed to be used in fast path and it elevates folio refcount by using add ref unless zero. We are guaranteed to have at least one stable reference in slow path, so the simple atomic add could be used. The performance difference should be trivial, but the misuse may be confusing and misleading. In another thread [1] a kernel warning was reported when pinning folio in CMA memory when launching SEV virtual machine. The splat looks like: [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 [ 464.325515] Call Trace: [ 464.325520] [ 464.325523] ? __get_user_pages+0x423/0x520 [ 464.325528] ? __warn+0x81/0x130 [ 464.325536] ? __get_user_pages+0x423/0x520 [ 464.325541] ? report_bug+0x171/0x1a0 [ 464.325549] ? handle_bug+0x3c/0x70 [ 464.325554] ? exc_invalid_op+0x17/0x70 [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 [ 464.325567] ? __get_user_pages+0x423/0x520 [ 464.325575] __gup_longterm_locked+0x212/0x7a0 [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 [ 464.325590] pin_user_pages_fast+0x47/0x60 [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] Per the analysis done by yangge, when starting the SEV virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory. But the page is in CMA area, so fast GUP will fail then fallback to the slow path due to the longterm pinnalbe check in try_grab_folio(). The slow path will try to pin the pages then migrate them out of CMA area. But the slow path also uses try_grab_folio() to pin the page, it will also fail due to the same check then the above warning is triggered. [1] https://lore.kernel.org/linux-mm/1719478388-31917-1-git-send-email-yangge1116@126.com/ Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") Cc: [6.6+] Reported-by: yangge Signed-off-by: Yang Shi --- mm/gup.c | 285 +++++++++++++++++++++++++---------------------- mm/huge_memory.c | 2 +- mm/internal.h | 3 +- 3 files changed, 152 insertions(+), 138 deletions(-) v2: 1. Fixed the build warning 2. Reworked the commit log to include the bug report and analysis (reworded by me) from yangge 3. Rebased onto the latest mm-unstable diff --git a/mm/gup.c b/mm/gup.c index 8bea9ad80984..7439359d0b71 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -99,95 +99,6 @@ static inline struct folio *try_get_folio(struct page *page, int refs) return folio; } -/** - * try_grab_folio() - Attempt to get or pin a folio. - * @page: pointer to page to be grabbed - * @refs: the value to (effectively) add to the folio's refcount - * @flags: gup flags: these are the FOLL_* flag values. - * - * "grab" names in this file mean, "look at flags to decide whether to use - * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount. - * - * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the - * same time. (That's true throughout the get_user_pages*() and - * pin_user_pages*() APIs.) Cases: - * - * FOLL_GET: folio's refcount will be incremented by @refs. - * - * FOLL_PIN on large folios: folio's refcount will be incremented by - * @refs, and its pincount will be incremented by @refs. - * - * FOLL_PIN on single-page folios: folio's refcount will be incremented by - * @refs * GUP_PIN_COUNTING_BIAS. - * - * Return: The folio containing @page (with refcount appropriately - * incremented) for success, or NULL upon failure. If neither FOLL_GET - * nor FOLL_PIN was set, that's considered failure, and furthermore, - * a likely bug in the caller, so a warning is also emitted. - */ -struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) -{ - struct folio *folio; - - if (WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == 0)) - return NULL; - - if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) - return NULL; - - if (flags & FOLL_GET) - return try_get_folio(page, refs); - - /* FOLL_PIN is set */ - - /* - * Don't take a pin on the zero page - it's not going anywhere - * and it is used in a *lot* of places. - */ - if (is_zero_page(page)) - return page_folio(page); - - folio = try_get_folio(page, refs); - if (!folio) - return NULL; - - /* - * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a - * right zone, so fail and let the caller fall back to the slow - * path. - */ - if (unlikely((flags & FOLL_LONGTERM) && - !folio_is_longterm_pinnable(folio))) { - if (!put_devmap_managed_folio_refs(folio, refs)) - folio_put_refs(folio, refs); - return NULL; - } - - /* - * When pinning a large folio, use an exact count to track it. - * - * However, be sure to *also* increment the normal folio - * refcount field at least once, so that the folio really - * is pinned. That's why the refcount from the earlier - * try_get_folio() is left intact. - */ - if (folio_test_large(folio)) - atomic_add(refs, &folio->_pincount); - else - folio_ref_add(folio, - refs * (GUP_PIN_COUNTING_BIAS - 1)); - /* - * Adjust the pincount before re-checking the PTE for changes. - * This is essentially a smp_mb() and is paired with a memory - * barrier in folio_try_share_anon_rmap_*(). - */ - smp_mb__after_atomic(); - - node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); - - return folio; -} - static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) { if (flags & FOLL_PIN) { @@ -205,28 +116,31 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) } /** - * try_grab_page() - elevate a page's refcount by a flag-dependent amount - * @page: pointer to page to be grabbed - * @flags: gup flags: these are the FOLL_* flag values. + * try_grab_folio() - add a folio's refcount by a flag-dependent amount + * @folio: pointer to folio to be grabbed + * @refs: the value to (effectively) add to the folio's refcount + * @flags: gup flags: these are the FOLL_* flag values * * This might not do anything at all, depending on the flags argument. * * "grab" names in this file mean, "look at flags to decide whether to use - * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount. + * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount. * * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same - * time. Cases: please see the try_grab_folio() documentation, with - * "refs=1". + * time. * * Return: 0 for success, or if no action was required (if neither FOLL_PIN * nor FOLL_GET was set, nothing is done). A negative error code for failure: * - * -ENOMEM FOLL_GET or FOLL_PIN was set, but the page could not + * -ENOMEM FOLL_GET or FOLL_PIN was set, but the folio could not * be grabbed. + * + * It is called when we have a stable reference for the folio, typically in + * GUP slow path. */ -int __must_check try_grab_page(struct page *page, unsigned int flags) +int __must_check try_grab_folio(struct folio *folio, int refs, unsigned int flags) { - struct folio *folio = page_folio(page); + struct page *page = &folio->page; if (WARN_ON_ONCE(folio_ref_count(folio) <= 0)) return -ENOMEM; @@ -235,7 +149,7 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) return -EREMOTEIO; if (flags & FOLL_GET) - folio_ref_inc(folio); + folio_ref_add(folio, refs); else if (flags & FOLL_PIN) { /* * Don't take a pin on the zero page - it's not going anywhere @@ -245,18 +159,18 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) return 0; /* - * Similar to try_grab_folio(): be sure to *also* - * increment the normal page refcount field at least once, + * Increment the normal page refcount field at least once, * so that the page really is pinned. */ if (folio_test_large(folio)) { - folio_ref_add(folio, 1); - atomic_add(1, &folio->_pincount); + folio_ref_add(folio, refs); + atomic_add(refs, &folio->_pincount); } else { - folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); + folio_ref_add(folio, + refs * GUP_PIN_COUNTING_BIAS); } - node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1); + node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); } return 0; @@ -584,7 +498,7 @@ static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, */ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz, unsigned long addr, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { unsigned long pte_end; struct page *page; @@ -607,9 +521,15 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz page = pte_page(pte); refs = record_subpages(page, sz, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); - if (!folio) - return 0; + if (fast) { + folio = try_grab_folio_fast(page, refs, flags); + if (!folio) + return 0; + } else { + folio = page_folio(page); + if (try_grab_folio(folio, refs, flags)) + return 0; + } if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, refs, flags); @@ -637,7 +557,7 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { pte_t *ptep; unsigned long sz = 1UL << hugepd_shift(hugepd); @@ -647,7 +567,7 @@ static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, ptep = hugepte_offset(hugepd, addr, pdshift); do { next = hugepte_addr_end(addr, end, sz); - ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr); + ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr, fast); if (ret != 1) return ret; } while (ptep++, addr = next, addr != end); @@ -674,7 +594,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, ptep = hugepte_offset(hugepd, addr, pdshift); ptl = huge_pte_lock(h, vma->vm_mm, ptep); ret = gup_hugepd(vma, hugepd, addr, pdshift, addr + PAGE_SIZE, - flags, &page, &nr); + flags, &page, &nr, false); spin_unlock(ptl); if (ret == 1) { @@ -691,7 +611,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, static inline int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { return 0; } @@ -778,7 +698,7 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma, gup_must_unshare(vma, flags, page)) return ERR_PTR(-EMLINK); - ret = try_grab_page(page, flags); + ret = try_grab_folio(page_folio(page), 1, flags); if (ret) page = ERR_PTR(ret); else @@ -855,7 +775,7 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma, VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && !PageAnonExclusive(page), page); - ret = try_grab_page(page, flags); + ret = try_grab_folio(page_folio(page), 1, flags); if (ret) return ERR_PTR(ret); @@ -1017,8 +937,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && !PageAnonExclusive(page), page); - /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */ - ret = try_grab_page(page, flags); + /* try_grab_folio() does nothing unless FOLL_GET or FOLL_PIN is set. */ + ret = try_grab_folio(page_folio(page), 1, flags); if (unlikely(ret)) { page = ERR_PTR(ret); goto out; @@ -1282,7 +1202,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, goto unmap; *page = pte_page(entry); } - ret = try_grab_page(*page, gup_flags); + ret = try_grab_folio(page_folio(*page), 1, gup_flags); if (unlikely(ret)) goto unmap; out: @@ -1685,20 +1605,19 @@ static long __get_user_pages(struct mm_struct *mm, * pages. */ if (page_increm > 1) { - struct folio *folio; + struct folio *folio = page_folio(page); /* * Since we already hold refcount on the * large folio, this should never fail. */ - folio = try_grab_folio(page, page_increm - 1, - foll_flags); - if (WARN_ON_ONCE(!folio)) { + if (try_grab_folio(folio, page_increm - 1, + foll_flags)) { /* * Release the 1st page ref if the * folio is problematic, fail hard. */ - gup_put_folio(page_folio(page), 1, + gup_put_folio(folio, 1, foll_flags); ret = -EFAULT; goto out; @@ -2876,6 +2795,101 @@ EXPORT_SYMBOL(get_user_pages_unlocked); * This code is based heavily on the PowerPC implementation by Nick Piggin. */ #ifdef CONFIG_HAVE_GUP_FAST +/** + * try_grab_folio_fast() - Attempt to get or pin a folio in fast path. + * @page: pointer to page to be grabbed + * @refs: the value to (effectively) add to the folio's refcount + * @flags: gup flags: these are the FOLL_* flag values. + * + * "grab" names in this file mean, "look at flags to decide whether to use + * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount. + * + * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the + * same time. (That's true throughout the get_user_pages*() and + * pin_user_pages*() APIs.) Cases: + * + * FOLL_GET: folio's refcount will be incremented by @refs. + * + * FOLL_PIN on large folios: folio's refcount will be incremented by + * @refs, and its pincount will be incremented by @refs. + * + * FOLL_PIN on single-page folios: folio's refcount will be incremented by + * @refs * GUP_PIN_COUNTING_BIAS. + * + * Return: The folio containing @page (with refcount appropriately + * incremented) for success, or NULL upon failure. If neither FOLL_GET + * nor FOLL_PIN was set, that's considered failure, and furthermore, + * a likely bug in the caller, so a warning is also emitted. + * + * It uses add ref unless zero to elevate the folio refcount and must be called + * in fast path only. + */ +static struct folio *try_grab_folio_fast(struct page *page, int refs, + unsigned int flags) +{ + struct folio *folio; + + /* Raise warn if it is not called in fast GUP */ + VM_WARN_ON_ONCE(!irqs_disabled()); + + if (WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == 0)) + return NULL; + + if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) + return NULL; + + if (flags & FOLL_GET) + return try_get_folio(page, refs); + + /* FOLL_PIN is set */ + + /* + * Don't take a pin on the zero page - it's not going anywhere + * and it is used in a *lot* of places. + */ + if (is_zero_page(page)) + return page_folio(page); + + folio = try_get_folio(page, refs); + if (!folio) + return NULL; + + /* + * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a + * right zone, so fail and let the caller fall back to the slow + * path. + */ + if (unlikely((flags & FOLL_LONGTERM) && + !folio_is_longterm_pinnable(folio))) { + if (!put_devmap_managed_folio_refs(folio, refs)) + folio_put_refs(folio, refs); + return NULL; + } + + /* + * When pinning a large folio, use an exact count to track it. + * + * However, be sure to *also* increment the normal folio + * refcount field at least once, so that the folio really + * is pinned. That's why the refcount from the earlier + * try_get_folio() is left intact. + */ + if (folio_test_large(folio)) + atomic_add(refs, &folio->_pincount); + else + folio_ref_add(folio, + refs * (GUP_PIN_COUNTING_BIAS - 1)); + /* + * Adjust the pincount before re-checking the PTE for changes. + * This is essentially a smp_mb() and is paired with a memory + * barrier in folio_try_share_anon_rmap_*(). + */ + smp_mb__after_atomic(); + + node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); + + return folio; +} /* * Used in the GUP-fast path to determine whether GUP is permitted to work on @@ -3041,7 +3055,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); - folio = try_grab_folio(page, 1, flags); + folio = try_grab_folio_fast(page, 1, flags); if (!folio) goto pte_unmap; @@ -3128,7 +3142,7 @@ static int gup_fast_devmap_leaf(unsigned long pfn, unsigned long addr, break; } - folio = try_grab_folio(page, 1, flags); + folio = try_grab_folio_fast(page, 1, flags); if (!folio) { gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages); break; @@ -3217,7 +3231,7 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr, page = pmd_page(orig); refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); + folio = try_grab_folio_fast(page, refs, flags); if (!folio) return 0; @@ -3261,7 +3275,7 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr, page = pud_page(orig); refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); + folio = try_grab_folio_fast(page, refs, flags); if (!folio) return 0; @@ -3301,7 +3315,7 @@ static int gup_fast_pgd_leaf(pgd_t orig, pgd_t *pgdp, unsigned long addr, page = pgd_page(orig); refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); + folio = try_grab_folio_fast(page, refs, flags); if (!folio) return 0; @@ -3355,7 +3369,7 @@ static int gup_fast_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, * pmd format and THP pmd format */ if (gup_hugepd(NULL, __hugepd(pmd_val(pmd)), addr, - PMD_SHIFT, next, flags, pages, nr) != 1) + PMD_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) @@ -3385,7 +3399,7 @@ static int gup_fast_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, return 0; } else if (unlikely(is_hugepd(__hugepd(pud_val(pud))))) { if (gup_hugepd(NULL, __hugepd(pud_val(pud)), addr, - PUD_SHIFT, next, flags, pages, nr) != 1) + PUD_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pmd_range(pudp, pud, addr, next, flags, pages, nr)) @@ -3412,7 +3426,7 @@ static int gup_fast_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, BUILD_BUG_ON(p4d_leaf(p4d)); if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) { if (gup_hugepd(NULL, __hugepd(p4d_val(p4d)), addr, - P4D_SHIFT, next, flags, pages, nr) != 1) + P4D_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) @@ -3441,7 +3455,7 @@ static void gup_fast_pgd_range(unsigned long addr, unsigned long end, return; } else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) { if (gup_hugepd(NULL, __hugepd(pgd_val(pgd)), addr, - PGDIR_SHIFT, next, flags, pages, nr) != 1) + PGDIR_SHIFT, next, flags, pages, nr, true) != 1) return; } else if (!gup_fast_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) @@ -3842,14 +3856,15 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, next_idx != folio_index(fbatch.folios[i])) continue; - folio = try_grab_folio(&fbatch.folios[i]->page, - 1, FOLL_PIN); - if (!folio) { + if (try_grab_folio(fbatch.folios[i], + 1, FOLL_PIN)) { folio_batch_release(&fbatch); ret = -EINVAL; goto err; } + folio = fbatch.folios[i]; + if (nr_folios == 0) *offset = offset_in_folio(folio, start); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c7ce28f6b7f3..954c63575917 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1333,7 +1333,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn); - ret = try_grab_page(page, flags); + ret = try_grab_folio(page_folio(page), 1, flags); if (ret) page = ERR_PTR(ret); diff --git a/mm/internal.h b/mm/internal.h index 2ea9a88dcb95..b264a7dabefe 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1226,8 +1226,7 @@ int migrate_device_coherent_page(struct page *page); /* * mm/gup.c */ -struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags); -int __must_check try_grab_page(struct page *page, unsigned int flags); +int __must_check try_grab_folio(struct folio *folio, int refs, unsigned int flags); /* * mm/huge_memory.c -- 2.41.0 . Date: Thu, 27 Jun 2024 16:27:05 -0600 X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 Message-ID: <20240627222705.2974207-1-yuzhao@google.com> Subject: [PATCH mm-unstable v2] mm/hugetlb_vmemmap: fix race with speculative PFN walkers From: Yu Zhao To: Andrew Morton , Muchun Song Cc: David Hildenbrand , Frank van der Linden , "Matthew Wilcox (Oracle)" , Peter Xu , Yang Shi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yu Zhao Content-Type: text/plain; charset="UTF-8" Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1261505 org.kvack.linux-mm:201711 Newsgroups: org.kernel.vger.linux-kernel,org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail While investigating HVO for THPs [1], it turns out that speculative PFN walkers like compaction can race with vmemmap modifications, e.g., CPU 1 (vmemmap modifier) CPU 2 (speculative PFN walker) ------------------------------- ------------------------------ Allocates an LRU folio page1 Sees page1 Frees page1 Allocates a hugeTLB folio page2 (page1 being a tail of page2) Updates vmemmap mapping page1 get_page_unless_zero(page1) Even though page1->_refcount is zero after HVO, get_page_unless_zero() can still try to modify this read-only field, resulting in a crash. An independent report [2] confirmed this race. There are two discussed approaches to fix this race: 1. Make RO vmemmap RW so that get_page_unless_zero() can fail without triggering a PF. 2. Use RCU to make sure get_page_unless_zero() either sees zero page->_refcount through the old vmemmap or non-zero page->_refcount through the new one. The second approach is preferred here because: 1. It can prevent illegal modifications to struct page[] that has been HVO'ed; 2. It can be generalized, in a way similar to ZERO_PAGE(), to fix similar races in other places, e.g., arch_remove_memory() on x86 [3], which frees vmemmap mapping offlined struct page[]. While adding synchronize_rcu(), the goal is to be surgical, rather than optimized. Specifically, calls to synchronize_rcu() on the error handling paths can be coalesced, but it is not done for the sake of Simplicity: noticeably, this fix removes ~50% more lines than it adds. According to the hugetlb_optimize_vmemmap section in Documentation/admin-guide/sysctl/vm.rst, enabling HVO makes allocating or freeing hugeTLB pages "~2x slower than before". Having synchronize_rcu() on top makes those operations even worse, and this also affects the user interface /proc/sys/vm/nr_overcommit_hugepages. [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ [2] https://lore.kernel.org/917FFC7F-0615-44DD-90EE-9F85F8EA9974@linux.dev/ [3] https://lore.kernel.org/be130a96-a27e-4240-ad78-776802f57cad@redhat.com/ Signed-off-by: Yu Zhao Acked-by: Muchun Song --- include/linux/page_ref.h | 8 +++++- mm/hugetlb.c | 53 ++++++---------------------------------- mm/hugetlb_vmemmap.c | 16 ++++++++++++ 3 files changed, 30 insertions(+), 47 deletions(-) diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 490d0ad6e56d..8c236c651d1d 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -230,7 +230,13 @@ static inline int folio_ref_dec_return(struct folio *folio) static inline bool page_ref_add_unless(struct page *page, int nr, int u) { - bool ret = atomic_add_unless(&page->_refcount, nr, u); + bool ret = false; + + rcu_read_lock(); + /* avoid writing to the vmemmap area being remapped */ + if (!page_is_fake_head(page) && page_ref_count(page) != u) + ret = atomic_add_unless(&page->_refcount, nr, u); + rcu_read_unlock(); if (page_ref_tracepoint_active(page_ref_mod_unless)) __page_ref_mod_unless(page, nr, ret); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9691624fcb79..0a69e194b517 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1629,13 +1629,10 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio, * folio appears as just a compound page. Otherwise, wait until after * allocating vmemmap to clear the flag. * - * A reference is held on the folio, except in the case of demote. - * * Must be called with hugetlb lock held. */ -static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, - bool adjust_surplus, - bool demote) +static void remove_hugetlb_folio(struct hstate *h, struct folio *folio, + bool adjust_surplus) { int nid = folio_nid(folio); @@ -1649,6 +1646,7 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, list_del(&folio->lru); if (folio_test_hugetlb_freed(folio)) { + folio_clear_hugetlb_freed(folio); h->free_huge_pages--; h->free_huge_pages_node[nid]--; } @@ -1665,33 +1663,13 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, if (!folio_test_hugetlb_vmemmap_optimized(folio)) __folio_clear_hugetlb(folio); - /* - * In the case of demote we do not ref count the page as it will soon - * be turned into a page of smaller size. - */ - if (!demote) - folio_ref_unfreeze(folio, 1); - h->nr_huge_pages--; h->nr_huge_pages_node[nid]--; } -static void remove_hugetlb_folio(struct hstate *h, struct folio *folio, - bool adjust_surplus) -{ - __remove_hugetlb_folio(h, folio, adjust_surplus, false); -} - -static void remove_hugetlb_folio_for_demote(struct hstate *h, struct folio *folio, - bool adjust_surplus) -{ - __remove_hugetlb_folio(h, folio, adjust_surplus, true); -} - static void add_hugetlb_folio(struct hstate *h, struct folio *folio, bool adjust_surplus) { - int zeroed; int nid = folio_nid(folio); VM_BUG_ON_FOLIO(!folio_test_hugetlb_vmemmap_optimized(folio), folio); @@ -1715,21 +1693,6 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio, */ folio_set_hugetlb_vmemmap_optimized(folio); - /* - * This folio is about to be managed by the hugetlb allocator and - * should have no users. Drop our reference, and check for others - * just in case. - */ - zeroed = folio_put_testzero(folio); - if (unlikely(!zeroed)) - /* - * It is VERY unlikely soneone else has taken a ref - * on the folio. In this case, we simply return as - * free_huge_folio() will be called when this other ref - * is dropped. - */ - return; - arch_clear_hugetlb_flags(folio); enqueue_hugetlb_folio(h, folio); } @@ -1783,6 +1746,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, spin_unlock_irq(&hugetlb_lock); } + folio_ref_unfreeze(folio, 1); + /* * Non-gigantic pages demoted from CMA allocated gigantic pages * need to be given back to CMA in free_gigantic_folio. @@ -3106,11 +3071,8 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h, free_new: spin_unlock_irq(&hugetlb_lock); - if (new_folio) { - /* Folio has a zero ref count, but needs a ref to be freed */ - folio_ref_unfreeze(new_folio, 1); + if (new_folio) update_and_free_hugetlb_folio(h, new_folio, false); - } return ret; } @@ -3965,7 +3927,7 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio) target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order); - remove_hugetlb_folio_for_demote(h, folio, false); + remove_hugetlb_folio(h, folio, false); spin_unlock_irq(&hugetlb_lock); /* @@ -3979,7 +3941,6 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio) if (rc) { /* Allocation of vmemmmap failed, we can not demote folio */ spin_lock_irq(&hugetlb_lock); - folio_ref_unfreeze(folio, 1); add_hugetlb_folio(h, folio, false); return rc; } diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index fa00d61b6c5a..829112b0a914 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -455,6 +455,8 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h, unsigned long vmemmap_reuse; VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio); + if (!folio_test_hugetlb_vmemmap_optimized(folio)) return 0; @@ -490,6 +492,9 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h, */ int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio) { + /* avoid writes from page_ref_add_unless() while unfolding vmemmap */ + synchronize_rcu(); + return __hugetlb_vmemmap_restore_folio(h, folio, 0); } @@ -514,6 +519,9 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h, long restored = 0; long ret = 0; + /* avoid writes from page_ref_add_unless() while unfolding vmemmap */ + synchronize_rcu(); + list_for_each_entry_safe(folio, t_folio, folio_list, lru) { if (folio_test_hugetlb_vmemmap_optimized(folio)) { ret = __hugetlb_vmemmap_restore_folio(h, folio, @@ -559,6 +567,8 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h, unsigned long vmemmap_reuse; VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio); + if (!vmemmap_should_optimize_folio(h, folio)) return ret; @@ -610,6 +620,9 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio) { LIST_HEAD(vmemmap_pages); + /* avoid writes from page_ref_add_unless() while folding vmemmap */ + synchronize_rcu(); + __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, 0); free_vmemmap_page_list(&vmemmap_pages); } @@ -653,6 +666,9 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l flush_tlb_all(); + /* avoid writes from page_ref_add_unless() while folding vmemmap */ + synchronize_rcu(); + list_for_each_entry(folio, folio_list, lru) { int ret; -- 2.45.2.803.g4e1b14247a-goog . Return-Path: Date: Fri, 28 Jun 2024 06:38:24 +0800 From: kernel test robot To: Benjamin Tissoires Cc: oe-kbuild-all@lists.linux.dev, Linux Memory Management List Subject: [linux-next:master 8744/9027] drivers/hid/hidraw.c:143:63: warning: cast from pointer to integer of different size Message-ID: <202406280633.OPB5uIFj-lkp@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Xref: photonic.trudheim.com org.kvack.linux-mm:201713 Newsgroups: org.kvack.linux-mm,dev.linux.lists.oe-kbuild-all Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master head: 642a16ca7994a50d7de85715996a8ce171a5bdfb commit: 67eccf151d76a9939ad8a50c6db5cb486b01df24 [8744/9027] HID: add source argument to HID low level functions config: sh-buildonly-randconfig-r004-20220419 (https://download.01.org/0day-ci/archive/20240628/202406280633.OPB5uIFj-lkp@intel.com/config) compiler: sh4-linux-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240628/202406280633.OPB5uIFj-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202406280633.OPB5uIFj-lkp@intel.com/ All warnings (new ones prefixed by >>): drivers/hid/hidraw.c: In function 'hidraw_send_report': >> drivers/hid/hidraw.c:143:63: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] 143 | ret = __hid_hw_output_report(dev, buf, count, (__u64)file); | ^ drivers/hid/hidraw.c:154:56: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] 154 | HID_REQ_SET_REPORT, (__u64)file); | ^ drivers/hid/hidraw.c: In function 'hidraw_get_report': drivers/hid/hidraw.c:231:56: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] 231 | HID_REQ_GET_REPORT, (__u64)file); | ^ vim +143 drivers/hid/hidraw.c 101 102 /* 103 * The first byte of the report buffer is expected to be a report number. 104 */ 105 static ssize_t hidraw_send_report(struct file *file, const char __user *buffer, size_t count, unsigned char report_type) 106 { 107 unsigned int minor = iminor(file_inode(file)); 108 struct hid_device *dev; 109 __u8 *buf; 110 int ret = 0; 111 112 lockdep_assert_held(&minors_rwsem); 113 114 if (!hidraw_table[minor] || !hidraw_table[minor]->exist) { 115 ret = -ENODEV; 116 goto out; 117 } 118 119 dev = hidraw_table[minor]->hid; 120 121 if (count > HID_MAX_BUFFER_SIZE) { 122 hid_warn(dev, "pid %d passed too large report\n", 123 task_pid_nr(current)); 124 ret = -EINVAL; 125 goto out; 126 } 127 128 if (count < 2) { 129 hid_warn(dev, "pid %d passed too short report\n", 130 task_pid_nr(current)); 131 ret = -EINVAL; 132 goto out; 133 } 134 135 buf = memdup_user(buffer, count); 136 if (IS_ERR(buf)) { 137 ret = PTR_ERR(buf); 138 goto out; 139 } 140 141 if ((report_type == HID_OUTPUT_REPORT) && 142 !(dev->quirks & HID_QUIRK_NO_OUTPUT_REPORTS_ON_INTR_EP)) { > 143 ret = __hid_hw_output_report(dev, buf, count, (__u64)file); 144 /* 145 * compatibility with old implementation of USB-HID and I2C-HID: 146 * if the device does not support receiving output reports, 147 * on an interrupt endpoint, fallback to SET_REPORT HID command. 148 */ 149 if (ret != -ENOSYS) 150 goto out_free; 151 } 152 153 ret = __hid_hw_raw_request(dev, buf[0], buf, count, report_type, 154 HID_REQ_SET_REPORT, (__u64)file); 155 156 out_free: 157 kfree(buf); 158 out: 159 return ret; 160 } 161 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki . From: Yang Shi To: peterx@redhat.com, yangge1116@126.com, david@redhat.com, akpm@linux-foundation.org Cc: yang@os.amperecomputing.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: [v2 linus-tree PATCH] mm: gup: do not call try_grab_folio() in slow path Date: Thu, 27 Jun 2024 16:16:01 -0700 Message-ID: <20240627231601.1713119-1-yang@os.amperecomputing.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1261526 org.kvack.linux-mm:201718 Newsgroups: org.kernel.vger.linux-kernel,org.kernel.vger.stable,org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail The try_grab_folio() is supposed to be used in fast path and it elevates folio refcount by using add ref unless zero. We are guaranteed to have at least one stable reference in slow path, so the simple atomic add could be used. The performance difference should be trivial, but the misuse may be confusing and misleading. In another thread [1] a kernel warning was reported when pinning folio in CMA memory when launching SEV virtual machine. The splat looks like: [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 [ 464.325515] Call Trace: [ 464.325520] [ 464.325523] ? __get_user_pages+0x423/0x520 [ 464.325528] ? __warn+0x81/0x130 [ 464.325536] ? __get_user_pages+0x423/0x520 [ 464.325541] ? report_bug+0x171/0x1a0 [ 464.325549] ? handle_bug+0x3c/0x70 [ 464.325554] ? exc_invalid_op+0x17/0x70 [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 [ 464.325567] ? __get_user_pages+0x423/0x520 [ 464.325575] __gup_longterm_locked+0x212/0x7a0 [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 [ 464.325590] pin_user_pages_fast+0x47/0x60 [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] Per the analysis done by yangge, when starting the SEV virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory. But the page is in CMA area, so fast GUP will fail then fallback to the slow path due to the longterm pinnalbe check in try_grab_folio(). The slow path will try to pin the pages then migrate them out of CMA area. But the slow path also uses try_grab_folio() to pin the page, it will also fail due to the same check then the above warning is triggered. [1] https://lore.kernel.org/linux-mm/1719478388-31917-1-git-send-email-yangge1116@126.com/ Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") Cc: [6.6+] Reported-by: yangge Signed-off-by: Yang Shi --- mm/gup.c | 278 +++++++++++++++++++++++++---------------------- mm/huge_memory.c | 2 +- mm/internal.h | 3 +- 3 files changed, 148 insertions(+), 135 deletions(-) v2: 1. Fixed the build warning 2. Reworked the commit log to include the bug report and analysis (reworded by me) from yangge 3. Rebased onto the latest Linus's tree diff --git a/mm/gup.c b/mm/gup.c index ca0f5cedce9b..6be165224c1e 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -97,95 +97,6 @@ static inline struct folio *try_get_folio(struct page *page, int refs) return folio; } -/** - * try_grab_folio() - Attempt to get or pin a folio. - * @page: pointer to page to be grabbed - * @refs: the value to (effectively) add to the folio's refcount - * @flags: gup flags: these are the FOLL_* flag values. - * - * "grab" names in this file mean, "look at flags to decide whether to use - * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount. - * - * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the - * same time. (That's true throughout the get_user_pages*() and - * pin_user_pages*() APIs.) Cases: - * - * FOLL_GET: folio's refcount will be incremented by @refs. - * - * FOLL_PIN on large folios: folio's refcount will be incremented by - * @refs, and its pincount will be incremented by @refs. - * - * FOLL_PIN on single-page folios: folio's refcount will be incremented by - * @refs * GUP_PIN_COUNTING_BIAS. - * - * Return: The folio containing @page (with refcount appropriately - * incremented) for success, or NULL upon failure. If neither FOLL_GET - * nor FOLL_PIN was set, that's considered failure, and furthermore, - * a likely bug in the caller, so a warning is also emitted. - */ -struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) -{ - struct folio *folio; - - if (WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == 0)) - return NULL; - - if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) - return NULL; - - if (flags & FOLL_GET) - return try_get_folio(page, refs); - - /* FOLL_PIN is set */ - - /* - * Don't take a pin on the zero page - it's not going anywhere - * and it is used in a *lot* of places. - */ - if (is_zero_page(page)) - return page_folio(page); - - folio = try_get_folio(page, refs); - if (!folio) - return NULL; - - /* - * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a - * right zone, so fail and let the caller fall back to the slow - * path. - */ - if (unlikely((flags & FOLL_LONGTERM) && - !folio_is_longterm_pinnable(folio))) { - if (!put_devmap_managed_folio_refs(folio, refs)) - folio_put_refs(folio, refs); - return NULL; - } - - /* - * When pinning a large folio, use an exact count to track it. - * - * However, be sure to *also* increment the normal folio - * refcount field at least once, so that the folio really - * is pinned. That's why the refcount from the earlier - * try_get_folio() is left intact. - */ - if (folio_test_large(folio)) - atomic_add(refs, &folio->_pincount); - else - folio_ref_add(folio, - refs * (GUP_PIN_COUNTING_BIAS - 1)); - /* - * Adjust the pincount before re-checking the PTE for changes. - * This is essentially a smp_mb() and is paired with a memory - * barrier in folio_try_share_anon_rmap_*(). - */ - smp_mb__after_atomic(); - - node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); - - return folio; -} - static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) { if (flags & FOLL_PIN) { @@ -203,28 +114,31 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) } /** - * try_grab_page() - elevate a page's refcount by a flag-dependent amount - * @page: pointer to page to be grabbed - * @flags: gup flags: these are the FOLL_* flag values. + * try_grab_folio() - add a folio's refcount by a flag-dependent amount + * @folio: pointer to folio to be grabbed + * @refs: the value to (effectively) add to the folio's refcount + * @flags: gup flags: these are the FOLL_* flag values * * This might not do anything at all, depending on the flags argument. * * "grab" names in this file mean, "look at flags to decide whether to use - * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount. + * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount. * * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same - * time. Cases: please see the try_grab_folio() documentation, with - * "refs=1". + * time. * * Return: 0 for success, or if no action was required (if neither FOLL_PIN * nor FOLL_GET was set, nothing is done). A negative error code for failure: * - * -ENOMEM FOLL_GET or FOLL_PIN was set, but the page could not + * -ENOMEM FOLL_GET or FOLL_PIN was set, but the folio could not * be grabbed. + * + * It is called when we have a stable reference for the folio, typically in + * GUP slow path. */ -int __must_check try_grab_page(struct page *page, unsigned int flags) +int __must_check try_grab_folio(struct folio *folio, int refs, unsigned int flags) { - struct folio *folio = page_folio(page); + struct page *page = &folio->page; if (WARN_ON_ONCE(folio_ref_count(folio) <= 0)) return -ENOMEM; @@ -233,7 +147,7 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) return -EREMOTEIO; if (flags & FOLL_GET) - folio_ref_inc(folio); + folio_ref_add(folio, refs); else if (flags & FOLL_PIN) { /* * Don't take a pin on the zero page - it's not going anywhere @@ -243,18 +157,18 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) return 0; /* - * Similar to try_grab_folio(): be sure to *also* - * increment the normal page refcount field at least once, + * Increment the normal page refcount field at least once, * so that the page really is pinned. */ if (folio_test_large(folio)) { - folio_ref_add(folio, 1); - atomic_add(1, &folio->_pincount); + folio_ref_add(folio, refs); + atomic_add(refs, &folio->_pincount); } else { - folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); + folio_ref_add(folio, + refs * GUP_PIN_COUNTING_BIAS); } - node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1); + node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); } return 0; @@ -535,7 +449,7 @@ static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, */ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz, unsigned long addr, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { unsigned long pte_end; struct page *page; @@ -558,9 +472,15 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz page = pte_page(pte); refs = record_subpages(page, sz, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); - if (!folio) - return 0; + if (fast) { + folio = try_grab_folio_fast(page, refs, flags); + if (!folio) + return 0; + } else { + folio = page_folio(page); + if (try_grab_folio(folio, refs, flags)) + return 0; + } if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, refs, flags); @@ -588,7 +508,7 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { pte_t *ptep; unsigned long sz = 1UL << hugepd_shift(hugepd); @@ -598,7 +518,7 @@ static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, ptep = hugepte_offset(hugepd, addr, pdshift); do { next = hugepte_addr_end(addr, end, sz); - ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr); + ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr, fast); if (ret != 1) return ret; } while (ptep++, addr = next, addr != end); @@ -625,7 +545,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, ptep = hugepte_offset(hugepd, addr, pdshift); ptl = huge_pte_lock(h, vma->vm_mm, ptep); ret = gup_hugepd(vma, hugepd, addr, pdshift, addr + PAGE_SIZE, - flags, &page, &nr); + flags, &page, &nr, false); spin_unlock(ptl); if (ret == 1) { @@ -642,7 +562,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, static inline int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { return 0; } @@ -729,7 +649,7 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma, gup_must_unshare(vma, flags, page)) return ERR_PTR(-EMLINK); - ret = try_grab_page(page, flags); + ret = try_grab_folio(page_folio(page), 1, flags); if (ret) page = ERR_PTR(ret); else @@ -806,7 +726,7 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma, VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && !PageAnonExclusive(page), page); - ret = try_grab_page(page, flags); + ret = try_grab_folio(page_folio(page), 1, flags); if (ret) return ERR_PTR(ret); @@ -968,8 +888,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && !PageAnonExclusive(page), page); - /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */ - ret = try_grab_page(page, flags); + /* try_grab_folio() does nothing unless FOLL_GET or FOLL_PIN is set. */ + ret = try_grab_folio(page_folio(page), 1, flags); if (unlikely(ret)) { page = ERR_PTR(ret); goto out; @@ -1233,7 +1153,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, goto unmap; *page = pte_page(entry); } - ret = try_grab_page(*page, gup_flags); + ret = try_grab_folio(page_folio(*page), 1, gup_flags); if (unlikely(ret)) goto unmap; out: @@ -1636,20 +1556,19 @@ static long __get_user_pages(struct mm_struct *mm, * pages. */ if (page_increm > 1) { - struct folio *folio; + struct folio *folio = page_folio(page); /* * Since we already hold refcount on the * large folio, this should never fail. */ - folio = try_grab_folio(page, page_increm - 1, - foll_flags); - if (WARN_ON_ONCE(!folio)) { + if (try_grab_folio(folio, page_increm - 1, + foll_flags)) { /* * Release the 1st page ref if the * folio is problematic, fail hard. */ - gup_put_folio(page_folio(page), 1, + gup_put_folio(folio, 1, foll_flags); ret = -EFAULT; goto out; @@ -2797,6 +2716,101 @@ EXPORT_SYMBOL(get_user_pages_unlocked); * This code is based heavily on the PowerPC implementation by Nick Piggin. */ #ifdef CONFIG_HAVE_GUP_FAST +/** + * try_grab_folio_fast() - Attempt to get or pin a folio in fast path. + * @page: pointer to page to be grabbed + * @refs: the value to (effectively) add to the folio's refcount + * @flags: gup flags: these are the FOLL_* flag values. + * + * "grab" names in this file mean, "look at flags to decide whether to use + * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount. + * + * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the + * same time. (That's true throughout the get_user_pages*() and + * pin_user_pages*() APIs.) Cases: + * + * FOLL_GET: folio's refcount will be incremented by @refs. + * + * FOLL_PIN on large folios: folio's refcount will be incremented by + * @refs, and its pincount will be incremented by @refs. + * + * FOLL_PIN on single-page folios: folio's refcount will be incremented by + * @refs * GUP_PIN_COUNTING_BIAS. + * + * Return: The folio containing @page (with refcount appropriately + * incremented) for success, or NULL upon failure. If neither FOLL_GET + * nor FOLL_PIN was set, that's considered failure, and furthermore, + * a likely bug in the caller, so a warning is also emitted. + * + * It uses add ref unless zero to elevate the folio refcount and must be called + * in fast path only. + */ +static struct folio *try_grab_folio_fast(struct page *page, int refs, + unsigned int flags) +{ + struct folio *folio; + + /* Raise warn if it is not called in fast GUP */ + VM_WARN_ON_ONCE(!irqs_disabled()); + + if (WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == 0)) + return NULL; + + if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) + return NULL; + + if (flags & FOLL_GET) + return try_get_folio(page, refs); + + /* FOLL_PIN is set */ + + /* + * Don't take a pin on the zero page - it's not going anywhere + * and it is used in a *lot* of places. + */ + if (is_zero_page(page)) + return page_folio(page); + + folio = try_get_folio(page, refs); + if (!folio) + return NULL; + + /* + * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a + * right zone, so fail and let the caller fall back to the slow + * path. + */ + if (unlikely((flags & FOLL_LONGTERM) && + !folio_is_longterm_pinnable(folio))) { + if (!put_devmap_managed_folio_refs(folio, refs)) + folio_put_refs(folio, refs); + return NULL; + } + + /* + * When pinning a large folio, use an exact count to track it. + * + * However, be sure to *also* increment the normal folio + * refcount field at least once, so that the folio really + * is pinned. That's why the refcount from the earlier + * try_get_folio() is left intact. + */ + if (folio_test_large(folio)) + atomic_add(refs, &folio->_pincount); + else + folio_ref_add(folio, + refs * (GUP_PIN_COUNTING_BIAS - 1)); + /* + * Adjust the pincount before re-checking the PTE for changes. + * This is essentially a smp_mb() and is paired with a memory + * barrier in folio_try_share_anon_rmap_*(). + */ + smp_mb__after_atomic(); + + node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); + + return folio; +} /* * Used in the GUP-fast path to determine whether GUP is permitted to work on @@ -2962,7 +2976,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); - folio = try_grab_folio(page, 1, flags); + folio = try_grab_folio_fast(page, 1, flags); if (!folio) goto pte_unmap; @@ -3049,7 +3063,7 @@ static int gup_fast_devmap_leaf(unsigned long pfn, unsigned long addr, break; } - folio = try_grab_folio(page, 1, flags); + folio = try_grab_folio_fast(page, 1, flags); if (!folio) { gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages); break; @@ -3138,7 +3152,7 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr, page = pmd_page(orig); refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); + folio = try_grab_folio_fast(page, refs, flags); if (!folio) return 0; @@ -3182,7 +3196,7 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr, page = pud_page(orig); refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); + folio = try_grab_folio_fast(page, refs, flags); if (!folio) return 0; @@ -3222,7 +3236,7 @@ static int gup_fast_pgd_leaf(pgd_t orig, pgd_t *pgdp, unsigned long addr, page = pgd_page(orig); refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); + folio = try_grab_folio_fast(page, refs, flags); if (!folio) return 0; @@ -3276,7 +3290,7 @@ static int gup_fast_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, * pmd format and THP pmd format */ if (gup_hugepd(NULL, __hugepd(pmd_val(pmd)), addr, - PMD_SHIFT, next, flags, pages, nr) != 1) + PMD_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) @@ -3306,7 +3320,7 @@ static int gup_fast_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, return 0; } else if (unlikely(is_hugepd(__hugepd(pud_val(pud))))) { if (gup_hugepd(NULL, __hugepd(pud_val(pud)), addr, - PUD_SHIFT, next, flags, pages, nr) != 1) + PUD_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pmd_range(pudp, pud, addr, next, flags, pages, nr)) @@ -3333,7 +3347,7 @@ static int gup_fast_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, BUILD_BUG_ON(p4d_leaf(p4d)); if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) { if (gup_hugepd(NULL, __hugepd(p4d_val(p4d)), addr, - P4D_SHIFT, next, flags, pages, nr) != 1) + P4D_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) @@ -3362,7 +3376,7 @@ static void gup_fast_pgd_range(unsigned long addr, unsigned long end, return; } else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) { if (gup_hugepd(NULL, __hugepd(pgd_val(pgd)), addr, - PGDIR_SHIFT, next, flags, pages, nr) != 1) + PGDIR_SHIFT, next, flags, pages, nr, true) != 1) return; } else if (!gup_fast_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index db7946a0a28c..2120f7478e55 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1331,7 +1331,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn); - ret = try_grab_page(page, flags); + ret = try_grab_folio(page_folio(page), 1, flags); if (ret) page = ERR_PTR(ret); diff --git a/mm/internal.h b/mm/internal.h index 6902b7dd8509..52db9219b2db 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1182,8 +1182,7 @@ int migrate_device_coherent_page(struct page *page); /* * mm/gup.c */ -struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags); -int __must_check try_grab_page(struct page *page, unsigned int flags); +int __must_check try_grab_folio(struct folio *folio, int refs, unsigned int flags); /* * mm/huge_memory.c -- 2.41.0 . Return-Path: Date: Fri, 28 Jun 2024 07:16:09 +0800 From: kernel test robot To: Niklas Cassel Cc: oe-kbuild-all@lists.linux.dev, Linux Memory Management List , Heiko Stuebner , Manivannan Sadhasivam Subject: [linux-next:master 8329/9027] arch/arm64/boot/dts/rockchip/rk3588-armsom-sige7.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] Message-ID: <202406280710.UM3PNqsz-lkp@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Xref: photonic.trudheim.com org.kvack.linux-mm:201720 Newsgroups: org.kvack.linux-mm,dev.linux.lists.oe-kbuild-all Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master head: 642a16ca7994a50d7de85715996a8ce171a5bdfb commit: 2fe9fe4e54f5763b8b681478dda9ac61fd42ecaf [8329/9027] arm64: dts: rockchip: Add PCIe endpoint mode support config: arm64-randconfig-051-20240628 (https://download.01.org/0day-ci/archive/20240628/202406280710.UM3PNqsz-lkp@intel.com/config) compiler: aarch64-linux-gcc (GCC) 13.2.0 dtschema version: 2024.6.dev2+g3b69bad reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240628/202406280710.UM3PNqsz-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202406280710.UM3PNqsz-lkp@intel.com/ dtcheck warnings: (new ones prefixed by >>) >> arch/arm64/boot/dts/rockchip/rk3588-armsom-sige7.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] -- >> arch/arm64/boot/dts/rockchip/rk3588-coolpi-cm5-evb.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] -- >> arch/arm64/boot/dts/rockchip/rk3588-edgeble-neu6a-io.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] -- >> arch/arm64/boot/dts/rockchip/rk3588-edgeble-neu6b-io.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] -- arch/arm64/boot/dts/rockchip/rk3588-evb1-v10.dts:1244.7-1252.4: Warning (graph_child_address): /usb@fc000000/port: graph node has single child node 'endpoint@0', #address-cells/#size-cells are not necessary >> arch/arm64/boot/dts/rockchip/rk3588-evb1-v10.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] -- >> arch/arm64/boot/dts/rockchip/rk3588-friendlyelec-cm3588-nas.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] -- >> arch/arm64/boot/dts/rockchip/rk3588-jaguar.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] -- arch/arm64/boot/dts/rockchip/rk3588-nanopc-t6.dtb: regulator@42: Unevaluated properties are not allowed ('rockchip,suspend-voltage-selector' was unexpected) from schema $id: http://devicetree.org/schemas/regulator/fcs,fan53555.yaml# arch/arm64/boot/dts/rockchip/rk3588-nanopc-t6.dtb: pmic@0: regulators:dcdc-reg4: Unevaluated properties are not allowed ('regulator-init-microvolt' was unexpected) from schema $id: http://devicetree.org/schemas/mfd/rockchip,rk806.yaml# arch/arm64/boot/dts/rockchip/rk3588-nanopc-t6.dtb: pmic@0: Unevaluated properties are not allowed ('regulators' was unexpected) from schema $id: http://devicetree.org/schemas/mfd/rockchip,rk806.yaml# arch/arm64/boot/dts/rockchip/rk3588-nanopc-t6.dtb: codec@1b: Unevaluated properties are not allowed ('assigned-clock-rates', 'assigned-clocks', 'clock-names', 'clocks', 'port' were unexpected) from schema $id: http://devicetree.org/schemas/sound/realtek,rt5616.yaml# >> arch/arm64/boot/dts/rockchip/rk3588-nanopc-t6.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] arch/arm64/boot/dts/rockchip/rk3588-nanopc-t6.dtb: sound: 'simple-audio-card,hp-pin-name' does not match any of the regexes: '^simple-audio-card,codec(@[0-9a-f]+)?$', '^simple-audio-card,cpu(@[0-9a-f]+)?$', '^simple-audio-card,dai-link(@[0-9a-f]+)?$', '^simple-audio-card,plat(@[0-9a-f]+)?$', 'pinctrl-[0-9]+' from schema $id: http://devicetree.org/schemas/sound/simple-card.yaml# arch/arm64/boot/dts/rockchip/rk3588-nanopc-t6.dtb: vcc3v3-sd-s0-regulator: Unevaluated properties are not allowed ('enable-active-low' was unexpected) from schema $id: http://devicetree.org/schemas/regulator/fixed-regulator.yaml# -- >> arch/arm64/boot/dts/rockchip/rk3588-ok3588-c.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] -- arch/arm64/boot/dts/rockchip/rk3588-orangepi-5-plus.dtb: audio-codec@11: 'clock-names' does not match any of the regexes: 'pinctrl-[0-9]+' from schema $id: http://devicetree.org/schemas/sound/everest,es8328.yaml# >> arch/arm64/boot/dts/rockchip/rk3588-orangepi-5-plus.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] -- arch/arm64/boot/dts/rockchip/rk3588-quartzpro64.dtb: audio-codec@11: 'clock-names' does not match any of the regexes: 'pinctrl-[0-9]+' from schema $id: http://devicetree.org/schemas/sound/everest,es8328.yaml# >> arch/arm64/boot/dts/rockchip/rk3588-quartzpro64.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] -- >> arch/arm64/boot/dts/rockchip/rk3588-rock-5b.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] -- >> arch/arm64/boot/dts/rockchip/rk3588-tiger-haikou.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] arch/arm64/boot/dts/rockchip/rk3588-tiger-haikou.dtb: /extcon-usb3: failed to match any schema with compatible: ['linux,extcon-usb-gpio'] -- arch/arm64/boot/dts/rockchip/rk3588-toybrick-x0.dtb: pmic@0: regulators:dcdc-reg4: Unevaluated properties are not allowed ('regulator-init-microvolt' was unexpected) from schema $id: http://devicetree.org/schemas/mfd/rockchip,rk806.yaml# arch/arm64/boot/dts/rockchip/rk3588-toybrick-x0.dtb: pmic@0: Unevaluated properties are not allowed ('regulators' was unexpected) from schema $id: http://devicetree.org/schemas/mfd/rockchip,rk806.yaml# >> arch/arm64/boot/dts/rockchip/rk3588-toybrick-x0.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] -- >> arch/arm64/boot/dts/rockchip/rk3588-turing-rk1.dtb: /pcie-ep@fe150000: failed to match any schema with compatible: ['rockchip,rk3588-pcie-ep'] -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki . Return-Path: Date: Fri, 28 Jun 2024 08:55:26 +0800 From: kernel test robot To: Alex Bee Cc: oe-kbuild-all@lists.linux.dev, Linux Memory Management List , Heiko Stuebner Subject: [linux-next:master 7406/9027] arch/arm64/boot/dts/rockchip/rk3368-lba3368.dtb: /i2c@ff660000/codec@1c: failed to match any schema with compatible: ['realtek,rt5640'] Message-ID: <202406280804.Kc8B4xzn-lkp@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Xref: photonic.trudheim.com org.kvack.linux-mm:201733 Newsgroups: org.kvack.linux-mm,dev.linux.lists.oe-kbuild-all Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master head: 642a16ca7994a50d7de85715996a8ce171a5bdfb commit: 7b4a8097e58b608638d416bd57469f8a9ab70e7b [7406/9027] arm64: dts: rockchip: Add Neardi LBA3368 board config: arm64-randconfig-051-20240628 (https://download.01.org/0day-ci/archive/20240628/202406280804.Kc8B4xzn-lkp@intel.com/config) compiler: aarch64-linux-gcc (GCC) 13.2.0 dtschema version: 2024.6.dev2+g3b69bad reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240628/202406280804.Kc8B4xzn-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202406280804.Kc8B4xzn-lkp@intel.com/ dtcheck warnings: (new ones prefixed by >>) >> arch/arm64/boot/dts/rockchip/rk3368-lba3368.dtb: /i2c@ff660000/codec@1c: failed to match any schema with compatible: ['realtek,rt5640'] >> arch/arm64/boot/dts/rockchip/rk3368-lba3368.dtb: /mbox@ff6b0000: failed to match any schema with compatible: ['rockchip,rk3368-mailbox'] arch/arm64/boot/dts/rockchip/rk3368-lba3368.dtb: i2s-8ch@ff898000: '#sound-dai-cells' is a required property from schema $id: http://devicetree.org/schemas/sound/rockchip-i2s.yaml# -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki . Return-Path: Date: Fri, 28 Jun 2024 09:06:53 +0800 From: kernel test robot To: "Rob Herring (Arm)" Cc: oe-kbuild-all@lists.linux.dev, Linux Memory Management List , Mark Brown Subject: [linux-next:master 8957/9027] arch/arm64/boot/dts/ti/k3-am642-evm-nand.dtb: pinctrl@f4000: 'gpmc0-pins-default' does not match any of the regexes: '-pins(-[0-9]+)?$|-pin$', 'pinctrl-[0-9]+' Message-ID: <202406280912.CCQfYTkq-lkp@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Xref: photonic.trudheim.com org.kvack.linux-mm:201736 Newsgroups: org.kvack.linux-mm,dev.linux.lists.oe-kbuild-all Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master head: 642a16ca7994a50d7de85715996a8ce171a5bdfb commit: 8d2bf5a527a48ac9f6855467768c91eeece46b62 [8957/9027] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git config: arm64-randconfig-051-20240628 (https://download.01.org/0day-ci/archive/20240628/202406280912.CCQfYTkq-lkp@intel.com/config) compiler: aarch64-linux-gcc (GCC) 13.2.0 dtschema version: 2024.6.dev2+g3b69bad reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240628/202406280912.CCQfYTkq-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202406280912.CCQfYTkq-lkp@intel.com/ dtcheck warnings: (new ones prefixed by >>) >> arch/arm64/boot/dts/ti/k3-am642-evm-nand.dtb: pinctrl@f4000: 'gpmc0-pins-default' does not match any of the regexes: '-pins(-[0-9]+)?$|-pin$', 'pinctrl-[0-9]+' from schema $id: http://devicetree.org/schemas/pinctrl/pinctrl-single.yaml# >> arch/arm64/boot/dts/ti/k3-am642-evm-nand.dtb: gpio@600000: 'gpio0-36' does not match any of the regexes: '^(.+-hog(-[0-9]+)?)$', 'pinctrl-[0-9]+' from schema $id: http://devicetree.org/schemas/gpio/gpio-davinci.yaml# >> arch/arm64/boot/dts/ti/k3-am642-evm-nand.dtb: gpio0-36: $nodename:0: 'gpio0-36' does not match '^(hog-[0-9]+|.+-hog(-[0-9]+)?)$' from schema $id: http://devicetree.org/schemas/gpio/gpio-hog.yaml# >> arch/arm64/boot/dts/ti/k3-am642-evm-nand.dtb: adc: 'ti,adc-channels' is a required property from schema $id: http://devicetree.org/schemas/iio/adc/ti,am3359-adc.yaml# >> arch/arm64/boot/dts/ti/k3-am642-evm-nand.dtb: nand@0,0: Unevaluated properties are not allowed ('partitions' was unexpected) from schema $id: http://devicetree.org/schemas/mtd/ti,gpmc-nand.yaml# -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki . From: alexs@kernel.org To: Vitaly Wool , Miaohe Lin , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, minchan@kernel.org, willy@infradead.org, senozhatsky@chromium.org, david@redhat.com, 42.hyeyoo@gmail.com Cc: Alex Shi Subject: [PATCH 00/20] mm/zsmalloc: add zpdesc memory descriptor for zswap.zpool Date: Fri, 28 Jun 2024 11:11:15 +0800 Message-ID: <20240628031138.429622-1-alexs@kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1261642 org.kvack.linux-mm:201742 Newsgroups: org.kernel.vger.linux-kernel,org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail From: Alex Shi According to Metthew's plan, the page descriptor will be replace by a 8 bytes mem_desc on destination purpose. https://lore.kernel.org/lkml/YvV1KTyzZ+Jrtj9x@casper.infradead.org/ Here is a implement on zsmalloc to replace page descriptor by 'zpdesc', which is still overlay on struct page now. but it's a step move forward above destination. To name the struct zpdesc instead of zsdesc, since there are still 3 zpools under zswap: zbud, z3fold, zsmalloc for now(z3fold maybe removed soon), and we could easyly extend it to other zswap.zpool in needs. For all zswap.zpools, they are all using single page since often used under memory pressure. So the conversion via folio series helper is better than page's for compound_head check saving. For now, all zpools are using some page struct members, like page.flags for PG_private/PG_locked. and list_head lru, page.mapping for page migration. This patachset could save 123Kbyetes zsmalloc.o size. Thanks Alex Alex Shi (8): mm/zsmalloc: add zpdesc memory descriptor for zswap.zpool mm/zsmalloc: use zpdesc in trylock_zspage/lock_zspage mm/zsmalloc: convert create_page_chain() and its users to use zpdesc mm/zsmalloc: rename reset_page to reset_zpdesc and use zpdesc in it mm/zsmalloc: convert SetZsPageMovable and remove unused funcs mm/zsmalloc: introduce __zpdesc_clear_movable mm/zsmalloc: introduce __zpdesc_clear_zsmalloc mm/zsmalloc: introduce __zpdesc_set_zsmalloc() Hyeonggon Yoo (12): mm/zsmalloc: convert __zs_map_object/__zs_unmap_object to use zpdesc mm/zsmalloc: add and use pfn/zpdesc seeking funcs mm/zsmalloc: convert obj_malloc() to use zpdesc mm/zsmalloc: convert obj_allocated() and related helpers to use zpdesc mm/zsmalloc: convert init_zspage() to use zpdesc mm/zsmalloc: convert obj_to_page() and zs_free() to use zpdesc mm/zsmalloc: add zpdesc_is_isolated/zpdesc_zone helper for zs_page_migrate mm/zsmalloc: convert __free_zspage() to use zdsesc mm/zsmalloc: convert location_to_obj() to take zpdesc mm/zsmalloc: convert migrate_zspage() to use zpdesc mm/zsmalloc: convert get_zspage() to take zpdesc mm/zsmalloc: convert get/set_first_obj_offset() to take zpdesc mm/zpdesc.h | 134 +++++++++++++++ mm/zsmalloc.c | 454 +++++++++++++++++++++++++++----------------------- 2 files changed, 384 insertions(+), 204 deletions(-) create mode 100644 mm/zpdesc.h -- 2.43.0 . From: "Bang Li" To: hughd@google.com, akpm@linux-foundation.org Cc: , , , , , "Bang Li" Subject: [PATCH] mm/shmem: Fix input and output inconsistencies Date: Fri, 28 Jun 2024 11:23:27 +0800 Message-Id: <20240628032327.16987-1-libang.li@antgroup.com> X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1261676 org.kvack.linux-mm:201765 Newsgroups: org.kernel.vger.linux-kernel,org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail After the commit 19eaf44954df ("mm: thp: support allocation of anonymous multi-size THP"), add mTHP support for anonymous shmem. We can configure different policies through the multi-size THP sysfs interface for anonymous shmem. But when we configure the "advise" policy of /sys/kernel/mm/transparent_hugepage/hugepages-xxxkB/shmem_enabled, we cannot write the "advise", but write the "madvise", which is unreasonable. We should keep the output and input values consistent, which is more convenient for users. Fixes: 61a57f1b1da9 ("mm: shmem: add multi-size THP sysfs interface for anonymous shmem") Signed-off-by: Bang Li --- mm/shmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index 13d139abe69a..d495c0701a83 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -4983,7 +4983,7 @@ static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj, clear_bit(order, &huge_shmem_orders_madvise); set_bit(order, &huge_shmem_orders_within_size); spin_unlock(&huge_shmem_orders_lock); - } else if (sysfs_streq(buf, "madvise")) { + } else if (sysfs_streq(buf, "advise")) { spin_lock(&huge_shmem_orders_lock); clear_bit(order, &huge_shmem_orders_always); clear_bit(order, &huge_shmem_orders_inherit); -- 2.19.1.6.gb485710b . From: Ruidong Tian To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, akpm@linux-foundation.org, nao.horiguchi@gmail.com, linmiaohe@huawei.com, tianruidong@alibaba.linux.com, xueshuai@linux.alibaba.com, Ruidong Tian Subject: [PATCH] mm/hwpoison: avoid speculation access after soft/hard offline Date: Fri, 28 Jun 2024 11:35:09 +0800 Message-Id: <20240628033509.27612-1-tianruidong@linux.alibaba.com> X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1261681 org.kvack.linux-mm:201769 Newsgroups: org.kernel.vger.linux-kernel,org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail Page that offlined can report CE/UE event due to speculation access. Delete kernel 1:1 linner mapping after soft/hard offline to avoid it. Signed-off-by: Ruidong Tian --- drivers/base/memory.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 67858eeb92ed..502ee1107ac6 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -600,6 +601,8 @@ static ssize_t soft_offline_page_store(struct device *dev, return -EINVAL; pfn >>= PAGE_SHIFT; ret = soft_offline_page(pfn, 0); + if (!ret) + set_mce_nospec(pfn); return ret == 0 ? count : ret; } @@ -616,6 +619,8 @@ static ssize_t hard_offline_page_store(struct device *dev, return -EINVAL; pfn >>= PAGE_SHIFT; ret = memory_failure(pfn, MF_SW_SIMULATED); + if (!ret) + set_mce_nospec(pfn); if (ret == -EOPNOTSUPP) ret = 0; return ret ? ret : count; -- 2.39.3 . From: "Ho-Ren (Jack) Chuang" To: "Jonathan Cameron" , "Huang, Ying" , "Gregory Price" , aneesh.kumar@linux.ibm.com, mhocko@suse.com, tj@kernel.org, john@jagalactic.com, "Eishan Mirakhur" , "Vinicius Tavares Petrucci" , "Ravis OpenSrc" , "Alistair Popple" , "Srinivasulu Thanneeru" , "SeongJae Park" , "Rafael J. Wysocki" , Len Brown , Andrew Morton , Dave Jiang , Dan Williams , Jonathan Cameron , "Ho-Ren (Jack) Chuang" , linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: "Ho-Ren (Jack) Chuang" , "Ho-Ren (Jack) Chuang" , "Ho-Ren (Jack) Chuang" , linux-cxl@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v2 0/1] memory tier: consolidate the initialization of memory tiers Date: Fri, 28 Jun 2024 06:09:22 +0000 Message-Id: <20240628060925.303309-1-horen.chuang@linux.dev> X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Xref: photonic.trudheim.com org.kernel.vger.linux-cxl:29169 org.kernel.vger.linux-kernel:1261750 org.kvack.linux-mm:201772 Newsgroups: org.kernel.vger.linux-cxl,org.kernel.vger.linux-acpi,org.kernel.vger.linux-kernel,org.kvack.linux-mm,org.nongnu.qemu-devel Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail The current memory tier initialization process is distributed across two different functions, memory_tier_init() and memory_tier_late_init(). This design is hard to maintain. Thus, this patch is proposed to reduce the possible code paths by consolidating different initialization patches into one. The earlier discussion with Jonathan and Ying is listed here: https://lore.kernel.org/lkml/20240405150244.00004b49@Huawei.com/ If we want to put these two initializations together, they must be placed together in the later function. Because only at that time, the HMAT information will be ready, adist between nodes can be calculated, and memory tiering can be established based on the adist. So we position the initialization at memory_tier_init() to the memory_tier_late_init() call. Moreover, it's natural to keep memory_tier initialization in drivers at device_initcall() level. - v2: Thanks to Huang, Ying's and Andrew's comments * Add cover letter * Add Suggested-by: Jonathan Cameron * Add get/put_online_mems() protection in memory_tier_late_init() * If memtype is set, skip initializing its node * Remove redundant code/comments or rewrite code in a cleaner manner - v1: * https://lore.kernel.org/all/20240621044833.3953055-1-horen.chuang@linux.dev/ This patchset is based on commits cf93be18fa1b and a72a30af550c: [0/2] https://lkml.kernel.org/r/20240405000707.2670063-1-horenchuang@bytedance.com [1/2] https://lkml.kernel.org/r/20240405000707.2670063-2-horenchuang@bytedance.com [1/2] https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com Ho-Ren (Jack) Chuang (1): memory tier: consolidate the initialization of memory tiers drivers/acpi/numa/hmat.c | 5 +-- include/linux/memory-tiers.h | 2 ++ mm/memory-tiers.c | 59 +++++++++++++++--------------------- 3 files changed, 28 insertions(+), 38 deletions(-) -- Ho-Ren (Jack) Chuang . Return-Path: From: yangge1116@126.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, 21cnbao@gmail.com, peterx@redhat.com, yang@os.amperecomputing.com, baolin.wang@linux.alibaba.com, liuzixing@hygon.cn, yangge Subject: [PATCH V2] mm/gup: Fix longterm pin on slow gup regression Date: Fri, 28 Jun 2024 14:01:58 +0800 Message-Id: <1719554518-11006-1-git-send-email-yangge1116@126.com> Sender: owner-linux-mm@kvack.org X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Xref: photonic.trudheim.com org.kvack.linux-mm:201776 Newsgroups: org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail From: yangge If a large number of CMA memory are configured in system (for example, the CMA memory accounts for 50% of the system memory), starting a SEV virtual machine will fail. During starting the SEV virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin memory. Normally if a page is present and in CMA area, pin_user_pages_fast() will first call __get_user_pages_locked() to pin the page in CMA area, and then call check_and_migrate_movable_pages() to migrate the page from CMA area to non-CMA area. But the current code calling __get_user_pages_locked() will fail, because it call try_grab_folio() to pin page in gup slow path. The commit 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") uses try_grab_folio() in gup slow path, which seems to be problematic because try_grap_folio() will check if the page can be longterm pinned. This check may fail and cause __get_user_pages_lock() to fail. However, these checks are not required in gup slow path, seems we can use try_grab_page() instead of try_grab_folio(). In addition, in the current code, try_grab_page() can only add 1 to the page's refcount. We extend this function so that the page's refcount can be increased according to the parameters passed in. The following log reveals it: [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 [ 464.325515] Call Trace: [ 464.325520] [ 464.325523] ? __get_user_pages+0x423/0x520 [ 464.325528] ? __warn+0x81/0x130 [ 464.325536] ? __get_user_pages+0x423/0x520 [ 464.325541] ? report_bug+0x171/0x1a0 [ 464.325549] ? handle_bug+0x3c/0x70 [ 464.325554] ? exc_invalid_op+0x17/0x70 [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 [ 464.325567] ? __get_user_pages+0x423/0x520 [ 464.325575] __gup_longterm_locked+0x212/0x7a0 [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 [ 464.325590] pin_user_pages_fast+0x47/0x60 [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] In another thread [1], hugepd also has a similar problem, so include relevant handling codes. [1] https://lore.kernel.org/all/20240604234858.948986-2-yang@os.amperecomputing.com/ Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") Cc: Signed-off-by: yangge --- mm/gup.c | 55 +++++++++++++++++++++++++++++-------------------------- mm/huge_memory.c | 2 +- mm/internal.h | 2 +- 3 files changed, 31 insertions(+), 28 deletions(-) V2: 1, Using unlikely instead of WARN_ON_ONCE 2, Reworked the code and commit log to include hugepd path handling from Yang diff --git a/mm/gup.c b/mm/gup.c index 6ff9f95..070cf58 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -222,7 +222,7 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) * -ENOMEM FOLL_GET or FOLL_PIN was set, but the page could not * be grabbed. */ -int __must_check try_grab_page(struct page *page, unsigned int flags) +int __must_check try_grab_page(struct page *page, int refs, unsigned int flags) { struct folio *folio = page_folio(page); @@ -233,7 +233,7 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) return -EREMOTEIO; if (flags & FOLL_GET) - folio_ref_inc(folio); + folio_ref_add(folio, refs); else if (flags & FOLL_PIN) { /* * Don't take a pin on the zero page - it's not going anywhere @@ -248,13 +248,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) * so that the page really is pinned. */ if (folio_test_large(folio)) { - folio_ref_add(folio, 1); - atomic_add(1, &folio->_pincount); + folio_ref_add(folio, refs); + atomic_add(refs, &folio->_pincount); } else { - folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); + folio_ref_add(folio, refs * GUP_PIN_COUNTING_BIAS); } - node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1); + node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); } return 0; @@ -535,7 +535,7 @@ static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, */ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz, unsigned long addr, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { unsigned long pte_end; struct page *page; @@ -558,9 +558,14 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz page = pte_page(pte); refs = record_subpages(page, sz, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); - if (!folio) - return 0; + if (fast) { + if (try_grab_page(page, refs, flags)) + return 0; + else { + folio = try_grab_folio(page, refs, flags); + if (!folio) + return 0; + } if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, refs, flags); @@ -588,7 +593,7 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { pte_t *ptep; unsigned long sz = 1UL << hugepd_shift(hugepd); @@ -598,7 +603,7 @@ static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, ptep = hugepte_offset(hugepd, addr, pdshift); do { next = hugepte_addr_end(addr, end, sz); - ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr); + ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr, fast); if (ret != 1) return ret; } while (ptep++, addr = next, addr != end); @@ -625,7 +630,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, ptep = hugepte_offset(hugepd, addr, pdshift); ptl = huge_pte_lock(h, vma->vm_mm, ptep); ret = gup_hugepd(vma, hugepd, addr, pdshift, addr + PAGE_SIZE, - flags, &page, &nr); + flags, &page, &nr, false); spin_unlock(ptl); if (ret == 1) { @@ -642,7 +647,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, static inline int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { return 0; } @@ -729,7 +734,7 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma, gup_must_unshare(vma, flags, page)) return ERR_PTR(-EMLINK); - ret = try_grab_page(page, flags); + ret = try_grab_page(page, 1, flags); if (ret) page = ERR_PTR(ret); else @@ -806,7 +811,7 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma, VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && !PageAnonExclusive(page), page); - ret = try_grab_page(page, flags); + ret = try_grab_page(page, 1, flags); if (ret) return ERR_PTR(ret); @@ -969,7 +974,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, !PageAnonExclusive(page), page); /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */ - ret = try_grab_page(page, flags); + ret = try_grab_page(page, 1, flags); if (unlikely(ret)) { page = ERR_PTR(ret); goto out; @@ -1233,7 +1238,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, goto unmap; *page = pte_page(entry); } - ret = try_grab_page(*page, gup_flags); + ret = try_grab_page(*page, 1, gup_flags); if (unlikely(ret)) goto unmap; out: @@ -1636,22 +1641,20 @@ static long __get_user_pages(struct mm_struct *mm, * pages. */ if (page_increm > 1) { - struct folio *folio; /* * Since we already hold refcount on the * large folio, this should never fail. */ - folio = try_grab_folio(page, page_increm - 1, + ret = try_grab_page(page, page_increm - 1, foll_flags); - if (WARN_ON_ONCE(!folio)) { + if (unlikely(ret)) { /* * Release the 1st page ref if the * folio is problematic, fail hard. */ gup_put_folio(page_folio(page), 1, foll_flags); - ret = -EFAULT; goto out; } } @@ -3276,7 +3279,7 @@ static int gup_fast_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, * pmd format and THP pmd format */ if (gup_hugepd(NULL, __hugepd(pmd_val(pmd)), addr, - PMD_SHIFT, next, flags, pages, nr) != 1) + PMD_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) @@ -3306,7 +3309,7 @@ static int gup_fast_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, return 0; } else if (unlikely(is_hugepd(__hugepd(pud_val(pud))))) { if (gup_hugepd(NULL, __hugepd(pud_val(pud)), addr, - PUD_SHIFT, next, flags, pages, nr) != 1) + PUD_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pmd_range(pudp, pud, addr, next, flags, pages, nr)) @@ -3333,7 +3336,7 @@ static int gup_fast_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, BUILD_BUG_ON(p4d_leaf(p4d)); if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) { if (gup_hugepd(NULL, __hugepd(p4d_val(p4d)), addr, - P4D_SHIFT, next, flags, pages, nr) != 1) + P4D_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) @@ -3362,7 +3365,7 @@ static void gup_fast_pgd_range(unsigned long addr, unsigned long end, return; } else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) { if (gup_hugepd(NULL, __hugepd(pgd_val(pgd)), addr, - PGDIR_SHIFT, next, flags, pages, nr) != 1) + PGDIR_SHIFT, next, flags, pages, nr, true) != 1) return; } else if (!gup_fast_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 425374a..18604e4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1332,7 +1332,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn); - ret = try_grab_page(page, flags); + ret = try_grab_page(page, 1, flags); if (ret) page = ERR_PTR(ret); diff --git a/mm/internal.h b/mm/internal.h index 2ea9a88..5305bbf 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1227,7 +1227,7 @@ int migrate_device_coherent_page(struct page *page); * mm/gup.c */ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags); -int __must_check try_grab_page(struct page *page, unsigned int flags); +int __must_check try_grab_page(struct page *page, int refs, unsigned int flags); /* * mm/huge_memory.c -- 2.7.4 . From: Xiu Jianfeng To: , , , , , CC: , , Subject: [PATCH -next] mm: memcg: adjust the warning when seq_buf overflows Date: Fri, 28 Jun 2024 07:23:33 +0000 Message-ID: <20240628072333.2496527-1-xiujianfeng@huawei.com> X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1261843 org.kvack.linux-mm:201782 Newsgroups: org.kernel.vger.linux-kernel,org.kernel.vger.cgroups,org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail Currently it uses WARN_ON_ONCE() if seq_buf overflows when user reads memory.stat, the only advantage of WARN_ON_ONCE is that the splat is so verbose that it gets noticed. And also it panics the system if panic_on_warn is enabled. It seems like the warning is just an over reaction and a simple pr_warn should just achieve the similar effect. Suggested-by: Michal Hocko Signed-off-by: Xiu Jianfeng --- mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c251bbe35f4b..8e5590ac43d7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1484,7 +1484,8 @@ static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) memcg_stat_format(memcg, s); else memcg1_stat_format(memcg, s); - WARN_ON_ONCE(seq_buf_has_overflowed(s)); + if (seq_buf_has_overflowed(s)) + pr_warn("%s: Warning, stat buffer overflow, please report\n", __func__); } /** -- 2.34.1 . Subject: [PATCH V5] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes From: Jesper Dangaard Brouer To: tj@kernel.org, cgroups@vger.kernel.org, yosryahmed@google.com, shakeel.butt@linux.dev Cc: Jesper Dangaard Brouer , hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Date: Fri, 28 Jun 2024 12:12:57 +0200 Message-ID: <171956951930.1897969.8709279863947931285.stgit@firesoul> X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1262043 org.kvack.linux-mm:201798 Newsgroups: org.kernel.vger.linux-kernel,org.kernel.vger.cgroups,org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail Avoid lock contention on the global cgroup rstat lock caused by kswapd starting on all NUMA nodes simultaneously. At Cloudflare, we observed massive issues due to kswapd and the specific mem_cgroup_flush_stats() call inlined in shrink_node, which takes the rstat lock. On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, we noted severe lock contention on the rstat lock. This contention causes 12 CPUs to waste cycles spinning every time kswapd runs. Fleet-wide stats (/proc/N/schedstat) for kthreads revealed that we are burning an average of 20,000 CPU cores fleet-wide on kswapd, primarily due to spinning on the rstat lock. Help reviewers follow code: __alloc_pages_slowpath calls wake_all_kswapds causing all kswapdN threads to wake up simultaneously. The kswapd thread invokes shrink_node (via balance_pgdat) triggering the cgroup rstat flush operation as part of its work. This results in kernel self-induced rstat lock contention by waking up all kswapd threads simultaneously. Leveraging this detail: balance_pgdat() have NULL value in target_mem_cgroup, this cause mem_cgroup_flush_stats() to do flush with root_mem_cgroup. To avoid this kind of thundering herd problem, kernel previously had a "stats_flush_ongoing" concept, but this was removed as part of commit 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). This patch reintroduce and generalized the concept to apply to all users of cgroup rstat, not just memcg. If there is an ongoing rstat flush, and current cgroup is a descendant, then it is unnecessary to do the flush. For callers to still see updated stats, wait for ongoing flusher to complete before returning, but add timeout as stats are already inaccurate given updaters keeps running. Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). Signed-off-by: Jesper Dangaard Brouer --- V5: Dropped trylock helper V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/ V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/ V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/ V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/ RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/ include/linux/cgroup-defs.h | 2 ++ kernel/cgroup/rstat.c | 57 +++++++++++++++++++++++++++++++++++++++---- 2 files changed, 54 insertions(+), 5 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index b36690ca0d3f..a33b37514c29 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -548,6 +548,8 @@ struct cgroup { #ifdef CONFIG_BPF_SYSCALL struct bpf_local_storage __rcu *bpf_cgrp_storage; #endif + /* completion queue for cgrp_rstat_ongoing_flusher */ + struct completion flush_done; /* All ancestors including self */ struct cgroup *ancestors[]; diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index fb8b49437573..e9d3e2aff698 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -2,6 +2,7 @@ #include "cgroup-internal.h" #include +#include #include #include @@ -11,6 +12,7 @@ static DEFINE_SPINLOCK(cgroup_rstat_lock); static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock); +static struct cgroup *cgrp_rstat_ongoing_flusher = NULL; static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu); @@ -299,6 +301,44 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) spin_unlock_irq(&cgroup_rstat_lock); } +#define MAX_WAIT msecs_to_jiffies(100) +/* Trylock helper that also checks for on ongoing flusher */ +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp) +{ + struct cgroup *cgrp_ongoing; + + /* Check if ongoing flusher is already taking care of this, if + * we are a descendant skip work, but wait for ongoing flusher + * to complete work. + */ + cgrp_ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); + if (cgrp_ongoing && cgroup_is_descendant(cgrp, cgrp_ongoing)) { + wait_for_completion_interruptible_timeout( + &cgrp_ongoing->flush_done, MAX_WAIT); + /* TODO: Add tracepoint here */ + return false; + } + + __cgroup_rstat_lock(cgrp, -1); + /* Obtained lock, record this cgrp as the ongoing flusher */ + if (!READ_ONCE(cgrp_rstat_ongoing_flusher)) { + reinit_completion(&cgrp->flush_done); + WRITE_ONCE(cgrp_rstat_ongoing_flusher, cgrp); + } + + return true; /* locked */ +} + +static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp) +{ + /* Detect if we are the ongoing flusher */ + if (cgrp == READ_ONCE(cgrp_rstat_ongoing_flusher)) { + WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL); + complete_all(&cgrp->flush_done); + } + __cgroup_rstat_unlock(cgrp, -1); +} + /* see cgroup_rstat_flush() */ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) @@ -350,9 +390,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) { might_sleep(); - __cgroup_rstat_lock(cgrp, -1); + if (!cgroup_rstat_trylock_flusher(cgrp)) + return; + cgroup_rstat_flush_locked(cgrp); - __cgroup_rstat_unlock(cgrp, -1); + cgroup_rstat_unlock_flusher(cgrp); } /** @@ -368,8 +410,11 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp) __acquires(&cgroup_rstat_lock) { might_sleep(); - __cgroup_rstat_lock(cgrp, -1); - cgroup_rstat_flush_locked(cgrp); + + if (cgroup_rstat_trylock_flusher(cgrp)) + cgroup_rstat_flush_locked(cgrp); + else + __cgroup_rstat_lock(cgrp, -1); } /** @@ -379,7 +424,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp) void cgroup_rstat_flush_release(struct cgroup *cgrp) __releases(&cgroup_rstat_lock) { - __cgroup_rstat_unlock(cgrp, -1); + cgroup_rstat_unlock_flusher(cgrp); } int cgroup_rstat_init(struct cgroup *cgrp) @@ -401,6 +446,8 @@ int cgroup_rstat_init(struct cgroup *cgrp) u64_stats_init(&rstatc->bsync); } + init_completion(&cgrp->flush_done); + return 0; } . From: "Bang Li" To: hughd@google.com, akpm@linux-foundation.org Cc: , , , , , , , "Bang Li" Subject: [PATCH] support "THPeligible" semantics for mTHP with anonymous shmem Date: Fri, 28 Jun 2024 18:49:26 +0800 Message-Id: <20240628104926.34209-1-libang.li@antgroup.com> X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1262065 org.kvack.linux-mm:201799 Newsgroups: org.kernel.vger.linux-kernel,org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail After the commit 7fb1b252afb5 ("mm: shmem: add mTHP support for anonymous shmem"), we can configure different policies through the multi-size THP sysfs interface for anonymous shmem. But currently "THPeligible" indicates only whether the mapping is eligible for allocating THP-pages as well as the THP is PMD mappable or not for anonymous shmem, we need to support semantics for mTHP with anonymous shmem similar to those for mTHP with anonymous memory. Signed-off-by: Bang Li --- fs/proc/task_mmu.c | 10 +++++++--- include/linux/huge_mm.h | 11 +++++++++++ mm/shmem.c | 9 +-------- 3 files changed, 19 insertions(+), 11 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 93fb2c61b154..09b5db356886 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -870,6 +870,7 @@ static int show_smap(struct seq_file *m, void *v) { struct vm_area_struct *vma = v; struct mem_size_stats mss = {}; + bool thp_eligible; smap_gather_stats(vma, &mss, 0); @@ -882,9 +883,12 @@ static int show_smap(struct seq_file *m, void *v) __show_smap(m, &mss, false); - seq_printf(m, "THPeligible: %8u\n", - !!thp_vma_allowable_orders(vma, vma->vm_flags, - TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL)); + thp_eligible = !!thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL); + if (vma_is_anon_shmem(vma)) + thp_eligible = !!shmem_allowable_huge_orders(file_inode(vma->vm_file), + vma, vma->vm_pgoff, thp_eligible); + seq_printf(m, "THPeligible: %8u\n", thp_eligible); if (arch_pkeys_enabled()) seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 212cca384d7e..f87136f38aa1 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -267,6 +267,10 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders); } +unsigned long shmem_allowable_huge_orders(struct inode *inode, + struct vm_area_struct *vma, pgoff_t index, + bool global_huge); + struct thpsize { struct kobject kobj; struct list_head node; @@ -460,6 +464,13 @@ static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, return 0; } +static inline unsigned long shmem_allowable_huge_orders(struct inode *inode, + struct vm_area_struct *vma, pgoff_t index, + bool global_huge) +{ + return 0; +} + #define transparent_hugepage_flags 0UL #define thp_get_unmapped_area NULL diff --git a/mm/shmem.c b/mm/shmem.c index d495c0701a83..aa85df9c662a 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1622,7 +1622,7 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp) } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -static unsigned long shmem_allowable_huge_orders(struct inode *inode, +unsigned long shmem_allowable_huge_orders(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, bool global_huge) { @@ -1707,13 +1707,6 @@ static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault return orders; } #else -static unsigned long shmem_allowable_huge_orders(struct inode *inode, - struct vm_area_struct *vma, pgoff_t index, - bool global_huge) -{ - return 0; -} - static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf, struct address_space *mapping, pgoff_t index, unsigned long orders) -- 2.19.1.6.gb485710b . From: Lance Yang To: akpm@linux-foundation.org Cc: dj456119@gmail.com, 21cnbao@gmail.com, ryan.roberts@arm.com, david@redhat.com, shy828301@gmail.com, ziy@nvidia.com, libang.li@antgroup.com, baolin.wang@linux.alibaba.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Lance Yang Subject: [PATCH v2 0/2] mm: introduce per-order mTHP split counters Date: Fri, 28 Jun 2024 21:07:48 +0800 Message-ID: <20240628130750.73097-1-ioworker0@gmail.com> X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1262172 org.kvack.linux-mm:201803 Newsgroups: org.kernel.vger.linux-kernel,org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail Hi all, Currently, the split counters in THP statistics no longer include PTE-mapped mTHP. Therefore, we propose introducing per-order mTHP split counters to monitor the frequency of mTHP splits. This will help developers better analyze and optimize system performance. /sys/kernel/mm/transparent_hugepage/hugepages-/stats split split_failed split_deferred --- Changes since v1 [1] ==================== - mm: add per-order mTHP split counters - Update the changelog - Drop '_page' from mTHP split counter names (per David and Ryan) - Store the order of the folio in a variable and reuse it later (per Bang) - mm: add docs for per-order mTHP split counters - Improve the doc suggested by Ryan [1] https://lore.kernel.org/linux-mm/20240424135148.30422-1-ioworker0@gmail.com Lance Yang (2): mm: add per-order mTHP split counters mm: add docs for per-order mTHP split counters Documentation/admin-guide/mm/transhuge.rst | 16 ++++++++++++++++ include/linux/huge_mm.h | 3 +++ mm/huge_memory.c | 19 ++++++++++++++----- 3 files changed, 33 insertions(+), 5 deletions(-) -- 2.45.2 . From: Lorenzo Stoakes To: Andrew Morton Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "Liam R . Howlett" , Vlastimil Babka , Matthew Wilcox , Alexander Viro , Christian Brauner , Jan Kara , Eric Biederman , Kees Cook , Suren Baghdasaryan , Lorenzo Stoakes Subject: [RFC PATCH v2 0/7] Make core VMA operations internal and testable Date: Fri, 28 Jun 2024 15:35:21 +0100 Message-ID: X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1262259 org.kvack.linux-mm:201807 Newsgroups: org.kernel.vger.linux-kernel,org.kernel.vger.linux-fsdevel,org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail There are a number of "core" VMA manipulation functions implemented in mm/mmap.c, notably those concerning VMA merging, splitting, modifying, expanding and shrinking, which logically don't belong there. More importantly this functionality represents an internal implementation detail of memory management and should not be exposed outside of mm/ itself. This patch series isolates core VMA manipulation functionality into its own file, mm/vma.c, and provides an API to the rest of the mm code in mm/vma.h. Importantly, it also carefully implements mm/vma_internal.h, which specifies which headers need to be imported by vma.c, leading to the very useful property that vma.c depends only on mm/vma.h and mm/vma_internal.h. This is useful, because we can then re-implement vma_internal.h in userland, stubbing out and adding shims for kernel mechanisms as required, and then can directly and very easily unit test internal VMA functionality. This patch series takes advantage of existing shim logic and full userland maple tree support contained in tools/testing/radix-tree/ and tools/include/linux/, separating out shared components of the radix tree implementation to provide this testing. Kernel functionality is stubbed and shimmed as needed in tools/testing/vma/ which contains a fully functional userland vma_internal.h file and which imports mm/vma.c and mm/vma.h to be directly tested from userland. A simple, skeleton testing implementation is provided in tools/testing/vma/vma.c as a proof-of-concept, asserting that simple VMA merge, modify (testing split), expand and shrink functionality work correctly. v2: * Reword commit messages. * Replace vma_expand() / vma_shrink() wrappers with relocate_vma(). * Make move_page_tables() internal too. * Have internal.h import vma.h. * Use header guards to more cleanly implement userland testing code. * Rename main.c to vma.c. * Update mm/vma_internal.h to have fewer superfluous comments. * Rework testing logic so we count test failures, and output test results. * Correct some SPDX license prefixes. * Make VM_xxx_ON() debug asserts forward to xxx_ON() macros. * Update VMA tests to correctly free memory, and re-enable ASAN leak detection. v1: https://lore.kernel.org/all/cover.1719481836.git.lstoakes@gmail.com/ Lorenzo Stoakes (7): userfaultfd: move core VMA manipulation logic to mm/userfaultfd.c mm: move vma_modify() and helpers to internal header mm: move vma_shrink(), vma_expand() to internal header mm: move internal core VMA manipulation functions to own file MAINTAINERS: Add entry for new VMA files tools: separate out shared radix-tree components tools: add skeleton code for userland testing of VMA logic MAINTAINERS | 14 + fs/exec.c | 68 +- fs/userfaultfd.c | 160 +- include/linux/atomic.h | 2 +- include/linux/mm.h | 112 +- include/linux/mmzone.h | 3 +- include/linux/userfaultfd_k.h | 19 + mm/Makefile | 2 +- mm/internal.h | 167 +- mm/mmap.c | 2070 ++--------------- mm/mmu_notifier.c | 2 + mm/userfaultfd.c | 168 ++ mm/vma.c | 1766 ++++++++++++++ mm/vma.h | 362 +++ mm/vma_internal.h | 52 + tools/testing/radix-tree/Makefile | 68 +- tools/testing/radix-tree/maple.c | 14 +- tools/testing/radix-tree/xarray.c | 9 +- tools/testing/shared/autoconf.h | 2 + tools/testing/{radix-tree => shared}/bitmap.c | 0 tools/testing/{radix-tree => shared}/linux.c | 0 .../{radix-tree => shared}/linux/bug.h | 0 .../{radix-tree => shared}/linux/cpu.h | 0 .../{radix-tree => shared}/linux/idr.h | 0 .../{radix-tree => shared}/linux/init.h | 0 .../{radix-tree => shared}/linux/kconfig.h | 0 .../{radix-tree => shared}/linux/kernel.h | 0 .../{radix-tree => shared}/linux/kmemleak.h | 0 .../{radix-tree => shared}/linux/local_lock.h | 0 .../{radix-tree => shared}/linux/lockdep.h | 0 .../{radix-tree => shared}/linux/maple_tree.h | 0 .../{radix-tree => shared}/linux/percpu.h | 0 .../{radix-tree => shared}/linux/preempt.h | 0 .../{radix-tree => shared}/linux/radix-tree.h | 0 .../{radix-tree => shared}/linux/rcupdate.h | 0 .../{radix-tree => shared}/linux/xarray.h | 0 tools/testing/shared/maple-shared.h | 9 + tools/testing/shared/maple-shim.c | 7 + tools/testing/shared/shared.h | 34 + tools/testing/shared/shared.mk | 68 + .../testing/shared/trace/events/maple_tree.h | 5 + tools/testing/shared/xarray-shared.c | 5 + tools/testing/shared/xarray-shared.h | 4 + tools/testing/vma/.gitignore | 6 + tools/testing/vma/Makefile | 15 + tools/testing/vma/errors.txt | 0 tools/testing/vma/generated/autoconf.h | 2 + tools/testing/vma/linux/atomic.h | 12 + tools/testing/vma/linux/mmzone.h | 38 + tools/testing/vma/vma.c | 207 ++ tools/testing/vma/vma_internal.h | 882 +++++++ 51 files changed, 3910 insertions(+), 2444 deletions(-) create mode 100644 mm/vma.c create mode 100644 mm/vma.h create mode 100644 mm/vma_internal.h create mode 100644 tools/testing/shared/autoconf.h rename tools/testing/{radix-tree => shared}/bitmap.c (100%) rename tools/testing/{radix-tree => shared}/linux.c (100%) rename tools/testing/{radix-tree => shared}/linux/bug.h (100%) rename tools/testing/{radix-tree => shared}/linux/cpu.h (100%) rename tools/testing/{radix-tree => shared}/linux/idr.h (100%) rename tools/testing/{radix-tree => shared}/linux/init.h (100%) rename tools/testing/{radix-tree => shared}/linux/kconfig.h (100%) rename tools/testing/{radix-tree => shared}/linux/kernel.h (100%) rename tools/testing/{radix-tree => shared}/linux/kmemleak.h (100%) rename tools/testing/{radix-tree => shared}/linux/local_lock.h (100%) rename tools/testing/{radix-tree => shared}/linux/lockdep.h (100%) rename tools/testing/{radix-tree => shared}/linux/maple_tree.h (100%) rename tools/testing/{radix-tree => shared}/linux/percpu.h (100%) rename tools/testing/{radix-tree => shared}/linux/preempt.h (100%) rename tools/testing/{radix-tree => shared}/linux/radix-tree.h (100%) rename tools/testing/{radix-tree => shared}/linux/rcupdate.h (100%) rename tools/testing/{radix-tree => shared}/linux/xarray.h (100%) create mode 100644 tools/testing/shared/maple-shared.h create mode 100644 tools/testing/shared/maple-shim.c create mode 100644 tools/testing/shared/shared.h create mode 100644 tools/testing/shared/shared.mk create mode 100644 tools/testing/shared/trace/events/maple_tree.h create mode 100644 tools/testing/shared/xarray-shared.c create mode 100644 tools/testing/shared/xarray-shared.h create mode 100644 tools/testing/vma/.gitignore create mode 100644 tools/testing/vma/Makefile create mode 100644 tools/testing/vma/errors.txt create mode 100644 tools/testing/vma/generated/autoconf.h create mode 100644 tools/testing/vma/linux/atomic.h create mode 100644 tools/testing/vma/linux/mmzone.h create mode 100644 tools/testing/vma/vma.c create mode 100644 tools/testing/vma/vma_internal.h -- 2.45.1 . Return-Path: Date: Fri, 28 Jun 2024 23:06:32 +0800 From: kernel test robot To: Benjamin Tissoires Cc: oe-kbuild-all@lists.linux.dev, Linux Memory Management List Subject: [linux-next:master 8744/9027] drivers/hid/bpf/hid_bpf_dispatch.c:363:47: warning: cast from pointer to integer of different size Message-ID: <202406282304.UydSVncq-lkp@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Xref: photonic.trudheim.com org.kvack.linux-mm:201817 Newsgroups: org.kvack.linux-mm,dev.linux.lists.oe-kbuild-all Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master head: 642a16ca7994a50d7de85715996a8ce171a5bdfb commit: 67eccf151d76a9939ad8a50c6db5cb486b01df24 [8744/9027] HID: add source argument to HID low level functions config: i386-allmodconfig (https://download.01.org/0day-ci/archive/20240628/202406282304.UydSVncq-lkp@intel.com/config) compiler: gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240628/202406282304.UydSVncq-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202406282304.UydSVncq-lkp@intel.com/ All warnings (new ones prefixed by >>): drivers/hid/bpf/hid_bpf_dispatch.c: In function 'hid_bpf_hw_request': >> drivers/hid/bpf/hid_bpf_dispatch.c:363:47: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] 363 | (__u64)ctx); | ^ drivers/hid/bpf/hid_bpf_dispatch.c: In function 'hid_bpf_hw_output_report': drivers/hid/bpf/hid_bpf_dispatch.c:403:49: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] 403 | (__u64)ctx); | ^ drivers/hid/bpf/hid_bpf_dispatch.c: In function 'hid_bpf_input_report': drivers/hid/bpf/hid_bpf_dispatch.c:434:68: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] 434 | return hid_ops->hid_input_report(hdev, type, buf, size, 0, (__u64)ctx); | ^ vim +363 drivers/hid/bpf/hid_bpf_dispatch.c 313 314 /** 315 * hid_bpf_hw_request - Communicate with a HID device 316 * 317 * @ctx: the HID-BPF context previously allocated in hid_bpf_allocate_context() 318 * @buf: a %PTR_TO_MEM buffer 319 * @buf__sz: the size of the data to transfer 320 * @rtype: the type of the report (%HID_INPUT_REPORT, %HID_FEATURE_REPORT, %HID_OUTPUT_REPORT) 321 * @reqtype: the type of the request (%HID_REQ_GET_REPORT, %HID_REQ_SET_REPORT, ...) 322 * 323 * @returns %0 on success, a negative error code otherwise. 324 */ 325 __bpf_kfunc int 326 hid_bpf_hw_request(struct hid_bpf_ctx *ctx, __u8 *buf, size_t buf__sz, 327 enum hid_report_type rtype, enum hid_class_request reqtype) 328 { 329 struct hid_device *hdev; 330 size_t size = buf__sz; 331 u8 *dma_data; 332 int ret; 333 334 /* check arguments */ 335 ret = __hid_bpf_hw_check_params(ctx, buf, &size, rtype); 336 if (ret) 337 return ret; 338 339 switch (reqtype) { 340 case HID_REQ_GET_REPORT: 341 case HID_REQ_GET_IDLE: 342 case HID_REQ_GET_PROTOCOL: 343 case HID_REQ_SET_REPORT: 344 case HID_REQ_SET_IDLE: 345 case HID_REQ_SET_PROTOCOL: 346 break; 347 default: 348 return -EINVAL; 349 } 350 351 hdev = (struct hid_device *)ctx->hid; /* discard const */ 352 353 dma_data = kmemdup(buf, size, GFP_KERNEL); 354 if (!dma_data) 355 return -ENOMEM; 356 357 ret = hid_ops->hid_hw_raw_request(hdev, 358 dma_data[0], 359 dma_data, 360 size, 361 rtype, 362 reqtype, > 363 (__u64)ctx); 364 365 if (ret > 0) 366 memcpy(buf, dma_data, ret); 367 368 kfree(dma_data); 369 return ret; 370 } 371 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki . Return-Path: Date: Fri, 28 Jun 2024 23:06:31 +0800 From: kernel test robot To: Benjamin Tissoires Cc: oe-kbuild-all@lists.linux.dev, Linux Memory Management List Subject: [linux-next:master 8744/9027] drivers/hid/hidraw.c:143:70: sparse: sparse: non size-preserving pointer to integer cast Message-ID: <202406282242.Fk738zzy-lkp@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Xref: photonic.trudheim.com org.kvack.linux-mm:201818 Newsgroups: org.kvack.linux-mm,dev.linux.lists.oe-kbuild-all Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master head: 642a16ca7994a50d7de85715996a8ce171a5bdfb commit: 67eccf151d76a9939ad8a50c6db5cb486b01df24 [8744/9027] HID: add source argument to HID low level functions config: i386-randconfig-r132-20240628 (https://download.01.org/0day-ci/archive/20240628/202406282242.Fk738zzy-lkp@intel.com/config) compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240628/202406282242.Fk738zzy-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202406282242.Fk738zzy-lkp@intel.com/ sparse warnings: (new ones prefixed by >>) >> drivers/hid/hidraw.c:143:70: sparse: sparse: non size-preserving pointer to integer cast drivers/hid/hidraw.c:154:63: sparse: sparse: non size-preserving pointer to integer cast drivers/hid/hidraw.c:231:63: sparse: sparse: non size-preserving pointer to integer cast vim +143 drivers/hid/hidraw.c 101 102 /* 103 * The first byte of the report buffer is expected to be a report number. 104 */ 105 static ssize_t hidraw_send_report(struct file *file, const char __user *buffer, size_t count, unsigned char report_type) 106 { 107 unsigned int minor = iminor(file_inode(file)); 108 struct hid_device *dev; 109 __u8 *buf; 110 int ret = 0; 111 112 lockdep_assert_held(&minors_rwsem); 113 114 if (!hidraw_table[minor] || !hidraw_table[minor]->exist) { 115 ret = -ENODEV; 116 goto out; 117 } 118 119 dev = hidraw_table[minor]->hid; 120 121 if (count > HID_MAX_BUFFER_SIZE) { 122 hid_warn(dev, "pid %d passed too large report\n", 123 task_pid_nr(current)); 124 ret = -EINVAL; 125 goto out; 126 } 127 128 if (count < 2) { 129 hid_warn(dev, "pid %d passed too short report\n", 130 task_pid_nr(current)); 131 ret = -EINVAL; 132 goto out; 133 } 134 135 buf = memdup_user(buffer, count); 136 if (IS_ERR(buf)) { 137 ret = PTR_ERR(buf); 138 goto out; 139 } 140 141 if ((report_type == HID_OUTPUT_REPORT) && 142 !(dev->quirks & HID_QUIRK_NO_OUTPUT_REPORTS_ON_INTR_EP)) { > 143 ret = __hid_hw_output_report(dev, buf, count, (__u64)file); 144 /* 145 * compatibility with old implementation of USB-HID and I2C-HID: 146 * if the device does not support receiving output reports, 147 * on an interrupt endpoint, fallback to SET_REPORT HID command. 148 */ 149 if (ret != -ENOSYS) 150 goto out_free; 151 } 152 153 ret = __hid_hw_raw_request(dev, buf[0], buf, count, report_type, 154 HID_REQ_SET_REPORT, (__u64)file); 155 156 out_free: 157 kfree(buf); 158 out: 159 return ret; 160 } 161 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki . Return-Path: MIME-Version: 1.0 From: Yu Zhao Date: Fri, 28 Jun 2024 11:57:18 -0600 Message-ID: Subject: MCEs on MIPS: multiple matching TLB entries To: linux-mips@vger.kernel.org Cc: Linux-MM Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Xref: photonic.trudheim.com org.kvack.linux-mm:201824 Newsgroups: org.kvack.linux-mm,org.kernel.vger.linux-mips Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail Hi, OpenWrt folks ran into MCEs caused by multiple matching TLB entries [1], after they updated their kernel from v6.1 to v6.6. I reported similar crashes previously [2], on v6.4. So they asked me whether I'm aware of a fix from the mainline, which I am not. I took a quick look from the MM's POV and found nothing obviously wrong. I'm hoping they have better luck with the MIPS experts. Thanks! [1] https://github.com/openwrt/openwrt/pull/15635 [2] https://lore.kernel.org/linux-mm/CAOUHufbAjZd4Mxkio9OGct-TZ=L0QRG+_6Xa7atQVFN_4ez86w@mail.gmail.com/ Copying and pasting one of the crashes from OpenWrt: CFE for WNR3500L version: v1.0.36 Build Date: Tue Aug 11 15:09:14 CST 2009 Init Arena Init Devs. Boot partition size = 262144(0x40000) Found a 8MB ST compatible serial flash et0: Broadcom BCM47XX 10/100/1000 Mbps Ethernet Controller 5.10.56.28 CPU type 0x19740: 453MHz Tot mem: 65536 KBytes Device eth0: hwaddr 00-FF-FF-FF-FF-FF, ipaddr 192.168.1.1, mask 255.255.255.0 gateway not set, nameserver not set too long file. LZMA boot failed Loader:raw Filesys:raw Dev:flash0.os File: Options:(null) Loading: .. 3808 bytes read Entry at 0x80001000 Closing network. Starting program at 0x80001000 [ 0.000000] Linux version 6.6.35 (user@connors) (mipsel-openwrt-linux-musl-gcc (OpenWrt GCC 13.3.0 r25518+987-f7a68458b4) 13.3.0, GNU ld (GNU Binutils) 2.42) #0 Sun Jun 23 09:14:12 2024 [ 0.000000] CPU0 revision is: 00019740 (MIPS 74Kc) [ 0.000000] bcm47xx: Using bcma bus [ 0.000000] (NULL device *): bus0: Found chip with id 0x4716, rev 0x01 and package 0x0A [ 0.000000] Initrd not found or empty - disabling initrd [ 0.000000] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes. [ 0.000000] Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes [ 0.000000] This processor doesn't support highmem. -65536k highmem ignored [ 0.000000] Zone ranges: [ 0.000000] Normal [mem 0x0000000000000000-0x0000000003ffffff] [ 0.000000] HighMem empty [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000000000000-0x0000000003ffffff] [ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x0000000003ffffff] [ 0.000000] Kernel command line: noinitrd console=ttyS0,115200 [ 0.000000] Dentry cache hash table entries: 8192 (order: 3, 32768 bytes, linear) [ 0.000000] Inode-cache hash table entries: 4096 (order: 2, 16384 bytes, linear) [ 0.000000] Writing ErrCtl register=00000000 [ 0.000000] Readback ErrCtl register=00000000 [ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 16240 [ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off [ 0.000000] Memory: 56672K/65536K available (5819K kernel code, 596K rwdata, 1244K rodata, 204K init, 297K bss, 8864K reserved, 0K cma-reserved, 0K highmem) [ 0.000000] NR_IRQS: 256 [ 0.000000] bcm47xx_soc: bus0: Core 0 found: ChipCommon (manuf 0x4BF, id 0x800, rev 0x1F, class 0x0) [ 0.000000] bcm47xx_soc: bus0: Core 1 found: IEEE 802.11 (manuf 0x4BF, id 0x812, rev 0x11, class 0x0) [ 0.000000] bcm47xx_soc: bus0: Core 2 found: GBit MAC (manuf 0x4BF, id 0x82D, rev 0x00, class 0x0) [ 0.000000] bcm47xx_soc: bus0: Core 3 found: MIPS 74K (manuf 0x4A7, id 0x82C, rev 0x01, class 0x0) [ 0.000000] bcm47xx_soc: bus0: Core 4 found: USB 2.0 Host (manuf 0x4BF, id 0x819, rev 0x04, class 0x0) [ 0.000000] bcm47xx_soc: bus0: Core 5 found: PCIe (manuf 0x4BF, id 0x820, rev 0x0E, class 0x0) [ 0.000000] bcm47xx_soc: bus0: Core 6 found: DDR1/DDR2 Memory Controller (manuf 0x4BF, id 0x82E, rev 0x01, class 0x0) [ 0.000000] bcm47xx_soc: bus0: Core 7 found: Internal Memory (manuf 0x4BF, id 0x80E, rev 0x07, class 0x0) [ 0.000000] bcm47xx_soc: bus0: Core 8 found: I2S (manuf 0x4BF, id 0x834, rev 0x00, class 0x0) [ 0.000000] bcm47xx_soc: bus0: Found M25P64 serial flash (size: 8192KiB, blocksize: 0x10000, blocks: 128) [ 0.000000] bcm47xx_soc: bus0: Early bus registered [ 0.000000] MIPS: machine is Netgear WNR3500L [ 0.000000] bcm47xx: Setting up vectored interrupts [ 0.000000] clocksource: MIPS: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 8438235966 ns [ 0.000003] sched_clock: 32 bits at 227MHz, resolution 4ns, wraps every 9481163773ns [ 0.009630] Calibrating delay loop... 226.09 BogoMIPS (lpj=1130496) [ 0.080067] pid_max: default: 32768 minimum: 301 [ 0.098070] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear) [ 0.098182] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear) [ 0.115630] RCU Tasks Trace: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1. [ 0.119327] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns [ 0.119449] futex hash table entries: 256 (order: -1, 3072 bytes, linear) [ 0.127277] NET: Registered PF_NETLINK/PF_ROUTE protocol family [ 0.147294] clocksource: Switched to clocksource MIPS [ 0.162062] NET: Registered PF_INET protocol family [ 0.162678] IP idents hash table entries: 2048 (order: 2, 16384 bytes, linear) [ 0.164718] tcp_listen_portaddr_hash hash table entries: 1024 (order: 0, 4096 bytes, linear) [ 0.167040] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear) [ 0.167138] TCP established hash table entries: 1024 (order: 0, 4096 bytes, linear) [ 0.167258] TCP bind hash table entries: 1024 (order: 1, 8192 bytes, linear) [ 0.167386] TCP: Hash tables configured (established 1024 bind 1024) [ 0.168124] UDP hash table entries: 256 (order: 0, 4096 bytes, linear) [ 0.168379] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes, linear) [ 0.169663] NET: Registered PF_UNIX/PF_LOCAL protocol family [ 0.169984] PCI: CLS 0 bytes, default 32 [ 0.201695] bcm47xx_soc: bus0: PCIEcore in host mode found [ 0.201712] bcm47xx_soc: bus0: This PCIE core is disabled and not working [ 0.203394] gpio gpiochip0: Static allocation of GPIO base is deprecated, use dynamic allocation. [ 0.204527] bcm47xx_soc: bus0: Bus registered [ 0.230753] workingset: timestamp_bits=14 max_order=14 bucket_order=0 [ 0.233286] squashfs: version 4.0 (2009/01/31) Phillip Lougher [ 0.233331] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc. [ 0.244147] Serial: 8250/16550 driver, 2 ports, IRQ sharing enabled [ 0.246592] printk: console [ttyS0] disabled [ 0.267632] serial8250.0: ttyS0 at MMIO 0xb8000300 (irq = 2, base_baud = 1250000) is a U6_16550A [ 0.267762] printk: console [ttyS0] enabled [ 0.793749] 3 bcm47xxpart partitions found on MTD device bcm47xxsflash [ 0.800497] Creating 3 MTD partitions on "bcm47xxsflash": [ 0.806143] 0x000000000000-0x000000040000 : "boot" [ 0.844873] 0x000000040000-0x0000007f0000 : "firmware" [ 0.853949] failed to parse "brcm,trx-magic" DT attribute, using default: -89 [ 0.861402] 3 trx partitions found on MTD device firmware [ 0.866918] Creating 3 MTD partitions on "firmware": [ 0.872045] 0x00000000001c-0x000000000928 : "loader" [ 0.877127] mtd: partition "loader" doesn't start on an erase/write block boundary -- force read-only [ 0.892292] 0x000000000928-0x00000024f800 : "linux" [ 0.897409] mtd: partition "linux" doesn't start on an erase/write block boundary -- force read-only [ 0.911927] 0x00000024f800-0x0000007b0000 : "rootfs" [ 0.917042] mtd: partition "rootfs" doesn't start on an erase/write block boundary -- force read-only [ 0.930529] mtd: setting mtd4 (rootfs) as root device [ 0.935780] 1 squashfs-split partitions found on MTD device rootfs [ 0.942210] 0x000000560000-0x0000007b0000 : "rootfs_data" [ 0.953347] 0x0000007f0000-0x000000800000 : "nvram" [ 0.976291] bgmac_bcma bcma0:2: Found PHY addr: 30 (NOREGS) [ 1.089641] b53_common: found switch: BCM53115, rev 8 [ 1.095225] bgmac_bcma bcma0:2: Support for Roboswitch not implemented [ 1.104192] bgmac_bcma: Broadcom 47xx GBit MAC driver loaded [ 1.110909] bcm47xx-wdt bcm47xx-wdt.0: BCM47xx Watchdog Timer enabled (30 seconds) [ 1.121569] NET: Registered PF_INET6 protocol family [ 1.146114] Segment Routing with IPv6 [ 1.150302] In-situ OAM (IOAM) with IPv6 [ 1.154877] NET: Registered PF_PACKET protocol family [ 1.160253] 8021q: 802.1Q VLAN Support v1.8 [ 1.273304] VFS: Mounted root (squashfs filesystem) readonly on device 31:4. [ 1.282239] Freeing unused kernel image (initmem) memory: 204K [ 1.288306] This architecture does not have kernel memory protection. [ 1.294893] Run /sbin/init as init process [ 2.348011] init: Console is alive [ 2.352416] init: - watchdog - [ 3.465026] kmodloader: loading kernel modules from /etc/modules-boot.d/* [ 3.645823] usbcore: registered new interface driver usbfs [ 3.651824] usbcore: registered new interface driver hub [ 3.657587] usbcore: registered new device driver usb [ 3.678605] gpio_button_hotplug: loading out-of-tree module taints kernel. [ 3.722626] ehci-platform ehci-platform.0: EHCI Host Controller [ 3.728888] ehci-platform ehci-platform.0: new USB bus registered, assigned bus number 1 [ 3.737524] ehci-platform ehci-platform.0: irq 5, io mem 0x18004000 [ 3.767369] ehci-platform ehci-platform.0: USB 2.0 started, EHCI 1.00 [ 3.776631] hub 1-0:1.0: USB hub found [ 3.782776] hub 1-0:1.0: 2 ports detected [ 3.813642] ohci-platform ohci-platform.0: Generic Platform OHCI controller [ 3.820986] ohci-platform ohci-platform.0: new USB bus registered, assigned bus number 2 [ 3.829703] ohci-platform ohci-platform.0: irq 5, io mem 0x18009000 [ 3.903777] hub 2-0:1.0: USB hub found [ 3.909946] hub 2-0:1.0: 2 ports detected [ 3.940493] kmodloader: done loading kernel modules from /etc/modules-boot.d/* [ 3.959102] init: - preinit - [ 3.965715] Got mcheck at 800104ec [ 3.969240] CPU: 0 PID: 245 Comm: init Tainted: G O 6.6.35 #0 [ 3.976522] $ 0 : 00000000 00000001 fffd5000 00000001 [ 3.981900] $ 4 : 00000004 00000003 00026edf 7fc8c008 [ 3.987273] $ 8 : 00000000 00000001 0000001b 00000068 [ 3.992635] $12 : 7fc8d600 81c2376c 81c2370c ffffff00 [ 3.998007] $16 : 7fc8d000 8192dad0 00000000 806ed280 [ 4.003379] $20 : 8192dad0 00000001 807b1000 7fc8d7c0 [ 4.008751] $24 : 77d75000 ffffffff [ 4.014122] $28 : 81ad6000 81ad7dc8 00000000 80017678 [ 4.019486] Hi : 00000071 [ 4.022428] Lo : 0ceb0000 [ 4.025369] epc : 800104ec __kmap_pgprot+0xdc/0x108 [ 4.030547] ra : 80017678 r4k_flush_cache_page+0x24c/0x29c [ 4.036430] Status: 1120a402 KERNEL EXL [ 4.040455] Cause : 00800060 (ExcCode 18) [ 4.044541] PrId : 00019740 (MIPS 74Kc) [ 4.048543] CPU: 0 PID: 245 Comm: init Tainted: G O 6.6.35 #0 [ 4.055816] Stack : 00000000 00000001 807b1000 800625ac 00000000 00000004 00000000 00000000 [ 4.064399] 81ad7c8c 807b0000 80780000 8064d57c 81ac1528 00000001 81ad7c30 860f9307 [ 4.072985] 00000000 00000000 8064d57c 81ad7b70 ffffefff 00000000 00000000 ffffffea [ 4.081562] 00000000 81ad7b7c 000000a9 806f5a88 00000000 8064d57c 00000000 806ed280 [ 4.090140] 8192dad0 00000001 807b1000 7fc8d7c0 00000018 80324b6c 89052010 00000060 [ 4.098718] ... [ 4.101229] Call Trace: [ 4.103730] [<80006fd0>] show_stack+0x28/0xf0 [ 4.108219] [<8057ec8c>] dump_stack_lvl+0x38/0x60 [ 4.113077] [<80008108>] do_mcheck+0x2c/0xa0 [ 4.117462] [<80003d34>] handle_mcheck_int+0x3c/0x48 [ 4.122547] [ 4.124072] Index : 3 [ 4.126661] PageMask : 0 [ 4.129241] EntryHi : fffd4000 [ 4.132447] EntryLo0 : 00026edf [ 4.135652] EntryLo1 : 00026edf [ 4.138848] Wired : 4 [ 4.141430] [ 4.142954] Index: 0 pgmask=4kb va=fffd4000 asid=00 [ 4.142954] [pa=007af000 c=3 d=1 v=1 g=1] [pa=007af000 c=3 d=1 v=1 g=1] [ 4.154868] Index: 1 pgmask=4kb va=fffd2000 asid=00 [ 4.154868] [pa=0105b000 c=3 d=1 v=1 g=1] [pa=0105b000 c=3 d=1 v=1 g=1] [ 4.166775] Index: 2 pgmask=4kb va=fffd0000 asid=00 [ 4.166775] [pa=01018000 c=3 d=1 v=1 g=1] [pa=01018000 c=3 d=1 v=1 g=1] [ 4.178682] Index: 3 pgmask=4kb va=fffd4000 asid=00 [ 4.178682] [pa=009bb000 c=3 d=1 v=1 g=1] [pa=009bb000 c=3 d=1 v=1 g=1] [ 4.190590] Index: 19 pgmask=4kb va=80026000 asid=00 [ 4.190590] [pa=00000000 c=0 d=0 v=0 g=0] [pa=00000000 c=0 d=0 v=0 g=0] [ 4.202496] Index: 28 pgmask=4kb va=80038000 asid=00 [ 4.202496] [pa=00000000 c=0 d=0 v=0 g=0] [pa=00000000 c=0 d=0 v=0 g=0] [ 4.214400] Index: 29 pgmask=4kb va=8003a000 asid=00 [ 4.214400] [pa=00000000 c=0 d=0 v=0 g=0] [pa=00000000 c=0 d=0 v=0 g=0] [ 4.226312] [ 4.227840] Code: 40843000 40850000 000000c0 <42000002> 000000c0 40875000 10600002 41606000 41606020 [ 4.237866] Kernel panic - not syncing: Caught Machine Check exception - caused by multiple matching entries in the TLB. [ 4.248920] Rebooting in 1 seconds.. [ 5.250906] bcm47xx: Please stand by while rebooting the system... Decompressing..........done . Return-Path: From: Yang Shi To: peterx@redhat.com, yangge1116@126.com, david@redhat.com, hch@infradead.org, akpm@linux-foundation.org Cc: yang@os.amperecomputing.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: [v3 linus-tree PATCH] mm: gup: stop abusing try_grab_folio Date: Fri, 28 Jun 2024 12:14:58 -0700 Message-ID: <20240628191458.2605553-1-yang@os.amperecomputing.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Xref: photonic.trudheim.com org.kvack.linux-mm:201827 Newsgroups: org.kvack.linux-mm,org.kernel.vger.stable Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail A kernel warning was reported when pinning folio in CMA memory when launching SEV virtual machine. The splat looks like: [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 [ 464.325515] Call Trace: [ 464.325520] [ 464.325523] ? __get_user_pages+0x423/0x520 [ 464.325528] ? __warn+0x81/0x130 [ 464.325536] ? __get_user_pages+0x423/0x520 [ 464.325541] ? report_bug+0x171/0x1a0 [ 464.325549] ? handle_bug+0x3c/0x70 [ 464.325554] ? exc_invalid_op+0x17/0x70 [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 [ 464.325567] ? __get_user_pages+0x423/0x520 [ 464.325575] __gup_longterm_locked+0x212/0x7a0 [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 [ 464.325590] pin_user_pages_fast+0x47/0x60 [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] Per the analysis done by yangge, when starting the SEV virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory. But the page is in CMA area, so fast GUP will fail then fallback to the slow path due to the longterm pinnalbe check in try_grab_folio(). The slow path will try to pin the pages then migrate them out of CMA area. But the slow path also uses try_grab_folio() to pin the page, it will also fail due to the same check then the above warning is triggered. In addition, the try_grab_folio() is supposed to be used in fast path and it elevates folio refcount by using add ref unless zero. We are guaranteed to have at least one stable reference in slow path, so the simple atomic add could be used. The performance difference should be trivial, but the misuse may be confusing and misleading. Redefined try_grab_folio() to try_grab_folio_fast(), and try_grab_page() to try_grab_folio(), and use them in the proper paths. This solves both the abuse and the kernel warning. The proper naming makes their usecase more clear and should prevent from abusing in the future. [1] https://lore.kernel.org/linux-mm/1719478388-31917-1-git-send-email-yangge1116@126.com/ Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") Cc: [6.6+] Reported-by: yangge Signed-off-by: Yang Shi --- mm/gup.c | 287 +++++++++++++++++++++++++---------------------- mm/huge_memory.c | 2 +- mm/internal.h | 4 +- 3 files changed, 155 insertions(+), 138 deletions(-) v3: 1. Renamed the patch subject to make it more clear per Peter 2. Rephrased the commit log and elaborated the function renaming per Peter 3. Fixed the comment from Christoph Hellwig v2: 1. Fixed the build warning 2. Reworked the commit log to include the bug report and analysis (reworded by me) from yangge 3. Rebased onto the latest Linus's tree diff --git a/mm/gup.c b/mm/gup.c index ca0f5cedce9b..e65773ce4622 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -97,95 +97,6 @@ static inline struct folio *try_get_folio(struct page *page, int refs) return folio; } -/** - * try_grab_folio() - Attempt to get or pin a folio. - * @page: pointer to page to be grabbed - * @refs: the value to (effectively) add to the folio's refcount - * @flags: gup flags: these are the FOLL_* flag values. - * - * "grab" names in this file mean, "look at flags to decide whether to use - * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount. - * - * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the - * same time. (That's true throughout the get_user_pages*() and - * pin_user_pages*() APIs.) Cases: - * - * FOLL_GET: folio's refcount will be incremented by @refs. - * - * FOLL_PIN on large folios: folio's refcount will be incremented by - * @refs, and its pincount will be incremented by @refs. - * - * FOLL_PIN on single-page folios: folio's refcount will be incremented by - * @refs * GUP_PIN_COUNTING_BIAS. - * - * Return: The folio containing @page (with refcount appropriately - * incremented) for success, or NULL upon failure. If neither FOLL_GET - * nor FOLL_PIN was set, that's considered failure, and furthermore, - * a likely bug in the caller, so a warning is also emitted. - */ -struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) -{ - struct folio *folio; - - if (WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == 0)) - return NULL; - - if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) - return NULL; - - if (flags & FOLL_GET) - return try_get_folio(page, refs); - - /* FOLL_PIN is set */ - - /* - * Don't take a pin on the zero page - it's not going anywhere - * and it is used in a *lot* of places. - */ - if (is_zero_page(page)) - return page_folio(page); - - folio = try_get_folio(page, refs); - if (!folio) - return NULL; - - /* - * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a - * right zone, so fail and let the caller fall back to the slow - * path. - */ - if (unlikely((flags & FOLL_LONGTERM) && - !folio_is_longterm_pinnable(folio))) { - if (!put_devmap_managed_folio_refs(folio, refs)) - folio_put_refs(folio, refs); - return NULL; - } - - /* - * When pinning a large folio, use an exact count to track it. - * - * However, be sure to *also* increment the normal folio - * refcount field at least once, so that the folio really - * is pinned. That's why the refcount from the earlier - * try_get_folio() is left intact. - */ - if (folio_test_large(folio)) - atomic_add(refs, &folio->_pincount); - else - folio_ref_add(folio, - refs * (GUP_PIN_COUNTING_BIAS - 1)); - /* - * Adjust the pincount before re-checking the PTE for changes. - * This is essentially a smp_mb() and is paired with a memory - * barrier in folio_try_share_anon_rmap_*(). - */ - smp_mb__after_atomic(); - - node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); - - return folio; -} - static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) { if (flags & FOLL_PIN) { @@ -203,58 +114,59 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) } /** - * try_grab_page() - elevate a page's refcount by a flag-dependent amount - * @page: pointer to page to be grabbed - * @flags: gup flags: these are the FOLL_* flag values. + * try_grab_folio() - add a folio's refcount by a flag-dependent amount + * @folio: pointer to folio to be grabbed + * @refs: the value to (effectively) add to the folio's refcount + * @flags: gup flags: these are the FOLL_* flag values * * This might not do anything at all, depending on the flags argument. * * "grab" names in this file mean, "look at flags to decide whether to use - * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount. + * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount. * * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same - * time. Cases: please see the try_grab_folio() documentation, with - * "refs=1". + * time. * * Return: 0 for success, or if no action was required (if neither FOLL_PIN * nor FOLL_GET was set, nothing is done). A negative error code for failure: * - * -ENOMEM FOLL_GET or FOLL_PIN was set, but the page could not + * -ENOMEM FOLL_GET or FOLL_PIN was set, but the folio could not * be grabbed. + * + * It is called when we have a stable reference for the folio, typically in + * GUP slow path. */ -int __must_check try_grab_page(struct page *page, unsigned int flags) +int __must_check try_grab_folio(struct folio *folio, int refs, + unsigned int flags) { - struct folio *folio = page_folio(page); - if (WARN_ON_ONCE(folio_ref_count(folio) <= 0)) return -ENOMEM; - if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) + if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(&folio->page))) return -EREMOTEIO; if (flags & FOLL_GET) - folio_ref_inc(folio); + folio_ref_add(folio, refs); else if (flags & FOLL_PIN) { /* * Don't take a pin on the zero page - it's not going anywhere * and it is used in a *lot* of places. */ - if (is_zero_page(page)) + if (is_zero_folio(folio)) return 0; /* - * Similar to try_grab_folio(): be sure to *also* - * increment the normal page refcount field at least once, + * Increment the normal page refcount field at least once, * so that the page really is pinned. */ if (folio_test_large(folio)) { - folio_ref_add(folio, 1); - atomic_add(1, &folio->_pincount); + folio_ref_add(folio, refs); + atomic_add(refs, &folio->_pincount); } else { - folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); + folio_ref_add(folio, refs * GUP_PIN_COUNTING_BIAS); } - node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1); + node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); } return 0; @@ -535,7 +447,7 @@ static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, */ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz, unsigned long addr, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { unsigned long pte_end; struct page *page; @@ -558,9 +470,15 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz page = pte_page(pte); refs = record_subpages(page, sz, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); - if (!folio) - return 0; + if (fast) { + folio = try_grab_folio_fast(page, refs, flags); + if (!folio) + return 0; + } else { + folio = page_folio(page); + if (try_grab_folio(folio, refs, flags)) + return 0; + } if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, refs, flags); @@ -588,7 +506,7 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { pte_t *ptep; unsigned long sz = 1UL << hugepd_shift(hugepd); @@ -598,7 +516,8 @@ static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, ptep = hugepte_offset(hugepd, addr, pdshift); do { next = hugepte_addr_end(addr, end, sz); - ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr); + ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr, + fast); if (ret != 1) return ret; } while (ptep++, addr = next, addr != end); @@ -625,7 +544,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, ptep = hugepte_offset(hugepd, addr, pdshift); ptl = huge_pte_lock(h, vma->vm_mm, ptep); ret = gup_hugepd(vma, hugepd, addr, pdshift, addr + PAGE_SIZE, - flags, &page, &nr); + flags, &page, &nr, false); spin_unlock(ptl); if (ret == 1) { @@ -642,7 +561,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, static inline int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { return 0; } @@ -729,7 +648,7 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma, gup_must_unshare(vma, flags, page)) return ERR_PTR(-EMLINK); - ret = try_grab_page(page, flags); + ret = try_grab_folio(page_folio(page), 1, flags); if (ret) page = ERR_PTR(ret); else @@ -806,7 +725,7 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma, VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && !PageAnonExclusive(page), page); - ret = try_grab_page(page, flags); + ret = try_grab_folio(page_folio(page), 1, flags); if (ret) return ERR_PTR(ret); @@ -968,8 +887,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && !PageAnonExclusive(page), page); - /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */ - ret = try_grab_page(page, flags); + /* try_grab_folio() does nothing unless FOLL_GET or FOLL_PIN is set. */ + ret = try_grab_folio(page_folio(page), 1, flags); if (unlikely(ret)) { page = ERR_PTR(ret); goto out; @@ -1233,7 +1152,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, goto unmap; *page = pte_page(entry); } - ret = try_grab_page(*page, gup_flags); + ret = try_grab_folio(page_folio(*page), 1, gup_flags); if (unlikely(ret)) goto unmap; out: @@ -1636,20 +1555,19 @@ static long __get_user_pages(struct mm_struct *mm, * pages. */ if (page_increm > 1) { - struct folio *folio; + struct folio *folio = page_folio(page); /* * Since we already hold refcount on the * large folio, this should never fail. */ - folio = try_grab_folio(page, page_increm - 1, - foll_flags); - if (WARN_ON_ONCE(!folio)) { + if (try_grab_folio(folio, page_increm - 1, + foll_flags)) { /* * Release the 1st page ref if the * folio is problematic, fail hard. */ - gup_put_folio(page_folio(page), 1, + gup_put_folio(folio, 1, foll_flags); ret = -EFAULT; goto out; @@ -2797,6 +2715,101 @@ EXPORT_SYMBOL(get_user_pages_unlocked); * This code is based heavily on the PowerPC implementation by Nick Piggin. */ #ifdef CONFIG_HAVE_GUP_FAST +/** + * try_grab_folio_fast() - Attempt to get or pin a folio in fast path. + * @page: pointer to page to be grabbed + * @refs: the value to (effectively) add to the folio's refcount + * @flags: gup flags: these are the FOLL_* flag values. + * + * "grab" names in this file mean, "look at flags to decide whether to use + * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount. + * + * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the + * same time. (That's true throughout the get_user_pages*() and + * pin_user_pages*() APIs.) Cases: + * + * FOLL_GET: folio's refcount will be incremented by @refs. + * + * FOLL_PIN on large folios: folio's refcount will be incremented by + * @refs, and its pincount will be incremented by @refs. + * + * FOLL_PIN on single-page folios: folio's refcount will be incremented by + * @refs * GUP_PIN_COUNTING_BIAS. + * + * Return: The folio containing @page (with refcount appropriately + * incremented) for success, or NULL upon failure. If neither FOLL_GET + * nor FOLL_PIN was set, that's considered failure, and furthermore, + * a likely bug in the caller, so a warning is also emitted. + * + * It uses add ref unless zero to elevate the folio refcount and must be called + * in fast path only. + */ +static struct folio *try_grab_folio_fast(struct page *page, int refs, + unsigned int flags) +{ + struct folio *folio; + + /* Raise warn if it is not called in fast GUP */ + VM_WARN_ON_ONCE(!irqs_disabled()); + + if (WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == 0)) + return NULL; + + if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) + return NULL; + + if (flags & FOLL_GET) + return try_get_folio(page, refs); + + /* FOLL_PIN is set */ + + /* + * Don't take a pin on the zero page - it's not going anywhere + * and it is used in a *lot* of places. + */ + if (is_zero_page(page)) + return page_folio(page); + + folio = try_get_folio(page, refs); + if (!folio) + return NULL; + + /* + * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a + * right zone, so fail and let the caller fall back to the slow + * path. + */ + if (unlikely((flags & FOLL_LONGTERM) && + !folio_is_longterm_pinnable(folio))) { + if (!put_devmap_managed_folio_refs(folio, refs)) + folio_put_refs(folio, refs); + return NULL; + } + + /* + * When pinning a large folio, use an exact count to track it. + * + * However, be sure to *also* increment the normal folio + * refcount field at least once, so that the folio really + * is pinned. That's why the refcount from the earlier + * try_get_folio() is left intact. + */ + if (folio_test_large(folio)) + atomic_add(refs, &folio->_pincount); + else + folio_ref_add(folio, + refs * (GUP_PIN_COUNTING_BIAS - 1)); + /* + * Adjust the pincount before re-checking the PTE for changes. + * This is essentially a smp_mb() and is paired with a memory + * barrier in folio_try_share_anon_rmap_*(). + */ + smp_mb__after_atomic(); + + node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); + + return folio; +} /* * Used in the GUP-fast path to determine whether GUP is permitted to work on @@ -2962,7 +2975,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); - folio = try_grab_folio(page, 1, flags); + folio = try_grab_folio_fast(page, 1, flags); if (!folio) goto pte_unmap; @@ -3049,7 +3062,7 @@ static int gup_fast_devmap_leaf(unsigned long pfn, unsigned long addr, break; } - folio = try_grab_folio(page, 1, flags); + folio = try_grab_folio_fast(page, 1, flags); if (!folio) { gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages); break; @@ -3138,7 +3151,7 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr, page = pmd_page(orig); refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); + folio = try_grab_folio_fast(page, refs, flags); if (!folio) return 0; @@ -3182,7 +3195,7 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr, page = pud_page(orig); refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); + folio = try_grab_folio_fast(page, refs, flags); if (!folio) return 0; @@ -3222,7 +3235,7 @@ static int gup_fast_pgd_leaf(pgd_t orig, pgd_t *pgdp, unsigned long addr, page = pgd_page(orig); refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); + folio = try_grab_folio_fast(page, refs, flags); if (!folio) return 0; @@ -3276,7 +3289,8 @@ static int gup_fast_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, * pmd format and THP pmd format */ if (gup_hugepd(NULL, __hugepd(pmd_val(pmd)), addr, - PMD_SHIFT, next, flags, pages, nr) != 1) + PMD_SHIFT, next, flags, pages, nr, + true) != 1) return 0; } else if (!gup_fast_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) @@ -3306,7 +3320,8 @@ static int gup_fast_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, return 0; } else if (unlikely(is_hugepd(__hugepd(pud_val(pud))))) { if (gup_hugepd(NULL, __hugepd(pud_val(pud)), addr, - PUD_SHIFT, next, flags, pages, nr) != 1) + PUD_SHIFT, next, flags, pages, nr, + true) != 1) return 0; } else if (!gup_fast_pmd_range(pudp, pud, addr, next, flags, pages, nr)) @@ -3333,7 +3348,8 @@ static int gup_fast_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, BUILD_BUG_ON(p4d_leaf(p4d)); if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) { if (gup_hugepd(NULL, __hugepd(p4d_val(p4d)), addr, - P4D_SHIFT, next, flags, pages, nr) != 1) + P4D_SHIFT, next, flags, pages, nr, + true) != 1) return 0; } else if (!gup_fast_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) @@ -3362,7 +3378,8 @@ static void gup_fast_pgd_range(unsigned long addr, unsigned long end, return; } else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) { if (gup_hugepd(NULL, __hugepd(pgd_val(pgd)), addr, - PGDIR_SHIFT, next, flags, pages, nr) != 1) + PGDIR_SHIFT, next, flags, pages, nr, + true) != 1) return; } else if (!gup_fast_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index db7946a0a28c..2120f7478e55 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1331,7 +1331,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn); - ret = try_grab_page(page, flags); + ret = try_grab_folio(page_folio(page), 1, flags); if (ret) page = ERR_PTR(ret); diff --git a/mm/internal.h b/mm/internal.h index 6902b7dd8509..cc2c5e07fad3 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1182,8 +1182,8 @@ int migrate_device_coherent_page(struct page *page); /* * mm/gup.c */ -struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags); -int __must_check try_grab_page(struct page *page, unsigned int flags); +int __must_check try_grab_folio(struct folio *folio, int refs, + unsigned int flags); /* * mm/huge_memory.c -- 2.41.0 . Return-Path: Date: Fri, 28 Jun 2024 20:59:54 +0000 Mime-Version: 1.0 Message-ID: <20240628205958.2845610-1-jiaqiyan@google.com> Subject: [PATCH v7 0/4] Userspace controls soft-offline pages From: Jiaqi Yan To: nao.horiguchi@gmail.com, linmiaohe@huawei.com Cc: jane.chu@oracle.com, ioworker0@gmail.com, muchun.song@linux.dev, akpm@linux-foundation.org, shuah@kernel.org, rdunlap@infradead.org, corbet@lwn.net, osalvador@suse.de, rientjes@google.com, duenwen@google.com, fvdl@google.com, ak@linux.intel.com, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org, Jiaqi Yan Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Xref: photonic.trudheim.com org.kvack.linux-mm:201833 Newsgroups: org.kvack.linux-mm,org.kernel.vger.linux-doc,org.kernel.vger.linux-kselftest Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail Correctable memory errors are very common on servers with large amount of memory, and are corrected by ECC, but with two pain points to users: 1. Correction usually happens on the fly and adds latency overhead 2. Not-fully-proved theory states excessive correctable memory errors can develop into uncorrectable memory error. Soft offline is kernel's additional solution for memory pages having (excessive) corrected memory errors. Impacted page is migrated to healthy page if it is in use, then the original page is discarded for any future use. The actual policy on whether (and when) to soft offline should be maintained by userspace, especially in case of an 1G HugeTLB page. Soft-offline dissolves the HugeTLB page, either in-use or free, into chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. If userspace has not acknowledged such behavior, it may be surprised when later mmap hugepages MAP_FAILED due to lack of hugepages. In case of a transparent hugepage, it will be split into 4K pages as well; userspace will stop enjoying the transparent performance. In addition, discarding the entire 1G HugeTLB page only because of corrected memory errors sounds very costly and kernel better not doing under the hood. But today there are at least 2 such cases: 1. GHES driver sees both GHES_SEV_CORRECTED and CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER. 2. RAS Correctable Errors Collector counts correctable errors per PFN and when the counter for a PFN reaches threshold In both cases, userspace has no control of the soft offline performed by kernel's memory failure recovery. This patch series give userspace the control of softofflining any page: kernel only soft offlines raw page / transparent hugepage / HugeTLB hugepage if userspace has agreed to. The interface to userspace is a new sysctl called enable_soft_offline under /proc/sys/vm. By default enable_soft_line is 1 to preserve existing behavior in kernel. Changelog v6 => v7 * incorporate feedbacks from Miaohe Lin and David Rientjes * remove PFN value from enable_soft_offline log * save/restore enable_soft_offline in run_vmtests.sh * v7 is based on commit 7c89bdbd3778 ("khugepaged: simplify the allocation of slab caches") v5 => v6: * incorporate feedbacks from Miaohe Lin * add a ':' in soft offline log. * close hugetlbfs file descriptor in selftest. * no need to "return" after ksft_exit_fail_msg. v4 => v5: * incorporate feedbacks from Muhammad Usama Anjum * refactor selftest to use what available in kselftest.h v3 => v4: * incorporate feedbacks from Miaohe Lin , Andrew Morton , and Oscar Salvador . * insert a refactor commit to unify soft offline's logs to follow "Soft offline: 0x${pfn}: ${message}" format. * some rewords in document: fail => will not perform. * v4 is still based on commit 83a7eefedc9b ("Linux 6.10-rc3"), akpm/mm-stable. v2 => v3: * incorporate feedbacks from Miaohe Lin , Lance Yang , Oscar Salvador , and David Rientjes . * release potential refcount if enable_soft_offline is 0. * soft_offline_page() returns EOPNOTSUPP if enable_soft_offline is 0. * refactor hugetlb-soft-offline.c, for example, introduce test_soft_offline_common to reduce repeated code. * rewrite enable_soft_offline's documentation, adds more details about the cost of soft-offline for transparent and hugetlb hugepages, and components that are impacted when enable_soft_offline becomes 0. * fix typos in commit messages. * v3 is still based on commit 83a7eefedc9b ("Linux 6.10-rc3"). v1 => v2: * incorporate feedbacks from both Miaohe Lin and Jane Chu . * make the switch to control all pages, instead of HugeTLB specific. * change the API from /sys/kernel/mm/hugepages/hugepages-${size}kB/softoffline_corrected_errors to /proc/sys/vm/enable_soft_offline. * minor update to test code. * update documentation of the user control API. * v2 is based on commit 83a7eefedc9b ("Linux 6.10-rc3"). Jiaqi Yan (4): mm/memory-failure: refactor log format in soft offline code mm/memory-failure: userspace controls soft-offlining pages selftest/mm: test enable_soft_offline behaviors docs: mm: add enable_soft_offline sysctl Documentation/admin-guide/sysctl/vm.rst | 32 +++ mm/memory-failure.c | 37 ++- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + .../selftests/mm/hugetlb-soft-offline.c | 228 ++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 6 + 6 files changed, 297 insertions(+), 8 deletions(-) create mode 100644 tools/testing/selftests/mm/hugetlb-soft-offline.c -- 2.45.2.803.g4e1b14247a-goog . From: Roman Gushchin To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Michal Hocko , Shakeel Butt , Muchun Song , Roman Gushchin Subject: [PATCH v1 0/9] mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1 Date: Fri, 28 Jun 2024 21:03:08 +0000 Message-ID: <20240628210317.272856-1-roman.gushchin@linux.dev> X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Xref: photonic.trudheim.com org.kernel.vger.linux-kernel:1262693 org.kvack.linux-mm:201838 Newsgroups: org.kernel.vger.linux-kernel,org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail This patchset puts all cgroup v1's members of struct mem_cgroup, struct mem_cgroup_per_node and struct task_struct under the CONFIG_MEMCG_V1 config option. If cgroup v1 support is not required (and it's true for many cgroup users these days), it allows to save a bit of memory and compile out some code, some of which is on relatively hot paths. It also structures the code a bit better by grouping cgroup v1-specific stuff in one place. Roman Gushchin (9): mm: memcg: move memcg_account_kmem() to memcontrol-v1.c mm: memcg: factor out legacy socket memory accounting code mm: memcg: guard cgroup v1-specific code in mem_cgroup_print_oom_meminfo() mm: memcg: gather memcg1-specific fields initialization in memcg1_memcg_init() mm: memcg: guard memcg1-specific fields accesses in mm/memcontrol.c mm: memcg: put memcg1-specific struct mem_cgroup's members under CONFIG_MEMCG_V1 mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_node mm: memcg: put struct task_struct::memcg_in_oom under CONFIG_MEMCG_V1 mm: memcg: put struct task_struct::in_user_fault under CONFIG_MEMCG_V1 include/linux/memcontrol.h | 147 +++++++++++++++++++------------------ include/linux/sched.h | 6 +- mm/memcontrol-v1.c | 38 ++++++++++ mm/memcontrol-v1.h | 20 +++++ mm/memcontrol.c | 70 +++++++----------- 5 files changed, 164 insertions(+), 117 deletions(-) -- 2.45.2.803.g4e1b14247a-goog . From: Luis Chamberlain To: p.raghav@samsung.com, hare@suse.de, kbusch@kernel.org, david@fromorbit.com, neilb@suse.de Cc: mcgrof@kernel.org, gost.dev@samsung.com, linux-block@vger.kernel.org, linux-mm@kvack.org, patches@lists.linux.dev Subject: [RFC] bdev: use bdev_io_min() for statx DIO min IO Date: Fri, 28 Jun 2024 14:23:50 -0700 Message-ID: <20240628212350.3577766-1-mcgrof@kernel.org> X-Mailing-List: linux-block@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: Luis Chamberlain Xref: photonic.trudheim.com org.kernel.vger.linux-block:93658 org.kvack.linux-mm:201848 Newsgroups: org.kernel.vger.linux-block,dev.linux.lists.patches,org.kvack.linux-mm Path: photonic.trudheim.com!nntp.lore.kernel.org!not-for-mail We currently rely on the block device logical block size for the offset alignment. While this *works* it doesn't work with performance in mind. That's exactly what the minimum_io_size attribute is for. This would for example enhance performance for DIO on 4k IU drives which have for example an LBA format of 512 bytes for both HDDs and NVMe. Another use case is to ensure that DIO will be used with 16k IOs on existing market 16k IU drives with an LBA format of 4k or 512 bytes. Signed-off-by: Luis Chamberlain --- block/bdev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/bdev.c b/block/bdev.c index 1b4af2cc3b1e..5d0874aa8661 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -1282,7 +1282,7 @@ void bdev_statx(struct inode *backing_inode, struct kstat *stat, if (request_mask & STATX_DIOALIGN) { stat->dio_mem_align = bdev_dma_alignment(bdev) + 1; - stat->dio_offset_align = bdev_logical_block_size(bdev); + stat->dio_offset_align = (unsigned int) bdev_io_min(bdev); stat->result_mask |= STATX_DIOALIGN; } -- 2.43.0 .