mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked() Write protect anon page faults require an accurate mapcount to decide if to break the COW or not. This is implemented in the THP path with reuse_swap_page() -> page_trans_huge_map_swapcount()/page_trans_huge_mapcount(). If the COW triggers while the other processes sharing the page are under a huge pmd split, to do an accurate reading, we must ensure the mapcount isn't computed while it's being transferred from the head page to the tail pages. reuse_swap_cache() already runs serialized by the page lock, so it's enough to add the page lock around __split_huge_pmd_locked too, in order to add the missing serialization. Note: the commit in "Fixes" is just to facilitate the backporting, because the code before such commit didn't try to do an accurate THP mapcount calculation and it instead used the page_count() to decide if to COW or not. Both the page_count and the pin_count are THP-wide refcounts, so they're inaccurate if used in reuse_swap_page(). Reverting such commit (besides the unrelated fix to the local anon_vma assignment) would have also opened the window for memory corruption side effects to certain workloads as documented in such commit header. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Suggested-by: Jann Horn <jannh@google.com> Reported-by: Jann Horn <jannh@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults") Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit: c444eb564fb16645c172d550359cb3d75fe8a040 [log] [tgz]
author: Andrea Arcangeli <aarcange@redhat.com> Wed May 27 19:06:24 2020 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> Wed Jun 03 20:06:45 2020 -0700
tree: b3e752a8f1d385d0ab4802ea37bd6fd9458464db
parent: cb8e59cc87201af93dfbb6c3dccc8fcad72a09c2 [diff]
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 11fe0b4..dddc863 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c

@@ -2385,6 +2385,8 @@
 {
 	spinlock_t *ptl;
 	struct mmu_notifier_range range;
+	bool was_locked = false;
+	pmd_t _pmd;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
 				address & HPAGE_PMD_MASK,
@@ -2397,11 +2399,32 @@
 	 * pmd against. Otherwise we can end up replacing wrong page.
 	 */
 	VM_BUG_ON(freeze && !page);
-	if (page && page != pmd_page(*pmd))
-	        goto out;
+	if (page) {
+		VM_WARN_ON_ONCE(!PageLocked(page));
+		was_locked = true;
+		if (page != pmd_page(*pmd))
+			goto out;
+	}
 
+repeat:
 	if (pmd_trans_huge(*pmd)) {
-		page = pmd_page(*pmd);
+		if (!page) {
+			page = pmd_page(*pmd);
+			if (unlikely(!trylock_page(page))) {
+				get_page(page);
+				_pmd = *pmd;
+				spin_unlock(ptl);
+				lock_page(page);
+				spin_lock(ptl);
+				if (unlikely(!pmd_same(*pmd, _pmd))) {
+					unlock_page(page);
+					put_page(page);
+					page = NULL;
+					goto repeat;
+				}
+				put_page(page);
+			}
+		}
 		if (PageMlocked(page))
 			clear_page_mlock(page);
 	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
@@ -2409,6 +2432,8 @@
 	__split_huge_pmd_locked(vma, pmd, range.start, freeze);
 out:
 	spin_unlock(ptl);
+	if (!was_locked && page)
+		unlock_page(page);
 	/*
 	 * No need to double call mmu_notifier->invalidate_range() callback.
 	 * They are 3 cases to consider inside __split_huge_pmd_locked():
commit	c444eb564fb16645c172d550359cb3d75fe8a040	[log] [tgz]
author	Andrea Arcangeli <aarcange@redhat.com>	Wed May 27 19:06:24 2020 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	Wed Jun 03 20:06:45 2020 -0700
tree	b3e752a8f1d385d0ab4802ea37bd6fd9458464db
parent	cb8e59cc87201af93dfbb6c3dccc8fcad72a09c2 [diff]