2019-06-01 16:08:44 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2009-04-28 06:27:18 +08:00
|
|
|
* Suspend support specific for i386/x86-64.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2008-02-10 06:24:09 +08:00
|
|
|
* Copyright (c) 2007 Rafael J. Wysocki <rjw@sisk.pl>
|
2010-07-18 20:27:13 +08:00
|
|
|
* Copyright (c) 2002 Pavel Machek <pavel@ucw.cz>
|
2005-04-17 06:20:36 +08:00
|
|
|
* Copyright (c) 2001 Patrick Mochel <mochel@osdl.org>
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/suspend.h>
|
2011-05-27 00:22:53 +08:00
|
|
|
#include <linux/export.h>
|
2009-04-28 06:26:22 +08:00
|
|
|
#include <linux/smp.h>
|
2013-03-15 21:26:07 +08:00
|
|
|
#include <linux/perf_event.h>
|
x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
On Intel hardware, native_play_dead() uses mwait_play_dead() by
default and only falls back to the other methods if that fails.
That also happens during resume from hibernation, when the restore
(boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
except for the boot one offline.
However, that is problematic, because the address passed to
__monitor() in mwait_play_dead() is likely to be written to in the
last phase of hibernate image restoration and that causes the "dead"
CPU to start executing instructions again. Unfortunately, the page
containing the address in that CPU's instruction pointer may not be
valid any more at that point.
First, that page may have been overwritten with image kernel memory
contents already, so the instructions the CPU attempts to execute may
simply be invalid. Second, the page tables previously used by that
CPU may have been overwritten by image kernel memory contents, so the
address in its instruction pointer is impossible to resolve then.
A report from Varun Koyyalagunta and investigation carried out by
Chen Yu show that the latter sometimes happens in practice.
To prevent it from happening, temporarily change the smp_ops.play_dead
pointer during resume from hibernation so that it points to a special
"play dead" routine which uses hlt_play_dead() and avoids the
inadvertent "revivals" of "dead" CPUs this way.
A slightly unpleasant consequence of this change is that if the
system is hibernated with one or more CPUs offline, it will generally
draw more power after resume than it did before hibernation, because
the physical state entered by CPUs via hlt_play_dead() is higher-power
than the mwait_play_dead() one in the majority of cases. It is
possible to work around this, but it is unclear how much of a problem
that's going to be in practice, so the workaround will be implemented
later if it turns out to be necessary.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
Original-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 09:55:23 +08:00
|
|
|
#include <linux/tboot.h>
|
2009-04-28 06:26:22 +08:00
|
|
|
|
[PATCH] x86_64: Set up safe page tables during resume
The following patch makes swsusp avoid the possible temporary corruption
of page translation tables during resume on x86-64. This is achieved by
creating a copy of the relevant page tables that will not be modified by
swsusp and can be safely used by it on resume.
The problem is that during resume on x86-64 swsusp may temporarily
corrupt the page tables used for the direct mapping of RAM. If that
happens, a page fault occurs and cannot be handled properly, which leads
to the solid hang of the affected system. This leads to the loss of the
system's state from before suspend and may result in the loss of data or
the corruption of filesystems, so it is a serious issue. Also, it
appears to happen quite often (for me, as often as 50% of the time).
The problem is related to the fact that (at least) one of the PMD
entries used in the direct memory mapping (starting at PAGE_OFFSET)
points to a page table the physical address of which is much greater
than the physical address of the PMD entry itself. Moreover,
unfortunately, the physical address of the page table before suspend
(i.e. the one stored in the suspend image) happens to be different to
the physical address of the corresponding page table used during resume
(i.e. the one that is valid right before swsusp_arch_resume() in
arch/x86_64/kernel/suspend_asm.S is executed). Thus while the image is
restored, the "offending" PMD entry gets overwritten, so it does not
point to the right physical address any more (i.e. there's no page
table at the address pointed to by it, because it points to the address
the page table has been at during suspend). Consequently, if the PMD
entry is used later on, and it _is_ used in the process of copying the
image pages, a page fault occurs, but it cannot be handled in the normal
way and the system hangs.
In principle we can call create_resume_mapping() from
swsusp_arch_resume() (ie. from suspend_asm.S), but then the memory
allocations in create_resume_mapping(), resume_pud_mapping(), and
resume_pmd_mapping() must be made carefully so that we use _only_
NosaveFree pages in them (the other pages are overwritten by the loop in
swsusp_arch_resume()). Additionally, we are in atomic context at that
time, so we cannot use GFP_KERNEL. Moreover, if one of the allocations
fails, we should free all of the allocated pages, so we need to trace
them somehow.
All of this is done in the appended patch, except that the functions
populating the page tables are located in arch/x86_64/kernel/suspend.c
rather than in init.c. It may be done in a more elegan way in the
future, with the help of some swsusp patches that are in the works now.
[AK: move some externs into headers, renamed a function]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-10 03:19:40 +08:00
|
|
|
#include <asm/pgtable.h>
|
2009-04-28 06:26:22 +08:00
|
|
|
#include <asm/proto.h>
|
2007-05-03 01:27:17 +08:00
|
|
|
#include <asm/mtrr.h>
|
2009-04-28 06:26:22 +08:00
|
|
|
#include <asm/page.h>
|
|
|
|
#include <asm/mce.h>
|
2009-04-01 06:23:37 +08:00
|
|
|
#include <asm/suspend.h>
|
2015-04-26 22:56:05 +08:00
|
|
|
#include <asm/fpu/internal.h>
|
2009-06-02 02:14:26 +08:00
|
|
|
#include <asm/debugreg.h>
|
2012-11-14 03:32:51 +08:00
|
|
|
#include <asm/cpu.h>
|
2015-07-31 05:31:32 +08:00
|
|
|
#include <asm/mmu_context.h>
|
x86/pm: Introduce quirk framework to save/restore extra MSR registers around suspend/resume
A bug was reported that on certain Broadwell platforms, after
resuming from S3, the CPU is running at an anomalously low
speed.
It turns out that the BIOS has modified the value of the
THERM_CONTROL register during S3, and changed it from 0 to 0x10,
thus enabled clock modulation(bit4), but with undefined CPU Duty
Cycle(bit1:3) - which causes the problem.
Here is a simple scenario to reproduce the issue:
1. Boot up the system
2. Get MSR 0x19a, it should be 0
3. Put the system into sleep, then wake it up
4. Get MSR 0x19a, it shows 0x10, while it should be 0
Although some BIOSen want to change the CPU Duty Cycle during
S3, in our case we don't want the BIOS to do any modification.
Fix this issue by introducing a more generic x86 framework to
save/restore specified MSR registers(THERM_CONTROL in this case)
for suspend/resume. This allows us to fix similar bugs in a much
simpler way in the future.
When the kernel wants to protect certain MSRs during suspending,
we simply add a quirk entry in msr_save_dmi_table, and customize
the MSR registers inside the quirk callback, for example:
u32 msr_id_need_to_save[] = {MSR_ID0, MSR_ID1, MSR_ID2...};
and the quirk mechanism ensures that, once resumed from suspend,
the MSRs indicated by these IDs will be restored to their
original, pre-suspend values.
Since both 64-bit and 32-bit kernels are affected, this patch
covers the common 64/32-bit suspend/resume code path. And
because the MSRs specified by the user might not be available or
readable in any situation, we use rdmsrl_safe() to safely save
these MSRs.
Reported-and-tested-by: Marcin Kaszewski <marcin.kaszewski@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: len.brown@intel.com
Cc: linux@horizon.com
Cc: luto@kernel.org
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/c9abdcbc173dd2f57e8990e304376f19287e92ba.1448382971.git.yu.c.chen@intel.com
[ More edits to the naming of data structures. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-25 01:03:41 +08:00
|
|
|
#include <linux/dmi.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-04-28 06:26:50 +08:00
|
|
|
#ifdef CONFIG_X86_32
|
2013-08-06 06:02:49 +08:00
|
|
|
__visible unsigned long saved_context_ebx;
|
|
|
|
__visible unsigned long saved_context_esp, saved_context_ebp;
|
|
|
|
__visible unsigned long saved_context_esi, saved_context_edi;
|
|
|
|
__visible unsigned long saved_context_eflags;
|
2009-04-28 06:26:50 +08:00
|
|
|
#endif
|
2013-05-02 09:53:30 +08:00
|
|
|
struct saved_context saved_context;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
x86/pm: Introduce quirk framework to save/restore extra MSR registers around suspend/resume
A bug was reported that on certain Broadwell platforms, after
resuming from S3, the CPU is running at an anomalously low
speed.
It turns out that the BIOS has modified the value of the
THERM_CONTROL register during S3, and changed it from 0 to 0x10,
thus enabled clock modulation(bit4), but with undefined CPU Duty
Cycle(bit1:3) - which causes the problem.
Here is a simple scenario to reproduce the issue:
1. Boot up the system
2. Get MSR 0x19a, it should be 0
3. Put the system into sleep, then wake it up
4. Get MSR 0x19a, it shows 0x10, while it should be 0
Although some BIOSen want to change the CPU Duty Cycle during
S3, in our case we don't want the BIOS to do any modification.
Fix this issue by introducing a more generic x86 framework to
save/restore specified MSR registers(THERM_CONTROL in this case)
for suspend/resume. This allows us to fix similar bugs in a much
simpler way in the future.
When the kernel wants to protect certain MSRs during suspending,
we simply add a quirk entry in msr_save_dmi_table, and customize
the MSR registers inside the quirk callback, for example:
u32 msr_id_need_to_save[] = {MSR_ID0, MSR_ID1, MSR_ID2...};
and the quirk mechanism ensures that, once resumed from suspend,
the MSRs indicated by these IDs will be restored to their
original, pre-suspend values.
Since both 64-bit and 32-bit kernels are affected, this patch
covers the common 64/32-bit suspend/resume code path. And
because the MSRs specified by the user might not be available or
readable in any situation, we use rdmsrl_safe() to safely save
these MSRs.
Reported-and-tested-by: Marcin Kaszewski <marcin.kaszewski@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: len.brown@intel.com
Cc: linux@horizon.com
Cc: luto@kernel.org
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/c9abdcbc173dd2f57e8990e304376f19287e92ba.1448382971.git.yu.c.chen@intel.com
[ More edits to the naming of data structures. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-25 01:03:41 +08:00
|
|
|
static void msr_save_context(struct saved_context *ctxt)
|
|
|
|
{
|
|
|
|
struct saved_msr *msr = ctxt->saved_msrs.array;
|
|
|
|
struct saved_msr *end = msr + ctxt->saved_msrs.num;
|
|
|
|
|
|
|
|
while (msr < end) {
|
|
|
|
msr->valid = !rdmsrl_safe(msr->info.msr_no, &msr->info.reg.q);
|
|
|
|
msr++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void msr_restore_context(struct saved_context *ctxt)
|
|
|
|
{
|
|
|
|
struct saved_msr *msr = ctxt->saved_msrs.array;
|
|
|
|
struct saved_msr *end = msr + ctxt->saved_msrs.num;
|
|
|
|
|
|
|
|
while (msr < end) {
|
|
|
|
if (msr->valid)
|
|
|
|
wrmsrl(msr->info.msr_no, msr->info.reg.q);
|
|
|
|
msr++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-01-30 20:30:04 +08:00
|
|
|
/**
|
|
|
|
* __save_processor_state - save CPU registers before creating a
|
|
|
|
* hibernation image and before restoring the memory state from it
|
|
|
|
* @ctxt - structure to store the registers contents in
|
|
|
|
*
|
|
|
|
* NOTE: If there is a CPU register the modification of which by the
|
|
|
|
* boot kernel (ie. the kernel used for loading the hibernation image)
|
|
|
|
* might affect the operations of the restored target kernel (ie. the one
|
|
|
|
* saved in the hibernation image), then its contents must be saved by this
|
|
|
|
* function. In other words, if kernel A is hibernated and different
|
|
|
|
* kernel B is used for loading the hibernation image into memory, the
|
|
|
|
* kernel A's __save_processor_state() function must save all registers
|
|
|
|
* needed by kernel A, so that it can operate correctly after the resume
|
|
|
|
* regardless of what kernel B does in the meantime.
|
|
|
|
*/
|
2008-01-30 20:31:23 +08:00
|
|
|
static void __save_processor_state(struct saved_context *ctxt)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2009-04-28 06:27:00 +08:00
|
|
|
#ifdef CONFIG_X86_32
|
|
|
|
mtrr_save_fixed_ranges(NULL);
|
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
kernel_fpu_begin();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* descriptor tables
|
|
|
|
*/
|
2009-04-28 06:27:00 +08:00
|
|
|
store_idt(&ctxt->idt);
|
2017-12-15 05:19:05 +08:00
|
|
|
|
2013-05-02 09:53:30 +08:00
|
|
|
/*
|
|
|
|
* We save it here, but restore it only in the hibernate case.
|
|
|
|
* For ACPI S3 resume, this is loaded via 'early_gdt_desc' in 64-bit
|
|
|
|
* mode in "secondary_startup_64". In 32-bit mode it is done via
|
|
|
|
* 'pmode_gdt' in wakeup_start.
|
|
|
|
*/
|
|
|
|
ctxt->gdt_desc.size = GDT_SIZE - 1;
|
2017-03-15 01:05:07 +08:00
|
|
|
ctxt->gdt_desc.address = (unsigned long)get_cpu_gdt_rw(smp_processor_id());
|
2013-05-02 09:53:30 +08:00
|
|
|
|
2007-10-20 02:35:03 +08:00
|
|
|
store_tr(ctxt->tr);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* XMM0..XMM15 should be handled by kernel_fpu_begin(). */
|
|
|
|
/*
|
|
|
|
* segment registers
|
|
|
|
*/
|
2017-12-15 05:19:07 +08:00
|
|
|
#ifdef CONFIG_X86_32_LAZY_GS
|
2009-04-28 06:27:00 +08:00
|
|
|
savesegment(gs, ctxt->gs);
|
2017-12-15 05:19:07 +08:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
savesegment(gs, ctxt->gs);
|
|
|
|
savesegment(fs, ctxt->fs);
|
|
|
|
savesegment(ds, ctxt->ds);
|
|
|
|
savesegment(es, ctxt->es);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
rdmsrl(MSR_FS_BASE, ctxt->fs_base);
|
2017-12-15 05:19:07 +08:00
|
|
|
rdmsrl(MSR_GS_BASE, ctxt->kernelmode_gs_base);
|
|
|
|
rdmsrl(MSR_KERNEL_GS_BASE, ctxt->usermode_gs_base);
|
2007-05-03 01:27:17 +08:00
|
|
|
mtrr_save_fixed_ranges(NULL);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-04-28 06:27:00 +08:00
|
|
|
rdmsrl(MSR_EFER, ctxt->efer);
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2008-02-10 06:24:09 +08:00
|
|
|
* control registers
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2007-07-22 17:12:29 +08:00
|
|
|
ctxt->cr0 = read_cr0();
|
|
|
|
ctxt->cr2 = read_cr2();
|
2017-06-13 01:26:14 +08:00
|
|
|
ctxt->cr3 = __read_cr3();
|
2016-09-30 03:48:12 +08:00
|
|
|
ctxt->cr4 = __read_cr4();
|
2014-10-25 06:58:08 +08:00
|
|
|
#ifdef CONFIG_X86_64
|
2007-07-22 17:12:29 +08:00
|
|
|
ctxt->cr8 = read_cr8();
|
2009-04-28 06:27:00 +08:00
|
|
|
#endif
|
2010-06-08 06:32:49 +08:00
|
|
|
ctxt->misc_enable_saved = !rdmsrl_safe(MSR_IA32_MISC_ENABLE,
|
|
|
|
&ctxt->misc_enable);
|
x86/pm: Introduce quirk framework to save/restore extra MSR registers around suspend/resume
A bug was reported that on certain Broadwell platforms, after
resuming from S3, the CPU is running at an anomalously low
speed.
It turns out that the BIOS has modified the value of the
THERM_CONTROL register during S3, and changed it from 0 to 0x10,
thus enabled clock modulation(bit4), but with undefined CPU Duty
Cycle(bit1:3) - which causes the problem.
Here is a simple scenario to reproduce the issue:
1. Boot up the system
2. Get MSR 0x19a, it should be 0
3. Put the system into sleep, then wake it up
4. Get MSR 0x19a, it shows 0x10, while it should be 0
Although some BIOSen want to change the CPU Duty Cycle during
S3, in our case we don't want the BIOS to do any modification.
Fix this issue by introducing a more generic x86 framework to
save/restore specified MSR registers(THERM_CONTROL in this case)
for suspend/resume. This allows us to fix similar bugs in a much
simpler way in the future.
When the kernel wants to protect certain MSRs during suspending,
we simply add a quirk entry in msr_save_dmi_table, and customize
the MSR registers inside the quirk callback, for example:
u32 msr_id_need_to_save[] = {MSR_ID0, MSR_ID1, MSR_ID2...};
and the quirk mechanism ensures that, once resumed from suspend,
the MSRs indicated by these IDs will be restored to their
original, pre-suspend values.
Since both 64-bit and 32-bit kernels are affected, this patch
covers the common 64/32-bit suspend/resume code path. And
because the MSRs specified by the user might not be available or
readable in any situation, we use rdmsrl_safe() to safely save
these MSRs.
Reported-and-tested-by: Marcin Kaszewski <marcin.kaszewski@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: len.brown@intel.com
Cc: linux@horizon.com
Cc: luto@kernel.org
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/c9abdcbc173dd2f57e8990e304376f19287e92ba.1448382971.git.yu.c.chen@intel.com
[ More edits to the naming of data structures. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-25 01:03:41 +08:00
|
|
|
msr_save_context(ctxt);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2009-04-28 06:27:00 +08:00
|
|
|
/* Needed by apm.c */
|
2005-04-17 06:20:36 +08:00
|
|
|
void save_processor_state(void)
|
|
|
|
{
|
|
|
|
__save_processor_state(&saved_context);
|
2012-02-13 21:07:27 +08:00
|
|
|
x86_platform.save_sched_clock_state();
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2009-04-28 06:27:00 +08:00
|
|
|
#ifdef CONFIG_X86_32
|
|
|
|
EXPORT_SYMBOL(save_processor_state);
|
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2005-10-31 06:59:28 +08:00
|
|
|
static void do_fpu_end(void)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2005-10-31 06:59:28 +08:00
|
|
|
/*
|
2009-04-28 06:27:05 +08:00
|
|
|
* Restore FPU regs if necessary.
|
2005-10-31 06:59:28 +08:00
|
|
|
*/
|
|
|
|
kernel_fpu_end();
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2009-04-28 06:27:05 +08:00
|
|
|
static void fix_processor_context(void)
|
|
|
|
{
|
|
|
|
int cpu = smp_processor_id();
|
2013-04-06 04:42:24 +08:00
|
|
|
#ifdef CONFIG_X86_64
|
2017-03-15 01:05:07 +08:00
|
|
|
struct desc_struct *desc = get_cpu_gdt_rw(cpu);
|
2013-04-06 04:42:24 +08:00
|
|
|
tss_desc tss;
|
|
|
|
#endif
|
2017-12-04 22:07:17 +08:00
|
|
|
|
|
|
|
/*
|
2017-12-04 22:07:20 +08:00
|
|
|
* We need to reload TR, which requires that we change the
|
|
|
|
* GDT entry to indicate "available" first.
|
|
|
|
*
|
|
|
|
* XXX: This could probably all be replaced by a call to
|
|
|
|
* force_reload_TR().
|
2017-12-04 22:07:17 +08:00
|
|
|
*/
|
2017-12-04 22:07:20 +08:00
|
|
|
set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
|
2009-04-28 06:27:05 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_X86_64
|
2013-04-06 04:42:24 +08:00
|
|
|
memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
|
|
|
|
tss.type = 0x9; /* The available 64-bit TSS (see AMD vol 2, pg 91 */
|
|
|
|
write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS);
|
2009-04-28 06:27:05 +08:00
|
|
|
|
|
|
|
syscall_init(); /* This sets MSR_*STAR and related */
|
2017-12-15 05:19:06 +08:00
|
|
|
#else
|
|
|
|
if (boot_cpu_has(X86_FEATURE_SEP))
|
|
|
|
enable_sep_cpu();
|
2009-04-28 06:27:05 +08:00
|
|
|
#endif
|
|
|
|
load_TR_desc(); /* This does ltr */
|
2015-07-31 05:31:32 +08:00
|
|
|
load_mm_ldt(current->active_mm); /* This does lldt */
|
2017-09-07 10:54:53 +08:00
|
|
|
initialize_tlbstate_and_flush();
|
2015-04-24 16:02:32 +08:00
|
|
|
|
|
|
|
fpu__resume_cpu();
|
2017-03-15 01:05:07 +08:00
|
|
|
|
|
|
|
/* The processor is back on the direct GDT, load back the fixmap */
|
|
|
|
load_fixmap_gdt(cpu);
|
2009-04-28 06:27:05 +08:00
|
|
|
}
|
|
|
|
|
2008-01-30 20:30:04 +08:00
|
|
|
/**
|
2017-12-15 05:19:07 +08:00
|
|
|
* __restore_processor_state - restore the contents of CPU registers saved
|
|
|
|
* by __save_processor_state()
|
|
|
|
* @ctxt - structure to load the registers contents from
|
|
|
|
*
|
|
|
|
* The asm code that gets us here will have restored a usable GDT, although
|
|
|
|
* it will be pointing to the wrong alias.
|
2008-01-30 20:30:04 +08:00
|
|
|
*/
|
2014-06-25 08:58:26 +08:00
|
|
|
static void notrace __restore_processor_state(struct saved_context *ctxt)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2010-06-08 06:32:49 +08:00
|
|
|
if (ctxt->misc_enable_saved)
|
|
|
|
wrmsrl(MSR_IA32_MISC_ENABLE, ctxt->misc_enable);
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* control registers
|
|
|
|
*/
|
2009-04-28 06:27:05 +08:00
|
|
|
/* cr4 was introduced in the Pentium CPU */
|
|
|
|
#ifdef CONFIG_X86_32
|
|
|
|
if (ctxt->cr4)
|
2014-10-25 06:58:08 +08:00
|
|
|
__write_cr4(ctxt->cr4);
|
2009-04-28 06:27:05 +08:00
|
|
|
#else
|
|
|
|
/* CONFIG X86_64 */
|
2007-05-03 01:27:07 +08:00
|
|
|
wrmsrl(MSR_EFER, ctxt->efer);
|
2007-07-22 17:12:29 +08:00
|
|
|
write_cr8(ctxt->cr8);
|
2014-10-25 06:58:08 +08:00
|
|
|
__write_cr4(ctxt->cr4);
|
2009-04-28 06:27:05 +08:00
|
|
|
#endif
|
2007-07-22 17:12:29 +08:00
|
|
|
write_cr3(ctxt->cr3);
|
|
|
|
write_cr2(ctxt->cr2);
|
|
|
|
write_cr0(ctxt->cr0);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-12-15 05:19:07 +08:00
|
|
|
/* Restore the IDT. */
|
|
|
|
load_idt(&ctxt->idt);
|
|
|
|
|
2005-06-26 05:55:14 +08:00
|
|
|
/*
|
2017-12-15 05:19:07 +08:00
|
|
|
* Just in case the asm code got us here with the SS, DS, or ES
|
|
|
|
* out of sync with the GDT, update them.
|
2005-06-26 05:55:14 +08:00
|
|
|
*/
|
2017-12-15 05:19:07 +08:00
|
|
|
loadsegment(ss, __KERNEL_DS);
|
|
|
|
loadsegment(ds, __USER_DS);
|
|
|
|
loadsegment(es, __USER_DS);
|
2005-06-26 05:55:14 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2017-12-15 05:19:07 +08:00
|
|
|
* Restore percpu access. Percpu access can happen in exception
|
|
|
|
* handlers or in complicated helpers like load_gs_index().
|
x86/power: Fix some ordering bugs in __restore_processor_context()
__restore_processor_context() had a couple of ordering bugs. It
restored GSBASE after calling load_gs_index(), and the latter can
call into tracing code. It also tried to restore segment registers
before restoring the LDT, which is straight-up wrong.
Reorder the code so that we restore GSBASE, then the descriptor
tables, then the segments.
This fixes two bugs. First, it fixes a regression that broke resume
under certain configurations due to irqflag tracing in
native_load_gs_index(). Second, it fixes resume when the userspace
process that initiated suspect had funny segments. The latter can be
reproduced by compiling this:
// SPDX-License-Identifier: GPL-2.0
/*
* ldt_echo.c - Echo argv[1] while using an LDT segment
*/
int main(int argc, char **argv)
{
int ret;
size_t len;
char *buf;
const struct user_desc desc = {
.entry_number = 0,
.base_addr = 0,
.limit = 0xfffff,
.seg_32bit = 1,
.contents = 0, /* Data, grow-up */
.read_exec_only = 0,
.limit_in_pages = 1,
.seg_not_present = 0,
.useable = 0
};
if (argc != 2)
errx(1, "Usage: %s STRING", argv[0]);
len = asprintf(&buf, "%s\n", argv[1]);
if (len < 0)
errx(1, "Out of memory");
ret = syscall(SYS_modify_ldt, 1, &desc, sizeof(desc));
if (ret < -1)
errno = -ret;
if (ret)
err(1, "modify_ldt");
asm volatile ("movw %0, %%es" :: "rm" ((unsigned short)7));
write(1, buf, len);
return 0;
}
and running ldt_echo >/sys/power/mem
Without the fix, the latter causes a triple fault on resume.
Fixes: ca37e57bbe0c ("x86/entry/64: Add missing irqflags tracing to native_load_gs_index()")
Reported-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/6b31721ea92f51ea839e79bd97ade4a75b1eeea2.1512057304.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-30 23:57:57 +08:00
|
|
|
*/
|
2017-12-15 05:19:07 +08:00
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
wrmsrl(MSR_GS_BASE, ctxt->kernelmode_gs_base);
|
|
|
|
#else
|
|
|
|
loadsegment(fs, __KERNEL_PERCPU);
|
|
|
|
loadsegment(gs, __KERNEL_STACK_CANARY);
|
x86/power: Fix some ordering bugs in __restore_processor_context()
__restore_processor_context() had a couple of ordering bugs. It
restored GSBASE after calling load_gs_index(), and the latter can
call into tracing code. It also tried to restore segment registers
before restoring the LDT, which is straight-up wrong.
Reorder the code so that we restore GSBASE, then the descriptor
tables, then the segments.
This fixes two bugs. First, it fixes a regression that broke resume
under certain configurations due to irqflag tracing in
native_load_gs_index(). Second, it fixes resume when the userspace
process that initiated suspect had funny segments. The latter can be
reproduced by compiling this:
// SPDX-License-Identifier: GPL-2.0
/*
* ldt_echo.c - Echo argv[1] while using an LDT segment
*/
int main(int argc, char **argv)
{
int ret;
size_t len;
char *buf;
const struct user_desc desc = {
.entry_number = 0,
.base_addr = 0,
.limit = 0xfffff,
.seg_32bit = 1,
.contents = 0, /* Data, grow-up */
.read_exec_only = 0,
.limit_in_pages = 1,
.seg_not_present = 0,
.useable = 0
};
if (argc != 2)
errx(1, "Usage: %s STRING", argv[0]);
len = asprintf(&buf, "%s\n", argv[1]);
if (len < 0)
errx(1, "Out of memory");
ret = syscall(SYS_modify_ldt, 1, &desc, sizeof(desc));
if (ret < -1)
errno = -ret;
if (ret)
err(1, "modify_ldt");
asm volatile ("movw %0, %%es" :: "rm" ((unsigned short)7));
write(1, buf, len);
return 0;
}
and running ldt_echo >/sys/power/mem
Without the fix, the latter causes a triple fault on resume.
Fixes: ca37e57bbe0c ("x86/entry/64: Add missing irqflags tracing to native_load_gs_index()")
Reported-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/6b31721ea92f51ea839e79bd97ade4a75b1eeea2.1512057304.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-30 23:57:57 +08:00
|
|
|
#endif
|
|
|
|
|
2017-12-15 05:19:07 +08:00
|
|
|
/* Restore the TSS, RO GDT, LDT, and usermode-relevant MSRs. */
|
x86/power: Fix some ordering bugs in __restore_processor_context()
__restore_processor_context() had a couple of ordering bugs. It
restored GSBASE after calling load_gs_index(), and the latter can
call into tracing code. It also tried to restore segment registers
before restoring the LDT, which is straight-up wrong.
Reorder the code so that we restore GSBASE, then the descriptor
tables, then the segments.
This fixes two bugs. First, it fixes a regression that broke resume
under certain configurations due to irqflag tracing in
native_load_gs_index(). Second, it fixes resume when the userspace
process that initiated suspect had funny segments. The latter can be
reproduced by compiling this:
// SPDX-License-Identifier: GPL-2.0
/*
* ldt_echo.c - Echo argv[1] while using an LDT segment
*/
int main(int argc, char **argv)
{
int ret;
size_t len;
char *buf;
const struct user_desc desc = {
.entry_number = 0,
.base_addr = 0,
.limit = 0xfffff,
.seg_32bit = 1,
.contents = 0, /* Data, grow-up */
.read_exec_only = 0,
.limit_in_pages = 1,
.seg_not_present = 0,
.useable = 0
};
if (argc != 2)
errx(1, "Usage: %s STRING", argv[0]);
len = asprintf(&buf, "%s\n", argv[1]);
if (len < 0)
errx(1, "Out of memory");
ret = syscall(SYS_modify_ldt, 1, &desc, sizeof(desc));
if (ret < -1)
errno = -ret;
if (ret)
err(1, "modify_ldt");
asm volatile ("movw %0, %%es" :: "rm" ((unsigned short)7));
write(1, buf, len);
return 0;
}
and running ldt_echo >/sys/power/mem
Without the fix, the latter causes a triple fault on resume.
Fixes: ca37e57bbe0c ("x86/entry/64: Add missing irqflags tracing to native_load_gs_index()")
Reported-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/6b31721ea92f51ea839e79bd97ade4a75b1eeea2.1512057304.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-30 23:57:57 +08:00
|
|
|
fix_processor_context();
|
|
|
|
|
|
|
|
/*
|
2017-12-15 05:19:07 +08:00
|
|
|
* Now that we have descriptor tables fully restored and working
|
|
|
|
* exception handling, restore the usermode segments.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2017-12-15 05:19:07 +08:00
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
loadsegment(ds, ctxt->es);
|
2009-04-28 06:27:05 +08:00
|
|
|
loadsegment(es, ctxt->es);
|
|
|
|
loadsegment(fs, ctxt->fs);
|
2005-04-17 06:20:36 +08:00
|
|
|
load_gs_index(ctxt->gs);
|
|
|
|
|
x86/power: Fix some ordering bugs in __restore_processor_context()
__restore_processor_context() had a couple of ordering bugs. It
restored GSBASE after calling load_gs_index(), and the latter can
call into tracing code. It also tried to restore segment registers
before restoring the LDT, which is straight-up wrong.
Reorder the code so that we restore GSBASE, then the descriptor
tables, then the segments.
This fixes two bugs. First, it fixes a regression that broke resume
under certain configurations due to irqflag tracing in
native_load_gs_index(). Second, it fixes resume when the userspace
process that initiated suspect had funny segments. The latter can be
reproduced by compiling this:
// SPDX-License-Identifier: GPL-2.0
/*
* ldt_echo.c - Echo argv[1] while using an LDT segment
*/
int main(int argc, char **argv)
{
int ret;
size_t len;
char *buf;
const struct user_desc desc = {
.entry_number = 0,
.base_addr = 0,
.limit = 0xfffff,
.seg_32bit = 1,
.contents = 0, /* Data, grow-up */
.read_exec_only = 0,
.limit_in_pages = 1,
.seg_not_present = 0,
.useable = 0
};
if (argc != 2)
errx(1, "Usage: %s STRING", argv[0]);
len = asprintf(&buf, "%s\n", argv[1]);
if (len < 0)
errx(1, "Out of memory");
ret = syscall(SYS_modify_ldt, 1, &desc, sizeof(desc));
if (ret < -1)
errno = -ret;
if (ret)
err(1, "modify_ldt");
asm volatile ("movw %0, %%es" :: "rm" ((unsigned short)7));
write(1, buf, len);
return 0;
}
and running ldt_echo >/sys/power/mem
Without the fix, the latter causes a triple fault on resume.
Fixes: ca37e57bbe0c ("x86/entry/64: Add missing irqflags tracing to native_load_gs_index()")
Reported-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/6b31721ea92f51ea839e79bd97ade4a75b1eeea2.1512057304.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-30 23:57:57 +08:00
|
|
|
/*
|
2017-12-15 05:19:07 +08:00
|
|
|
* Restore FSBASE and GSBASE after restoring the selectors, since
|
|
|
|
* restoring the selectors clobbers the bases. Keep in mind
|
|
|
|
* that MSR_KERNEL_GS_BASE is horribly misnamed.
|
x86/power: Fix some ordering bugs in __restore_processor_context()
__restore_processor_context() had a couple of ordering bugs. It
restored GSBASE after calling load_gs_index(), and the latter can
call into tracing code. It also tried to restore segment registers
before restoring the LDT, which is straight-up wrong.
Reorder the code so that we restore GSBASE, then the descriptor
tables, then the segments.
This fixes two bugs. First, it fixes a regression that broke resume
under certain configurations due to irqflag tracing in
native_load_gs_index(). Second, it fixes resume when the userspace
process that initiated suspect had funny segments. The latter can be
reproduced by compiling this:
// SPDX-License-Identifier: GPL-2.0
/*
* ldt_echo.c - Echo argv[1] while using an LDT segment
*/
int main(int argc, char **argv)
{
int ret;
size_t len;
char *buf;
const struct user_desc desc = {
.entry_number = 0,
.base_addr = 0,
.limit = 0xfffff,
.seg_32bit = 1,
.contents = 0, /* Data, grow-up */
.read_exec_only = 0,
.limit_in_pages = 1,
.seg_not_present = 0,
.useable = 0
};
if (argc != 2)
errx(1, "Usage: %s STRING", argv[0]);
len = asprintf(&buf, "%s\n", argv[1]);
if (len < 0)
errx(1, "Out of memory");
ret = syscall(SYS_modify_ldt, 1, &desc, sizeof(desc));
if (ret < -1)
errno = -ret;
if (ret)
err(1, "modify_ldt");
asm volatile ("movw %0, %%es" :: "rm" ((unsigned short)7));
write(1, buf, len);
return 0;
}
and running ldt_echo >/sys/power/mem
Without the fix, the latter causes a triple fault on resume.
Fixes: ca37e57bbe0c ("x86/entry/64: Add missing irqflags tracing to native_load_gs_index()")
Reported-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/6b31721ea92f51ea839e79bd97ade4a75b1eeea2.1512057304.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-30 23:57:57 +08:00
|
|
|
*/
|
2005-04-17 06:20:36 +08:00
|
|
|
wrmsrl(MSR_FS_BASE, ctxt->fs_base);
|
2017-12-15 05:19:07 +08:00
|
|
|
wrmsrl(MSR_KERNEL_GS_BASE, ctxt->usermode_gs_base);
|
|
|
|
#elif defined(CONFIG_X86_32_LAZY_GS)
|
|
|
|
loadsegment(gs, ctxt->gs);
|
2009-04-28 06:27:05 +08:00
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
do_fpu_end();
|
2016-12-13 21:14:17 +08:00
|
|
|
tsc_verify_tsc_adjust(true);
|
2012-04-02 00:53:36 +08:00
|
|
|
x86_platform.restore_sched_clock_state();
|
2009-08-20 09:05:36 +08:00
|
|
|
mtrr_bp_restore();
|
2013-03-15 21:26:07 +08:00
|
|
|
perf_restore_debug_store();
|
x86/pm: Introduce quirk framework to save/restore extra MSR registers around suspend/resume
A bug was reported that on certain Broadwell platforms, after
resuming from S3, the CPU is running at an anomalously low
speed.
It turns out that the BIOS has modified the value of the
THERM_CONTROL register during S3, and changed it from 0 to 0x10,
thus enabled clock modulation(bit4), but with undefined CPU Duty
Cycle(bit1:3) - which causes the problem.
Here is a simple scenario to reproduce the issue:
1. Boot up the system
2. Get MSR 0x19a, it should be 0
3. Put the system into sleep, then wake it up
4. Get MSR 0x19a, it shows 0x10, while it should be 0
Although some BIOSen want to change the CPU Duty Cycle during
S3, in our case we don't want the BIOS to do any modification.
Fix this issue by introducing a more generic x86 framework to
save/restore specified MSR registers(THERM_CONTROL in this case)
for suspend/resume. This allows us to fix similar bugs in a much
simpler way in the future.
When the kernel wants to protect certain MSRs during suspending,
we simply add a quirk entry in msr_save_dmi_table, and customize
the MSR registers inside the quirk callback, for example:
u32 msr_id_need_to_save[] = {MSR_ID0, MSR_ID1, MSR_ID2...};
and the quirk mechanism ensures that, once resumed from suspend,
the MSRs indicated by these IDs will be restored to their
original, pre-suspend values.
Since both 64-bit and 32-bit kernels are affected, this patch
covers the common 64/32-bit suspend/resume code path. And
because the MSRs specified by the user might not be available or
readable in any situation, we use rdmsrl_safe() to safely save
these MSRs.
Reported-and-tested-by: Marcin Kaszewski <marcin.kaszewski@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: len.brown@intel.com
Cc: linux@horizon.com
Cc: luto@kernel.org
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/c9abdcbc173dd2f57e8990e304376f19287e92ba.1448382971.git.yu.c.chen@intel.com
[ More edits to the naming of data structures. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-25 01:03:41 +08:00
|
|
|
msr_restore_context(ctxt);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2009-04-28 06:27:05 +08:00
|
|
|
/* Needed by apm.c */
|
2014-06-25 08:58:26 +08:00
|
|
|
void notrace restore_processor_state(void)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
__restore_processor_state(&saved_context);
|
|
|
|
}
|
2009-04-28 06:27:05 +08:00
|
|
|
#ifdef CONFIG_X86_32
|
|
|
|
EXPORT_SYMBOL(restore_processor_state);
|
|
|
|
#endif
|
2012-11-14 03:32:42 +08:00
|
|
|
|
x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
On Intel hardware, native_play_dead() uses mwait_play_dead() by
default and only falls back to the other methods if that fails.
That also happens during resume from hibernation, when the restore
(boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
except for the boot one offline.
However, that is problematic, because the address passed to
__monitor() in mwait_play_dead() is likely to be written to in the
last phase of hibernate image restoration and that causes the "dead"
CPU to start executing instructions again. Unfortunately, the page
containing the address in that CPU's instruction pointer may not be
valid any more at that point.
First, that page may have been overwritten with image kernel memory
contents already, so the instructions the CPU attempts to execute may
simply be invalid. Second, the page tables previously used by that
CPU may have been overwritten by image kernel memory contents, so the
address in its instruction pointer is impossible to resolve then.
A report from Varun Koyyalagunta and investigation carried out by
Chen Yu show that the latter sometimes happens in practice.
To prevent it from happening, temporarily change the smp_ops.play_dead
pointer during resume from hibernation so that it points to a special
"play dead" routine which uses hlt_play_dead() and avoids the
inadvertent "revivals" of "dead" CPUs this way.
A slightly unpleasant consequence of this change is that if the
system is hibernated with one or more CPUs offline, it will generally
draw more power after resume than it did before hibernation, because
the physical state entered by CPUs via hlt_play_dead() is higher-power
than the mwait_play_dead() one in the majority of cases. It is
possible to work around this, but it is unclear how much of a problem
that's going to be in practice, so the workaround will be implemented
later if it turns out to be necessary.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
Original-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 09:55:23 +08:00
|
|
|
#if defined(CONFIG_HIBERNATION) && defined(CONFIG_HOTPLUG_CPU)
|
|
|
|
static void resume_play_dead(void)
|
|
|
|
{
|
|
|
|
play_dead_common();
|
|
|
|
tboot_shutdown(TB_SHUTDOWN_WFS);
|
|
|
|
hlt_play_dead();
|
|
|
|
}
|
|
|
|
|
|
|
|
int hibernate_resume_nonboot_cpu_disable(void)
|
|
|
|
{
|
|
|
|
void (*play_dead)(void) = smp_ops.play_dead;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure that MONITOR/MWAIT will not be used in the "play dead" loop
|
|
|
|
* during hibernate image restoration, because it is likely that the
|
|
|
|
* monitored address will be actually written to at that time and then
|
|
|
|
* the "dead" CPU will attempt to execute instructions again, but the
|
|
|
|
* address in its instruction pointer may not be possible to resolve
|
|
|
|
* any more at that point (the page tables used by it previously may
|
|
|
|
* have been overwritten by hibernate image data).
|
x86/power: Fix 'nosmt' vs hibernation triple fault during resume
As explained in
0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
we always, no matter what, have to bring up x86 HT siblings during boot at
least once in order to avoid first MCE bringing the system to its knees.
That means that whenever 'nosmt' is supplied on the kernel command-line,
all the HT siblings are as a result sitting in mwait or cpudile after
going through the online-offline cycle at least once.
This causes a serious issue though when a kernel, which saw 'nosmt' on its
commandline, is going to perform resume from hibernation: if the resume
from the hibernated image is successful, cr3 is flipped in order to point
to the address space of the kernel that is being resumed, which in turn
means that all the HT siblings are all of a sudden mwaiting on address
which is no longer valid.
That results in triple fault shortly after cr3 is switched, and machine
reboots.
Fix this by always waking up all the SMT siblings before initiating the
'restore from hibernation' process; this guarantees that all the HT
siblings will be properly carried over to the resumed kernel waiting in
resume_play_dead(), and acted upon accordingly afterwards, based on the
target kernel configuration.
Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
again in case it has SMT disabled; this means it has to online all
the siblings when resuming (so that they come out of hlt) and offline
them again to let them reach mwait.
Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
Debugged-by: Thomas Gleixner <tglx@linutronix.de>
Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Pavel Machek <pavel@ucw.cz>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-05-30 06:09:39 +08:00
|
|
|
*
|
|
|
|
* First, make sure that we wake up all the potentially disabled SMT
|
|
|
|
* threads which have been initially brought up and then put into
|
|
|
|
* mwait/cpuidle sleep.
|
|
|
|
* Those will be put to proper (not interfering with hibernation
|
|
|
|
* resume) sleep afterwards, and the resumed kernel will decide itself
|
|
|
|
* what to do with them.
|
x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
On Intel hardware, native_play_dead() uses mwait_play_dead() by
default and only falls back to the other methods if that fails.
That also happens during resume from hibernation, when the restore
(boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
except for the boot one offline.
However, that is problematic, because the address passed to
__monitor() in mwait_play_dead() is likely to be written to in the
last phase of hibernate image restoration and that causes the "dead"
CPU to start executing instructions again. Unfortunately, the page
containing the address in that CPU's instruction pointer may not be
valid any more at that point.
First, that page may have been overwritten with image kernel memory
contents already, so the instructions the CPU attempts to execute may
simply be invalid. Second, the page tables previously used by that
CPU may have been overwritten by image kernel memory contents, so the
address in its instruction pointer is impossible to resolve then.
A report from Varun Koyyalagunta and investigation carried out by
Chen Yu show that the latter sometimes happens in practice.
To prevent it from happening, temporarily change the smp_ops.play_dead
pointer during resume from hibernation so that it points to a special
"play dead" routine which uses hlt_play_dead() and avoids the
inadvertent "revivals" of "dead" CPUs this way.
A slightly unpleasant consequence of this change is that if the
system is hibernated with one or more CPUs offline, it will generally
draw more power after resume than it did before hibernation, because
the physical state entered by CPUs via hlt_play_dead() is higher-power
than the mwait_play_dead() one in the majority of cases. It is
possible to work around this, but it is unclear how much of a problem
that's going to be in practice, so the workaround will be implemented
later if it turns out to be necessary.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
Original-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 09:55:23 +08:00
|
|
|
*/
|
x86/power: Fix 'nosmt' vs hibernation triple fault during resume
As explained in
0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
we always, no matter what, have to bring up x86 HT siblings during boot at
least once in order to avoid first MCE bringing the system to its knees.
That means that whenever 'nosmt' is supplied on the kernel command-line,
all the HT siblings are as a result sitting in mwait or cpudile after
going through the online-offline cycle at least once.
This causes a serious issue though when a kernel, which saw 'nosmt' on its
commandline, is going to perform resume from hibernation: if the resume
from the hibernated image is successful, cr3 is flipped in order to point
to the address space of the kernel that is being resumed, which in turn
means that all the HT siblings are all of a sudden mwaiting on address
which is no longer valid.
That results in triple fault shortly after cr3 is switched, and machine
reboots.
Fix this by always waking up all the SMT siblings before initiating the
'restore from hibernation' process; this guarantees that all the HT
siblings will be properly carried over to the resumed kernel waiting in
resume_play_dead(), and acted upon accordingly afterwards, based on the
target kernel configuration.
Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
again in case it has SMT disabled; this means it has to online all
the siblings when resuming (so that they come out of hlt) and offline
them again to let them reach mwait.
Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
Debugged-by: Thomas Gleixner <tglx@linutronix.de>
Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Pavel Machek <pavel@ucw.cz>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-05-30 06:09:39 +08:00
|
|
|
ret = cpuhp_smt_enable();
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
On Intel hardware, native_play_dead() uses mwait_play_dead() by
default and only falls back to the other methods if that fails.
That also happens during resume from hibernation, when the restore
(boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
except for the boot one offline.
However, that is problematic, because the address passed to
__monitor() in mwait_play_dead() is likely to be written to in the
last phase of hibernate image restoration and that causes the "dead"
CPU to start executing instructions again. Unfortunately, the page
containing the address in that CPU's instruction pointer may not be
valid any more at that point.
First, that page may have been overwritten with image kernel memory
contents already, so the instructions the CPU attempts to execute may
simply be invalid. Second, the page tables previously used by that
CPU may have been overwritten by image kernel memory contents, so the
address in its instruction pointer is impossible to resolve then.
A report from Varun Koyyalagunta and investigation carried out by
Chen Yu show that the latter sometimes happens in practice.
To prevent it from happening, temporarily change the smp_ops.play_dead
pointer during resume from hibernation so that it points to a special
"play dead" routine which uses hlt_play_dead() and avoids the
inadvertent "revivals" of "dead" CPUs this way.
A slightly unpleasant consequence of this change is that if the
system is hibernated with one or more CPUs offline, it will generally
draw more power after resume than it did before hibernation, because
the physical state entered by CPUs via hlt_play_dead() is higher-power
than the mwait_play_dead() one in the majority of cases. It is
possible to work around this, but it is unclear how much of a problem
that's going to be in practice, so the workaround will be implemented
later if it turns out to be necessary.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
Original-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 09:55:23 +08:00
|
|
|
smp_ops.play_dead = resume_play_dead;
|
|
|
|
ret = disable_nonboot_cpus();
|
|
|
|
smp_ops.play_dead = play_dead;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2012-11-14 03:32:42 +08:00
|
|
|
/*
|
|
|
|
* When bsp_check() is called in hibernate and suspend, cpu hotplug
|
|
|
|
* is disabled already. So it's unnessary to handle race condition between
|
|
|
|
* cpumask query and cpu hotplug.
|
|
|
|
*/
|
|
|
|
static int bsp_check(void)
|
|
|
|
{
|
|
|
|
if (cpumask_first(cpu_online_mask) != 0) {
|
|
|
|
pr_warn("CPU0 is offline.\n");
|
|
|
|
return -ENODEV;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int bsp_pm_callback(struct notifier_block *nb, unsigned long action,
|
|
|
|
void *ptr)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
switch (action) {
|
|
|
|
case PM_SUSPEND_PREPARE:
|
|
|
|
case PM_HIBERNATION_PREPARE:
|
|
|
|
ret = bsp_check();
|
|
|
|
break;
|
2012-11-14 03:32:51 +08:00
|
|
|
#ifdef CONFIG_DEBUG_HOTPLUG_CPU0
|
|
|
|
case PM_RESTORE_PREPARE:
|
|
|
|
/*
|
|
|
|
* When system resumes from hibernation, online CPU0 because
|
|
|
|
* 1. it's required for resume and
|
|
|
|
* 2. the CPU was online before hibernation
|
|
|
|
*/
|
|
|
|
if (!cpu_online(0))
|
|
|
|
_debug_hotplug_cpu(0, 1);
|
|
|
|
break;
|
|
|
|
case PM_POST_RESTORE:
|
|
|
|
/*
|
|
|
|
* When a resume really happens, this code won't be called.
|
|
|
|
*
|
|
|
|
* This code is called only when user space hibernation software
|
|
|
|
* prepares for snapshot device during boot time. So we just
|
|
|
|
* call _debug_hotplug_cpu() to restore to CPU0's state prior to
|
|
|
|
* preparing the snapshot device.
|
|
|
|
*
|
|
|
|
* This works for normal boot case in our CPU0 hotplug debug
|
|
|
|
* mode, i.e. CPU0 is offline and user mode hibernation
|
|
|
|
* software initializes during boot time.
|
|
|
|
*
|
|
|
|
* If CPU0 is online and user application accesses snapshot
|
|
|
|
* device after boot time, this will offline CPU0 and user may
|
|
|
|
* see different CPU0 state before and after accessing
|
|
|
|
* the snapshot device. But hopefully this is not a case when
|
|
|
|
* user debugging CPU0 hotplug. Even if users hit this case,
|
|
|
|
* they can easily online CPU0 back.
|
|
|
|
*
|
|
|
|
* To simplify this debug code, we only consider normal boot
|
|
|
|
* case. Otherwise we need to remember CPU0's state and restore
|
|
|
|
* to that state and resolve racy conditions etc.
|
|
|
|
*/
|
|
|
|
_debug_hotplug_cpu(0, 0);
|
|
|
|
break;
|
|
|
|
#endif
|
2012-11-14 03:32:42 +08:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return notifier_from_errno(ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __init bsp_pm_check_init(void)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Set this bsp_pm_callback as lower priority than
|
|
|
|
* cpu_hotplug_pm_callback. So cpu_hotplug_pm_callback will be called
|
|
|
|
* earlier to disable cpu hotplug before bsp online check.
|
|
|
|
*/
|
|
|
|
pm_notifier(bsp_pm_callback, -INT_MAX);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
core_initcall(bsp_pm_check_init);
|
x86/pm: Introduce quirk framework to save/restore extra MSR registers around suspend/resume
A bug was reported that on certain Broadwell platforms, after
resuming from S3, the CPU is running at an anomalously low
speed.
It turns out that the BIOS has modified the value of the
THERM_CONTROL register during S3, and changed it from 0 to 0x10,
thus enabled clock modulation(bit4), but with undefined CPU Duty
Cycle(bit1:3) - which causes the problem.
Here is a simple scenario to reproduce the issue:
1. Boot up the system
2. Get MSR 0x19a, it should be 0
3. Put the system into sleep, then wake it up
4. Get MSR 0x19a, it shows 0x10, while it should be 0
Although some BIOSen want to change the CPU Duty Cycle during
S3, in our case we don't want the BIOS to do any modification.
Fix this issue by introducing a more generic x86 framework to
save/restore specified MSR registers(THERM_CONTROL in this case)
for suspend/resume. This allows us to fix similar bugs in a much
simpler way in the future.
When the kernel wants to protect certain MSRs during suspending,
we simply add a quirk entry in msr_save_dmi_table, and customize
the MSR registers inside the quirk callback, for example:
u32 msr_id_need_to_save[] = {MSR_ID0, MSR_ID1, MSR_ID2...};
and the quirk mechanism ensures that, once resumed from suspend,
the MSRs indicated by these IDs will be restored to their
original, pre-suspend values.
Since both 64-bit and 32-bit kernels are affected, this patch
covers the common 64/32-bit suspend/resume code path. And
because the MSRs specified by the user might not be available or
readable in any situation, we use rdmsrl_safe() to safely save
these MSRs.
Reported-and-tested-by: Marcin Kaszewski <marcin.kaszewski@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: len.brown@intel.com
Cc: linux@horizon.com
Cc: luto@kernel.org
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/c9abdcbc173dd2f57e8990e304376f19287e92ba.1448382971.git.yu.c.chen@intel.com
[ More edits to the naming of data structures. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-25 01:03:41 +08:00
|
|
|
|
|
|
|
static int msr_init_context(const u32 *msr_id, const int total_num)
|
|
|
|
{
|
|
|
|
int i = 0;
|
|
|
|
struct saved_msr *msr_array;
|
|
|
|
|
|
|
|
if (saved_context.saved_msrs.array || saved_context.saved_msrs.num > 0) {
|
|
|
|
pr_err("x86/pm: MSR quirk already applied, please check your DMI match table.\n");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
msr_array = kmalloc_array(total_num, sizeof(struct saved_msr), GFP_KERNEL);
|
|
|
|
if (!msr_array) {
|
|
|
|
pr_err("x86/pm: Can not allocate memory to save/restore MSRs during suspend.\n");
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < total_num; i++) {
|
|
|
|
msr_array[i].info.msr_no = msr_id[i];
|
|
|
|
msr_array[i].valid = false;
|
|
|
|
msr_array[i].info.reg.q = 0;
|
|
|
|
}
|
|
|
|
saved_context.saved_msrs.num = total_num;
|
|
|
|
saved_context.saved_msrs.array = msr_array;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The following section is a quirk framework for problematic BIOSen:
|
|
|
|
* Sometimes MSRs are modified by the BIOSen after suspended to
|
|
|
|
* RAM, this might cause unexpected behavior after wakeup.
|
|
|
|
* Thus we save/restore these specified MSRs across suspend/resume
|
|
|
|
* in order to work around it.
|
|
|
|
*
|
|
|
|
* For any further problematic BIOSen/platforms,
|
|
|
|
* please add your own function similar to msr_initialize_bdw.
|
|
|
|
*/
|
|
|
|
static int msr_initialize_bdw(const struct dmi_system_id *d)
|
|
|
|
{
|
|
|
|
/* Add any extra MSR ids into this array. */
|
|
|
|
u32 bdw_msr_id[] = { MSR_IA32_THERM_CONTROL };
|
|
|
|
|
|
|
|
pr_info("x86/pm: %s detected, MSR saving is needed during suspending.\n", d->ident);
|
|
|
|
return msr_init_context(bdw_msr_id, ARRAY_SIZE(bdw_msr_id));
|
|
|
|
}
|
|
|
|
|
2017-09-14 17:59:30 +08:00
|
|
|
static const struct dmi_system_id msr_save_dmi_table[] = {
|
x86/pm: Introduce quirk framework to save/restore extra MSR registers around suspend/resume
A bug was reported that on certain Broadwell platforms, after
resuming from S3, the CPU is running at an anomalously low
speed.
It turns out that the BIOS has modified the value of the
THERM_CONTROL register during S3, and changed it from 0 to 0x10,
thus enabled clock modulation(bit4), but with undefined CPU Duty
Cycle(bit1:3) - which causes the problem.
Here is a simple scenario to reproduce the issue:
1. Boot up the system
2. Get MSR 0x19a, it should be 0
3. Put the system into sleep, then wake it up
4. Get MSR 0x19a, it shows 0x10, while it should be 0
Although some BIOSen want to change the CPU Duty Cycle during
S3, in our case we don't want the BIOS to do any modification.
Fix this issue by introducing a more generic x86 framework to
save/restore specified MSR registers(THERM_CONTROL in this case)
for suspend/resume. This allows us to fix similar bugs in a much
simpler way in the future.
When the kernel wants to protect certain MSRs during suspending,
we simply add a quirk entry in msr_save_dmi_table, and customize
the MSR registers inside the quirk callback, for example:
u32 msr_id_need_to_save[] = {MSR_ID0, MSR_ID1, MSR_ID2...};
and the quirk mechanism ensures that, once resumed from suspend,
the MSRs indicated by these IDs will be restored to their
original, pre-suspend values.
Since both 64-bit and 32-bit kernels are affected, this patch
covers the common 64/32-bit suspend/resume code path. And
because the MSRs specified by the user might not be available or
readable in any situation, we use rdmsrl_safe() to safely save
these MSRs.
Reported-and-tested-by: Marcin Kaszewski <marcin.kaszewski@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: len.brown@intel.com
Cc: linux@horizon.com
Cc: luto@kernel.org
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/c9abdcbc173dd2f57e8990e304376f19287e92ba.1448382971.git.yu.c.chen@intel.com
[ More edits to the naming of data structures. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-25 01:03:41 +08:00
|
|
|
{
|
|
|
|
.callback = msr_initialize_bdw,
|
|
|
|
.ident = "BROADWELL BDX_EP",
|
|
|
|
.matches = {
|
|
|
|
DMI_MATCH(DMI_PRODUCT_NAME, "GRANTLEY"),
|
|
|
|
DMI_MATCH(DMI_PRODUCT_VERSION, "E63448-400"),
|
|
|
|
},
|
|
|
|
},
|
|
|
|
{}
|
|
|
|
};
|
|
|
|
|
|
|
|
static int pm_check_save_msr(void)
|
|
|
|
{
|
|
|
|
dmi_check_system(msr_save_dmi_table);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
device_initcall(pm_check_save_msr);
|