Commit Graph

358 Commits

Author SHA1 Message Date
Thiago Jung Bauermann c8424e776b MODSIGN: Export module signature definitions
IMA will use the module_signature format for append signatures, so export
the relevant definitions and factor out the code which verifies that the
appended signature trailer is valid.

Also, create a CONFIG_MODULE_SIG_FORMAT option so that IMA can select it
and be able to use mod_check_sig() without having to depend on either
CONFIG_MODULE_SIG or CONFIG_MODULES.

s390 duplicated the definition of struct module_signature so now they can
use the new <linux/module_signature.h> header instead.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Acked-by: Jessica Yu <jeyu@kernel.org>
Reviewed-by: Philipp Rudo <prudo@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-08-05 18:39:56 -04:00
Christoph Hellwig 14c5cebad5 memremap: move from kernel/ to mm/
memremap.c implements MM functionality for ZONE_DEVICE, so it really
should be in the mm/ directory, not the kernel/ one.

Link: http://lkml.kernel.org/r/20190722094143.18387-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-03 07:02:01 -07:00
Joel Fernandes (Google) f7b101d330 kheaders: Move from proc to sysfs
The kheaders archive consisting of the kernel headers used for compiling
bpf programs is in /proc. However there is concern that moving it here
will make it permanent. Let us move it to /sys/kernel as discussed [1].

[1] https://lore.kernel.org/patchwork/patch/1067310/#1265969

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 20:16:01 +02:00
Andrew Morton acb2ec3dd0 kernel/Makefile: don't assume that kernel/gen_ikh_data.sh is executable
If the user downloads and applies patch-5.1.gz using patch(1), the x bit
on kernel/gen_ikh_data.sh is not set.

  /bin/sh: 1: ./kernel/gen_ikh_data.sh: Permission denied

Fix this by using CONFIG_SHELL.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:47 -07:00
Linus Torvalds cf482a49af Driver core/kobject patches for 5.2-rc1
Here is the "big" set of driver core patches for 5.2-rc1
 
 There are a number of ACPI patches in here as well, as Rafael said they
 should go through this tree due to the driver core changes they
 required.  They have all been acked by the ACPI developers.
 
 There are also a number of small subsystem-specific changes in here, due
 to some changes to the kobject core code.  Those too have all been acked
 by the various subsystem maintainers.
 
 As for content, it's pretty boring outside of the ACPI changes:
   - spdx cleanups
   - kobject documentation updates
   - default attribute groups for kobjects
   - other minor kobject/driver core fixes
 
 All have been in linux-next for a while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXNHDbw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynDAgCfbb4LBR6I50wFXb8JM/R6cAS7qrsAn1unshKV
 8XCYcif2RxjtdJWXbjdm
 =/rLh
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core/kobject updates from Greg KH:
 "Here is the "big" set of driver core patches for 5.2-rc1

  There are a number of ACPI patches in here as well, as Rafael said
  they should go through this tree due to the driver core changes they
  required. They have all been acked by the ACPI developers.

  There are also a number of small subsystem-specific changes in here,
  due to some changes to the kobject core code. Those too have all been
  acked by the various subsystem maintainers.

  As for content, it's pretty boring outside of the ACPI changes:
   - spdx cleanups
   - kobject documentation updates
   - default attribute groups for kobjects
   - other minor kobject/driver core fixes

  All have been in linux-next for a while with no reported issues"

* tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (47 commits)
  kobject: clean up the kobject add documentation a bit more
  kobject: Fix kernel-doc comment first line
  kobject: Remove docstring reference to kset
  firmware_loader: Fix a typo ("syfs" -> "sysfs")
  kobject: fix dereference before null check on kobj
  Revert "driver core: platform: Fix the usage of platform device name(pdev->name)"
  init/config: Do not select BUILD_BIN2C for IKCONFIG
  Provide in-kernel headers to make extending kernel easier
  kobject: Improve doc clarity kobject_init_and_add()
  kobject: Improve docs for kobject_add/del
  driver core: platform: Fix the usage of platform device name(pdev->name)
  livepatch: Replace klp_ktype_patch's default_attrs with groups
  cpufreq: schedutil: Replace default_attrs field with groups
  padata: Replace padata_attr_type default_attrs field with groups
  irqdesc: Replace irq_kobj_type's default_attrs field with groups
  net-sysfs: Replace ktype default_attrs field with groups
  block: Replace all ktype default_attrs with groups
  samples/kobject: Replace foo_ktype's default_attrs field with groups
  kobject: Add support for default attribute groups to kobj_type
  driver core: Postpone DMA tear-down until after devres release for probe failure
  ...
2019-05-07 13:01:40 -07:00
Joel Fernandes (Google) 43d8ce9d65 Provide in-kernel headers to make extending kernel easier
Introduce in-kernel headers which are made available as an archive
through proc (/proc/kheaders.tar.xz file). This archive makes it
possible to run eBPF and other tracing programs that need to extend the
kernel for tracing purposes without any dependency on the file system
having headers.

A github PR is sent for the corresponding BCC patch at:
https://github.com/iovisor/bcc/pull/2312

On Android and embedded systems, it is common to switch kernels but not
have kernel headers available on the file system. Further once a
different kernel is booted, any headers stored on the file system will
no longer be useful. This is an issue even well known to distros.
By storing the headers as a compressed archive within the kernel, we can
avoid these issues that have been a hindrance for a long time.

The best way to use this feature is by building it in. Several users
have a need for this, when they switch debug kernels, they do not want to
update the filesystem or worry about it where to store the headers on
it. However, the feature is also buildable as a module in case the user
desires it not being part of the kernel image. This makes it possible to
load and unload the headers from memory on demand. A tracing program can
load the module, do its operations, and then unload the module to save
kernel memory. The total memory needed is 3.3MB.

By having the archive available at a fixed location independent of
filesystem dependencies and conventions, all debugging tools can
directly refer to the fixed location for the archive, without concerning
with where the headers on a typical filesystem which significantly
simplifies tooling that needs kernel headers.

The code to read the headers is based on /proc/config.gz code and uses
the same technique to embed the headers.

Other approaches were discussed such as having an in-memory mountable
filesystem, but that has drawbacks such as requiring an in-kernel xz
decompressor which we don't have today, and requiring usage of 42 MB of
kernel memory to host the decompressed headers at anytime. Also this
approach is simpler than such approaches.

Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-29 16:48:03 +02:00
Peter Zijlstra 40ea97290b x86/uaccess, kcov: Disable stack protector
New tooling noticed this mishap:

  kernel/kcov.o: warning: objtool: write_comp_data()+0x138: call to __stack_chk_fail() with UACCESS enabled
  kernel/kcov.o: warning: objtool: __sanitizer_cov_trace_pc()+0xd9: call to __stack_chk_fail() with UACCESS enabled

All the other instrumentation (KASAN,UBSAN) also have stack protector
disabled.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 11:02:24 +02:00
Masahiro Yamada 13610aa908 kernel/configs: use .incbin directive to embed config_data.gz
This slightly optimizes the kernel/configs.c build.

bin2c is not very efficient because it converts a data file into a huge
array to embed it into a *.c file.

Instead, we can use the .incbin directive.

Also, this simplifies the code; Makefile is cleaner, and the way to get
the offset/size of the config_data.gz is more straightforward.

I used the "asm" statement in *.c instead of splitting it into *.S
because MODULE_* tags are not supported in *.S files.

I also cleaned up kernel/.gitignore; "config_data.gz" is unneeded
because the top-level .gitignore takes care of the "*.gz" pattern.

[yamada.masahiro@socionext.com: v2]
  Link: http://lkml.kernel.org/r/1550108893-21226-1-git-send-email-yamada.masahiro@socionext.com
Link: http://lkml.kernel.org/r/1549941160-8084-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07 18:32:02 -08:00
Masahiro Yamada ad77408635 kbuild: change filechk to surround the given command with { }
filechk_* rules often consist of multiple 'echo' lines. They must be
surrounded with { } or ( ) to work correctly. Otherwise, only the
string from the last 'echo' would be written into the target.

Let's take care of that in the 'filechk' in scripts/Kbuild.include
to clean up filechk_* rules.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-01-06 09:46:51 +09:00
Linus Torvalds b12a9124ee y2038: more syscalls and cleanups
This concludes the main part of the system call rework for 64-bit time_t,
 which has spread over most of year 2018, the last six system calls being
 
  - ppoll
  - pselect6
  - io_pgetevents
  - recvmmsg
  - futex
  - rt_sigtimedwait
 
 As before, nothing changes for 64-bit architectures, while 32-bit
 architectures gain another entry point that differs only in the layout
 of the timespec structure. Hopefully in the next release we can wire up
 all 22 of those system calls on all 32-bit architectures, which gives
 us a baseline version for glibc to start using them.
 
 This does not include the clock_adjtime, getrusage/waitid, and
 getitimer/setitimer system calls. I still plan to have new versions
 of those as well, but they are not required for correct operation of
 the C library since they can be emulated using the old 32-bit time_t
 based system calls.
 
 Aside from the system calls, there are also a few cleanups here,
 removing old kernel internal interfaces that have become unused after
 all references got removed. The arch/sh cleanups are part of this,
 there were posted several times over the past year without a reaction
 from the maintainers, while the corresponding changes made it into all
 other architectures.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJcHCCRAAoJEGCrR//JCVInkqsP/3TuLgSyQwolFRXcoBOjR1Ar
 JoX33GuDlAxHSqPadButVfflmRIWvL3aNMFFwcQM4uYgQ593FoHbmnusCdFgHcQ7
 Q13pGo7szbfEFxydhnDMVust/hxd5C9Y5zNSJ+eMLGLLJXosEyjd9YjRoHDROWal
 oDLqpPCArlLN1B1XFhjH8J847+JgS+hUrAfk3AOU0B2TuuFkBnRImlCGCR5JcgPh
 XIpHRBOgEMP4kZ3LjztPfS3v/XJeGrguRcbD3FsPKdPeYO9QRUiw0vahEQRr7qXL
 9hOgDq1YHPUQeUFhy3hJPCZdsDFzWoIE7ziNkZCZvGBw+qSw9i8KChGUt6PcSNlJ
 nqKJY5Wneb4svu+kOdK7d8ONbTdlVYvWf5bj/sKoNUA4BVeIjNcDXplvr3cXiDzI
 e40CcSQ3oLEvrIxMcoyNPPG63b+FYG8nMaCOx4dB4pZN7sSvZUO9a1DbDBtzxMON
 xy5Kfk1n5gIHcfBJAya5CnMQ1Jm4FCCu/LHVanYvb/nXA/2jEegSm24Md17icE/Q
 VA5jJqIdICExor4VHMsG0lLQxBJsv/QqYfT2OCO6Oykh28mjFqf+X+9Ctz1w6KVG
 VUkY1u97x8jB0M4qolGO7ZGn6P1h0TpNVFD1zDNcDt2xI63cmuhgKWiV2pv5b7No
 ty6insmmbJWt3tOOPyfb
 =yIAT
 -----END PGP SIGNATURE-----

Merge tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground

Pull y2038 updates from Arnd Bergmann:
 "More syscalls and cleanups

  This concludes the main part of the system call rework for 64-bit
  time_t, which has spread over most of year 2018, the last six system
  calls being

    - ppoll
    - pselect6
    - io_pgetevents
    - recvmmsg
    - futex
    - rt_sigtimedwait

  As before, nothing changes for 64-bit architectures, while 32-bit
  architectures gain another entry point that differs only in the layout
  of the timespec structure. Hopefully in the next release we can wire
  up all 22 of those system calls on all 32-bit architectures, which
  gives us a baseline version for glibc to start using them.

  This does not include the clock_adjtime, getrusage/waitid, and
  getitimer/setitimer system calls. I still plan to have new versions of
  those as well, but they are not required for correct operation of the
  C library since they can be emulated using the old 32-bit time_t based
  system calls.

  Aside from the system calls, there are also a few cleanups here,
  removing old kernel internal interfaces that have become unused after
  all references got removed. The arch/sh cleanups are part of this,
  there were posted several times over the past year without a reaction
  from the maintainers, while the corresponding changes made it into all
  other architectures"

* tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground:
  timekeeping: remove obsolete time accessors
  vfs: replace current_kernel_time64 with ktime equivalent
  timekeeping: remove timespec_add/timespec_del
  timekeeping: remove unused {read,update}_persistent_clock
  sh: remove board_time_init() callback
  sh: remove unused rtc_sh_get/set_time infrastructure
  sh: sh03: rtc: push down rtc class ops into driver
  sh: dreamcast: rtc: push down rtc class ops into driver
  y2038: signal: Add compat_sys_rt_sigtimedwait_time64
  y2038: signal: Add sys_rt_sigtimedwait_time32
  y2038: socket: Add compat_sys_recvmmsg_time64
  y2038: futex: Add support for __kernel_timespec
  y2038: futex: Move compat implementation into futex.c
  io_pgetevents: use __kernel_timespec
  pselect6: use __kernel_timespec
  ppoll: use __kernel_timespec
  signal: Add restore_user_sigmask()
  signal: Add set_user_sigmask()
2018-12-28 12:45:04 -08:00
Arnd Bergmann 04e7712f44 y2038: futex: Move compat implementation into futex.c
We are going to share the compat_sys_futex() handler between 64-bit
architectures and 32-bit architectures that need to deal with both 32-bit
and 64-bit time_t, and this is easier if both entry points are in the
same file.

In fact, most other system call handlers do the same thing these days, so
let's follow the trend here and merge all of futex_compat.c into futex.c.

In the process, a few minor changes have to be done to make sure everything
still makes sense: handle_futex_death() and futex_cmpxchg_enabled() become
local symbol, and the compat version of the fetch_robust_entry() function
gets renamed to compat_fetch_robust_entry() to avoid a symbol clash.

This is intended as a purely cosmetic patch, no behavior should
change.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-12-07 22:19:07 +01:00
Richard Guy Briggs c8fc5d49c3 audit: remove WATCH and TREE config options
Remove the CONFIG_AUDIT_WATCH and CONFIG_AUDIT_TREE config options since
they are both dependent on CONFIG_AUDITSYSCALL and force
CONFIG_FSNOTIFY.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-19 16:29:50 -05:00
Alexander Popov afaef01c00 x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls
The STACKLEAK feature (initially developed by PaX Team) has the following
benefits:

1. Reduces the information that can be revealed through kernel stack leak
   bugs. The idea of erasing the thread stack at the end of syscalls is
   similar to CONFIG_PAGE_POISONING and memzero_explicit() in kernel
   crypto, which all comply with FDP_RIP.2 (Full Residual Information
   Protection) of the Common Criteria standard.

2. Blocks some uninitialized stack variable attacks (e.g. CVE-2017-17712,
   CVE-2010-2963). That kind of bugs should be killed by improving C
   compilers in future, which might take a long time.

This commit introduces the code filling the used part of the kernel
stack with a poison value before returning to userspace. Full
STACKLEAK feature also contains the gcc plugin which comes in a
separate commit.

The STACKLEAK feature is ported from grsecurity/PaX. More information at:
  https://grsecurity.net/
  https://pax.grsecurity.net/

This code is modified from Brad Spengler/PaX Team's code in the last
public patch of grsecurity/PaX based on our understanding of the code.
Changes or omissions from the original code are ours and don't reflect
the original grsecurity/PaX code.

Performance impact:

Hardware: Intel Core i7-4770, 16 GB RAM

Test #1: building the Linux kernel on a single core
        0.91% slowdown

Test #2: hackbench -s 4096 -l 2000 -g 15 -f 25 -P
        4.2% slowdown

So the STACKLEAK description in Kconfig includes: "The tradeoff is the
performance impact: on a single CPU system kernel compilation sees a 1%
slowdown, other systems and workloads may vary and you are advised to
test this feature on your expected workload before deploying it".

Signed-off-by: Alexander Popov <alex.popov@linux.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-04 10:35:47 -07:00
Masahiro Yamada c417fbce98 kbuild: move bin2c back to scripts/ from scripts/basic/
Commit 8370edea81 ("bin2c: move bin2c in scripts/basic") moved bin2c
to the scripts/basic/ directory, incorrectly stating "Kexec wants to
use bin2c and it wants to use it really early in the build process.
See arch/x86/purgatory/ code in later patches."

Commit bdab125c93 ("Revert "kexec/purgatory: Add clean-up for
purgatory directory"") and commit d6605b6bbe ("x86/build: Remove
unnecessary preparation for purgatory") removed the redundant
purgatory build magic entirely.

That means that the move of bin2c was unnecessary in the first place.

fixdep is the only host program that deserves to sit in the
scripts/basic/ directory.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-07-18 01:18:05 +09:00
Christoph Hellwig cf65a0f6f6 dma-mapping: move all DMA mapping code to kernel/dma
Currently the code is split over various files with dma- prefixes in the
lib/ and drives/base directories, and the number of files keeps growing.
Move them into a single directory to keep the code together and remove
the file name prefixes.  To match the irq infrastructure this directory
is placed under the kernel/ directory.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-14 08:50:37 +02:00
Linus Torvalds d82991a868 Merge branch 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull restartable sequence support from Thomas Gleixner:
 "The restartable sequences syscall (finally):

  After a lot of back and forth discussion and massive delays caused by
  the speculative distraction of maintainers, the core set of
  restartable sequences has finally reached a consensus.

  It comes with the basic non disputed core implementation along with
  support for arm, powerpc and x86 and a full set of selftests

  It was exposed to linux-next earlier this week, so it does not fully
  comply with the merge window requirements, but there is really no
  point to drag it out for yet another cycle"

* 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rseq/selftests: Provide Makefile, scripts, gitignore
  rseq/selftests: Provide parametrized tests
  rseq/selftests: Provide basic percpu ops test
  rseq/selftests: Provide basic test
  rseq/selftests: Provide rseq library
  selftests/lib.mk: Introduce OVERRIDE_TARGETS
  powerpc: Wire up restartable sequences system call
  powerpc: Add syscall detection for restartable sequences
  powerpc: Add support for restartable sequences
  x86: Wire up restartable sequence system call
  x86: Add support for restartable sequences
  arm: Wire up restartable sequences system call
  arm: Add syscall detection for restartable sequences
  arm: Add restartable sequences support
  rseq: Introduce restartable sequences system call
  uapi/headers: Provide types_32_64.h
2018-06-10 10:17:09 -07:00
Mathieu Desnoyers d7822b1e24 rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.

* Restartable sequences (per-cpu atomics)

Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.

The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.

Here are benchmarks of various rseq use-cases.

Test hardware:

arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

The following benchmarks were all performed on a single thread.

* Per-CPU statistic counter increment

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:                344.0                 31.4          11.0
x86-64:                15.3                  2.0           7.7

* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
             per-cpu buffer

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:               2502.0                 2250.0         1.1
x86-64:               117.4                   98.0         1.2

* liburcu percpu: lock-unlock pair, dereference, read/compare word

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:                751.0                 128.5          5.8
x86-64:                53.4                  28.6          1.9

* jemalloc memory allocator adapted to use rseq

Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):

The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.

* Reading the current CPU number

Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:

- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
  executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
  assembly, which makes it a useful building block for restartable
  sequences.
- The approach of reading the cpu id through memory mapping shared
  between kernel and user-space is portable (e.g. ARM), which is not the
  case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop):                                    8.4 ns
- Read CPU from rseq cpu_id:                               16.7 ns
- Read CPU from rseq cpu_id (lazy register):               19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
- getcpu system call:                                     234.9 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop):                                    0.8 ns
- Read CPU from rseq cpu_id:                                0.8 ns
- Read CPU from rseq cpu_id (lazy register):                0.8 ns
- Read using gs segment selector:                           0.8 ns
- "lsl" inline assembly:                                   13.0 ns
- glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
- getcpu system call:                                      53.9 ns

- Speed (benchmark taken on v8 of patchset)

Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:

Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.

* CONFIG_RSEQ=n

avg.:      41.37 s
std.dev.:   0.36 s

* CONFIG_RSEQ=y

avg.:      40.46 s
std.dev.:   0.33 s

- Size

On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.

[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-06 11:58:31 +02:00
Dan Williams 5981690ddb memremap: split devm_memremap_pages() and memremap() infrastructure
Currently, kernel/memremap.c contains generic code for supporting
memremap() (CONFIG_HAS_IOMEM) and devm_memremap_pages()
(CONFIG_ZONE_DEVICE). This causes ongoing build maintenance problems as
additions to memremap.c, especially for the ZONE_DEVICE case, need to be
careful about being placed in ifdef guards. Remove the need for these
ifdef guards by moving the ZONE_DEVICE support functions to their own
compilation unit.

Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-15 23:08:33 -07:00
Masami Hiramatsu 4b1a29a7f5 error-injection: Support fault injection framework
Support in-kernel fault-injection framework via debugfs.
This allows you to inject a conditional error to specified
function using debugfs interfaces.

Here is the result of test script described in
Documentation/fault-injection/fault-injection.txt

  ===========
  # ./test_fail_function.sh
  1+0 records in
  1+0 records out
  1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
  btrfs-progs v4.4
  See http://btrfs.wiki.kernel.org for more information.

  Label:              (null)
  UUID:               bfa96010-12e9-4360-aed0-42eec7af5798
  Node size:          16384
  Sector size:        4096
  Filesystem size:    1001.00MiB
  Block group profiles:
    Data:             single            8.00MiB
    Metadata:         DUP              58.00MiB
    System:           DUP              12.00MiB
  SSD detected:       no
  Incompat features:  extref, skinny-metadata
  Number of devices:  1
  Devices:
     ID        SIZE  PATH
      1  1001.00MiB  /dev/loop2

  mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
  SUCCESS!
  ===========

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-12 17:33:38 -08:00
Greg Kroah-Hartman b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Luis R. Rodriguez 0ce2c20293 kmod: move #ifdef CONFIG_MODULES wrapper to Makefile
The entire file is now conditionally compiled only when CONFIG_MODULES is
enabled, and this this is a bool.  Just move this conditional to the
Makefile as its easier to read this way.

Link: http://lkml.kernel.org/r/20170810180618.22457-5-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michal Marek <mmarek@suse.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:51 -07:00
Luis R. Rodriguez 235586939d kmod: split out umh code into its own file
Patch series "kmod: few code cleanups to split out umh code"

The usermode helper has a provenance from the old usb code which first
required a usermode helper.  Eventually this was shoved into kmod.c and
the kernel's modprobe calls was converted over eventually to share the
same code.  Over time the list of usermode helpers in the kernel has grown
-- so kmod is just but one user of the API.

This series is a simple logical cleanup which acknowledges the code
evolution of the usermode helper and shoves the UMH API into its own
dedicated file.  This way users of the API can later just include umh.h
instead of kmod.h.

Note despite the diff state the first patch really is just a code shove,
no functional changes are done there.  I did use git format-patch -M to
generate the patch, but in the end the split was not enough for git to
consider it a rename hence the large diffstat.

I've put this through 0-day and it gives me their machine compilation
blessings with all tests as OK.

This patch (of 4):

There's a slew of usermode helper users and kmod is just one of them.
Split out the usermode helper code into its own file to keep the logic and
focus split up.

This change provides no functional changes.

Link: http://lkml.kernel.org/r/20170810180618.22457-2-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michal Marek <mmarek@suse.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:50 -07:00
Mathieu Desnoyers 22e4ebb975 membarrier: Provide expedited private command
Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
from all runqueues for which current thread's mm is the same as the
thread calling sys_membarrier. It executes faster than the non-expedited
variant (no blocking). It also works on NOHZ_FULL configurations.

Scheduler-wise, it requires a memory barrier before and after context
switching between processes (which have different mm). The memory
barrier before context switch is already present. For the barrier after
context switch:

* Our TSO archs can do RELEASE without being a full barrier. Look at
  x86 spin_unlock() being a regular STORE for example.  But for those
  archs, all atomics imply smp_mb and all of them have atomic ops in
  switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
  barrier.

* From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
  the rest does indeed do smp_mb(), so there the spin_unlock() is a full
  barrier and we're good.

* ARM64 has a very heavy barrier in switch_to(), which suffices.

* PPC just removed its barrier from switch_to(), but appears to be
  talking about adding something to switch_mm(). So add a
  smp_mb__after_unlock_lock() for now, until this is settled on the PPC
  side.

Changes since v3:
- Properly document the memory barriers provided by each architecture.

Changes since v2:
- Address comments from Peter Zijlstra,
- Add smp_mb__after_unlock_lock() after finish_lock_switch() in
  finish_task_switch() to add the memory barrier we need after storing
  to rq->curr. This is much simpler than the previous approach relying
  on atomic_dec_and_test() in mmdrop(), which actually added a memory
  barrier in the common case of switching between userspace processes.
- Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
  kernel, rather than having the whole membarrier system call returning
  -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
  Adapt the CMD_QUERY mask accordingly.

Changes since v1:
- move membarrier code under kernel/sched/ because it uses the
  scheduler runqueue,
- only add the barrier when we switch from a kernel thread. The case
  where we switch from a user-space thread is already handled by
  the atomic_dec_and_test() in mmdrop().
- add a comment to mmdrop() documenting the requirement on the implicit
  memory barrier.

CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: gromer@google.com
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Dave Watson <davejwatson@fb.com>
2017-08-17 07:28:05 -07:00
Nicholas Piggin 05a4a95279 kernel/watchdog: split up config options
Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.

LOCKUP_DETECTOR implies the general boot, sysctl, and programming
interfaces for the lockup detectors.

An architecture that wants to use a hard lockup detector must define
HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH.

Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the
minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and
does not implement the LOCKUP_DETECTOR interfaces.

sparc is unusual in that it has started to implement some of the
interfaces, but not fully yet.  It should probably be converted to a full
HAVE_HARDLOCKUP_DETECTOR_ARCH.

[npiggin@gmail.com: fix]
  Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com
Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Tested-by: Babu Moger <babu.moger@oracle.com>	[sparc]
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:02 -07:00
Hari Bathini 692f66f26a crash: move crashkernel parsing and vmcore related code under CONFIG_CRASH_CORE
Patch series "kexec/fadump: remove dependency with CONFIG_KEXEC and
reuse crashkernel parameter for fadump", v4.

Traditionally, kdump is used to save vmcore in case of a crash.  Some
architectures like powerpc can save vmcore using architecture specific
support instead of kexec/kdump mechanism.  Such architecture specific
support also needs to reserve memory, to be used by dump capture kernel.
crashkernel parameter can be a reused, for memory reservation, by such
architecture specific infrastructure.

This patchset removes dependency with CONFIG_KEXEC for crashkernel
parameter and vmcoreinfo related code as it can be reused without kexec
support.  Also, crashkernel parameter is reused instead of
fadump_reserve_mem to reserve memory for fadump.

The first patch moves crashkernel parameter parsing and vmcoreinfo
related code under CONFIG_CRASH_CORE instead of CONFIG_KEXEC_CORE.  The
second patch reuses the definitions of append_elf_note() & final_note()
functions under CONFIG_CRASH_CORE in IA64 arch code.  The third patch
removes dependency on CONFIG_KEXEC for firmware-assisted dump (fadump)
in powerpc.  The next patch reuses crashkernel parameter for reserving
memory for fadump, instead of the fadump_reserve_mem parameter.  This
has the advantage of using all syntaxes crashkernel parameter supports,
for fadump as well.  The last patch updates fadump kernel documentation
about use of crashkernel parameter.

This patch (of 5):

Traditionally, kdump is used to save vmcore in case of a crash.  Some
architectures like powerpc can save vmcore using architecture specific
support instead of kexec/kdump mechanism.  Such architecture specific
support also needs to reserve memory, to be used by dump capture kernel.
crashkernel parameter can be a reused, for memory reservation, by such
architecture specific infrastructure.

But currently, code related to vmcoreinfo and parsing of crashkernel
parameter is built under CONFIG_KEXEC_CORE.  This patch introduces
CONFIG_CRASH_CORE and moves the above mentioned code under this config,
allowing code reuse without dependency on CONFIG_KEXEC.  There is no
functional change with this patch.

Link: http://lkml.kernel.org/r/149035338104.6881.4550894432615189948.stgit@hbathini.in.ibm.com
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:11 -07:00
Tejun Heo 201af4c0fa cgroup: move cgroup files under kernel/cgroup/
They're growing to be too many and planned to get split further.  Move
them under their own directory.

 kernel/cgroup.c		-> kernel/cgroup/cgroup.c
 kernel/cgroup_freezer.c	-> kernel/cgroup/freezer.c
 kernel/cgroup_pids.c		-> kernel/cgroup/pids.c
 kernel/cpuset.c		-> kernel/cgroup/cpuset.c

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:05 -05:00
Linus Torvalds a57cb1c1d7 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - a few misc things

 - kexec updates

 - DMA-mapping updates to better support networking DMA operations

 - IPC updates

 - various MM changes to improve DAX fault handling

 - lots of radix-tree changes, mainly to the test suite. All leading up
   to reimplementing the IDA/IDR code to be a wrapper layer over the
   radix-tree. However the final trigger-pulling patch is held off for
   4.11.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
  radix tree test suite: delete unused rcupdate.c
  radix tree test suite: add new tag check
  radix-tree: ensure counts are initialised
  radix tree test suite: cache recently freed objects
  radix tree test suite: add some more functionality
  idr: reduce the number of bits per level from 8 to 6
  rxrpc: abstract away knowledge of IDR internals
  tpm: use idr_find(), not idr_find_slowpath()
  idr: add ida_is_empty
  radix tree test suite: check multiorder iteration
  radix-tree: fix replacement for multiorder entries
  radix-tree: add radix_tree_split_preload()
  radix-tree: add radix_tree_split
  radix-tree: add radix_tree_join
  radix-tree: delete radix_tree_range_tag_if_tagged()
  radix-tree: delete radix_tree_locate_item()
  radix-tree: improve multiorder iterators
  btrfs: fix race in btrfs_free_dummy_fs_info()
  radix-tree: improve dump output
  radix-tree: make radix_tree_find_next_bit more useful
  ...
2016-12-14 17:25:18 -08:00
Babu Moger 73ce0511c4 kernel/watchdog.c: move hardlockup detector to separate file
Separate hardlockup code from watchdog.c and move it to watchdog_hld.c.
It is mostly straight forward.  Remove everything inside
CONFIG_HARDLOCKUP_DETECTORS.  This code will go to file watchdog_hld.c.
Also update the makefile accordigly.

Link: http://lkml.kernel.org/r/1478034826-43888-3-git-send-email-babu.moger@oracle.com
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Josh Hunt <johunt@akamai.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:08 -08:00
Paul Bolle d06505b2a9 Remove last traces of ikconfig.h
The build system stopped generating ikconfig.h in v2.6.8. Remove an entry
for it in dontdiff. There's also a reference to it in a small comment.
Remove that comment too, as it is of little help in any case.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-12-14 10:54:28 +01:00
Eric W. Biederman dbec28460a userns: Add per user namespace sysctls.
Limit per userns sysctls to only be opened for write by a holder
of CAP_SYS_RESOURCE.

Add all of the necessary boilerplate for having per user namespace
sysctls.

Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-08-08 13:18:58 -05:00
Ralf Baechle f43edca7ed ELF/MIPS build fix
CONFIG_MIPS32_N32=y but CONFIG_BINFMT_ELF disabled results in the
following linker errors:

  arch/mips/built-in.o: In function `elf_core_dump':
  binfmt_elfn32.c:(.text+0x23dbc): undefined reference to `elf_core_extra_phdrs'
  binfmt_elfn32.c:(.text+0x246e4): undefined reference to `elf_core_extra_data_size'
  binfmt_elfn32.c:(.text+0x248d0): undefined reference to `elf_core_write_extra_phdrs'
  binfmt_elfn32.c:(.text+0x24ac4): undefined reference to `elf_core_write_extra_data'

CONFIG_MIPS32_O32=y but CONFIG_BINFMT_ELF disabled results in the following
linker errors:

  arch/mips/built-in.o: In function `elf_core_dump':
  binfmt_elfo32.c:(.text+0x28a04): undefined reference to `elf_core_extra_phdrs'
  binfmt_elfo32.c:(.text+0x29330): undefined reference to `elf_core_extra_data_size'
  binfmt_elfo32.c:(.text+0x2951c): undefined reference to `elf_core_write_extra_phdrs'
  binfmt_elfo32.c:(.text+0x29710): undefined reference to `elf_core_write_extra_data'

This is because binfmt_elfn32 and binfmt_elfo32 are using symbols from
elfcore but for these configurations elfcore will not be built.

Fixed by making elfcore selectable by a separate config symbol which
unlike the current mechanism can also be used from other directories
than kernel/, then having each flavor of ELF that relies on elfcore.o,
select it in Kconfig, including CONFIG_MIPS32_N32 and CONFIG_MIPS32_O32
which fixes this issue.

Link: http://lkml.kernel.org/r/20160520141705.GA1913@linux-mips.org
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Reviewed-by: James Hogan <james.hogan@imgtec.com>
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23 17:04:14 -07:00
Dmitry Vyukov 5c9a8750a6 kernel: add kcov code coverage
kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing).  Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system.  A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/).  However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.

kcov does not aim to collect as much coverage as possible.  It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g.  scheduler, locking).

Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes.  Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch).  I've
dropped the second mode for simplicity.

This patch adds the necessary support on kernel side.  The complimentary
compiler support was added in gcc revision 231296.

We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:

  https://github.com/google/syzkaller/wiki/Found-Bugs

We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation".  For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.

Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
typical coverage can be just a dozen of basic blocks (e.g.  an invalid
input).  In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M).  Cost of
kcov depends only on number of executed basic blocks/edges.  On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.

kcov exposes kernel PCs and control flow to user-space which is
insecure.  But debugfs should not be mapped as user accessible.

Based on a patch by Quentin Casasnovas.

[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Drysdale <drysdale@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22 15:36:02 -07:00
Li Bin e11b956e9e kernel/Makefile: remove the useless CFLAGS_REMOVE_cgroup-debug.o
The file cgroup-debug.c had been removed from commit fe6934354f
(cgroups: move the cgroup debug subsys into cgroup.c to access internal state).
Remain the CFLAGS_REMOVE_cgroup-debug.o = $(CC_FLAGS_FTRACE)
useless in kernel/Makefile.

Signed-off-by: Li Bin <huawei.libin@huawei.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-01-31 05:11:21 -05:00
Mathieu Desnoyers 5b25b13ab0 sys_membarrier(): system-wide memory barrier (generic, x86)
Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system.  It is
implemented by calling synchronize_sched().  It can be used to
distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of
sys_membarrier() and a compiler barrier.  For synchronization primitives
that distinguish between read-side and write-side (e.g.  userspace RCU
[1], rwlocks), the read-side can be accelerated significantly by moving
the bulk of the memory barrier overhead to the write-side.

The existing applications of which I am aware that would be improved by
this system call are as follows:

* Through Userspace RCU library (http://urcu.so)
  - DNS server (Knot DNS) https://www.knot-dns.cz/
  - Network sniffer (http://netsniff-ng.org/)
  - Distributed object storage (https://sheepdog.github.io/sheepdog/)
  - User-space tracing (http://lttng.org)
  - Network storage system (https://www.gluster.org/)
  - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
  - Financial software (https://lkml.org/lkml/2015/3/23/189)

Those projects use RCU in userspace to increase read-side speed and
scalability compared to locking.  Especially in the case of RCU used by
libraries, sys_membarrier can speed up the read-side by moving the bulk of
the memory barrier cost to synchronize_rcu().

* Direct users of sys_membarrier
  - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

Microsoft core dotnet GC developers are planning to use the mprotect()
side-effect of issuing memory barriers through IPIs as a way to implement
Windows FlushProcessWriteBuffers() on Linux.  They are referring to
sys_membarrier in their github thread, specifically stating that
sys_membarrier() is what they are looking for.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
previous mem accesses       previous mem accesses
smp_mb()                    smp_mb()
following mem accesses      following mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().

* Benchmarks

On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)

1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

* User-space user of this system call: Userspace RCU library

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader:    1701557485 reads, 2202847 writes
signal-based scheme:          9830061167 reads,    6700 writes
sys_membarrier:               9952759104 reads,     425 writes
sys_membarrier (dyn. check):  7970328887 reads,     425 writes

The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
sys_membarrier slightly outperforms the signal-based scheme. However,
this non-expedited sys_membarrier implementation has a much slower grace
period than signal and memory barrier schemes.

Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application.

An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.

This patch adds the system call to x86 and to asm-generic.

[1] http://urcu.so

membarrier(2) man page:

MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)

NAME
       membarrier - issue memory barriers on a set of threads

SYNOPSIS
       #include <linux/membarrier.h>

       int membarrier(int cmd, int flags);

DESCRIPTION
       The cmd argument is one of the following:

       MEMBARRIER_CMD_QUERY
              Query  the  set  of  supported commands. It returns a bitmask of
              supported commands.

       MEMBARRIER_CMD_SHARED
              Execute a memory barrier on all threads running on  the  system.
              Upon  return from system call, the caller thread is ensured that
              all running threads have passed through a state where all memory
              accesses  to  user-space  addresses  match program order between
              entry to and return from the system  call  (non-running  threads
              are de facto in such a state). This covers threads from all pro=E2=80=90
              cesses running on the system.  This command returns 0.

       The flags argument needs to be 0. For future extensions.

       All memory accesses performed  in  program  order  from  each  targeted
       thread is guaranteed to be ordered with respect to sys_membarrier(). If
       we use the semantic "barrier()" to represent a compiler barrier forcing
       memory  accesses  to  be performed in program order across the barrier,
       and smp_mb() to represent explicit memory barriers forcing full  memory
       ordering  across  the barrier, we have the following ordering table for
       each pair of barrier(), sys_membarrier() and smp_mb():

       The pair ordering is detailed as (O: ordered, X: not ordered):

                              barrier()   smp_mb() sys_membarrier()
              barrier()          X           X            O
              smp_mb()           X           O            O
              sys_membarrier()   O           O            O

RETURN VALUE
       On success, these system calls return zero.  On error, -1 is  returned,
       and errno is set appropriately. For a given command, with flags
       argument set to 0, this system call is guaranteed to always return the
       same value until reboot.

ERRORS
       ENOSYS System call is not implemented.

       EINVAL Invalid arguments.

Linux                             2015-04-15                     MEMBARRIER(2)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nicholas Miell <nmiell@comcast.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-11 15:21:34 -07:00
Dave Young 2965faa5e0 kexec: split kexec_load syscall from kexec core code
There are two kexec load syscalls, kexec_load another and kexec_file_load.
 kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
split kexec_load syscall code to kernel/kexec.c.

And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.

The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
kexec_load syscall can bypass the checking.

Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.

Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
kexec_load syscall.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Dave Young a43cac0d9d kexec: split kexec_file syscall code to kexec_file.c
Split kexec_file syscall related code to another file kernel/kexec_file.c
so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.

Sharing variables and functions are moved to kernel/kexec_internal.h per
suggestion from Vivek and Petr.

[akpm@linux-foundation.org: fix bisectability]
[akpm@linux-foundation.org: declare the various arch_kexec functions]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Linus Torvalds 12f03ee606 libnvdimm for 4.3:
1/ Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.  This facility is used by the pmem driver to
    enable pfn_to_page() operations on the page frames returned by DAX
    ('direct_access' in 'struct block_device_operations'). For now, the
    'memmap' allocation for these "device" pages comes from "System
    RAM".  Support for allocating the memmap from device memory will
    arrive in a later kernel.
 
 2/ Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt().  memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects.  The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.  Completion of
    the conversion is targeted for v4.4.
 
 3/ Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.
 
 4/ Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.
 
 5/ Miscellaneous updates and fixes to libnvdimm including support
    for issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJV6Nx7AAoJEB7SkWpmfYgCWyYQAI5ju6Gvw27RNFtPovHcZUf5
 JGnxXejI6/AqeTQ+IulgprxtEUCrXOHjCDA5dkjr1qvsoqK1qxug+vJHOZLgeW0R
 OwDtmdW4Qrgeqm+CPoxETkorJ8wDOc8mol81kTiMgeV3UqbYeeHIiTAmwe7VzZ0C
 nNdCRDm5g8dHCjTKcvK3rvozgyoNoWeBiHkPe76EbnxDICxCB5dak7XsVKNMIVFQ
 NuYlnw6IYN7+rMHgpgpRux38NtIW8VlYPWTmHExejc2mlioWMNBG/bmtwLyJ6M3e
 zliz4/cnonTMUaizZaVozyinTa65m7wcnpjK+vlyGV2deDZPJpDRvSOtB0lH30bR
 1gy+qrKzuGKpaN6thOISxFLLjmEeYwzYd7SvC9n118r32qShz+opN9XX0WmWSFlA
 sajE1ehm4M7s5pkMoa/dRnAyR8RUPu4RNINdQ/Z9jFfAOx+Q26rLdQXwf9+uqbEb
 bIeSQwOteK5vYYCstvpAcHSMlJAglzIX5UfZBvtEIJN7rlb0VhmGWfxAnTu+ktG1
 o9cqAt+J4146xHaFwj5duTsyKhWb8BL9+xqbKPNpXEp+PbLsrnE/+WkDLFD67jxz
 dgIoK60mGnVXp+16I2uMqYYDgAyO5zUdmM4OygOMnZNa1mxesjbDJC6Wat1Wsndn
 slsw6DkrWT60CRE42nbK
 =o57/
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This update has successfully completed a 0day-kbuild run and has
  appeared in a linux-next release.  The changes outside of the typical
  drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
  removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
  the introduction of ZONE_DEVICE + devm_memremap_pages().

  Summary:

   - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
     mechanism for adding device-driver-discovered memory regions to the
     kernel's direct map.

     This facility is used by the pmem driver to enable pfn_to_page()
     operations on the page frames returned by DAX ('direct_access' in
     'struct block_device_operations').

     For now, the 'memmap' allocation for these "device" pages comes
     from "System RAM".  Support for allocating the memmap from device
     memory will arrive in a later kernel.

   - Introduce memremap() to replace usages of ioremap_cache() and
     ioremap_wt().  memremap() drops the __iomem annotation for these
     mappings to memory that do not have i/o side effects.  The
     replacement of ioremap_cache() with memremap() is limited to the
     pmem driver to ease merging the api change in v4.3.

     Completion of the conversion is targeted for v4.4.

   - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
     driver, update the VFS DAX implementation and PMEM api to provide
     persistence guarantees for kernel operations on a DAX mapping.

   - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
     cacheable to improve performance.

   - Miscellaneous updates and fixes to libnvdimm including support for
     issuing "address range scrub" commands, clarifying the optimal
     'sector size' of pmem devices, a clarification of the usage of the
     ACPI '_STA' (status) property for DIMM devices, and other minor
     fixes"

* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
  libnvdimm, pmem: direct map legacy pmem by default
  libnvdimm, pmem: 'struct page' for pmem
  libnvdimm, pfn: 'struct page' provider infrastructure
  x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
  add devm_memremap_pages
  mm: ZONE_DEVICE for "device memory"
  mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
  dax: drop size parameter to ->direct_access()
  nd_blk: change aperture mapping from WC to WB
  nvdimm: change to use generic kvfree()
  pmem, dax: have direct_access use __pmem annotation
  dax: update I/O path to do proper PMEM flushing
  pmem: add copy_from_iter_pmem() and clear_pmem()
  pmem, x86: clean up conditional pmem includes
  pmem: remove layer when calling arch_has_wmb_pmem()
  pmem, x86: move x86 PMEM API to new pmem.h header
  libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
  pmem: switch to devm_ allocations
  devres: add devm_memremap
  libnvdimm, btt: write and validate parent_uuid
  ...
2015-09-08 14:35:59 -07:00
Linus Torvalds 425afcff13 Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit
Pull audit update from Paul Moore:
 "This is one of the larger audit patchsets in recent history,
  consisting of eight patches and almost 400 lines of changes.

  The bulk of the patchset is the new "audit by executable"
  functionality which allows admins to set an audit watch based on the
  executable on disk.  Prior to this, admins could only track an
  application by PID, which has some obvious limitations.

  Beyond the new functionality we also have some refcnt fixes and a few
  minor cleanups"

* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
  fixup: audit: implement audit by executable
  audit: implement audit by executable
  audit: clean simple fsnotify implementation
  audit: use macros for unset inode and device values
  audit: make audit_del_rule() more robust
  audit: fix uninitialized variable in audit_add_rule()
  audit: eliminate unnecessary extra layer of watch parent references
  audit: eliminate unnecessary extra layer of watch references
2015-09-08 13:34:59 -07:00
Linus Torvalds b793c005ce Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "Highlights:

   - PKCS#7 support added to support signed kexec, also utilized for
     module signing.  See comments in 3f1e1bea.

     ** NOTE: this requires linking against the OpenSSL library, which
        must be installed, e.g.  the openssl-devel on Fedora **

   - Smack
      - add IPv6 host labeling; ignore labels on kernel threads
      - support smack labeling mounts which use binary mount data

   - SELinux:
      - add ioctl whitelisting (see
        http://kernsec.org/files/lss2015/vanderstoep.pdf)
      - fix mprotect PROT_EXEC regression caused by mm change

   - Seccomp:
      - add ptrace options for suspend/resume"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits)
  PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them
  Documentation/Changes: Now need OpenSSL devel packages for module signing
  scripts: add extract-cert and sign-file to .gitignore
  modsign: Handle signing key in source tree
  modsign: Use if_changed rule for extracting cert from module signing key
  Move certificate handling to its own directory
  sign-file: Fix warning about BIO_reset() return value
  PKCS#7: Add MODULE_LICENSE() to test module
  Smack - Fix build error with bringup unconfigured
  sign-file: Document dependency on OpenSSL devel libraries
  PKCS#7: Appropriately restrict authenticated attributes and content type
  KEYS: Add a name for PKEY_ID_PKCS7
  PKCS#7: Improve and export the X.509 ASN.1 time object decoder
  modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
  extract-cert: Cope with multiple X.509 certificates in a single file
  sign-file: Generate CMS message as signature instead of PKCS#7
  PKCS#7: Support CMS messages also [RFC5652]
  X.509: Change recorded SKID & AKID to not include Subject or Issuer
  PKCS#7: Check content type and versions
  MAINTAINERS: The keyrings mailing list has moved
  ...
2015-09-08 12:41:25 -07:00
Dan Williams 92281dee82 arch: introduce memremap()
Existing users of ioremap_cache() are mapping memory that is known in
advance to not have i/o side effects.  These users are forced to cast
away the __iomem annotation, or otherwise neglect to fix the sparse
errors thrown when dereferencing pointers to this memory.  Provide
memremap() as a non __iomem annotated ioremap_*() in the case when
ioremap is otherwise a pointer to cacheable memory. Empirically,
ioremap_<cacheable-type>() call sites are seeking memory-like semantics
(e.g.  speculative reads, and prefetching permitted).

memremap() is a break from the ioremap implementation pattern of adding
a new memremap_<type>() for each mapping type and having silent
compatibility fall backs.  Instead, the implementation defines flags
that are passed to the central memremap() and if a mapping type is not
supported by an arch memremap returns NULL.

We introduce a memremap prototype as a trivial wrapper of
ioremap_cache() and ioremap_wt().  Later, once all ioremap_cache() and
ioremap_wt() usage has been removed from drivers we teach archs to
implement arch_memremap() with the ability to strictly enforce the
mapping type.

Cc: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-14 13:23:28 -04:00
David Howells cfc411e7ff Move certificate handling to its own directory
Move certificate handling out of the kernel/ directory and into a certs/
directory to get all the weird stuff in one place and move the generated
signing keys into this directory.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Woodhouse <David.Woodhouse@intel.com>
2015-08-14 16:06:13 +01:00
David Woodhouse 770f2b9876 modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
Fix up the dependencies somewhat too, while we're at it.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-12 17:01:01 +01:00
David Woodhouse 99d27b1b52 modsign: Add explicit CONFIG_SYSTEM_TRUSTED_KEYS option
Let the user explicitly provide a file containing trusted keys, instead of
just automatically finding files matching *.x509 in the build tree and
trusting whatever we find. This really ought to be an *explicit*
configuration, and the build rules for dealing with the files were
fairly painful too.

Fix applied from James Morris that removes an '=' from a macro definition
in kernel/Makefile as this is a feature that only exists from GNU make 3.82
onwards.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07 16:26:14 +01:00
David Woodhouse fb11794991 modsign: Use single PEM file for autogenerated key
The current rule for generating signing_key.priv and signing_key.x509 is
a classic example of a bad rule which has a tendency to break parallel
make. When invoked to create *either* target, it generates the other
target as a side-effect that make didn't predict.

So let's switch to using a single file signing_key.pem which contains
both key and certificate. That matches what we do in the case of an
external key specified by CONFIG_MODULE_SIG_KEY anyway, so it's also
slightly cleaner.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07 16:26:14 +01:00
David Woodhouse 1329e8cc69 modsign: Extract signing cert from CONFIG_MODULE_SIG_KEY if needed
Where an external PEM file or PKCS#11 URI is given, we can get the cert
from it for ourselves instead of making the user drop signing_key.x509
in place for us.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07 16:26:14 +01:00
David Woodhouse 19e91b69d7 modsign: Allow external signing key to be specified
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07 16:26:14 +01:00
Richard Guy Briggs 7f49294282 audit: clean simple fsnotify implementation
This is to be used to audit by executable path rules, but audit watches should
be able to share this code eventually.

At the moment the audit watch code is a lot more complex.  That code only
creates one fsnotify watch per parent directory.  That 'audit_parent' in
turn has a list of 'audit_watches' which contain the name, ino, dev of
the specific object we care about.  This just creates one fsnotify watch
per object we care about.  So if you watch 100 inodes in /etc this code
will create 100 fsnotify watches on /etc.  The audit_watch code will
instead create 1 fsnotify watch on /etc (the audit_parent) and then 100
individual watches chained from that fsnotify mark.

We should be able to convert the audit_watch code to do one fsnotify
mark per watch and simplify things/remove a whole lot of code.  After
that conversion we should be able to convert the audit_fsnotify code to
support that hierarchy if the optimization is necessary.

Move the access to the entry for audit_match_signal() to the beginning of
the audit_del_rule() function in case the entry found is the same one passed
in.  This will enable it to be used by audit_autoremove_mark_rule(),
kill_rules() and audit_remove_parent_watches().

This is a heavily modified and merged version of two patches originally
submitted by Eric Paris.

Cc: Peter Moody <peter@hda3.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: added a space after a declaration to keep ./scripts/checkpatch happy]
Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-06 16:14:53 -04:00
Aleksa Sarai 49b786ea14 cgroup: implement the PIDs subsystem
Adds a new single-purpose PIDs subsystem to limit the number of
tasks that can be forked inside a cgroup. Essentially this is an
implementation of RLIMIT_NPROC that applies to a cgroup rather than a
process tree.

However, it should be noted that organisational operations (adding and
removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
the number of tasks in the hierarchy cannot exceed the limit through
forking. This is due to the fact that, in the unified hierarchy, attach
cannot fail (and it is not possible for a task to overcome its PIDs
cgroup policy limit by attaching to a child cgroup -- even if migrating
mid-fork it must be able to fork in the parent first).

PIDs are fundamentally a global resource, and it is possible to reach
PID exhaustion inside a cgroup without hitting any reasonable kmemcg
policy. Once you've hit PID exhaustion, you're only in a marginally
better state than OOM. This subsystem allows PID exhaustion inside a
cgroup to be prevented.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-07-14 17:29:23 -04:00
Linus Torvalds 7df9ab845c make certificate list change message more useful
It's a bug in our Makefile rules, make it show what the changing
certificate list was, and make it a warning so that people actually see
it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-02 16:42:13 -07:00
David Howells 9c4249c8e0 modsign: change default key details
Change default key details to be more obviously unspecified.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-30 09:35:41 -07:00
Iulia Manda 2813893f8b kernel: conditionally support non-root users, groups and capabilities
There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root.  For these systems,
supporting multiple users is not necessary.

This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional.  It is enabled
under CONFIG_EXPERT menu.

When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.

The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.

Also, groups.c is compiled out completely.

In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.

This change saves about 25 KB on a defconfig build.  The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB.  (The 25k goes down a bit with allnoconfig, but not that much.

The kernel was booted in Qemu.  All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.

Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:22 -07:00
Linus Torvalds 8cc748aa76 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security layer updates from James Morris:
 "Highlights:

   - Smack adds secmark support for Netfilter
   - /proc/keys is now mandatory if CONFIG_KEYS=y
   - TPM gets its own device class
   - Added TPM 2.0 support
   - Smack file hook rework (all Smack users should review this!)"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (64 commits)
  cipso: don't use IPCB() to locate the CIPSO IP option
  SELinux: fix error code in policydb_init()
  selinux: add security in-core xattr support for pstore and debugfs
  selinux: quiet the filesystem labeling behavior message
  selinux: Remove unused function avc_sidcmp()
  ima: /proc/keys is now mandatory
  Smack: Repair netfilter dependency
  X.509: silence asn1 compiler debug output
  X.509: shut up about included cert for silent build
  KEYS: Make /proc/keys unconditional if CONFIG_KEYS=y
  MAINTAINERS: email update
  tpm/tpm_tis: Add missing ifdef CONFIG_ACPI for pnp_acpi_device
  smack: fix possible use after frees in task_security() callers
  smack: Add missing logging in bidirectional UDS connect check
  Smack: secmark support for netfilter
  Smack: Rework file hooks
  tpm: fix format string error in tpm-chip.c
  char/tpm/tpm_crb: fix build error
  smack: Fix a bidirectional UDS connect check typo
  smack: introduce a special case for tmpfs in smack_d_instantiate()
  ...
2015-02-11 20:25:11 -08:00
Linus Torvalds b3d6524ff7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:

 - The remaining patches for the z13 machine support: kernel build
   option for z13, the cache synonym avoidance, SMT support,
   compare-and-delay for spinloops and the CES5S crypto adapater.

 - The ftrace support for function tracing with the gcc hotpatch option.
   This touches common code Makefiles, Steven is ok with the changes.

 - The hypfs file system gets an extension to access diagnose 0x0c data
   in user space for performance analysis for Linux running under z/VM.

 - The iucv hvc console gets wildcard spport for the user id filtering.

 - The cacheinfo code is converted to use the generic infrastructure.

 - Cleanup and bug fixes.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
  s390/process: free vx save area when releasing tasks
  s390/hypfs: Eliminate hypfs interval
  s390/hypfs: Add diagnose 0c support
  s390/cacheinfo: don't use smp_processor_id() in preemptible context
  s390/zcrypt: fixed domain scanning problem (again)
  s390/smp: increase maximum value of NR_CPUS to 512
  s390/jump label: use different nop instruction
  s390/jump label: add sanity checks
  s390/mm: correct missing space when reporting user process faults
  s390/dasd: cleanup profiling
  s390/dasd: add locking for global_profile access
  s390/ftrace: hotpatch support for function tracing
  ftrace: let notrace function attribute disable hotpatching if necessary
  ftrace: allow architectures to specify ftrace compile options
  s390: reintroduce diag 44 calls for cpu_relax()
  s390/zcrypt: Add support for new crypto express (CEX5S) adapter.
  s390/zcrypt: Number of supported ap domains is not retrievable.
  s390/spinlock: add compare-and-delay to lock wait loops
  s390/tape: remove redundant if statement
  s390/hvc_iucv: add simple wildcard matches to the iucv allow filter
  ...
2015-02-11 17:42:32 -08:00
Heiko Carstens c0a80c0c27 ftrace: allow architectures to specify ftrace compile options
If the kernel is compiled with function tracer support the -pg compile option
is passed to gcc to generate extra code into the prologue of each function.

This patch replaces the "open-coded" -pg compile flag with a CC_FLAGS_FTRACE
makefile variable which architectures can override if a different option
should be used for code generation.

Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-01-29 09:19:19 +01:00
Arnd Bergmann 89f703f093 X.509: shut up about included cert for silent build
Every kernel build that includes X.509 support prints out
a message like

 - Including cert signing_key.x509

This may be useful for some cases, but when doing automated
build tests, it just means noise.

To hide the message, this uses '$(kecho)' for printing the
message, which means we still see it when building with V=1,
but not at the normal level or when building with 'make -s'.

Signed-off-by: Arnd Bergmann <arnd@arnd.de>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-01-23 12:10:39 +00:00
Seth Jennings b700e7f03d livepatch: kernel: add support for live patching
This commit introduces code for the live patching core.  It implements
an ftrace-based mechanism and kernel interface for doing live patching
of kernel and kernel module functions.

It represents the greatest common functionality set between kpatch and
kgraft and can accept patches built using either method.

This first version does not implement any consistency mechanism that
ensures that old and new code do not run together.  In practice, ~90% of
CVEs are safe to apply in this way, since they simply add a conditional
check.  However, any function change that can not execute safely with
the old version of the function can _not_ be safely applied in this
version.

[ jkosina@suse.cz: due to the number of contributions that got folded into
  this original patch from Seth Jennings, add SUSE's copyright as well, as
  discussed via e-mail ]

Signed-off-by: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-12-22 15:40:49 +01:00
Johannes Weiner 5b1efc027c kernel: res_counter: remove the unused API
All memory accounting and limiting has been switched over to the
lockless page counters.  Bye, res_counter!

[akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
[mhocko@suse.cz: ditch the last remainings of res_counter]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
Alexei Starovoitov f89b7755f5 bpf: split eBPF out of NET
introduce two configs:
- hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
  depend on
- visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use

that solves several problems:
- tracing and others that wish to use eBPF don't need to depend on NET.
  They can use BPF_SYSCALL to allow loading from userspace or select BPF
  to use it directly from kernel in NET-less configs.
- in 3.18 programs cannot be attached to events yet, so don't force it on
- when the rest of eBPF infra is there in 3.19+, it's still useful to
  switch it off to minimize kernel size

bloat-o-meter on x64 shows:
add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)

tested with many different config combinations. Hopefully didn't miss anything.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-27 19:09:59 -04:00
Vivek Goyal 8370edea81 bin2c: move bin2c in scripts/basic
This patch series does not do kernel signature verification yet.  I plan
to post another patch series for that.  Now distributions are already
signing PE/COFF bzImage with PKCS7 signature I plan to parse and verify
those signatures.

Primary goal of this patchset is to prepare groundwork so that kernel
image can be signed and signatures be verified during kexec load.  This
should help with two things.

- It should allow kexec/kdump on secureboot enabled machines.

- In general it can help even without secureboot. By being able to verify
  kernel image signature in kexec, it should help with avoiding module
  signing restrictions. Matthew Garret showed how to boot into a custom
  kernel, modify first kernel's memory and then jump back to old kernel and
  bypass any policy one wants to.

This patch (of 15):

Kexec wants to use bin2c and it wants to use it really early in the build
process. See arch/x86/purgatory/ code in later patches.

So move bin2c in scripts/basic so that it can be built very early and
be usable by arch/x86/purgatory/

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: WANG Chao <chaowang@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:32 -07:00
Linus Torvalds ae045e2455 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Steady transitioning of the BPF instructure to a generic spot so
      all kernel subsystems can make use of it, from Alexei Starovoitov.

   2) SFC driver supports busy polling, from Alexandre Rames.

   3) Take advantage of hash table in UDP multicast delivery, from David
      Held.

   4) Lighten locking, in particular by getting rid of the LRU lists, in
      inet frag handling.  From Florian Westphal.

   5) Add support for various RFC6458 control messages in SCTP, from
      Geir Ola Vaagland.

   6) Allow to filter bridge forwarding database dumps by device, from
      Jamal Hadi Salim.

   7) virtio-net also now supports busy polling, from Jason Wang.

   8) Some low level optimization tweaks in pktgen from Jesper Dangaard
      Brouer.

   9) Add support for ipv6 address generation modes, so that userland
      can have some input into the process.  From Jiri Pirko.

  10) Consolidate common TCP connection request code in ipv4 and ipv6,
      from Octavian Purdila.

  11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.

  12) Generic resizable RCU hash table, with intial users in netlink and
      nftables.  From Thomas Graf.

  13) Maintain a name assignment type so that userspace can see where a
      network device name came from (enumerated by kernel, assigned
      explicitly by userspace, etc.) From Tom Gundersen.

  14) Automatic flow label generation on transmit in ipv6, from Tom
      Herbert.

  15) New packet timestamping facilities from Willem de Bruijn, meant to
      assist in measuring latencies going into/out-of the packet
      scheduler, latency from TCP data transmission to ACK, etc"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
  cxgb4 : Disable recursive mailbox commands when enabling vi
  net: reduce USB network driver config options.
  tg3: Modify tg3_tso_bug() to handle multiple TX rings
  amd-xgbe: Perform phy connect/disconnect at dev open/stop
  amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
  net: sun4i-emac: fix memory leak on bad packet
  sctp: fix possible seqlock seadlock in sctp_packet_transmit()
  Revert "net: phy: Set the driver when registering an MDIO bus device"
  cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
  team: Simplify return path of team_newlink
  bridge: Update outdated comment on promiscuous mode
  net-timestamp: ACK timestamp for bytestreams
  net-timestamp: TCP timestamping
  net-timestamp: SCHED timestamp on entering packet scheduler
  net-timestamp: add key to disambiguate concurrent datagrams
  net-timestamp: move timestamp flags out of sk_flags
  net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
  cxgb4i : Move stray CPL definitions to cxgb4 driver
  tcp: reduce spurious retransmits due to transient SACK reneging
  qlcnic: Initialize dcbnl_ops before register_netdev
  ...
2014-08-06 09:38:14 -07:00
Alexei Starovoitov f5bffecda9 net: filter: split filter.c into two files
BPF is used in several kernel components. This split creates logical boundary
between generic eBPF core and the rest

kernel/bpf/core.c: eBPF interpreter

net/core/filter.c: classic->eBPF converter, classic verifiers, socket filters

This patch only moves functions.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-23 21:06:22 -07:00
Thomas Gleixner 5cee964597 time/timers: Move all time(r) related files into kernel/time
Except for Kconfig.HZ. That needs a separate treatment.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-06-23 11:22:35 +02:00
Linus Torvalds 176ab02d49 Merge branch 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 LTO changes from Peter Anvin:
 "More infrastructure work in preparation for link-time optimization
  (LTO).  Most of these changes is to make sure symbols accessed from
  assembly code are properly marked as visible so the linker doesn't
  remove them.

  My understanding is that the changes to support LTO are still not
  upstream in binutils, but are on the way there.  This patchset should
  conclude the x86-specific changes, and remaining patches to actually
  enable LTO will be fed through the Kbuild tree (other than keeping up
  with changes to the x86 code base, of course), although not
  necessarily in this merge window"

* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  Kbuild, lto: Handle basic LTO in modpost
  Kbuild, lto: Disable LTO for asm-offsets.c
  Kbuild, lto: Add a gcc-ld script to let run gcc as ld
  Kbuild, lto: add ld-version and ld-ifversion macros
  Kbuild, lto: Drop .number postfixes in modpost
  Kbuild, lto, workaround: Don't warn for initcall_reference in modpost
  lto: Disable LTO for sys_ni
  lto: Handle LTO common symbols in module loader
  lto, workaround: Add workaround for initcall reordering
  lto: Make asmlinkage __visible
  x86, lto: Disable LTO for the x86 VDSO
  initconst, x86: Fix initconst mistake in ts5500 code
  initconst: Fix initconst mistake in dcdbas
  asmlinkage: Make trace_hardirqs_on/off_caller visible
  asmlinkage, x86: Fix 32bit memcpy for LTO
  asmlinkage Make __stack_chk_failed and memcmp visible
  asmlinkage: Mark rwsem functions that can be called from assembler asmlinkage
  asmlinkage: Make main_extable_sort_needed visible
  asmlinkage, mutex: Mark __visible
  asmlinkage: Make trace_hardirq visible
  ...
2014-03-31 14:13:25 -07:00
Linus Torvalds 971eae7c99 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "Bigger changes:

   - sched/idle restructuring: they are WIP preparation for deeper
     integration between the scheduler and idle state selection, by
     Nicolas Pitre.

   - add NUMA scheduling pseudo-interleaving, by Rik van Riel.

   - optimize cgroup context switches, by Peter Zijlstra.

   - RT scheduling enhancements, by Thomas Gleixner.

  The rest is smaller changes, non-urgnt fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
  sched: Clean up the task_hot() function
  sched: Remove double calculation in fix_small_imbalance()
  sched: Fix broken setscheduler()
  sparc64, sched: Remove unused sparc64_multi_core
  sched: Remove unused mc_capable() and smt_capable()
  sched/numa: Move task_numa_free() to __put_task_struct()
  sched/fair: Fix endless loop in idle_balance()
  sched/core: Fix endless loop in pick_next_task()
  sched/fair: Push down check for high priority class task into idle_balance()
  sched/rt: Fix picking RT and DL tasks from empty queue
  trace: Replace hardcoding of 19 with MAX_NICE
  sched: Guarantee task priority in pick_next_task()
  sched/idle: Remove stale old file
  sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED
  cpuidle/arm64: Remove redundant cpuidle_idle_call()
  cpuidle/powernv: Remove redundant cpuidle_idle_call()
  sched, nohz: Exclude isolated cores from load balancing
  sched: Fix select_task_rq_fair() description comments
  workqueue: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
  sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
  ...
2014-03-31 11:21:19 -07:00
Paul E. McKenney 51b1130eb5 rcutorture: Abstract rcu_torture_random()
Because rcu_torture_random() will be used by the locking equivalent to
rcutorture, pull it out into its own module.  This new module cannot
be separately configured, instead, use the Kconfig "select" statement
from the Kconfig options of tests depending on it.

Suggested-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-02-23 09:00:58 -08:00
Andi Kleen 58edae3aac lto: Disable LTO for sys_ni
The assembler alias code in cond_syscall does not work
when compiled for LTO. Just disable LTO for that file.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391846481-31491-6-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13 20:24:53 -08:00
Nicolas Pitre cf37b6b484 sched/idle: Move cpu/idle.c to sched/idle.c
Integration of cpuidle with the scheduler requires that the idle loop be
closely integrated with the scheduler proper. Moving cpu/idle.c into the
sched directory will allow for a smoother integration, and eliminate a
subdirectory which contained only one source file.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/alpine.LFD.2.11.1401301102210.1652@knanqh.ubzr
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-11 09:58:30 +01:00
Kirill Tkhai f46a3cbbeb KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y
Always remove generated SYSTEM_TRUSTED_KEYRING files while doing make mrproper.

Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: David Howells <dhowells@redhat.com>
2013-12-13 15:59:11 +00:00
David Howells d7ec435fdd X.509: Fix certificate gathering
Fix the gathering of certificates from both the source tree and the build tree
to correctly calculate the pathnames of all the certificates.

The problem was that if the default generated cert, signing_key.x509, didn't
exist then it would not have a path attached and if it did, it would have a
path attached.

This means that the contents of kernel/.x509.list would change between the
first compilation in a directory and the second.  After the second it would
remain stable because the signing_key.x509 file exists.

The consequence was that the kernel would get relinked unconditionally on the
second recompilation.  The second recompilation would also show something like
this:

   X.509 certificate list changed
     CERTS   kernel/x509_certificate_list
     - Including cert /home/torvalds/v2.6/linux/signing_key.x509
     AS      kernel/system_certificates.o
     LD      kernel/built-in.o

which is why the relink would happen.


Unfortunately, it isn't a simple matter of just sticking a path on the front
of the filename of the certificate in the build directory as make can't then
work out how to build it.

So the path has to be prepended to the name for sorting and duplicate
elimination and then removed for the make rule if it is in the build tree.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
2013-12-13 15:28:14 +00:00
Linus Torvalds 78dc53c422 Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "In this patchset, we finally get an SELinux update, with Paul Moore
  taking over as maintainer of that code.

  Also a significant update for the Keys subsystem, as well as
  maintenance updates to Smack, IMA, TPM, and Apparmor"

and since I wanted to know more about the updates to key handling,
here's the explanation from David Howells on that:

 "Okay.  There are a number of separate bits.  I'll go over the big bits
  and the odd important other bit, most of the smaller bits are just
  fixes and cleanups.  If you want the small bits accounting for, I can
  do that too.

   (1) Keyring capacity expansion.

        KEYS: Consolidate the concept of an 'index key' for key access
        KEYS: Introduce a search context structure
        KEYS: Search for auth-key by name rather than target key ID
        Add a generic associative array implementation.
        KEYS: Expand the capacity of a keyring

     Several of the patches are providing an expansion of the capacity of a
     keyring.  Currently, the maximum size of a keyring payload is one page.
     Subtract a small header and then divide up into pointers, that only gives
     you ~500 pointers on an x86_64 box.  However, since the NFS idmapper uses
     a keyring to store ID mapping data, that has proven to be insufficient to
     the cause.

     Whatever data structure I use to handle the keyring payload, it can only
     store pointers to keys, not the keys themselves because several keyrings
     may point to a single key.  This precludes inserting, say, and rb_node
     struct into the key struct for this purpose.

     I could make an rbtree of records such that each record has an rb_node
     and a key pointer, but that would use four words of space per key stored
     in the keyring.  It would, however, be able to use much existing code.

     I selected instead a non-rebalancing radix-tree type approach as that
     could have a better space-used/key-pointer ratio.  I could have used the
     radix tree implementation that we already have and insert keys into it by
     their serial numbers, but that means any sort of search must iterate over
     the whole radix tree.  Further, its nodes are a bit on the capacious side
     for what I want - especially given that key serial numbers are randomly
     allocated, thus leaving a lot of empty space in the tree.

     So what I have is an associative array that internally is a radix-tree
     with 16 pointers per node where the index key is constructed from the key
     type pointer and the key description.  This means that an exact lookup by
     type+description is very fast as this tells us how to navigate directly to
     the target key.

     I made the data structure general in lib/assoc_array.c as far as it is
     concerned, its index key is just a sequence of bits that leads to a
     pointer.  It's possible that someone else will be able to make use of it
     also.  FS-Cache might, for example.

   (2) Mark keys as 'trusted' and keyrings as 'trusted only'.

        KEYS: verify a certificate is signed by a 'trusted' key
        KEYS: Make the system 'trusted' keyring viewable by userspace
        KEYS: Add a 'trusted' flag and a 'trusted only' flag
        KEYS: Separate the kernel signature checking keyring from module signing

     These patches allow keys carrying asymmetric public keys to be marked as
     being 'trusted' and allow keyrings to be marked as only permitting the
     addition or linkage of trusted keys.

     Keys loaded from hardware during kernel boot or compiled into the kernel
     during build are marked as being trusted automatically.  New keys can be
     loaded at runtime with add_key().  They are checked against the system
     keyring contents and if their signatures can be validated with keys that
     are already marked trusted, then they are marked trusted also and can
     thus be added into the master keyring.

     Patches from Mimi Zohar make this usable with the IMA keyrings also.

   (3) Remove the date checks on the key used to validate a module signature.

        X.509: Remove certificate date checks

     It's not reasonable to reject a signature just because the key that it was
     generated with is no longer valid datewise - especially if the kernel
     hasn't yet managed to set the system clock when the first module is
     loaded - so just remove those checks.

   (4) Make it simpler to deal with additional X.509 being loaded into the kernel.

        KEYS: Load *.x509 files into kernel keyring
        KEYS: Have make canonicalise the paths of the X.509 certs better to deduplicate

     The builder of the kernel now just places files with the extension ".x509"
     into the kernel source or build trees and they're concatenated by the
     kernel build and stuffed into the appropriate section.

   (5) Add support for userspace kerberos to use keyrings.

        KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches
        KEYS: Implement a big key type that can save to tmpfs

     Fedora went to, by default, storing kerberos tickets and tokens in tmpfs.
     We looked at storing it in keyrings instead as that confers certain
     advantages such as tickets being automatically deleted after a certain
     amount of time and the ability for the kernel to get at these tokens more
     easily.

     To make this work, two things were needed:

     (a) A way for the tickets to persist beyond the lifetime of all a user's
         sessions so that cron-driven processes can still use them.

         The problem is that a user's session keyrings are deleted when the
         session that spawned them logs out and the user's user keyring is
         deleted when the UID is deleted (typically when the last log out
         happens), so neither of these places is suitable.

         I've added a system keyring into which a 'persistent' keyring is
         created for each UID on request.  Each time a user requests their
         persistent keyring, the expiry time on it is set anew.  If the user
         doesn't ask for it for, say, three days, the keyring is automatically
         expired and garbage collected using the existing gc.  All the kerberos
         tokens it held are then also gc'd.

     (b) A key type that can hold really big tickets (up to 1MB in size).

         The problem is that Active Directory can return huge tickets with lots
         of auxiliary data attached.  We don't, however, want to eat up huge
         tracts of unswappable kernel space for this, so if the ticket is
         greater than a certain size, we create a swappable shmem file and dump
         the contents in there and just live with the fact we then have an
         inode and a dentry overhead.  If the ticket is smaller than that, we
         slap it in a kmalloc()'d buffer"

* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (121 commits)
  KEYS: Fix keyring content gc scanner
  KEYS: Fix error handling in big_key instantiation
  KEYS: Fix UID check in keyctl_get_persistent()
  KEYS: The RSA public key algorithm needs to select MPILIB
  ima: define '_ima' as a builtin 'trusted' keyring
  ima: extend the measurement list to include the file signature
  kernel/system_certificate.S: use real contents instead of macro GLOBAL()
  KEYS: fix error return code in big_key_instantiate()
  KEYS: Fix keyring quota misaccounting on key replacement and unlink
  KEYS: Fix a race between negating a key and reading the error set
  KEYS: Make BIG_KEYS boolean
  apparmor: remove the "task" arg from may_change_ptraced_domain()
  apparmor: remove parent task info from audit logging
  apparmor: remove tsk field from the apparmor_audit_struct
  apparmor: fix capability to not use the current task, during reporting
  Smack: Ptrace access check mode
  ima: provide hash algo info in the xattr
  ima: enable support for larger default filedata hash algorithms
  ima: define kernel parameter 'ima_template=' to change configured default
  ima: add Kconfig default measurement list template
  ...
2013-11-21 19:46:00 -08:00
Peter Zijlstra cd4d241d57 locking: Move the lglocks code to kernel/locking/
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-amd6pg1mif6tikbyktfvby3y@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 09:24:20 +01:00
Peter Zijlstra ed428bfc3c locking: Move the rwsem code to kernel/locking/
Notably: changed lib/rwsem* targets from lib- to obj-, no idea about
the ramifications of that.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-g0kynfh5feriwc6p3h6kpbw6@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 09:24:18 +01:00
Peter Zijlstra 1696a8bee3 locking: Move the rtmutex code to kernel/locking/
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-p9ijt8div0hwldexwfm4nlhj@git.kernel.org
[ Fixed build failure in kernel/rcu/tree_plugin.h. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 09:23:59 +01:00
Peter Zijlstra e25a64c401 locking: Move the semaphore core to kernel/locking/
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-vmw5sf6vzmua1z6nx1cg69h2@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 07:55:22 +01:00
Peter Zijlstra 60fc28746a locking: Move the spinlock code to kernel/locking/
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-b81ol0z3mon45m51o131yc9j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 07:55:21 +01:00
Peter Zijlstra 8eddac3f10 locking: Move the lockdep code to kernel/locking/
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-wl7s3tta5isufzfguc23et06@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 07:55:08 +01:00
Peter Zijlstra 01768b42dc locking: Move the mutex code to kernel/locking/
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-1ditvncg30dgbpvrz2bxfmke@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 07:55:07 +01:00
Ingo Molnar c90423d1de Merge branch 'sched/core' into core/locking, to prepare the kernel/locking/ file move
Conflicts:
	kernel/Makefile

There are conflicts in kernel/Makefile due to file moving in the
scheduler tree - resolve them.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 07:50:37 +01:00
Peter Zijlstra 7a6354e241 sched: Move wait.c into kernel/sched/
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-5q5yqvdaen0rmapwloeaotx3@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 07:49:16 +01:00
Paul E. McKenney 4102adab91 rcu: Move RCU-related source code to kernel/rcu directory
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
2013-10-15 12:53:31 -07:00
David Howells b56e5a17b6 KEYS: Separate the kernel signature checking keyring from module signing
Separate the kernel signature checking keyring from module signing so that it
can be used by code other than the module-signing code.

Signed-off-by: David Howells <dhowells@redhat.com>
2013-09-25 17:17:01 +01:00
David Howells 0fbd39cf7f KEYS: Have make canonicalise the paths of the X.509 certs better to deduplicate
Have make canonicalise the paths of the X.509 certificates before we sort them
as this allows $(sort) to better remove duplicates.

Signed-off-by: David Howells <dhowells@redhat.com>
2013-09-25 17:17:01 +01:00
David Howells f0e6d220a7 KEYS: Load *.x509 files into kernel keyring
Load all the files matching the pattern "*.x509" that are to be found in kernel
base source dir and base build dir into the module signing keyring.

The "extra_certificates" file is then redundant.

Signed-off-by: David Howells <dhowells@redhat.com>
2013-09-25 17:17:01 +01:00
Martin Schwidefsky 0244ad004a Remove GENERIC_HARDIRQ config option
After the last architecture switched to generic hard irqs the config
options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
for !CONFIG_GENERIC_HARDIRQS can be removed.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-09-13 15:09:52 +02:00
Joe Perches b9ee979e9d printk: move to separate directory for easier modification
Make it easier to break up printk into bite-sized chunks.

Remove printk path/filename from comment.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31 14:41:03 -07:00
Robin Holt 15d94b8256 reboot: move shutdown/reboot related functions to kernel/reboot.c
This patch is preparatory.  It moves reboot related syscall, etc
functions from kernel/sys.c to kernel/reboot.c.

Signed-off-by: Robin Holt <holt@sgi.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09 10:33:29 -07:00
Linus Torvalds f8ce1faf55 We get rid of the general module prefix confusion with a binary config option,
fix a remove/insert race which Never Happens, and (my favorite) handle the
 case when we have too many modules for a single commandline.  Seriously,
 the kernel is full, please go away!
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJRgbGEAAoJENkgDmzRrbjx2+QP/jXs93K/sXw3rL0vBklwCFv6
 IPZmqYZiGjrzqlB4coWkgYRwW1oOsREfAjF5MmfPdykS3fO5kXfdxN4FBdfKp+IZ
 RdsycdGDuSxWomgYsivrrxLBDxDAX1VuBOjr6mu5Uuk/pCjFa61cfJDiErsu0jKz
 2EMTc98A+E71XamJdvbtal5MUIu9yeluJWG2ux2+VbCul4MSpMc//0n2nrws/RCB
 AoC96AT/Xf4U10a8zT8RfCJ29M5Vvx/KfTIcFiZvtCQxEaHNNmj831gDNiw/3jFI
 ndRph+VLHBsMoBMxfzNRrM+evqkq8+AGEGRj3ycQy5Pa6DunPyzMafWOVGBGnmaS
 tl9hATGx1438048i5tUn8ieAYG1YL1HM83hQovpCThfUKQMiq186iDt1SYYmlq3g
 0thj3znQqZDYhboPtgWzOMUdqOG/iBIKjhGQjjHZs+MInFgxL2hmax0gBNkvEtQb
 oLyfGbF6UjS7I/Md/HohnUQ4xr9kYa3MQeqPjKbRwgHRkdXhzTEZtI+MYDJBxOnW
 QGVQ97aJ2WA7vC7sz/1VhTcZqmU5zfrSc8lF+Ea+H8dQGHHbz8HxKQacEvKcMrXl
 OJyEkRUWDA0MTjeIHzn2fff9Q6/qqA1QejRiFofGJrpxopcJS84/7yA0repxvuMG
 yaMPsLq53UW37/AXYsho
 =MPiD
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull mudule updates from Rusty Russell:
 "We get rid of the general module prefix confusion with a binary config
  option, fix a remove/insert race which Never Happens, and (my
  favorite) handle the case when we have too many modules for a single
  commandline.  Seriously, the kernel is full, please go away!"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  modpost: fix unwanted VMLINUX_SYMBOL_STR expansion
  X.509: Support parse long form of length octets in Authority Key Identifier
  module: don't unlink the module until we've removed all exposure.
  kernel: kallsyms: memory override issue, need check destination buffer length
  MODSIGN: do not send garbage to stderr when enabling modules signature
  modpost: handle huge numbers of modules.
  modpost: add -T option to read module names from file/stdin.
  modpost: minor cleanup.
  genksyms: pass symbol-prefix instead of arch
  module: fix symbol versioning with symbol prefixes
  CONFIG_SYMBOL_PREFIX: cleanup.
2013-05-05 10:58:06 -07:00
David Cohen 07c449bbc6 MODSIGN: do not send garbage to stderr when enabling modules signature
When compiling kernel with -jN (N > 1), all warning/error messages
printed while openssl is generating key pair may get mixed dots and
other symbols openssl sends to stderr. This patch makes sure openssl
logs go to default stdout.

Example of the garbage on stderr:

crypto/anubis.c:581: warning: ‘inter’ is used uninitialized in this function
Generating a 4096 bit RSA private key
.........
drivers/gpu/drm/i915/i915_gem_gtt.c: In function ‘gen6_ggtt_insert_entries’:
drivers/gpu/drm/i915/i915_gem_gtt.c:440: warning: ‘addr’ may be used uninitialized in this function
.net/mac80211/tx.c: In function ‘ieee80211_subif_start_xmit’:
net/mac80211/tx.c:1780: warning: ‘chanctx_conf’ may be used uninitialized in this function
..drivers/isdn/hardware/mISDN/hfcpci.c: In function ‘hfcpci_softirq’:
.....drivers/isdn/hardware/mISDN/hfcpci.c:2298: warning: ignoring return value of ‘driver_for_each_device’, declared with attribute warn_unused_result

Signed-off-by: David Cohen <david.a.cohen@intel.com>
Reviewed-by: mark gross <mark.gross@intel.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-04-11 13:22:34 +09:30
Thomas Gleixner a1a04ec3c7 idle: Provide a generic entry point for the idle code
For now this calls cpu_idle(), but in the long run we want to move the
cpu bringup code to the core and therefor we add a state argument.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130321215233.583190032@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-08 17:39:23 +02:00
Linus Torvalds 2a7d2b96d5 Merge branch 'akpm' (final batch from Andrew)
Merge third patch-bumb from Andrew Morton:
 "This wraps me up for -rc1.
   - Lots of misc stuff and things which were deferred/missed from
     patchbombings 1 & 2.
   - ocfs2 things
   - lib/scatterlist
   - hfsplus
   - fatfs
   - documentation
   - signals
   - procfs
   - lockdep
   - coredump
   - seqfile core
   - kexec
   - Tejun's large IDR tree reworkings
   - ipmi
   - partitions
   - nbd
   - random() things
   - kfifo
   - tools/testing/selftests updates
   - Sasha's large and pointless hlist cleanup"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (163 commits)
  hlist: drop the node parameter from iterators
  kcmp: make it depend on CHECKPOINT_RESTORE
  selftests: add a simple doc
  tools/testing/selftests/Makefile: rearrange targets
  selftests/efivarfs: add create-read test
  selftests/efivarfs: add empty file creation test
  selftests: add tests for efivarfs
  kfifo: fix kfifo_alloc() and kfifo_init()
  kfifo: move kfifo.c from kernel/ to lib/
  arch Kconfig: centralise CONFIG_ARCH_NO_VIRT_TO_BUS
  w1: add support for DS2413 Dual Channel Addressable Switch
  memstick: move the dereference below the NULL test
  drivers/pps/clients/pps-gpio.c: use devm_kzalloc
  Documentation/DMA-API-HOWTO.txt: fix typo
  include/linux/eventfd.h: fix incorrect filename is a comment
  mtd: mtd_stresstest: use prandom_bytes()
  mtd: mtd_subpagetest: convert to use prandom library
  mtd: mtd_speedtest: use prandom_bytes
  mtd: mtd_pagetest: convert to use prandom library
  mtd: mtd_oobtest: convert to use prandom library
  ...
2013-02-27 20:58:09 -08:00
Cyrill Gorcunov 1e142b29e2 kcmp: make it depend on CHECKPOINT_RESTORE
Since kcmp syscall has been implemented (initially on x86 architecture) a
number of other archs wire it up as well: xtensa, sparc, sh, s390, mips,
microblaze, m68k (not taking into account those who uses
<asm-generic/unistd.h> for syscall numbers definitions).

But the Makefile, which turns kcmp.o generation on still depends on former
config-x86.  Thus get rid of this limitation and make kcmp.o depend on
CHECKPOINT_RESTORE option.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:24 -08:00
Stefani Seibold c759b35e64 kfifo: move kfifo.c from kernel/ to lib/
Move kfifo.c from kernel/ to lib/

Signed-off-by: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:23 -08:00
Linus Torvalds 0ca7ffb356 Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kbuild changes from Michal Marek:

 - Alias generation in modpost is cross-compile safe.

 - kernel/timeconst.h is now generated using a bc script instead of
   perl.

 - scripts/link-vmlinux.sh now works with an alternative
   $KCONFIG_CONFIG.

 - destination-y for exported headers is supported in Kbuild files
   again.

 - depmod is called with -P $CONFIG_SYMBOL_PREFIX on architectures that
   need it.

 - CONFIG_DEBUG_INFO_REDUCED disables var-tracking

 - scripts/setlocalversion works with too much translated locales ;)

* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  kbuild: Fix reading of .config in link-vmlinux.sh
  kbuild: Unset language specific variables in setlocalversion script
  Kbuild: Disable var tracking with CONFIG_DEBUG_INFO_REDUCED
  depmod: pass -P $CONFIG_SYMBOL_PREFIX
  kbuild: Fix destination-y for installed headers
  scripts/link-vmlinux.sh: source variables from KCONFIG_CONFIG
  kernel: Replace timeconst.pl with a bc script
  mod/file2alias: make modalias generation safe for cross compiling
2013-02-27 12:25:47 -08:00
H. Peter Anvin 70730bca13 kernel: Replace timeconst.pl with a bc script
bc is the standard tool for multi-precision arithmetic.  We switched
to Perl because akpm reported a hard-to-reproduce build hang, which
was very odd because affected and unaffected machines were all running
the same version of GNU bc.

Unfortunately switching to Perl required a really ugly "canning"
mechanism to support Perl < 5.8 installations lacking the Math::BigInt
module.

It was recently pointed out to me that some very old versions of GNU
make had problems with pipes in subshells, which was indeed the
construct used in the Makefile rules in that version of the patch;
Perl didn't need it so switching to Perl fixed the problem for
unrelated reasons.  With the problem (hopefully) root-caused, we can
switch back to bc and do the arbitrary-precision arithmetic naturally.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Michal Marek <mmarek@suse.cz>
2013-02-16 23:17:25 +01:00
Michal Marek 227536740e MODSIGN: Simplify Makefile with a Kconfig helper
Signed-off-by: Michal Marek <mmarek@suse.cz>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-01-25 16:55:35 +10:30
Linus Torvalds 7a684c452e Nothing all that exciting; a new module-from-fd syscall for those who want
to verify the source of the module (ChromeOS) and/or use standard IMA on it
 or other security hooks.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQ0VKlAAoJENkgDmzRrbjxjuEQALVHpD1cSmryOzVwkNn7rVGP
 PV3KVbUs+qzUCm2c3AafIIlSBm2LOUl+cR3uNC7di8aHarRF3VHkK2OQ4Fx97ECd
 KKBqAyY3R0q1mAKujb/MWwiK0YgosEDIOzGGn2yQhNFsxKqnMB02P4j82IO7+g+w
 Cc3XuDyWHoH2I+ySgz0Q8NHAqufD/DMZUKud7jw2Lsv6PuICJ1Oqgl/Gd/muxort
 4a5tV3tjhRGywHS/8b2fbDUXkybC5NKK0FN+gyoaROmJ/THeHEQDGXZT9bc2vmVx
 HvRy/5k8dzQ6LAJ2mLnPvy0pmv0u7NYMvjxTxxUlUkFMkYuVticikQfwSYDbDPt4
 mbsLxchpgi8z4x8HltEERffCX5tldo/5hz1uemqhqIsMRIrRFnlHkSIgkGjVHf2u
 LXQBLT8uTm6C0VyNQPrI/hUZzIax7WtKbPSoK9lmExNbKqloEFh/mVXvfQxei2kp
 wnUZcnmPIqSvw7b4CWu7HibMYu2VvGBgm3YIfJRi4AQme1mzFYLpZoxF5Pj+Ykbt
 T//Hb1EsNQTTFCg7MZhnJSAw/EVUvNDUoullORClyqw6+xxjVKqWpPJgYDRfWOlJ
 Xa+s7DNrL+Oo1WWR8l5ruoQszbR8szIyeyPKKxRUcQj2zsqghoWuzKAx2saSEw3W
 pNkoJU+dGC7kG/yVAS8N
 =uoJj
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull module update from Rusty Russell:
 "Nothing all that exciting; a new module-from-fd syscall for those who
  want to verify the source of the module (ChromeOS) and/or use standard
  IMA on it or other security hooks."

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  MODSIGN: Fix kbuild output when using default extra_certificates
  MODSIGN: Avoid using .incbin in C source
  modules: don't hand 0 to vmalloc.
  module: Remove a extra null character at the top of module->strtab.
  ASN.1: Use the ASN1_LONG_TAG and ASN1_INDEFINITE_LENGTH constants
  ASN.1: Define indefinite length marker constant
  moduleparam: use __UNIQUE_ID()
  __UNIQUE_ID()
  MODSIGN: Add modules_sign make target
  powerpc: add finit_module syscall.
  ima: support new kernel module syscall
  add finit_module syscall to asm-generic
  ARM: add finit_module syscall to ARM
  security: introduce kernel_module_from_file hook
  module: add flags arg to sys_finit_module()
  module: add syscall to load module from fd
2012-12-19 07:55:08 -08:00
Michal Marek e10e1774ef MODSIGN: Fix kbuild output when using default extra_certificates
Reported-by: Peter Foley <pefoley2@verizon.net>
Signed-off-by: Michal Marek <mmarek@suse.cz>
Acked-by: Peter Foley <pefoley2@verizon.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-12-14 13:06:45 +10:30
Takashi Iwai 919aa45e43 MODSIGN: Avoid using .incbin in C source
Using the asm .incbin statement in C sources breaks any gcc wrapper which
assumes that preprocessed C source is self-contained. Use a separate .S
file to include the siging key and certificate.

[ This means we no longer need SYMBOL_PREFIX which is defined in kernel.h
  from cbdbf2abb7, so I removed it -- RR ]

Tested-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: James Hogan <james.hogan@imgtec.com>
2012-12-14 13:06:44 +10:30
Ingo Molnar 630e1e0bcd Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Conflicts:
	arch/x86/kernel/ptrace.c

Pull the latest RCU tree from Paul E. McKenney:

"       The major features of this series are:

  1.	A first version of no-callbacks CPUs.  This version prohibits
  	offlining CPU 0, but only when enabled via CONFIG_RCU_NOCB_CPU=y.
  	Relaxing this constraint is in progress, but not yet ready
  	for prime time.  These commits were posted to LKML at
  	https://lkml.org/lkml/2012/10/30/724, and are at branch rcu/nocb.

  2.	Changes to SRCU that allows statically initialized srcu_struct
  	structures.  These commits were posted to LKML at
  	https://lkml.org/lkml/2012/10/30/296, and are at branch rcu/srcu.

  3.	Restructuring of RCU's debugfs output.  These commits were posted
  	to LKML at https://lkml.org/lkml/2012/10/30/341, and are at
  	branch rcu/tracing.

  4.	Additional CPU-hotplug/RCU improvements, posted to LKML at
  	https://lkml.org/lkml/2012/10/30/327, and are at branch rcu/hotplug.
  	Note that the commit eliminating __stop_machine() was judged to
  	be too-high of risk, so is deferred to 3.9.

  5.	Changes to RCU's idle interface, most notably a new module
  	parameter that redirects normal grace-period operations to
  	their expedited equivalents.  These were posted to LKML at
  	https://lkml.org/lkml/2012/10/30/739, and are at branch rcu/idle.

  6.	Additional diagnostics for RCU's CPU stall warning facility,
  	posted to LKML at https://lkml.org/lkml/2012/10/30/315, and
  	are at branch rcu/stall.  The most notable change reduces the
  	default RCU CPU stall-warning time from 60 seconds to 21 seconds,
  	so that it once again happens sooner than the softlockup timeout.

  7.	Documentation updates, which were posted to LKML at
  	https://lkml.org/lkml/2012/10/30/280, and are at branch rcu/doc.
  	A couple of late-breaking changes were posted at
  	https://lkml.org/lkml/2012/11/16/634 and
  	https://lkml.org/lkml/2012/11/16/547.

  8.	Miscellaneous fixes, which were posted to LKML at
  	https://lkml.org/lkml/2012/10/30/309, along with a late-breaking
  	change posted at Fri, 16 Nov 2012 11:26:25 -0800 with message-ID
  	<20121116192625.GA447@linux.vnet.ibm.com>, but which lkml.org
  	seems to have missed.  These are at branch rcu/fixes.

  9.	Finally, a fix for an lockdep-RCU splat was posted to LKML
  	at https://lkml.org/lkml/2012/11/7/486.  This is at rcu/next. "

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-12-03 06:27:05 +01:00
Frederic Weisbecker 91d1aa43d3 context_tracking: New context tracking susbsystem
Create a new subsystem that probes on kernel boundaries
to keep track of the transitions between level contexts
with two basic initial contexts: user or kernel.

This is an abstraction of some RCU code that use such tracking
to implement its userspace extended quiescent state.

We need to pull this up from RCU into this new level of indirection
because this tracking is also going to be used to implement an "on
demand" generic virtual cputime accounting. A necessary step to
shutdown the tick while still accounting the cputime.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
[ paulmck: fix whitespace error and email address. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-11-30 11:40:07 -08:00