The ThunderX NIC has two Ethernet Interfaces (BGX) each of them could has
up to four Logical MACs configured. Each of BGX has 32 filters to be
configured for filtering ingress packets. The number of filters available
to particular LMAC is from 8 (if we have four LMACs configured per BGX)
up to 32 (in case of only one LMAC is configured per BGX).
At the same time the NIC could present up to 128 VFs to OS as network
interfaces, each of them kernel will configure with set of MAC addresses
for filtering. So to prevent dupes in BGX filter registers from different
network interfaces it is required to cache and track all filter
configuration requests prior to applying them onto BGX filter registers.
This commit is to update LMAC structures with control fields to
allocate/releasing filters tracking list along with implementing
dmac array allocate/release per LMAC.
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ThunderX NIC has set of registers which allows to configure
filter policy for ingress packets. There are three possible regimes
of filtering multicasts, broadcasts and unicasts: accept all, reject all
and accept filter allowed only.
Current implementation has enum with all of them and two generic macro
for enabling filtering et all (CAM_ACCEPT) and enabling/disabling
broadcast packets, which also should be corrected in order to represent
register bits properly. All these values are private for driver and
there is no need to ‘publish’ them via header file.
This commit is to move filtering register manipulation values from
header file into source with explicit assignment of exact register
values to them to be used while register configuring.
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Blumenstingl says:
====================
Meson8m2 support for dwmac-meson8b
The Meson8m2 SoC is an updated version of the Meson8 SoC. Some of the
peripherals are shared with Meson8b (for example the watchdog registers
and the internal temperature sensor calibration procedure).
Meson8m2 also seems to include the same Gigabit MAC register layout as
Meson8b.
The registers in the Amlogic dwmac "glue" seem identical between Meson8b
and Meson8m2. Manual testing seems to confirm this.
To be extra-safe a new compatible string is added because there's no
(public) documentation on the Meson8m2 SoC. This will allow us to
implement any SoC-specific variations later on (if needed).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The Meson8m2 SoC uses a similar (potentially even identical) register
layout as the Meson8b and GXBB SoCs for the dwmac glue.
Add a new compatible string and update the module description to
indicate support for these SoCs.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Meson8m2 SoC uses a similar (potentially even identical) register
layout for the dwmac glue as Meson8b and GXBB. Unfortunately there is no
documentation available.
Testing shows that both, RMII and RGMII PHYs are working if they are
configured as on Meson8b. Add a new compatible string to the
documentation so differences (if there are any) between Meson8m2 and the
other SoCs can be taken care of within the driver.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull perf fixes from Ingo Molnar:
"Two fixlets"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/hwbp: Simplify the perf-hwbp code, fix documentation
perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs
Pull x86 PTI fixes from Ingo Molnar:
"Two fixes: a relatively simple objtool fix that makes Clang built
kernels work with ORC debug info, plus an alternatives macro fix"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives: Fixup alternative_call_2
objtool: Add Clang support
This issue happens on new ASUS laptop UX331UA which has modern
standby mode (suspend-to-idle). Pressing keys on the PS2 keyboard
can't wake up the system from suspend-to-idle which is not expected.
However, pressing power button can wake up without problem.
Per the engineers of ASUS, the keypress event is routed to Embedded
Controller (EC) in standby mode. EC then signals the SCI event to
BIOS so BIOS would Notify() power button to wake up the system. It's
from BIOS perspective. What we observe here is that kernel receives
the SCI event from SCI interrupt handler which informs that the GPE
status bit belongs to EC needs to be handled and then queries the EC
to find out what event is pending. Then execute the following ACPI
_QDF method which defined in ACPI DSDT for EC to notify power button.
Method (_QDF, 0, NotSerialized) // _Qxx: EC Query
{
Notify (PWRB, 0x80) // Status Change
}
With more debug messages added to analyze this problem, we find that
the keypress does wake up the system from suspend-to-idle but it's back
to suspend again almost immediately. As we see in the following messages,
the acpi_button_notify() is invoked but acpi_pm_wakeup_event() can not
really wake up the system here because acpi_s2idle_wakeup() is false.
The acpi_s2idle_wakeup() returnd false because the acpi_s2idle_sync() has
alrealdy exited.
[ 52.987048] s2idle_loop going s2idle
[ 59.713392] acpi_s2idle_wake enter
[ 59.713394] acpi_s2idle_wake exit
[ 59.760888] acpi_ev_gpe_detect enter
[ 59.760893] acpi_s2idle_sync enter
[ 59.760893] acpi_ec_query_flushed ec pending queries 0
[ 59.760953] Read registers for GPE 50-57: Status=01, Enable=01, RunEnable=01, WakeEnable=00
[ 59.760955] ACPI: EC: ===== IRQ (1) =====
[ 59.760972] ACPI: EC: EC_SC(R) = 0x28 SCI_EVT=1 BURST=0 CMD=1 IBF=0 OBF=0
[ 59.760979] ACPI: EC: +++++ Polling enabled +++++
[ 59.760979] ACPI: EC: ##### Command(QR_EC) submitted/blocked #####
[ 59.761003] acpi_s2idle_sync exit
[ 59.769587] ACPI: EC: ##### Query(0xdf) started #####
[ 59.769611] ACPI: EC: ##### Query(0xdf) stopped #####
[ 59.774154] acpi_button_notify button type 1
[ 59.813175] s2idle_loop going s2idle
acpi_s2idle_sync() already makes an effort to flush the EC event
queue, but in this case, the EC event has yet to be generated when
the call to acpi_ec_flush_work() is made. The event is generated
shortly after, through the ongoing handling of the SCI interrupt
which is happening on another CPU, and we must synchronize that
to make sure that it has run and completed. Adding another call to
acpi_os_wait_events_complete() solves this issue, since that
function synchronizes with SCI interrupt completion.
Signed-off-by: Chris Chiu <chiu@endlessm.com>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Trivial fix to spelling mistake in the pr_err_once() error message text.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-janitors@vger.kernel.org
Link: http://lkml.kernel.org/r/20180313154709.1015-1-colin.king@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some .<target>.cmd files under arch/x86 are showing two instances of
-D__KERNEL__, like arch/x86/boot/ and arch/x86/realmode/rm/.
__KERNEL__ is already defined in KBUILD_CPPFLAGS in the top Makefile,
so it can be dropped safely.
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kbuild@vger.kernel.org
Link: http://lkml.kernel.org/r/20180316084944.3997-1-caoj.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Two config options in the lock debugging menu that are probably the most
frequently used, as far as I am concerned, is the PROVE_LOCKING and
LOCK_STAT. From a UI perspective, they should be front and center. So
these two options are now moved to the top of the lock debugging menu.
The DEBUG_WW_MUTEX_SLOWPATH option is also added to the PROVE_LOCKING
umbrella.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1522445280-7767-4-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are a couples of lock debugging Kconfig options that depends on
the following support options:
- TRACE_IRQFLAGS_SUPPORT
- STACKTRACE_SUPPORT
- LOCKDEP_SUPPORT
That makes those lock debugging options harder to read and understand.
So a new LOCK_DEBUGGING_SUPPORT option is added that is equivalent to
the above three options together. That makes the Kconfig.debug file
more readable.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1522445280-7767-3-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For a rwsem, locking can either be exclusive or shared. The corresponding
exclusive or shared unlock must be used. Otherwise, the protected data
structures may get corrupted or the lock may be in an inconsistent state.
In order to detect such anomaly, a new configuration option DEBUG_RWSEMS
is added which can be enabled to look for such mismatches and print
warnings that that happens.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1522445280-7767-2-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- fix missed rebuild of TRIM_UNUSED_KSYMS
- fix rpm-pkg for GNU tar >= 1.29
- include scripts/dtc/include-prefixes/* to kernel header deb-pkg
- add -no-integrated-as option ealier to fix building with Clang
- fix netfilter Makefile for parallel building
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJavwJpAAoJED2LAQed4NsGQuIQAK/UmPVczOxT7RefB4BrAsZG
Zlai7HnfpzWk5EZE6fbTHTmbFu6HZ1TuYhOW5UlJcxd3P+nJfL5WwDo0H52LVfLT
UkSubLCtZBl+DqtbuOg4Xrmh8k3WneGqYT7H9D19LRXTeeoh82g81+mWYL3F9UOA
OWGzKf9+3CQhP7OjeVlfdQ8qv2UR+snyIK0jNRImTuhtys8iy2Q4EP/nQYtF7oAA
KcYY62rS3qVKfTrdk5NY7kxvpp6/1m6141UPR75Xve7h+Emx/u0RthiMUW08e2bv
PX5IlyI8XFz54wD2tojawMEo235cYPJAKQHZAry5tiLXvOF5vEZvoPGc8oUZnMGe
bMNONRfXrKWi10/pcTqEfl6gEAE+bvOrqIKj/DECT4hF1av2uEeou/SzuEX+wbqK
GxU4L5mnUwDsJNLPiUeVjyl4GD48X16lBdCs9laamRzYat5lKzJFBmgNf0dyHdI+
l/myEtk17nSeohPWRgUeTBcP8O+E27rER7U/+KC0c4spwKrEfLFIzzNauLLJdugN
o1VNYacseg3cLQnjSpmC26jxZw29jMFaLM5mBuiI7/F9mUlK6zaG6gyoDzV3A5lN
jgPw48apNj4SLnUMrOi+1RYWXWkguF09f8GecjJKXvR5wGqzY7E3ZDi/zgXBf72q
5r5dDuIExh0KXcO9Risp
=2WPN
-----END PGP SIGNATURE-----
Merge tag 'kbuild-fixes-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- fix missed rebuild of TRIM_UNUSED_KSYMS
- fix rpm-pkg for GNU tar >= 1.29
- include scripts/dtc/include-prefixes/* to kernel header deb-pkg
- add -no-integrated-as option ealier to fix building with Clang
- fix netfilter Makefile for parallel building
* tag 'kbuild-fixes-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
netfilter: nf_nat_snmp_basic: add correct dependency to Makefile
kbuild: rpm-pkg: Support GNU tar >= 1.29
builddeb: Fix header package regarding dtc source links
kbuild: set no-integrated-as before incl. arch Makefile
kbuild: make scripts/adjust_autoksyms.sh robust against timestamp races
Pull networking fixes from David Miller:
1) Fix RCU locking in xfrm_local_error(), from Taehee Yoo.
2) Fix return value assignments and thus error checking in
iwl_mvm_start_ap_ibss(), from Johannes Berg.
3) Don't count header length twice in vti4, from Stefano Brivio.
4) Fix deadlock in rt6_age_examine_exception, from Eric Dumazet.
5) Fix out-of-bounds access in nf_sk_lookup_slow{v4,v6}() from Subash
Abhinov.
6) Check nladdr size in netlink_connect(), from Alexander Potapenko.
7) VF representor SQ numbers are 32 not 16 bits, in mlx5 driver, from
Or Gerlitz.
8) Out of bounds read in skb_network_protocol(), from Eric Dumazet.
9) r8169 driver sets driver data pointer after register_netdev() which
is too late. Fix from Heiner Kallweit.
10) Fix memory leak in mlx4 driver, from Moshe Shemesh.
11) The multi-VLAN decap fix added a regression when dealing with device
that lack a MAC header, such as tun. Fix from Toshiaki Makita.
12) Fix integer overflow in dynamic interrupt coalescing code. From Tal
Gilboa.
13) Use after free in vrf code, from David Ahern.
14) IPV6 route leak between VRFs fix, also from David Ahern.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits)
net: mvneta: fix enable of all initialized RXQs
net/ipv6: Fix route leaking between VRFs
vrf: Fix use after free and double free in vrf_finish_output
ipv6: sr: fix seg6 encap performances with TSO enabled
net/dim: Fix int overflow
vlan: Fix vlan insertion for packets without ethernet header
net: Fix untag for vlan packets without ethernet header
atm: iphase: fix spelling mistake: "Receiverd" -> "Received"
vhost: validate log when IOTLB is enabled
qede: Do not drop rx-checksum invalidated packets.
hv_netvsc: enable multicast if necessary
ip_tunnel: Resolve ipsec merge conflict properly.
lan78xx: Crash in lan78xx_writ_reg (Workqueue: events lan78xx_deferred_multicast_write)
qede: Fix barrier usage after tx doorbell write.
vhost: correctly remove wait queue during poll failure
net/mlx4_core: Fix memory leak while delete slave's resources
net/mlx4_en: Fix mixed PFC and Global pause user control requests
net/smc: use announced length in sock_recvmsg()
llc: properly handle dev_queue_xmit() return value
strparser: Fix sign of err codes
...
Since commit 28128c61e0 ("kconfig.h: Include compiler types to avoid
missed struct attributes"), <linux/kconfig.h> pulls in kernel-space
headers to unrelated places.
Commit 0f9da844d8 ("MIPS: boot: Define __ASSEMBLY__ for its.S build")
suppress the build error by defining __ASSEMBLY__, but ITS (i.e. DTS)
is not assembly, and should not include <linux/compiler_types.h> in the
first place.
Looking at arch/s390/tools/Makefile, host programs gen_facilities and
gen_opcode_table now pull in <linux/compiler_types.h> as well.
The motivation for that commit was to define necessary attributes
before any struct is defined. Obviously, this happens only in C.
It is enough to include <linux/compiler_types.h> only when compiling
C files, and only when compiling kernel space. Move the include to
c_flags.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Andrey Ignatov says:
====================
v2->v3:
- rebase due to conflicts
- fix ipv6=m build
v1->v2:
- support expected_attach_type at prog load time so that prog (incl.
context accesses and calls to helpers) can be validated with regard to
specific attach point it is supposed to be attached to.
Later, at attach time, attach type is checked so that it must be same as
at load time if it was provided
- reworked hooks to rely on expected_attach_type, and reduced number of new
prog types from 6 to just 1: BPF_PROG_TYPE_CGROUP_SOCK_ADDR
- reused BPF_PROG_TYPE_CGROUP_SOCK for sys_bind post-hooks
- add selftests for post-sys_bind hook
For our container management we've been using complicated and fragile setup
consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
all containerized applications. Unfortunately it doesn't work for apps that
don't use glibc and changing all applications that run in the datacenter
is not possible due to 3rd party code and libraries (despite being
open source code) and sheer amount of legacy code that has to be rewritten
(we're rewriting what we can in parallel)
These applications are written without containers in mind and have
builtin assumptions about network services. Like an application X
expects to connect localhost:special_port and find service Y in there.
To move application X and service Y into two different containers
LD_PRELOAD approach is used to help one service connect to another
without rewriting them.
Moving these two applications into different L2 (netns) or L3 (vrf)
network isolation scopes doesn't help to solve the problem, since
applications need to see each other like they were running on
the host without containers.
So if app X and app Y would run in different netns something
would need to punch a connectivity hole in those namespaces.
That would be real layering violation (with corresponding
network debugging pains), since clean l2, l3 abstraction would
suddenly support something that breaks through the layers.
Instead we used LD_PRELOAD (and now bpf programs) at bind/connect
time to help applications discover and connect to each other.
All applications are running in init_nens and there are no vrfs.
After bind/connect the normal fib/neighbor core networking
logic works as it should always do and the whole system is
clean from network point of view and can be debugged with
standard tools.
We also considered resurrecting Hannes's afnetns work,
but all hierarchical namespace abstraction don't work due
to these builtin networking assumptions inside the apps.
To run an application inside cgroup container that was not written
with containers in mind we have to make an illusion of running
in non-containerized environment.
In some cases we remember the port and container id in the post-bind hook
in a bpf map and when some other task in a different container is trying
to connect to a service we need to know where this service is running.
It can be remote and can be local. Both client and service may or may not
be written with containers in mind and this sockaddr rewrite is providing
connectivity and load balancing feature.
BPF+cgroup looks to be the best solution for this problem.
Hence we introduce 3 hooks:
- at entry into sys_bind and sys_connect
to let bpf prog look and modify 'struct sockaddr' provided
by user space and fail bind/connect when appropriate
- post sys_bind after port is allocated
The approach works great and has zero overhead for anyone who doesn't
use it and very low overhead when deployed.
Different use case for this feature is to do low overhead firewall
that doesn't need to inspect all packets and works at bind/connect time.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add selftest for attach types `BPF_CGROUP_INET4_POST_BIND` and
`BPF_CGROUP_INET6_POST_BIND`.
The main things tested are:
* prog load behaves as expected (valid/invalid accesses in prog);
* prog attach behaves as expected (load- vs attach-time attach types);
* `BPF_CGROUP_INET_SOCK_CREATE` can be attached in a backward compatible
way;
* post-hooks return expected result and errno.
Example:
# ./test_sock
Test case: bind4 load with invalid access: src_ip6 .. [PASS]
Test case: bind4 load with invalid access: mark .. [PASS]
Test case: bind6 load with invalid access: src_ip4 .. [PASS]
Test case: sock_create load with invalid access: src_port .. [PASS]
Test case: sock_create load w/o expected_attach_type (compat mode) ..
[PASS]
Test case: sock_create load w/ expected_attach_type .. [PASS]
Test case: attach type mismatch bind4 vs bind6 .. [PASS]
Test case: attach type mismatch bind6 vs bind4 .. [PASS]
Test case: attach type mismatch default vs bind4 .. [PASS]
Test case: attach type mismatch bind6 vs sock_create .. [PASS]
Test case: bind4 reject all .. [PASS]
Test case: bind6 reject all .. [PASS]
Test case: bind6 deny specific IP & port .. [PASS]
Test case: bind4 allow specific IP & port .. [PASS]
Test case: bind4 allow all .. [PASS]
Test case: bind6 allow all .. [PASS]
Summary: 16 PASSED, 0 FAILED
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
"Post-hooks" are hooks that are called right before returning from
sys_bind. At this time IP and port are already allocated and no further
changes to `struct sock` can happen before returning from sys_bind but
BPF program has a chance to inspect the socket and change sys_bind
result.
Specifically it can e.g. inspect what port was allocated and if it
doesn't satisfy some policy, BPF program can force sys_bind to fail and
return EPERM to user.
Another example of usage is recording the IP:port pair to some map to
use it in later calls to sys_connect. E.g. if some TCP server inside
cgroup was bound to some IP:port_n, it can be recorded to a map. And
later when some TCP client inside same cgroup is trying to connect to
127.0.0.1:port_n, BPF hook for sys_connect can override the destination
and connect application to IP:port_n instead of 127.0.0.1:port_n. That
helps forcing all applications inside a cgroup to use desired IP and not
break those applications if they e.g. use localhost to communicate
between each other.
== Implementation details ==
Post-hooks are implemented as two new attach types
`BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.
Separate attach types for IPv4 and IPv6 are introduced to avoid access
to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
`inet6_bind()` since those fields might not make sense in such cases.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
== The problem ==
See description of the problem in the initial patch of this patch set.
== The solution ==
The patch provides much more reliable in-kernel solution for the 2nd
part of the problem: making outgoing connecttion from desired IP.
It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
`BPF_CGROUP_INET6_CONNECT` for program type
`BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
source and destination of a connection at connect(2) time.
Local end of connection can be bound to desired IP using newly
introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
and doesn't support binding to port, i.e. leverages
`IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
* looking for a free port is expensive and can affect performance
significantly;
* there is no use-case for port.
As for remote end (`struct sockaddr *` passed by user), both parts of it
can be overridden, remote IP and remote port. It's useful if an
application inside cgroup wants to connect to another application inside
same cgroup or to itself, but knows nothing about IP assigned to the
cgroup.
Support is added for IPv4 and IPv6, for TCP and UDP.
IPv4 and IPv6 have separate attach types for same reason as sys_bind
hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
when user passes sockaddr_in since it'd be out-of-bound.
== Implementation notes ==
The patch introduces new field in `struct proto`: `pre_connect` that is
a pointer to a function with same signature as `connect` but is called
before it. The reason is in some cases BPF hooks should be called way
before control is passed to `sk->sk_prot->connect`. Specifically
`inet_dgram_connect` autobinds socket before calling
`sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
it'd cause double-bind. On the other hand `proto.pre_connect` provides a
flexible way to add BPF hooks for connect only for necessary `proto` and
call them at desired time before `connect`. Since `bpf_bind()` is
allowed to bind only to IP and autobind in `inet_dgram_connect` binds
only port there is no chance of double-bind.
bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
of value of `bind_address_no_port` socket field.
bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
and __inet6_bind() since all call-sites, where bpf_bind() is called,
already hold socket lock.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Refactor `bind()` code to make it ready to be called from BPF helper
function `bpf_bind()` (will be added soon). Implementation of
`inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and
`__inet6_bind()` correspondingly. These function can be used from both
`sk_prot->bind` and `bpf_bind()` contexts.
New functions have two additional arguments.
`force_bind_address_no_port` forces binding to IP only w/o checking
`inet_sock.bind_address_no_port` field. It'll allow to bind local end of
a connection to desired IP in `bpf_bind()` w/o changing
`bind_address_no_port` field of a socket. It's useful since `bpf_bind()`
can return an error and we'd need to restore original value of
`bind_address_no_port` in that case if we changed this before calling to
the helper.
`with_lock` specifies whether to lock socket when working with `struct
sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old
behavior is preserved. But it will be set to `false` for `bpf_bind()`
use-case. The reason is all call-sites, where `bpf_bind()` will be
called, already hold that socket lock.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add selftest to work with bpf_sock_addr context from
`BPF_PROG_TYPE_CGROUP_SOCK_ADDR` programs.
Try to bind(2) on IP:port and apply:
* loads to make sure context can be read correctly, including narrow
loads (byte, half) for IP and full-size loads (word) for all fields;
* stores to those fields allowed by verifier.
All combination from IPv4/IPv6 and TCP/UDP are tested.
Both scenarios are tested:
* valid programs can be loaded and attached;
* invalid programs can be neither loaded nor attached.
Test passes when expected data can be read from context in the
BPF-program, and after the call to bind(2) socket is bound to IP:port
pair that was written by BPF-program to the context.
Example:
# ./test_sock_addr
Attached bind4 program.
Test case #1 (IPv4/TCP):
Requested: bind(192.168.1.254, 4040) ..
Actual: bind(127.0.0.1, 4444)
Test case #2 (IPv4/UDP):
Requested: bind(192.168.1.254, 4040) ..
Actual: bind(127.0.0.1, 4444)
Attached bind6 program.
Test case #3 (IPv6/TCP):
Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
Actual: bind(::1, 6666)
Test case #4 (IPv6/UDP):
Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
Actual: bind(::1, 6666)
### SUCCESS
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
== The problem ==
There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured. Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.
Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.
== The solution ==
The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).
It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
(similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
The new program type is intended to be used with sockets (`struct sock`)
in a cgroup and provided by user `struct sockaddr`. Pointers to both of
them are parts of the context passed to programs of newly added types.
The new attach types provides hooks in `bind(2)` system call for both
IPv4 and IPv6 so that one can write a program to override IP addresses
and ports user program tries to bind to and apply such a program for
whole cgroup.
== Implementation notes ==
[1]
Separate attach types for `AF_INET` and `AF_INET6` are added
intentionally to prevent reading/writing to offsets that don't make
sense for corresponding socket family. E.g. if user passes `sockaddr_in`
it doesn't make sense to read from / write to `user_ip6[]` context
fields.
[2]
The write access to `struct bpf_sock_addr_kern` is implemented using
special field as an additional "register".
There are just two registers in `sock_addr_convert_ctx_access`: `src`
with value to write and `dst` with pointer to context that can't be
changed not to break later instructions. But the fields, allowed to
write to, are not available directly and to access them address of
corresponding pointer has to be loaded first. To get additional register
the 1st not used by `src` and `dst` one is taken, its content is saved
to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
address of pointer field, and finally the register's content is restored
from the temporary field after writing `src` value.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Support setting `expected_attach_type` at prog load time in both
`bpf/bpf.h` and `bpf/libbpf.h`.
Since both headers already have API to load programs, new functions are
added not to break backward compatibility for existing ones:
* `bpf_load_program_xattr()` is added to `bpf/bpf.h`;
* `bpf_prog_load_xattr()` is added to `bpf/libbpf.h`.
Both new functions accept structures, `struct bpf_load_program_attr` and
`struct bpf_prog_load_attr` correspondingly, where new fields can be
added in the future w/o changing the API.
Standard `_xattr` suffix is used to name the new API functions.
Since `bpf_load_program_name()` is not used as heavily as
`bpf_load_program()`, it was removed in favor of more generic
`bpf_load_program_xattr()`.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
== The problem ==
There are use-cases when a program of some type can be attached to
multiple attach points and those attach points must have different
permissions to access context or to call helpers.
E.g. context structure may have fields for both IPv4 and IPv6 but it
doesn't make sense to read from / write to IPv6 field when attach point
is somewhere in IPv4 stack.
Same applies to BPF-helpers: it may make sense to call some helper from
some attach point, but not from other for same prog type.
== The solution ==
Introduce `expected_attach_type` field in in `struct bpf_attr` for
`BPF_PROG_LOAD` command. If scenario described in "The problem" section
is the case for some prog type, the field will be checked twice:
1) At load time prog type is checked to see if attach type for it must
be known to validate program permissions correctly. Prog will be
rejected with EINVAL if it's the case and `expected_attach_type` is
not specified or has invalid value.
2) At attach time `attach_type` is compared with `expected_attach_type`,
if prog type requires to have one, and, if they differ, attach will
be rejected with EINVAL.
The `expected_attach_type` is now available as part of `struct bpf_prog`
in both `bpf_verifier_ops->is_valid_access()` and
`bpf_verifier_ops->get_func_proto()` () and can be used to check context
accesses and calls to helpers correspondingly.
Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
Daniel Borkmann <daniel@iogearbox.net> here:
https://marc.info/?l=linux-netdev&m=152107378717201&w=2
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add explicit checks in ext4_xattr_block_get() just in case the
e_value_offs and e_value_size fields in the the xattr block are
corrupted in memory after the buffer_verified bit is set on the xattr
block.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
The missing error handling in add_extent_changeset was hidden, so make
it at least visible in the callers.
Signed-off-by: David Sterba <dsterba@suse.com>
When mount fails to read trees like fs tree, checksum tree, extent
tree, etc, there is not enough information about where went wrong.
With this, messages like
"BTRFS warning (device sdf): failed to read root (objectid=7): -5"
would help us a bit.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All users pass a local unsigned int and not the __uXX types that are
supposed to be used for userspace interfaces.
Signed-off-by: David Sterba <dsterba@suse.com>
The current calls are unclear in what way btrfs_dev_replace_lock takes
the locks, so drop the argument, split the helpers and use similar
naming as for read and write locks.
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_mutex has been killed in 2008, a213501153 ("Btrfs: Replace
the big fs_mutex with a collection of other locks"), still remembered in
some comments.
We don't have any extra needs for locking in the ACL handlers.
Signed-off-by: David Sterba <dsterba@suse.com>
The show_devname callback is used to print device name in
/proc/self/mounts, we need to traverse the device list consistently and
read the name that's copied to a seq buffer so we don't need further
locking.
If the first device is being deleted at the same time, the RCU will
allow us to read the device name, though it will become stale right
after the RCU protection ends. This is unavoidable and the user can
expect that the device will disappear from the filesystem's list at some
point.
The device_list_mutex was pretty heavy as it is used eg. for writing
superblock and a few other IO related contexts. This can stall any
application that reads the proc file for no reason.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Once there was a simple int force_cow that was used with the plain
barriers, and then converted to a bit, so we should use the appropriate
barrier helper.
Other variables in the complex if condition do not depend on a barrier,
so we should be fine in case the atomic barrier becomes a no-op.
Signed-off-by: David Sterba <dsterba@suse.com>
We have several reports about node pointer points to incorrect child
tree blocks, which could have even wrong owner and level but still with
valid generation and checksum.
Although btrfs check could handle it and print error message like:
leaf parent key incorrect 60670574592
Kernel doesn't have enough check on this type of corruption correctly.
At least add such check to read_tree_block() and btrfs_read_buffer(),
where we need two new parameters @level and @first_key to verify the
child tree block.
The new @level check is mandatory and all call sites are already
modified to extract expected level from its call chain.
While @first_key is optional, the following call sites are skipping such
check:
1) Root node/leaf
As ROOT_ITEM doesn't contain the first key, skip @first_key check.
2) Direct backref
Only parent bytenr and level is known and we need to resolve the key
all by ourselves, skip @first_key check.
Another note of this verification is, it needs extra info from nodeptr
or ROOT_ITEM, so it can't fit into current tree-checker framework, which
is limited to node/leaf boundary.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent tree of the test fs is like the following:
BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
item 0 key (4096 168 4096) itemoff 3944 itemsize 51
extent refs 1 gen 1 flags 2
tree block key (68719476736 0 0) level 1
^^^^^^^
ref#0: tree block backref root 5
And it's using an empty tree for fs tree, so there is no way that its
level can be 1.
For REAL (created by mkfs) fs tree backref with no skinny metadata, the
result should look like:
item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
refs 1 gen 4 flags TREE_BLOCK
tree block key (256 INODE_ITEM 0) level 0
^^^^^^^
tree block backref root 5
Fix the level to 0, so it won't break later tree level checker.
Fixes: faa2dbf004 ("Btrfs: add sanity tests for new qgroup accounting code")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode, at tree-log.c:copy_items(), if we call
btrfs_next_leaf() at the loop which checks for the need to log holes, we
need to make sure copy_items() returns the value 1 to its caller and
not 0 (on success). This is because the path the caller passed was
released and is now different from what is was before, and the caller
expects a return value of 0 to mean both success and that the path
has not changed, while a return value of 1 means both success and
signals the caller that it can not reuse the path, it has to perform
another tree search.
Even though this is a case that should not be triggered on normal
circumstances or very rare at least, its consequences can be very
unpredictable (especially when replaying a log tree).
Fixes: 16e7549f04 ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we have the no-holes mode enabled and fsync a file after punching a
hole in it, we can end up not logging the whole hole range in the log tree.
This happens if the file has extent items that span more than one leaf and
we punch a hole that covers a range that starts in a leaf but does not go
beyond the offset of the first extent in the next leaf.
Example:
$ mkfs.btrfs -f -O no-holes -n 65536 /dev/sdb
$ mount /dev/sdb /mnt
$ for ((i = 0; i <= 831; i++)); do
offset=$((i * 2 * 256 * 1024))
xfs_io -f -c "pwrite -S 0xab -b 256K $offset 256K" \
/mnt/foobar >/dev/null
done
$ sync
# We now have 2 leafs in our filesystem fs tree, the first leaf has an
# item corresponding the extent at file offset 216530944 and the second
# leaf has a first item corresponding to the extent at offset 217055232.
# Now we punch a hole that partially covers the range of the extent at
# offset 216530944 but does go beyond the offset 217055232.
$ xfs_io -c "fpunch $((216530944 + 128 * 1024 - 4000)) 256K" /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
<power fail>
# mount to replay the log
$ mount /dev/sdb /mnt
# Before this patch, only the subrange [216658016, 216662016[ (length of
# 4000 bytes) was logged, leaving an incorrect file layout after log
# replay.
Fix this by checking if there is a hole between the last extent item that
we processed and the first extent item in the next leaf, and if there is
one, log an explicit hole extent item.
Fixes: 16e7549f04 ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a nice helper to do proper casting of a qgroup to a ulist aux
value. And several places that could make use of it.
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 48a89bc4f2.
The idea to commit transaction and free some space after hitting qgroup
limit is good, although the problem is it can easily cause deadlocks.
One deadlock example is caused by trying to flush data while still
holding it:
Call Trace:
__schedule+0x49d/0x10f0
schedule+0xc6/0x290
schedule_timeout+0x187/0x1c0
wait_for_completion+0x204/0x3a0
btrfs_wait_ordered_extents+0xa40/0xaf0 [btrfs]
qgroup_reserve+0x913/0xa10 [btrfs]
btrfs_qgroup_reserve_data+0x3ef/0x580 [btrfs]
btrfs_check_data_free_space+0x96/0xd0 [btrfs]
__btrfs_buffered_write+0x3ac/0xd40 [btrfs]
btrfs_file_write_iter+0x62a/0xba0 [btrfs]
__vfs_write+0x320/0x430
vfs_write+0x107/0x270
SyS_write+0xbf/0x150
do_syscall_64+0x1b0/0x3d0
entry_SYSCALL64_slow_path+0x25/0x25
Another can be caused by trying to commit one transaction while nesting
with trans handle held by ourselves:
btrfs_start_transaction()
|- btrfs_qgroup_reserve_meta_pertrans()
|- qgroup_reserve()
|- btrfs_join_transaction()
|- btrfs_commit_transaction()
The retry is causing more problems than exppected when limit is enabled.
At least a graceful EDQUOT is way better than deadlock.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now trace_qgroup_meta_reserve() will have extra type parameter.
And introduce two new trace events:
1) trace_qgroup_meta_free_all_pertrans()
For btrfs_qgroup_free_meta_all_pertrans()
2) trace_qgroup_meta_convert()
For btrfs_qgroup_convert_reserved_meta()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For quota disabled->enable case, it's possible that at reservation time
quota was not enabled so no bytes were really reserved, while at release
time, quota was enabled so we will try to release some bytes we didn't
really own.
Such situation can cause metadata reserveation underflow, for both types,
also less possible for per-trans type since quota enable will commit
transaction.
To address this, record qgroup meta reserved bytes into
root::qgroup_meta_rsv_pertrans and ::prealloc.
So at releasing time we won't free any bytes we didn't reserve.
For DATA, it's already handled by io_tree, so nothing needs to be done
there.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Quite similar for delalloc, some modification to delayed-inode and
delayed-item reservation. Also needs extra parameter for release case
to distinguish normal release and error release.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add some paranoia checks to make sure we don't stray beyond the end of
the valid memory region containing ext4 xattr entries while we are
scanning for a match.
Also rename the function to xattr_find_entry() since it is static and
thus only used in fs/ext4/xattr.c
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org