Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
Settings that merely control a single network device are allowed.
Either the network device is a logical network device where
restrictions make no difference or the network device is hardware NIC
that has been explicity moved from the initial network namespace.
In general policy and network stack state changes are allowed
while resource control is left unchanged.
Allow ethtool ioctls.
Allow binding to network devices.
Allow setting the socket mark.
Allow setting the socket priority.
Allow setting the network device alias via sysfs.
Allow setting the mtu via sysfs.
Allow changing the network device flags via sysfs.
Allow setting the network device group via sysfs.
Allow the following network device ioctls.
SIOCGMIIPHY
SIOCGMIIREG
SIOCSIFNAME
SIOCSIFFLAGS
SIOCSIFMETRIC
SIOCSIFMTU
SIOCSIFHWADDR
SIOCSIFSLAVE
SIOCADDMULTI
SIOCDELMULTI
SIOCSIFHWBROADCAST
SIOCSMIIREG
SIOCBONDENSLAVE
SIOCBONDRELEASE
SIOCBONDSETHWADDR
SIOCBONDCHANGEACTIVE
SIOCBRADDIF
SIOCBRDELIF
SIOCSHWTSTAMP
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the user calling sendmsg has the appropriate privieleges
in their user namespace allow them to set the uid, gid, and
pid in the SCM_CREDENTIALS control message to any valid value.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check
to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged
users to make netlink calls to modify their local network
namespace.
- In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so
that calls that are not safe for unprivileged users are still
protected.
Later patches will remove the extra capable calls from methods
that are safe for unprivilged users.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for supporting the creation of network namespaces
by unprivileged users, modify all of the per net sysctl exports
and refuse to allow them to unprivileged users.
This makes it safe for unprivileged users in general to access
per net sysctls, and allows sysctls to be exported to unprivileged
users on an individual basis as they are deemed safe.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The user namespace which creates a new network namespace owns that
namespace and all resources created in it. This way we can target
capability checks for privileged operations against network resources to
the user_ns which created the network namespace in which the resource
lives. Privilege to the user namespace which owns the network
namespace, or any parent user namespace thereof, provides the same
privilege to the network resource.
This patch is reworked from a version originally by
Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The copy of copy_net_ns used when the network stack is not
built is broken as it does not return -EINVAL when attempting
to create a new network namespace. We don't even have
a previous network namespace.
Since we need a copy of copy_net_ns in net/net_namespace.h that is
available when the networking stack is not built at all move the
correct version of copy_net_ns from net_namespace.c into net_namespace.h
Leaving us with just 2 versions of copy_net_ns. One version for when
we compile in network namespace suport and another stub for all other
occasions.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit c445477d74 which adds aRFS to the kernel, the CPU
selected for RFS is not set correctly when CPU is changing.
This is causing OOO packets and probably other issues.
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
offload_base is protected by offload_lock, not ptype_lock
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check (ha->addr == dev->dev_addr) is always true because dev_addr_init()
sets this. Correct the check to behave properly on addr removal.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the offload callbacks into its own structure.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert to using the new GSO/GRO registration mechanism and new
packet offload structure.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new data structure to contain the GRO/GSO callbacks and add
a new registration mechanism.
Singed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net. Based upon a conflict resolution
patch posted by Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to a NULL dereference, the following patch is causing oops
in normal trafic condition:
commit c0de08d042
Author: Eric Leblond <eric@regit.org>
Date: Thu Aug 16 22:02:58 2012 +0000
af_packet: don't emit packet on orig fanout group
This buggy patch was a feature fix and has reached most stable
branches.
When skb->sk is NULL and when packet fanout is used, there is a
crash in match_fanout_group where skb->sk is accessed.
This patch fixes the issue by returning false as soon as the
socket is NULL: this correspond to the wanted behavior because
the kernel as to resend the skb to all the listening socket in
this case.
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge notify hook rtnl_bridge_notify() was not handling the
case where the master flags was set or with both flags set. First
flags are not being passed correctly and second the logic to parse
them is broken.
This patch passes the original flags value and fixes the
logic.
Reported-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the dflt fdb dump handler to use RTM_NEWNEIGH to
be compatible with bridge dump routines.
The dump reply from the network driver handlers should
match the reply from bridge handler. The fact they were
not in the ixgbe case was effectively a bug. This patch
resolves it.
Applications that were not checking the nlmsg type will
continue to work. And now applications that do check
the type will work as expected.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some years ago, the ktime_t helper functions ktime_now() and ktime_lt()
have been introduced. Instead of defining them inside pktgen.c, they
should either use ktime_t library functions or, if not available, they
should be defined in ktime.h, so that also others can benefit from them.
ktime_compare() is introduced with a similar notion as in timespec_compare().
Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e5a55a8987 ('net: create generic
bridge ops') broke the handling of a non-zero starting index in
rtnl_bridge_getlink() (based on the old br_dump_ifinfo()).
When the starting index is non-zero, we need to increment the current
index for each entry that we are skipping. Also, we need to check the
index before both cases, since we may previously have stopped
iteration between getting information about a device from its master
and from itself.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Tested-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orphaning frags for zero copy skbs needs to allocate data in atomic
context so is has a chance to fail. If it does we currently discard
the skb which is safe, but we don't report anything to the caller,
so it can not recover by e.g. disabling zero copy.
Add an API to free skb reporting such errors: this is used
by tun in case orphaning frags fails.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if skb is marked for zero copy, net core might still decide
to copy it later which is somewhat slower than a copy in user context:
besides copying the data we need to pin/unpin the pages.
Add a parameter reporting such cases through zero copy callback:
if this happens a lot, device can take this into account
and switch to copying in user context.
This patch updates all users but ignores the passed value for now:
it will be used by follow-up patches.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SO_ATTACH_FILTER option is set only. I propose to add the get
ability by using SO_ATTACH_FILTER in getsockopt. To be less
irritating to eyes the SO_GET_FILTER alias to it is declared. This
ability is required by checkpoint-restore project to be able to
save full state of a socket.
There are two issues with getting filter back.
First, kernel modifies the sock_filter->code on filter load, thus in
order to return the filter element back to user we have to decode it
into user-visible constants. Fortunately the modification in question
is interconvertible.
Second, the BPF_S_ALU_DIV_K code modifies the command argument k to
speed up the run-time division by doing kernel_k = reciprocal(user_k).
Bad news is that different user_k may result in same kernel_k, so we
can't get the original user_k back. Good news is that we don't have
to do it. What we need to is calculate a user2_k so, that
reciprocal(user2_k) == reciprocal(user_k) == kernel_k
i.e. if it's re-loaded back the compiled again value will be exactly
the same as it was. That said, the user2_k can be calculated like this
user2_k = reciprocal(kernel_k)
with an exception, that if kernel_k == 0, then user2_k == 1.
The optlen argument is treated like this -- when zero, kernel returns
the amount of sock_fprog elements in filter, otherwise it should be
large enough for the sock_fprog array.
changes since v1:
* Declared SO_GET_FILTER in all arch headers
* Added decode of vlan-tag codes
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BPF filters lack ability to access skb->vlan_tci
This patch adds two new ancillary accessors :
SKF_AD_VLAN_TAG (44) mapped to vlan_tx_tag_get(skb)
SKF_AD_VLAN_TAG_PRESENT (48) mapped to vlan_tx_tag_present(skb)
This allows libpcap/tcpdump to use a kernel filter instead of
having to fallback to accept all packets, then filter them in
user space.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Ani Sinha <ani@aristanetworks.com>
Suggested-by: Daniel Borkmann <danborkmann@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds support for the net device ops to manage the embedded
hardware bridge on ixgbe devices. With this patch the bridge
mode can be toggled between VEB and VEPA to support stacking
macvlan devices or using the embedded switch without any SW
component in 802.1Qbg/br environments.
Additionally, this adds source address pruning to the ixgbevf
driver to prune any frames sent back from a reflective relay on
the switch. This is required because the existing hardware does
not support this. Without it frames get pushed into the stack
with its own src mac which is invalid per 802.1Qbg VEPA
definition.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hardware switches may support enabling and disabling the
loopback switch which puts the device in a VEPA mode defined
in the IEEE 802.1Qbg specification. In this mode frames are
not switched in the hardware but sent directly to the switch.
SR-IOV capable NICs will likely support this mode I am
aware of at least two such devices. Also I am told (but don't
have any of this hardware available) that there are devices
that only support VEPA modes. In these cases it is important
at a minimum to be able to query these attributes.
This patch adds an additional IFLA_BRIDGE_MODE attribute that can be
set and dumped via the PF_BRIDGE:{SET|GET}LINK operations. Also
anticipating bridge attributes that may be common for both embedded
bridges and software bridges this adds a flags attribute
IFLA_BRIDGE_FLAGS currently used to determine if the command or event
is being generated to/from an embedded bridge or software bridge.
Finally, the event generation is pulled out of the bridge module and
into rtnetlink proper.
For example using the macvlan driver in VEPA mode on top of
an embedded switch requires putting the embedded switch into
a VEPA mode to get the expected results.
-------- --------
| VEPA | | VEPA | <-- macvlan vepa edge relays
-------- --------
| |
| |
------------------
| VEPA | <-- embedded switch in NIC
------------------
|
|
-------------------
| external switch | <-- shiny new physical
------------------- switch with VEPA support
A packet sent from the macvlan VEPA at the top could be
loopbacked on the embedded switch and never seen by the
external switch. So in order for this to work the embedded
switch needs to be set in the VEPA state via the above
described commands.
By making these attributes nested in IFLA_AF_SPEC we allow
future extensions to be made as needed.
CC: Lennert Buytenhek <buytenh@wantstofly.org>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PF_BRIDGE:RTM_{GET|SET}LINK nlmsg family and type are
currently embedded in the ./net/bridge module. This prohibits
them from being used by other bridging devices. One example
of this being hardware that has embedded bridging components.
In order to use these nlmsg types more generically this patch
adds two net_device_ops hooks. One to set link bridge attributes
and another to dump the current bride attributes.
ndo_bridge_setlink()
ndo_bridge_getlink()
CC: Lennert Buytenhek <buytenh@wantstofly.org>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_update_classid() assumes that the update operation always are
applied on the current task. sock_update_classid() needs to know on
which tasks to work on in order to be able to migrate task between
cgroups using the struct cgroup_subsys attach() callback.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Joe Perches <joe@perches.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <netdev@vger.kernel.org>
Cc: <cgroups@vger.kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Eric pointed out:
"Hey task_cls_classid() has its own rcu protection since commit
3fb5a99191 (cls_cgroup: Fix rcu lockdep warning)
So we can safely revert Paul commit (1144182a87)
(We no longer need rcu_read_lock/unlock here)"
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_prio_attach() is only access via cgroup_subsys callbacks,
therefore we can reduce the visibility of this function.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <netdev@vger.kernel.org>
Cc: <cgroups@vger.kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's no needed to check the return value of tab since the NULL situation
has been handled already, and the rtnl_msg_handlers[PF_UNSPEC] has been
initialized as non-NULL during the rtnetlink_init().
Signed-off-by: Hans Zhang <zhanghonghui@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mike Kazantsev found 3.5 kernels and beyond were leaking memory,
and tracked the faulty commit to a1c7fff7e1 ("net:
netdev_alloc_skb() use build_skb()")
While this commit seems fine, it uncovered a bug introduced
in commit bad43ca832 ("net: introduce skb_try_coalesce()), in function
kfree_skb_partial()"):
If head is stolen, we free the sk_buff,
without removing references on secpath (skb->sp).
So IPsec + IP defrag/reassembly (using skb coalescing), or
TCP coalescing could leak secpath objects.
Fix this bug by calling skb_release_head_state(skb) to properly
release all possible references to linked objects.
Reported-by: Mike Kazantsev <mk.fraggod@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: Mike Kazantsev <mk.fraggod@gmail.com>
Tested-by: Mike Kazantsev <mk.fraggod@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes double assignment of err to -EINVAL in dev_change_net_namespace().
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SO_BINDTODEVICE option is the only SOL_SOCKET one that can be set, but
cannot be get via sockopt API. The only way we can find the device id a
socket is bound to is via sock-diag interface. But the diag works only on
hashed sockets, while the opt in question can be set for yet unhashed one.
That said, in order to know what device a socket is bound to (we do want
to know this in checkpoint-restore project) I propose to make this option
getsockopt-able and report the respective device index.
Another solution to the problem might be to teach the sock-diag reporting
info on unhashed sockets. Should I go this way instead?
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not easy to use in4_pton() correctly without reading
its definition, so add some doc for it.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not easy to use in6_pton() correctly without reading
its definition, so add some doc for it.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is weird to display IPv4 address in %x format, what's more,
IPv6 address is disaplayed in human-readable format too. So,
make it human-readable.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ETH_ZLEN is too small for IPv6, so this default value is not
suitable.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For IPv6, sizeof(struct ipv6hdr) = 40, thus the following
expression will result negative:
datalen = pkt_dev->cur_pkt_size - 14 -
sizeof(struct ipv6hdr) - sizeof(struct udphdr) -
pkt_dev->pkt_overhead;
And, the check "if (datalen < sizeof(struct pktgen_hdr))" will be
passed as "datalen" is promoted to unsigned, therefore will cause
a crash later.
This is a quick fix by checking if "datalen" is negative. The following
patch will increase the default value of 'min_pkt_size' for IPv6.
This bug should exist for a long time, so Cc -stable too.
Cc: <stable@vger.kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6a32e4f9dd made the vlan code skip marking
vlan-tagged frames for not locally configured vlans as PACKET_OTHERHOST if
there was an rx_handler, as the rx_handler could cause the frame to be received
on a different (virtual) vlan-capable interface where that vlan might be
configured.
As rx_handlers do not necessarily return RX_HANDLER_ANOTHER, this could cause
frames for unknown vlans to be delivered to the protocol stack as if they had
been received untagged.
For example, if an ipv6 router advertisement that's tagged for a locally not
configured vlan is received on an interface with macvlan interfaces attached,
macvlan's rx_handler returns RX_HANDLER_PASS after delivering the frame to the
macvlan interfaces, which caused it to be passed to the protocol stack, leading
to ipv6 addresses for the announced prefix being configured even though those
are completely unusable on the underlying interface.
The fix moves marking as PACKET_OTHERHOST after the rx_handler so the
rx_handler, if there is one, sees the frame unchanged, but afterwards,
before the frame is delivered to the protocol stack, it gets marked whether
there is an rx_handler or not.
Signed-off-by: Florian Zumbiehl <florz@florz.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current GRO can hold packets in gro_list for almost unlimited
time, in case napi->poll() handler consumes its budget over and over.
In this case, napi_complete()/napi_gro_flush() are not called.
Another problem is that gro_list is flushed in non friendly way :
We scan the list and complete packets in the reverse order.
(youngest packets first, oldest packets last)
This defeats priorities that sender could have cooked.
Since GRO currently only store TCP packets, we dont really notice the
bug because of retransmits, but this behavior can add unexpected
latencies, particularly on mice flows clamped by elephant flows.
This patch makes sure no packet can stay more than 1 ms in queue, and
only in stress situations.
It also complete packets in the right order to minimize latencies.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before accessing skb first fragment, better make sure there
is one.
This is probably not needed for old kernels, since an ethernet frame
cannot contain only an ethernet header, but the recent GRO addition
to tunnels makes this patch needed.
Also skb_gro_reset_offset() can be static, it actually allows
compiler to inline it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The retry loop in neigh_resolve_output() and neigh_connected_output()
call dev_hard_header() with out reseting the skb to network_header.
This causes the retry to fail with skb_under_panic. The fix is to
reset the network_header within the retry loop.
Signed-off-by: Ramesh Nagappa <ramesh.nagappa@ericsson.com>
Reviewed-by: Shawn Lu <shawn.lu@ericsson.com>
Reviewed-by: Robert Coulson <robert.coulson@ericsson.com>
Reviewed-by: Billie Alsup <billie.alsup@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Over time, skb recycling infrastructure got litle interest and
many bugs. Generic rx path skb allocation is now using page
fragments for efficient GRO / TCP coalescing, and recyling
a tx skb for rx path is not worth the pain.
Last identified bug is that fat skbs can be recycled
and it can endup using high order pages after few iterations.
With help from Maxime Bizon, who pointed out that commit
87151b8689 (net: allow pskb_expand_head() to get maximum tailroom)
introduced this regression for recycled skbs.
Instead of fixing this bug, lets remove skb recycling.
Drivers wanting really hot skbs should use build_skb() anyway,
to allocate/populate sk_buff right before netif_receive_skb()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull vfs update from Al Viro:
- big one - consolidation of descriptor-related logics; almost all of
that is moved to fs/file.c
(BTW, I'm seriously tempted to rename the result to fd.c. As it is,
we have a situation when file_table.c is about handling of struct
file and file.c is about handling of descriptor tables; the reasons
are historical - file_table.c used to be about a static array of
struct file we used to have way back).
A lot of stray ends got cleaned up and converted to saner primitives,
disgusting mess in android/binder.c is still disgusting, but at least
doesn't poke so much in descriptor table guts anymore. A bunch of
relatively minor races got fixed in process, plus an ext4 struct file
leak.
- related thing - fget_light() partially unuglified; see fdget() in
there (and yes, it generates the code as good as we used to have).
- also related - bits of Cyrill's procfs stuff that got entangled into
that work; _not_ all of it, just the initial move to fs/proc/fd.c and
switch of fdinfo to seq_file.
- Alex's fs/coredump.c spiltoff - the same story, had been easier to
take that commit than mess with conflicts. The rest is a separate
pile, this was just a mechanical code movement.
- a few misc patches all over the place. Not all for this cycle,
there'll be more (and quite a few currently sit in akpm's tree)."
Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
MAX_LFS_FILESIZE should be a loff_t
compat: fs: Generic compat_sys_sendfile implementation
fs: push rcu_barrier() from deactivate_locked_super() to filesystems
btrfs: reada_extent doesn't need kref for refcount
coredump: move core dump functionality into its own file
coredump: prevent double-free on an error path in core dumper
usb/gadget: fix misannotations
fcntl: fix misannotations
ceph: don't abuse d_delete() on failure exits
hypfs: ->d_parent is never NULL or negative
vfs: delete surplus inode NULL check
switch simple cases of fget_light to fdget
new helpers: fdget()/fdput()
switch o2hb_region_dev_write() to fget_light()
proc_map_files_readdir(): don't bother with grabbing files
make get_file() return its argument
vhost_set_vring(): turn pollstart/pollstop into bool
switch prctl_set_mm_exe_file() to fget_light()
switch xfs_find_handle() to fget_light()
switch xfs_swapext() to fget_light()
...
Pull networking changes from David Miller:
1) GRE now works over ipv6, from Dmitry Kozlov.
2) Make SCTP more network namespace aware, from Eric Biederman.
3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.
4) Make openvswitch network namespace aware, from Pravin B Shelar.
5) IPV6 NAT implementation, from Patrick McHardy.
6) Server side support for TCP Fast Open, from Jerry Chu and others.
7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
Borkmann.
8) Increate the loopback default MTU to 64K, from Eric Dumazet.
9) Use a per-task rather than per-socket page fragment allocator for
outgoing networking traffic. This benefits processes that have very
many mostly idle sockets, which is quite common.
From Eric Dumazet.
10) Use up to 32K for page fragment allocations, with fallbacks to
smaller sizes when higher order page allocations fail. Benefits are
a) less segments for driver to process b) less calls to page
allocator c) less waste of space.
From Eric Dumazet.
11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.
12) VXLAN device driver, one way to handle VLAN issues such as the
limitation of 4096 VLAN IDs yet still have some level of isolation.
From Stephen Hemminger.
13) As usual there is a large boatload of driver changes, with the scale
perhaps tilted towards the wireless side this time around.
Fix up various fairly trivial conflicts, mostly caused by the user
namespace changes.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
hyperv: Add buffer for extended info after the RNDIS response message.
hyperv: Report actual status in receive completion packet
hyperv: Remove extra allocated space for recv_pkt_list elements
hyperv: Fix page buffer handling in rndis_filter_send_request()
hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
hyperv: Fix the max_xfer_size in RNDIS initialization
vxlan: put UDP socket in correct namespace
vxlan: Depend on CONFIG_INET
sfc: Fix the reported priorities of different filter types
sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
sfc: Fix loopback self-test with separate_tx_channels=1
sfc: Fix MCDI structure field lookup
sfc: Add parentheses around use of bitfield macro arguments
sfc: Fix null function pointer in efx_sriov_channel_type
vxlan: virtual extensible lan
igmp: export symbol ip_mc_leave_group
netlink: add attributes to fdb interface
tg3: unconditionally select HWMON support when tg3 is enabled.
Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
gre: fix sparse warning
...
Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.
The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.
The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.
Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.
Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.
While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.
Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.
After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."
Fix up some fairly trivial conflicts in netfilter uid/git logging code.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...
Pull cgroup hierarchy update from Tejun Heo:
"Currently, different cgroup subsystems handle nested cgroups
completely differently. There's no consistency among subsystems and
the behaviors often are outright broken.
People at least seem to agree that the broken hierarhcy behaviors need
to be weeded out if any progress is gonna be made on this front and
that the fallouts from deprecating the broken behaviors should be
acceptable especially given that the current behaviors don't make much
sense when nested.
This patch makes cgroup emit warning messages if cgroups for
subsystems with broken hierarchy behavior are nested to prepare for
fixing them in the future. This was put in a separate branch because
more related changes were expected (didn't make it this round) and the
memory cgroup wanted to pull in this and make changes on top."
* 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them