Commit Graph

74747 Commits

Author SHA1 Message Date
David S. Miller 572152adfb Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contain Netfilter fixes for your net tree, they are:

1) Fix a race in nfnetlink_log and nfnetlink_queue that can lead to a crash.
   This problem is due to wrong order in the per-net registration and netlink
   socket events. Patch from Francesco Ruggeri.

2) Make sure that counters that userspace pass us are higher than 0 in all the
   x_tables frontends. Discovered via Trinity, patch from Dave Jones.

3) Revert a patch for br_netfilter to rely on the conntrack status bits. This
   breaks stateless IPv6 NAT transformations. Patch from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-22 14:25:45 -04:00
Eric Dumazet d654976cbf tcp: fix a potential deadlock in tcp_get_info()
Taking socket spinlock in tcp_get_info() can deadlock, as
inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
while packet processing can use the reverse locking order.

We could avoid this locking for TCP_LISTEN states, but lockdep would
certainly get confused as all TCP sockets share same lockdep classes.

[  523.722504] ======================================================
[  523.728706] [ INFO: possible circular locking dependency detected ]
[  523.734990] 4.1.0-dbg-DEV #1676 Not tainted
[  523.739202] -------------------------------------------------------
[  523.745474] ss/18032 is trying to acquire lock:
[  523.750002]  (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
[  523.758129]
[  523.758129] but task is already holding lock:
[  523.763968]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
[  523.774661]
[  523.774661] which lock already depends on the new lock.
[  523.774661]
[  523.782850]
[  523.782850] the existing dependency chain (in reverse order) is:
[  523.790326]
-> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
[  523.796599]        [<ffffffff811126bb>] lock_acquire+0xbb/0x270
[  523.802565]        [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
[  523.808628]        [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
[  523.815273]        [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
[  523.822067]        [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
[  523.828199]        [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
[  523.834331]        [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
[  523.840202]        [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
[  523.847214]        [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
[  523.853440]        [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
[  523.859624]        [<ffffffff81659db7>] ip_rcv+0x307/0x420

Lets use u64_sync infrastructure instead. As a bonus, 64bit
arches get optimized, as these are nop for them.

Fixes: 0df48c26d8 ("tcp: add tcpi_bytes_acked to tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-22 13:46:06 -04:00
Florian Westphal bd5850d39f net: sched: pkt_cls: remove unused macros from uapi
Jamal points out that this header also contains kernel internal magic that
cannot be used from userspace for anything meaningful.

Lets remove what the kernel doesn't use anymore and wrap remainder with
__KERNEL__.

Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 23:26:51 -04:00
Marcelo Ricardo Leitner 2efd055c53 tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info
This patch tracks the total number of inbound and outbound segments on a
TCP socket. One may use this number to have an idea on connection
quality when compared against the retransmissions.

RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut

These are a 32bit field each and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)

Note that tp->segs_out was placed near tp->snd_nxt for good data
locality and minimal performance impact, while tp->segs_in was placed
near tp->bytes_received for the same reason.

Join work with Eric Dumazet.

Note that received SYN are accounted on the listener, but sent SYNACK
are not accounted.

Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 23:25:21 -04:00
Linus Torvalds 865d872280 xen: bug fixes for 4.1-rc4
- Fix ARM build regression.
 - Fix VIRQ_CONSOLE related oops.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJVXGwGAAoJEFxbo/MsZsTR63EH/32TrMPyQXa/hkRbwhxAb63z
 6FyPLfoJ0I9eVS0VE+EICDQkhCX/6553dgTu3vJqLzVrB8Q3ULGqOywUA6yIJAed
 s/BPz6IcYl5sVaZt5S5KHdB3eCeF2SlFhWpyD52e3Ejs+Ykkd8F+YbS0XN2aIyDI
 VaSFQ9gW1gtBIOe/ie0nC7vhsZUTTysYctBspic4QrTJcVRv2rlvMbSb/XpZhb87
 S8DQ1PlPIvOFz/l3NVx0MH/rz281sS7ooaDc0QHTrlz5A+weIAc7YJtrUt0Hm9Vd
 HbPyhFNEwuPv6A1FlVO6XR6DeHqUW6Kbt83tgtnboly6Ye4UQfRBiGOD8rB7HuU=
 =Xsha
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.1b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull two xen bugfixes from David Vrabel:

 - fix ARM build regression.

 - fix VIRQ_CONSOLE related oops.

* tag 'for-linus-4.1b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/events: don't bind non-percpu VIRQs with percpu chip
  xen/arm: Define xen_arch_suspend()
2015-05-21 20:19:38 -07:00
Linus Torvalds 6efdb114b4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID fixes from Jiri Kosina:
 "Bugfixes for HID subsystem that should go in 4.1.  Important
  highlights:

   - the patch that extended support for HID++ protocol for TK820
     touchpad turns out to be causing regressions due to firmware
     issues; patch reverting back to basic support from Benjamin
     Tissoires

   - Wacom driver can oops for devices that report non-touch data on
     touch interfaces.  Fix from Ping Cheng

   - gpiolib is not mandatory for i2c-hid, so the driver shouldn't fail
     if gpiolib is not enabled.  Fix from Mika Westerberg"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
  HID: wacom: fix an Oops caused by wacom_wac_finger_count_touches
  HID: usbhid: Add HID_QUIRK_NOGET for Aten DVI KVM switch
  HID: hid-sensor-hub: Fix debug lock warning
  Revert "HID: logitech-hidpp: support combo keyboard touchpad TK820"
  HID: i2c-hid: Do not fail probing if gpiolib is not enabled
2015-05-21 17:23:11 -07:00
Linus Torvalds 3d854120e9 The first set of clk fixes for 4.1 are all driver bugs, with the
exception of a single locking fix in the core code. All driver fixes are
 for code that was merged recently. The Samsung stuff is mostly fixes
 around suspend/resume, the Qualcomm fixes are for invalid hardware
 configuration data and the Silicon Labs patches are fixes following
 their move away from platform_data to Device Tree.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVW/1LAAoJEKI6nJvDJaTUuTEP/i0mykL1zAXOoUDAVOsULkLq
 55MUnknjfoh4CHeLRru+XeBkGNDjLKPNg4y47NadrMvRsav3sqY9WR6HZasB3roa
 uGanJZRr1nVqaD3l2ki8twc4FgNfeEaFfY018Z3dDLSYbnsq990hWQChgTGLxuDN
 iz9fh8DdOALa9dp2qo604oUjVN9QU+rqRClA1d8JhtiEfeCkTjzUBO8pkJ5ZO8ve
 sRgQ6TkY0u69rVTYspot1cwkPLL8gCiX+nTazakhKNxEg6W1q7PrrHtZsWms47P5
 SpemNaSnwwRBy1wYnbZAxgsLOmg/g7seICN/uz/OyR9ZgzRJz66HeI9yPceaCbnz
 9eNxGtsDWvE5uejoqIBqivFULTtuUdo5ZOe5Xwz/b35Hy0l79T5Ag7NmPNwkAqOd
 WOSMuv581tnRpgHz3bG+PFknsjqzCKCoMfW3bjHIH56lZA9AMlJhKuV6g85XY3WA
 WAF+QtBWqD76TMzcf3DMg9egGn2321jvs5/I8jRMa9xCgV0Ycam0DwzDZ6G8KAhK
 nWWoUyQSJU0tElphfylqSFR+00anovbB/BORGwMhZ9gohDR4UCdtCK2FwOxRY7kg
 SWU1ruDpFSfLMstpwU9Rb+9F0TtSkxWWPtk23bto83erYXzx0ezwhLIA/pEPi+mO
 chOqdtgPvtfS3NeBwB/5
 =yjsw
 -----END PGP SIGNATURE-----

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk fixes from Michael Turquette:
 "The first set of clk fixes for 4.1 are all driver bugs, with the
  exception of a single locking fix in the core code.

  All driver fixes are for code that was merged recently.  The Samsung
  stuff is mostly fixes around suspend/resume, the Qualcomm fixes are
  for invalid hardware configuration data and the Silicon Labs patches
  are fixes following their move away from platform_data to Device Tree"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: si5351: Do not pass struct clk in platform_data
  clk: si5351: Mention clock-names in the binding documentation
  clk: add missing lock when call clk_core_enable in clk_set_parent
  clk: exynos5420: Restore GATE_BUS_TOP on suspend
  clk: qcom: Fix MSM8916 gfx3d_clk_src configuration
  clk: qcom: Fix MSM8916 venus divider value
  clk: exynos5433: Fix wrong PMS value of exynos5433_pll_rates
  clk: exynos5433: Fix wrong parent clock of sclk_apollo clock
  clk: exynos5433: Fix CLK_PCLK_MONOTONIC_CNT clk register assignment
  clk: exynos5433: Fix wrong offset of PCLK_MSCL_SECURE_SMMU_JPEG
  clk: Use CONFIG_ARCH_EXYNOS instead of CONFIG_ARCH_EXYNOS5433
2015-05-21 16:57:50 -07:00
Eric Dumazet f5af1f57a2 inet_hashinfo: remove bsocket counter
We no longer need bsocket atomic counter, as inet_csk_get_port()
calls bind_conflict() regardless of its value, after commit
2b05ad33e1 ("tcp: bind() fix autoselection to share ports")

This patch removes overhead of maintaining this counter and
double inet_csk_get_port() calls under pressure.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Flavio Leitner <fbl@redhat.com>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 18:55:32 -04:00
Alexei Starovoitov 04fd61ab36 bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  bpf_tail_call(ctx, &jmp_table, index);
  ...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  if (jmp_table[index])
    return (*jmp_table[index])(ctx);
  ...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.

bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table

Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.

New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.

The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.

Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>

==========
- simplify complex programs by splitting them into a sequence of small programs

- dispatch routine
  For tracing and future seccomp the program may be triggered on all system
  calls, but processing of syscall arguments will be different. It's more
  efficient to implement them as:
  int syscall_entry(struct seccomp_data *ctx)
  {
     bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
     ... default: process unknown syscall ...
  }
  int sys_write_event(struct seccomp_data *ctx) {...}
  int sys_read_event(struct seccomp_data *ctx) {...}
  syscall_jmp_table[__NR_write] = sys_write_event;
  syscall_jmp_table[__NR_read] = sys_read_event;

  For networking the program may call into different parsers depending on
  packet format, like:
  int packet_parser(struct __sk_buff *skb)
  {
     ... parse L2, L3 here ...
     __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
     bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
     ... default: process unknown protocol ...
  }
  int parse_tcp(struct __sk_buff *skb) {...}
  int parse_udp(struct __sk_buff *skb) {...}
  ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
  ipproto_jmp_table[IPPROTO_UDP] = parse_udp;

- for TC use case, bpf_tail_call() allows to implement reclassify-like logic

- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
  are atomic, so user space can build chains of BPF programs on the fly

Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
  It could have been implemented without JIT changes as a wrapper on top of
  BPF_PROG_RUN() macro, but with two downsides:
  . all programs would have to pay performance penalty for this feature and
    tail call itself would be slower, since mandatory stack unwind, return,
    stack allocate would be done for every tailcall.
  . tailcall would be limited to programs running preempt_disabled, since
    generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
    need to be either global per_cpu variable accessed by helper and by wrapper
    or global variable protected by locks.

  In this implementation x64 JIT bypasses stack unwind and jumps into the
  callee program after prologue.

- bpf_prog_array_compatible() ensures that prog_type of callee and caller
  are the same and JITed/non-JITed flag is the same, since calling JITed
  program from non-JITed is invalid, since stack frames are different.
  Similarly calling kprobe type program from socket type program is invalid.

- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
  abstraction, its user space API and all of verifier logic.
  It's in the existing arraymap.c file, since several functions are
  shared with regular array map.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 17:07:59 -04:00
Eric Dumazet eb9344781a tcp: add a force_schedule argument to sk_stream_alloc_skb()
In commit 8e4d980ac2 ("tcp: fix behavior for epoll edge trigger")
we fixed a possible hang of TCP sockets under memory pressure,
by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
if no packet is in socket write queue.

It turns out there are other cases where we want to force memory
schedule :

tcp_fragment() & tso_fragment() need to split a big TSO packet into
two smaller ones. If we block here because of TCP memory pressure,
we can effectively block TCP socket from sending new data.
If no further ACK is coming, this hang would be definitive, and socket
has no chance to effectively reduce its memory usage.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 16:56:40 -04:00
Johannes Berg f8bdbb5847 mac80211: add missing drv_priv description for TXQ struct
The kernel-doc description for the drv_priv member of
struct ieee80211_txq was missing, leading to errors.
Add a suitable description to fix that.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-20 15:04:53 +02:00
Florian Westphal faecbb45eb Revert "netfilter: bridge: query conntrack about skb dnat"
This reverts commit c055d5b03b.

There are two issues:
'dnat_took_place' made me think that this is related to
-j DNAT/MASQUERADE.

But thats only one part of the story.  This is also relevant for SNAT
when we undo snat translation in reverse/reply direction.

Furthermore, I originally wanted to do this mainly to avoid
storing ipv6 addresses once we make DNAT/REDIRECT work
for ipv6 on bridges.

However, I forgot about SNPT/DNPT which is stateless.

So we can't escape storing address for ipv6 anyway. Might as
well do it for ipv4 too.

Reported-and-tested-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-20 13:51:25 +02:00
Andy Zhou 06b2c61c92 ip: remove unused function prototype
ip_do_nat() function was removed prior to kernel 3.4. Remove the
unnecessary function prototype as well.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:54:36 -04:00
Daniel Borkmann 492135557d tcp: add rfc3168, section 6.1.1.1. fallback
This work as a follow-up of commit f7b3bec6f5 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:

  [...] A host that receives no reply to an ECN-setup SYN within the
  normal SYN retransmission timeout interval MAY resend the SYN and
  any subsequent SYN retransmissions with CWR and ECE cleared. [...]

Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3 ("tcp: extend
ECN sysctl to allow server-side only ECN"):

 1) Normal ECN-capable path:

    SYN ECE CWR ----->
                <----- SYN ACK ECE
            ACK ----->

 2) Path with broken middlebox, when client has fallback:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
            SYN ----->
                <----- SYN ACK
            ACK ----->

In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)

In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:

  Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
  Gorry Fairhurst, and Richard Scheffenegger:
    "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
    Proc. PAM 2015, New York.
  http://ecn.ethz.ch/ecn-pam15.pdf

Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.

tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.

Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:53:37 -04:00
Parav Pandit c9a70d4346 net-next: ethtool: Added port speed macros.
Signed-off-by: Parav Pandit <parav.pandit@avagotech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:32:18 -04:00
David Vrabel 77bb3dfdc0 xen/events: don't bind non-percpu VIRQs with percpu chip
A non-percpu VIRQ (e.g., VIRQ_CONSOLE) may be freed on a different
VCPU than it is bound to.  This can result in a race between
handle_percpu_irq() and removing the action in __free_irq() because
handle_percpu_irq() does not take desc->lock.  The interrupt handler
sees a NULL action and oopses.

Only use the percpu chip/handler for per-CPU VIRQs (like VIRQ_TIMER).

  # cat /proc/interrupts | grep virq
   40:      87246          0  xen-percpu-virq      timer0
   44:          0          0  xen-percpu-virq      debug0
   47:          0      20995  xen-percpu-virq      timer1
   51:          0          0  xen-percpu-virq      debug1
   69:          0          0   xen-dyn-virq      xen-pcpu
   74:          0          0   xen-dyn-virq      mce
   75:         29          0   xen-dyn-virq      hvc_console

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: <stable@vger.kernel.org>
2015-05-19 19:55:36 +01:00
Eric Dumazet b5d721d761 inet: properly align icsk_ca_priv
tcp_illinois and upcoming tcp_cdg require 64bit alignment of
icsk_ca_priv

x86 does not care, but other architectures might.

Fixes: 05cbc0db03 ("ipv4: Create probe timer for tcp PMTU as per RFC4821")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Fan Du <fan.du@intel.com>
Acked-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 11:08:00 -04:00
Alexander Aring 0e66545701 nl802154: add support for dump phy capabilities
This patch add support to nl802154 to dump all phy capabilities which is
inside the wpan_phy_supported struct. Also we introduce a new method to
dumping supported channels. The new method will offer a easier interface
and has lesser netlink traffic.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:44 +02:00
Alexander Aring 65318680c9 ieee802154: add iftypes capability
This patch adds capability flags for supported interface types.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:42 +02:00
Alexander Aring edea8f7c75 cfg802154: introduce wpan phy flags
This patch introduce a flag property for the wpan phy structure.
The current flag settings in ieee802154_hw are accessable in mac802154
layer only which is okay for flags which indicates MAC handling which
are done by phy. For real PHY layer settings like cca mode, transmit
power, cca energy detection level.

The difference between these flags are that the MAC handling flags are
only handled in mac802154/HardMac layer e.g. on an interface up. The phy
settings are direct netlink calls from nl802154 into the driver layer
and the nl802154 need to have a chance to check if the driver supports
this handling before sending to the next layer.

We also check now on PHY flags while dumping and setting pib attributes.
In comparing with MIB attributes the 802.15.4 gives us an default value
which we assume when a transceiver implement less functionality. In case
of MIB settings the nl802154 layer doesn't need to check on the
ieee802154_hw flags then.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:42 +02:00
Alexander Aring 791021bf13 mac802154: check for really changes
This patch adds check if the value is really changed inside pib/mib.
If a transceiver do support only one value for e.g. max_be then this
will also handle that the driver layer doesn't need to care about
handling to set one value only.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:42 +02:00
Alexander Aring fea3318d20 ieee802154: add several phy supported handling
This patch adds support for phy supported handling for all other already
existing handling 802.15.4 functionality. We assume now a fully 802.15.4
complaint transceiver at phy allocation. If a transceiver can support
802.15.4 default values only, then the values should be overwirtten by
values the transceiver supports. If the transceiver doesn't set the
according hardware flags, we assume the 802.15.4 defaults now which
cannot be changed.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Suggested-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:42 +02:00
Alexander Aring 72f655e44d ieee802154: introduce wpan_phy_supported
This patch introduce the wpan_phy_supported struct for wpan_phy. There
is currently no way to check if a transceiver can handle IEEE 802.15.4
complaint values. With this struct we can check before if the
transceiver supports these values before sending to driver layer.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Suggested-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Acked-by: Varka Bhadram <varkabhadram@gmail.com>
Cc: Alan Ott <alan@signal11.us>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:42 +02:00
Alexander Aring 32b23550ad ieee802154: change cca ed level to mbm
This patch change the handling of cca energy detection level from dbm to
mbm. This prepares to handle floating point cca energy detection levels
values. The old netlink 802.15.4 will convert the dbm value to mbm for
handling backward compatibility.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:42 +02:00
Alexander Aring e2eb173aaa ieee802154: change transmit power to mbm
This patch change the handling of transmit power level from dbm to mbm.
This prepares to handle floating point transmit power levels values. The
old netlink 802.15.4 will convert the dbm value to mbm for handling
backward compatibility.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:41 +02:00
Alexander Aring 1a19cb680b ieee802154: change transmit power to s32
This patch change the transmit power from s8 to s32. This prepares to store a
mbm value instead dbm inside the transmit power variable. The old
interface keep the a s8 dbm value, which should be backward compatibility
when assign s8 to s32.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:41 +02:00
Andy Zhou 49d16b23cd bridge_netfilter: No ICMP packet on IPv4 fragmentation error
When bridge netfilter re-fragments an IP packet for output, all
packets that can not be re-fragmented to their original input size
should be silently discarded.

However, current bridge netfilter output path generates an ICMP packet
with 'size exceeded MTU' message for such packets, this is a bug.

This patch refactors the ip_fragment() API to allow two separate
use cases. The bridge netfilter user case will not
send ICMP, the routing output will, as before.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 00:15:39 -04:00
Andy Zhou 5cf4228082 ipv4: introduce frag_expire_skip_icmp()
Improve readability of skip ICMP for de-fragmentation expiration logic.
This change will also make the logic easier to maintain when the
following patches in this series are applied.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 00:15:26 -04:00
David S. Miller 0bc4c07046 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next. Briefly
speaking, cleanups and minor fixes for ipset from Jozsef Kadlecsik and
Serget Popovich, more incremental updates to make br_netfilter a better
place from Florian Westphal, ARP support to the x_tables mark match /
target from and context Zhang Chunyu and the addition of context to know
that the x_tables runs through nft_compat. More specifically, they are:

1) Fix sparse warning in ipset/ip_set_hash_ipmark.c when fetching the
   IPSET_ATTR_MARK netlink attribute, from Jozsef Kadlecsik.

2) Rename STREQ macro to STRNCMP in ipset, also from Jozsef.

3) Use skb->network_header to calculate the transport offset in
   ip_set_get_ip{4,6}_port(). From Alexander Drozdov.

4) Reduce memory consumption per element due to size miscalculation,
   this patch and follow up patches from Sergey Popovich.

5) Expand nomatch field from 1 bit to 8 bits to allow to simplify
   mtype_data_reset_flags(), also from Sergey.

6) Small clean for ipset macro trickery.

7) Fix error reporting when both ip_set_get_hostipaddr4() and
   ip_set_get_extensions() from per-set uadt functions.

8) Simplify IPSET_ATTR_PORT netlink attribute validation.

9) Introduce HOST_MASK instead of hardcoded 32 in ipset.

10) Return true/false instead of 0/1 in functions that return boolean
    in the ipset code.

11) Validate maximum length of the IPSET_ATTR_COMMENT netlink attribute.

12) Allow to dereference from ext_*() ipset macros.

13) Get rid of incorrect definitions of HKEY_DATALEN.

14) Include linux/netfilter/ipset/ip_set.h in the x_tables set match.

15) Reduce nf_bridge_info size in br_netfilter, from Florian Westphal.

16) Release nf_bridge_info after POSTROUTING since this is only needed
    from the physdev match, also from Florian.

17) Reduce size of ipset code by deinlining ip_set_put_extensions(),
    from Denys Vlasenko.

18) Oneliner to add ARP support to the x_tables mark match/target, from
    Zhang Chunyu.

19) Add context to know if the x_tables extension runs from nft_compat,
    to address minor problems with three existing extensions.

20) Correct return value in several seqfile *_show() functions in the
    netfilter tree, from Joe Perches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 14:47:36 -04:00
WANG Cong de133464c9 netns: make nsid_lock per net
The spinlock is used to protect netns_ids which is per net,
so there is no need to use a global spinlock.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 23:41:11 -04:00
Samudrala, Sridhar 45d4122ca7 switchdev: add support for fdb add/del/dump via switchdev_port_obj ops.
- introduce port fdb obj and generic switchdev_port_fdb_add/del/dump()
- use switchdev_port_fdb_add/del/dump in rocker/team/bonding ndo ops.
- add support for fdb obj in switchdev_port_obj_add/del/dump()
- switch rocker to implement fdb ops via switchdev_ops

v3: updated to sync with named union changes.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:49:09 -04:00
Eric Dumazet b8da51ebb1 tcp: introduce tcp_under_memory_pressure()
Introduce an optimized version of sk_under_memory_pressure()
for TCP. Our intent is to use it in fast paths.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Eric Dumazet a6c5ea4ccf tcp: rename sk_forced_wmem_schedule() to sk_forced_mem_schedule()
We plan to use sk_forced_wmem_schedule() in input path as well,
so make it non static and rename it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Eric Dumazet 1a24e04e4b net: fix sk_mem_reclaim_partial()
sk_mem_reclaim_partial() goal is to ensure each socket has
one SK_MEM_QUANTUM forward allocation. This is needed both for
performance and better handling of memory pressure situations in
follow up patches.

SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes
as some arches have 64KB pages.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Eric Dumazet c91d460652 net: fix two sparse errors
First one in __skb_checksum_validate_complete() fixes the following
(and other callers)

make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/tcp_ipv4.o
  CHECK   net/ipv4/tcp_ipv4.c
include/linux/skbuff.h:3052:24: warning: incorrect type in return expression (different base types)
include/linux/skbuff.h:3052:24:    expected restricted __sum16
include/linux/skbuff.h:3052:24:    got int

Second is fixing gso_make_checksum() :

  CHECK   net/ipv4/gre_offload.c
include/linux/skbuff.h:3360:14: warning: incorrect type in assignment (different base types)
include/linux/skbuff.h:3360:14:    expected unsigned short [unsigned] [usertype] csum
include/linux/skbuff.h:3360:14:    got restricted __sum16
include/linux/skbuff.h:3365:16: warning: incorrect type in return expression (different base types)
include/linux/skbuff.h:3365:16:    expected restricted __sum16
include/linux/skbuff.h:3365:16:    got unsigned short [unsigned] [usertype] csum

Fixes: 5a21232983 ("net: Support for csum_bad in skbuff")
Fixes: 7e2b10c1e5 ("net: Support for multiple checksums with gso")
Signed-off-by: Eric Dumazet <edumazet@google.com>
CC: Tom Herbert <tom@herbertland.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 13:08:29 -04:00
Eric Dumazet d53a2aa3a1 net: fix sparse error in csum_replace4()
make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/netfilter/nf_nat_l3proto_ipv4.o
  CHECK   net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
include/net/checksum.h:125:64: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:64:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:64:    got restricted __be32 [usertype] from
include/net/checksum.h:125:71: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:71:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:71:    got restricted __be32 [usertype] to
include/net/checksum.h:125:64: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:64:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:64:    got restricted __be32 [usertype] from
include/net/checksum.h:125:71: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:71:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:71:    got restricted __be32 [usertype] to

Fixes: 4565af0d40 ("net: optimise csum_replace4()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 13:08:29 -04:00
Linus Torvalds dd8edd7e97 TTY/Serial fixes for 4.1-rc4
Here's some TTY and serial driver fixes for reported issues.  All of
 these have been in linux-next successfully.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlVX4JMACgkQMUfUDdst+ymy/QCfSx/npI/WfRNlKBMHI20xwOaE
 szQAoJKRxeF0d+2GCJ56OVbmqqjnG4IN
 =FUtJ
 -----END PGP SIGNATURE-----

Merge tag 'tty-4.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial fixes from Greg KH:
 "Here's some TTY and serial driver fixes for reported issues.

  All of these have been in linux-next successfully"

* tag 'tty-4.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  pty: Fix input race when closing
  tty/n_gsm.c: fix a memory leak when gsmtty is removed
  Revert "serial/amba-pl011: Leave the TX IRQ alone when the UART is not open"
  serial: omap: Fix error handling in probe
  earlycon: Revert log warnings
2015-05-16 21:10:05 -07:00
Herbert Xu 07ee0722bf rhashtable: Add cap on number of elements in hash table
We currently have no limit on the number of elements in a hash table.
This is a problem because some users (tipc) set a ceiling on the
maximum table size and when that is reached the hash table may
degenerate.  Others may encounter OOM when growing and if we allow
insertions when that happens the hash table perofrmance may also
suffer.

This patch adds a new paramater insecure_max_entries which becomes
the cap on the table.  If unset it defaults to max_size * 2.  If
it is also zero it means that there is no cap on the number of
elements in the table.  However, the table will grow whenever the
utilisation hits 100% and if that growth fails, you will get ENOMEM
on insertion.

As allowing oversubscription is potentially dangerous, the name
contains the word insecure.

Note that the cap is not a hard limit.  This is done for performance
reasons as enforcing a hard limit will result in use of atomic ops
that are heavier than the ones we currently use.

The reasoning is that we're only guarding against a gross over-
subscription of the table, rather than a small breach of the limit.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-16 18:08:26 -04:00
David S. Miller 1d6057019e Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
The following patchset contains Netfilter fixes for your net tree, they are:

1) Fix a leak in IPVS, the sysctl table is not released accordingly when
   destroying a netns, patch from Tommi Rantala.

2) Fix a build error when TPROXY and socket are built-in but IPv6 defrag is
   compiled as module, from Florian Westphal.

3) Fix TCP tracket wrt. RFC5961 challenge ACK when in LAST_ACK state, patch
   from Jesper Dangaard Brouer.

4) Fix a bogus WARN_ON() in nf_tables when deleting a set element that stores
   a map, from Mirek Kratochvil.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-16 16:40:22 -04:00
Linus Torvalds 14db1e8dc0 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two fixes: a suspend/resume related regression fix, and an RT priority
  boosting fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix regression in cpuset_cpu_inactive() for suspend
  sched: Handle priority boosted tasks proper in setscheduler()
2015-05-15 12:42:33 -07:00
Jesper Dangaard Brouer b3cad287d1 conntrack: RFC5961 challenge ACK confuse conntrack LAST-ACK transition
In compliance with RFC5961, the network stack send challenge ACK in
response to spurious SYN packets, since commit 0c228e833c ("tcp:
Restore RFC5961-compliant behavior for SYN packets").

This pose a problem for netfilter conntrack in state LAST_ACK, because
this challenge ACK is (falsely) seen as ACKing last FIN, causing a
false state transition (into TIME_WAIT).

The challenge ACK is hard to distinguish from real last ACK.  Thus,
solution introduce a flag that tracks the potential for seeing a
challenge ACK, in case a SYN packet is let through and current state
is LAST_ACK.

When conntrack transition LAST_ACK to TIME_WAIT happens, this flag is
used for determining if we are expecting a challenge ACK.

Scapy based reproducer script avail here:
 https://github.com/netoptimizer/network-testing/blob/master/scapy/tcp_hacks_3WHS_LAST_ACK.py

Fixes: 0c228e833c ("tcp: Restore RFC5961-compliant behavior for SYN packets")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-15 20:50:56 +02:00
Linus Torvalds c4d0bcc228 Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
 "Radeon:
     one oops fix, one bug fix, one pci id addition patch

  i915:
     one suspend/resume regression fix.

  All seems quiet enough."

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/radeon: don't do mst probing if MST isn't enabled.
  drm/radeon: add new bonaire pci id
  drm/radeon: fix VM_CONTEXT*_PAGE_TABLE_END_ADDR handling
  drm/i915: Avoid GPU hang when coming out of s3 or s4
2015-05-15 11:44:30 -07:00
Pablo Neira Ayuso 55917a21d0 netfilter: x_tables: add context to know if extension runs from nft_compat
Currently, we have four xtables extensions that cannot be used from the
xt over nft compat layer. The problem is that they need real access to
the full blown xt_entry to validate that the rule comes with the right
dependencies. This check was introduced to overcome the lack of
sufficient userspace dependency validation in iptables.

To resolve this problem, this patch introduces a new field to the
xt_tgchk_param structure that tell us if the extension is run from
nft_compat context.

The three affected extensions are:

1) CLUSTERIP, this target has been superseded by xt_cluster. So just
   bail out by returning -EINVAL.

2) TCPMSS. Relax the checking when used from nft_compat. If used with
   the wrong configuration, it will corrupt !syn packets by adding TCP
   MSS option.

3) ebt_stp. Relax the check to make sure it uses the reserved
   destination MAC address for STP.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
2015-05-15 20:14:07 +02:00
Dave Airlie e52f649e5b Merge branch 'drm-fixes-4.1' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
radeon minor fixes, and pci id addition.
* 'drm-fixes-4.1' of git://people.freedesktop.org/~agd5f/linux:
  drm/radeon: don't do mst probing if MST isn't enabled.
  drm/radeon: add new bonaire pci id
  drm/radeon: fix VM_CONTEXT*_PAGE_TABLE_END_ADDR handling
2015-05-15 15:20:45 +10:00
Roopa Prabhu eea39946a1 rename RTNH_F_EXTERNAL to RTNH_F_OFFLOAD
RTNH_F_EXTERNAL today is printed as "offload" in iproute2 output.

This patch renames the flag to be consistent with what the user sees.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 22:45:39 -04:00
Bert Vermeulen ef7f3a5c71 mdio-gpio: Propagate mii_bus.phy_ignore_ta_mask
This also changes mii_bus.phy_mask to u32 for consistency.

Signed-off-by: Bert Vermeulen <bert@biot.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 22:35:13 -04:00
Daniel Borkmann a4afd37b26 test_bpf: add tests related to BPF_MAXINSNS
Couple of torture test cases related to the bug fixed in 0b59d8806a
("ARM: net: delegate filter to kernel interpreter when imm_offset()
return value can't fit into 12bits.").

I've added a helper to allocate and fill the insn space. Output on
x86_64 from my laptop:

test_bpf: #233 BPF_MAXINSNS: Maximum possible literals jited:0 7 PASS
test_bpf: #234 BPF_MAXINSNS: Single literal jited:0 8 PASS
test_bpf: #235 BPF_MAXINSNS: Run/add until end jited:0 11553 PASS
test_bpf: #236 BPF_MAXINSNS: Too many instructions PASS
test_bpf: #237 BPF_MAXINSNS: Very long jump jited:0 9 PASS
test_bpf: #238 BPF_MAXINSNS: Ctx heavy transformations jited:0 20329 20398 PASS
test_bpf: #239 BPF_MAXINSNS: Call heavy transformations jited:0 32178 32475 PASS
test_bpf: #240 BPF_MAXINSNS: Jump heavy test jited:0 10518 PASS

test_bpf: #233 BPF_MAXINSNS: Maximum possible literals jited:1 4 PASS
test_bpf: #234 BPF_MAXINSNS: Single literal jited:1 4 PASS
test_bpf: #235 BPF_MAXINSNS: Run/add until end jited:1 1625 PASS
test_bpf: #236 BPF_MAXINSNS: Too many instructions PASS
test_bpf: #237 BPF_MAXINSNS: Very long jump jited:1 8 PASS
test_bpf: #238 BPF_MAXINSNS: Ctx heavy transformations jited:1 3301 3174 PASS
test_bpf: #239 BPF_MAXINSNS: Call heavy transformations jited:1 24107 23491 PASS
test_bpf: #240 BPF_MAXINSNS: Jump heavy test jited:1 8651 PASS

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 22:34:10 -04:00
Eric Dumazet 264ea103a7 tcp: syncookies: extend validity range
Now we allow storing more request socks per listener, we might
hit syncookie mode less often and hit following bug in our stack :

When we send a burst of syncookies, then exit this mode,
tcp_synq_no_recent_overflow() can return false if the ACK packets coming
from clients are coming three seconds after the end of syncookie
episode.

This is a way too strong requirement and conflicts with rest of
syncookie code which allows ACK to be aged up to 2 minutes.

Perfectly valid ACK packets are dropped just because clients might be
in a crowded wifi environment or on another planet.

So let's fix this, and also change tcp_synq_overflow() to not
dirty a cache line for every syncookie we send, as we are under attack.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 22:32:17 -04:00
Josh Triplett 929aa5b250 uidgid: make uid_valid and gid_valid work with !CONFIG_MULTIUSER
{u,g}id_valid call {u,g}id_eq, which calls __k{u,g}id_val on both
arguments and compares.  With !CONFIG_MULTIUSER, __k{u,g}id_val return a
constant 0, which makes {u,g}id_valid always return false.  Change
{u,g}id_valid to compare their argument against -1 instead.  That produces
identical results in the normal CONFIG_MULTIUSER=y case, but with
!CONFIG_MULTIUSER will make {u,g}id_valid constant-fold into "return
true;" rather than "return false;".

This fixes uses of devpts without CONFIG_MULTIUSER.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>,
Cc: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-14 17:55:51 -07:00
Vladimir Davydov 8f4fc071b1 gfp: add __GFP_NOACCOUNT
Not all kmem allocations should be accounted to memcg.  The following
patch gives an example when accounting of a certain type of allocations to
memcg can effectively result in a memory leak.  This patch adds the
__GFP_NOACCOUNT flag which if passed to kmalloc and friends will force the
allocation to go through the root cgroup.  It will be used by the next
patch.

Note, since in case of kmemleak enabled each kmalloc implies yet another
allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
gfp_kmemleak_mask.

Alternatively, we could introduce a per kmem cache flag disabling
accounting for all allocations of a particular kind, but (a) we would not
be able to bypass accounting for kmalloc then and (b) a kmem cache with
this flag set could not be merged with a kmem cache without this flag,
which would increase the number of global caches and therefore
fragmentation even if the memory cgroup controller is not used.

Despite its generic name, currently __GFP_NOACCOUNT disables accounting
only for kmem allocations while user page allocations are always charged.
To catch abusing of this flag, a warning is issued on an attempt of
passing it to mem_cgroup_try_charge.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>	[4.0.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-14 17:55:51 -07:00
Florian Fainelli 922f2dd1b6 net: phy: Add phy_ignore_ta_mask to account for broken turn-around
Some PHY devices/switches will not release the turn-around line as they
should do at the end of a MDIO transaction. To help with such
situations, allow MDIO bus drivers to be made aware of such
restrictions.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 13:40:55 -04:00
Denys Vlasenko a3b1c1eb50 netfilter: ipset: deinline ip_set_put_extensions()
On x86 allyesconfig build:
The function compiles to 489 bytes of machine code.
It has 25 callsites.

    text    data       bss       dec     hex filename
82441375 22255384 20627456 125324215 7784bb7 vmlinux.before
82434909 22255384 20627456 125317749 7783275 vmlinux

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: David S. Miller <davem@davemloft.net>
CC: Jan Engelhardt <jengelh@medozas.de>
CC: Jiri Pirko <jpirko@redhat.com>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: netfilter-devel@vger.kernel.org
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-14 12:51:19 +02:00
Florian Westphal 7fb48c5bc3 netfilter: bridge: neigh_head and physoutdev can't be used at same time
The neigh_header is only needed when we detect DNAT after prerouting
and neigh cache didn't have a mac address for us.

The output port has not been chosen yet so we can re-use the storage
area, bringing struct size down to 32 bytes on x86_64.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-14 12:43:48 +02:00
Pablo Neira e687ad60af netfilter: add netfilter ingress hook after handle_ing() under unique static key
This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.

Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().

* Without this patch:

Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
  16086246pps 7721Mb/sec (7721398080bps) errors: 100000000

    42.46%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
    25.92%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
     7.81%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
     5.62%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
     2.70%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
     2.34%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
     1.44%  kpktgend_0   [kernel.kallsyms]   [k] __build_skb

* With this patch:

Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
  16090536pps 7723Mb/sec (7723457280bps) errors: 100000000

    41.23%  kpktgend_0      [kernel.kallsyms]  [k] __netif_receive_skb_core
    26.57%  kpktgend_0      [kernel.kallsyms]  [k] kfree_skb
     7.72%  kpktgend_0      [pktgen]           [k] pktgen_thread_worker
     5.55%  kpktgend_0      [kernel.kallsyms]  [k] ip_rcv
     2.78%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_internal
     2.06%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_sk
     1.43%  kpktgend_0      [kernel.kallsyms]  [k] __build_skb

* Without this patch + tc ingress:

        tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                u32 match ip dst 4.3.2.1/32

Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
  10788648pps 5178Mb/sec (5178551040bps) errors: 100000000

    40.99%  kpktgend_0   [kernel.kallsyms]  [k] __netif_receive_skb_core
    17.50%  kpktgend_0   [kernel.kallsyms]  [k] kfree_skb
    11.77%  kpktgend_0   [cls_u32]          [k] u32_classify
     5.62%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify_compat
     5.18%  kpktgend_0   [pktgen]           [k] pktgen_thread_worker
     3.23%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify
     2.97%  kpktgend_0   [kernel.kallsyms]  [k] ip_rcv
     1.83%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_internal
     1.50%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_sk
     0.99%  kpktgend_0   [kernel.kallsyms]  [k] __build_skb

* With this patch + tc ingress:

        tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                u32 match ip dst 4.3.2.1/32

Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
  10743194pps 5156Mb/sec (5156733120bps) errors: 100000000

    42.01%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
    17.78%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
    11.70%  kpktgend_0   [cls_u32]           [k] u32_classify
     5.46%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify_compat
     5.16%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
     2.98%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
     2.84%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify
     1.96%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
     1.57%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk

Note that the results are very similar before and after.

I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.

Using gcc version 4.8.4 on:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
[...]
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              8192K

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 01:10:05 -04:00
Pablo Neira 1cf51900f8 net: add CONFIG_NET_INGRESS to enable ingress filtering
This new config switch enables the ingress filtering infrastructure that is
controlled through the ingress_needed static key. This prepares the
introduction of the Netfilter ingress hook that resides under this unique
static key.

Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
problem since this also depends on CONFIG_NET_CLS_ACT.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 01:10:05 -04:00
Pablo Neira b8d0aad0c7 netfilter: add nf_hook_list_active()
In preparation to have netfilter ingress per-device hook list.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 01:10:05 -04:00
Pablo Neira f719148346 netfilter: add hook list to nf_hook_state
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 01:10:05 -04:00
Pablo Neira 87d5c18ce1 netfilter: cleanup struct nf_hook_ops indentation
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 01:10:05 -04:00
John W. Linville 2d07dc79fe geneve: add initial netdev driver for GENEVE tunnels
This is an initial implementation of a netdev driver for GENEVE
tunnels.  This implementation uses a fixed UDP port, and only supports
point-to-point links with specific partner endpoints.  Only IPv4
links are supported at this time.

Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:59:13 -04:00
John W. Linville 35d32e8fe4 geneve: move definition of geneve_hdr() to geneve.h
This is a static inline with identical definitions in multiple places...

Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:59:13 -04:00
Pablo Neira f0b5e8a42f net: kill useless net_*_ingress_queue() definitions when NET_CLS_ACT is unset
This fixes 4577139b2d ("net: use jump label patching for ingress qdisc in
__netif_receive_skb_core").

The only client of this is sch_ingress and it depends on NET_CLS_ACT. So
there is no way these definition can be of any help.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:44:28 -04:00
Willem de Bruijn a9b6391814 packet: rollover statistics
Rollover indicates exceptional conditions. Export a counter to inform
socket owners of this state.

If no socket with sufficient room is found, rollover fails. Also count
these events.

Finally, also count when flows are rolled over early thanks to huge
flow detection, to validate its correctness.

Tested:
  Read counters in bench_rollover on all other tests in the patchset

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:43:00 -04:00
Jiri Pirko 77b9900ef5 tc: introduce Flower classifier
This patch introduces a flow-based filter. So far, the very essential
packet fields are supported.

This patch is only the first step. There is a lot of potential performance
improvements possible to implement. Also a lot of features are missing
now. They will be addressed in follow-up patches.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:48 -04:00
Jiri Pirko 59346afe7a flow_dissector: change port array into src, dst tuple
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:47 -04:00
Jiri Pirko 67a900cc04 flow_dissector: introduce support for Ethernet addresses
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:47 -04:00
Jiri Pirko b924933cbb flow_dissector: introduce support for ipv6 addressses
So far, only hashes made out of ipv6 addresses could be dissected. This
patch introduces support for dissection of full ipv6 addresses.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:47 -04:00
Jiri Pirko c3f8eaeb6e flow_dissector: add missing header includes
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:47 -04:00
Jiri Pirko 06635a35d1 flow_dissect: use programable dissector in skb_flow_dissect and friends
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:47 -04:00
Jiri Pirko fbff949e3b flow_dissector: introduce programable flow_dissector
Introduce dissector infrastructure which allows user to specify which
parts of skb he wants to dissect.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:47 -04:00
Jiri Pirko 5605c76240 net: move __skb_tx_hash to dev.c
__skb_tx_hash function has no relation to flow_dissect so just move it
to dev.c

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:46 -04:00
Jiri Pirko 9c684b5083 net: move __skb_get_hash function declaration to flow_dissector.h
Since the definition of the function is in flow_dissector.c, it makes
sense to have the declaration in flow_dissector.h

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:46 -04:00
Jiri Pirko 10b89ee43e net: move *skb_get_poff declarations into correct header
Since these functions are defined in flow_dissector.c, move header
declarations from skbuff.h into flow_dissector.h

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:45 -04:00
Jiri Pirko b0a31431b4 flow_dissector: remove unused function flow_get_hlen declaration
commit 56193d1bce ("net: Add function for parsing the header length out
of linear ethernet frames") added this function declaration but it is
defined nowhere.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:45 -04:00
Jiri Pirko 1bd758eb1c net: change name of flow_dissector header to match the .c file name
add couple of empty lines on the way.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:45 -04:00
Florian Westphal e578d9c025 net: sched: use counter to break reclassify loops
Seems all we want here is to avoid endless 'goto reclassify' loop.
tc_classify_compat even resets this counter when something other
than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't
break hypothetical loops induced by something other than perpetual
TC_ACT_RECLASSIFY return values.

skb_act_clone is now identical to skb_clone, so just use that.

Tested with following (bogus) filter:
tc filter add dev eth0 parent ffff: \
 protocol ip u32 match u32 0 0 police rate 10Kbit burst \
 64000 mtu 1500 action reclassify

Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:08:14 -04:00
David S. Miller b04096ff33 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Four minor merge conflicts:

1) qca_spi.c renamed the local variable used for the SPI device
   from spi_device to spi, meanwhile the spi_set_drvdata() call
   got moved further up in the probe function.

2) Two changes were both adding new members to codel params
   structure, and thus we had overlapping changes to the
   initializer function.

3) 'net' was making a fix to sk_release_kernel() which is
   completely removed in 'net-next'.

4) In net_namespace.c, the rtnl_net_fill() call for GET operations
   had the command value fixed, meanwhile 'net-next' adjusted the
   argument signature a bit.

This also matches example merge resolutions posted by Stephen
Rothwell over the past two days.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 14:31:43 -04:00
Scott Feldman 42275bd8fc switchdev: don't use anonymous union on switchdev attr/obj structs
Older gcc versions (e.g.  gcc version 4.4.6) don't like anonymous unions
which was causing build issues on the newly added switchdev attr/obj
structs.  Fix this by using named union on structs.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 14:20:59 -04:00
Scott Feldman 5eb764edee switchdev: align comment with other comments in block
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 12:26:27 -04:00
Sergey Popovich 275e2bc0f2 netfilter: ipset: Fix ext_*() macros
So pointers returned by these macros could be
referenced with -> directly.

Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-13 18:21:07 +02:00
Linus Torvalds 110bc76729 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Handle max TX power properly wrt VIFs and the MAC in iwlwifi, from
    Avri Altman.

 2) Use the correct FW API for scan completions in iwlwifi, from Avraham
    Stern.

 3) FW monitor in iwlwifi accidently uses unmapped memory, fix from Liad
    Kaufman.

 4) rhashtable conversion of mac80211 station table was buggy, the
    virtual interface was not taken into account.  Fix from Johannes
    Berg.

 5) Fix deadlock in rtlwifi by not using a zero timeout for
    usb_control_msg(), from Larry Finger.

 6) Update reordering state before calculating loss detection, from
    Yuchung Cheng.

 7) Fix off by one in bluetooth firmward parsing, from Dan Carpenter.

 8) Fix extended frame handling in xiling_can driver, from Jeppe
    Ledet-Pedersen.

 9) Fix CODEL packet scheduler behavior in the presence of TSO packets,
    from Eric Dumazet.

10) Fix NAPI budget testing in fm10k driver, from Alexander Duyck.

11) macvlan needs to propagate promisc settings down the the lower
    device, from Vlad Yasevich.

12) igb driver can oops when changing number of rings, from Toshiaki
    Makita.

13) Source specific default routes not handled properly in ipv6, from
    Markus Stenberg.

14) Use after free in tc_ctl_tfilter(), from WANG Cong.

15) Use softirq spinlocking in netxen driver, from Tony Camuso.

16) Two ARM bpf JIT fixes from Nicolas Schichan.

17) Handle MSG_DONTWAIT properly in ring based AF_PACKET sends, from
    Mathias Kretschmer.

18) Fix x86 bpf JIT implementation of FROM_{BE16,LE16,LE32}, from Alexei
    Starovoitov.

19) ll_temac driver DMA maps TX packet header with incorrect length, fix
    from Michal Simek.

20) We removed pm_qos bits from netdevice.h, but some indirect
    references remained.  Kill them.  From David Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
  net: Remove remaining remnants of pm_qos from netdevice.h
  e1000e: Add pm_qos header
  net: phy: micrel: Fix regression in kszphy_probe
  net: ll_temac: Fix DMA map size bug
  x86: bpf_jit: fix FROM_BE16 and FROM_LE16/32 instructions
  netns: return RTM_NEWNSID instead of RTM_GETNSID on a get
  Update be2net maintainers' email addresses
  net_sched: gred: use correct backlog value in WRED mode
  pppoe: drop pppoe device in pppoe_unbind_sock_work
  net: qca_spi: Fix possible race during probe
  net: mdio-gpio: Allow for unspecified bus id
  af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT).
  bnx2x: limit fw delay in kdump to 5s after boot
  ARM: net: delegate filter to kernel interpreter when imm_offset() return value can't fit into 12bits.
  ARM: net fix emit_udiv() for BPF_ALU | BPF_DIV | BPF_K intruction.
  mpls: Change reserved label names to be consistent with netbsd
  usbnet: avoid integer overflow in start_xmit
  netxen_nic: use spin_[un]lock_bh around tx_clean_lock (2)
  net: xgene_enet: Set hardware dependency
  net: amd-xgbe: Add hardware dependency
  ...
2015-05-12 21:10:38 -07:00
David Ahern 01d460dd70 net: Remove remaining remnants of pm_qos from netdevice.h
Commit e2c6544829 removed pm_qos from struct net_device but left the
comment and header file. Remove those.

Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 23:22:03 -04:00
Ying Xue 9449c3cd90 net: make skb_dst_pop routine static
As xfrm_output_one() is the only caller of skb_dst_pop(), we should
make skb_dst_pop() localized.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 23:19:49 -04:00
Michael Holzheu cffc642d93 test_bpf: add 173 new testcases for eBPF
add an exhaustive set of eBPF tests bringing total to:
test_bpf: Summary: 233 PASSED, 0 FAILED, [0/226 JIT'ed]

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 23:15:25 -04:00
Denys Vlasenko a2029240e5 net: deinline netif_tx_stop_all_queues(), remove WARN_ON in netif_tx_stop_queue()
These functions compile to 60 bytes of machine code each.
With this .config: http://busybox.net/~vda/kernel_config
there are 617 calls of netif_tx_stop_queue()
and 49 calls of netif_tx_stop_all_queues() in vmlinux.

To fix this, remove WARN_ON in netif_tx_stop_queue()
as suggested by davem, and deinline netif_tx_stop_all_queues().

Change in code size is about 20k:

   text      data      bss       dec     hex filename
82426986 22255416 20627456 125309858 77813a2 vmlinux.before
82406248 22255416 20627456 125289120 777c2a0 vmlinux

gcc-4.7.2 still creates deinlined version of netif_tx_stop_queue
sometimes:

$ nm --size-sort vmlinux | grep netif_tx_stop_queue | wc -l
190

ffffffff81b558a8 <netif_tx_stop_queue>:
ffffffff81b558a8:       55                      push   %rbp
ffffffff81b558a9:       48 89 e5                mov    %rsp,%rbp
ffffffff81b558ac:       f0 80 8f e0 01 00 00    lock orb $0x1,0x1e0(%rdi)
ffffffff81b558b3:       01
ffffffff81b558b4:       5d                      pop    %rbp
ffffffff81b558b5:       c3                      retq

This needs additional fixing.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Alexei Starovoitov <alexei.starovoitov@gmail.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Joe Perches <joe@perches.com>
CC: David S. Miller <davem@davemloft.net>
CC: Jiri Pirko <jpirko@redhat.com>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: netfilter-devel@vger.kernel.org
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 23:05:35 -04:00
Scott Feldman 7889cbee83 switchdev: remove NETIF_F_HW_SWITCH_OFFLOAD feature flag
Roopa said remove the feature flag for this series and she'll work on
bringing it back if needed at a later date.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:55 -04:00
Scott Feldman 58c2cb16b1 switchdev: convert fib_ipv4_add/del over to switchdev_port_obj_add/del
The IPv4 FIB ops convert nicely to the switchdev objs and we're left with
only four switchdev ops: port get/set and port add/del.  Other objs will
follow, such as FDB.  So go ahead and convert IPv4 FIB over to switchdev
obj for consistency, anticipating more objs to come.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:55 -04:00
Scott Feldman 8793d0a664 switchdev: add new switchdev_port_bridge_getlink
Like bridge_setlink, add switchdev wrapper to handle bridge_getlink and
call into port driver to get port attrs.  For now, only BR_LEARNING and
BR_LEARNING_SYNC are returned.  To add more, we'll probably want to break
away from ndo_dflt_bridge_getlink() and build the netlink skb directly in
the switchdev code.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:55 -04:00
Scott Feldman 87a5dae59e switchdev: remove unused switchdev_port_bridge_dellink
Now we can remove old wrappers for dellink.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:55 -04:00
Scott Feldman 5c34e02214 switchdev: add new switchdev_port_bridge_dellink
Same change as setlink.  Provide the wrapper op for SELF ndo_bridge_dellink
and call into the switchdev driver to delete afspec VLANs.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:55 -04:00
Scott Feldman e71f220b34 switchdev: remove old switchdev_port_bridge_setlink
New attr-based bridge_setlink can recurse lower devs and recover on err, so
remove old wrapper (including ndo_dflt_switchdev_port_bridge_setlink).

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:54 -04:00
Scott Feldman 6004c86718 switchdev: add bridge port flags attr
rocker: use switchdev get/set attr for bridge port flags

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:54 -04:00
Scott Feldman 6fc3016da7 switchdev: add port vlan obj
VLAN obj has flags (PVID and untagged) as well as start and end vid ranges.
The switchdev driver can optimize programing the device using the ranges.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:53 -04:00
Scott Feldman 491d0f1533 switchdev: introduce switchdev add/del obj ops
Like switchdev attr get/set, add new switchdev obj add/del.  switchdev objs
will be things like VLANs or FIB entries, so add/del fits better for
objects than get/set used for attributes.

Use same two-phase prepare-commit transaction model as in attr set.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:53 -04:00
Scott Feldman 3563606258 switchdev: convert STP update to switchdev attr set
STP update is just a settable port attribute, so convert
switchdev_port_stp_update to an attr set.

For DSA, the prepare phase is skipped and STP updates are only done in the
commit phase.  This is because currently the DSA drivers don't need to
allocate any memory for STP updates and the STP update will not fail to HW
(unless something horrible goes wrong on the MDIO bus, in which case the
prepare phase wouldn't have been able to predict anyway).

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:53 -04:00
Scott Feldman f8e20a9f87 switchdev: convert parent_id_get to switchdev attr get
Switch ID is just a gettable port attribute.  Convert switchdev op
switchdev_parent_id_get to a switchdev attr.

Note: for sysfs and netlink interfaces, SWITCHDEV_ATTR_PORT_PARENT_ID is
called with SWITCHDEV_F_NO_RECUSE to limit switch ID user-visiblity to only
port netdevs.  So when a port is stacked under bond/bridge, the user can
only query switch id via the switch ports, but not via the upper devices

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:53 -04:00
Scott Feldman 3094333d90 switchdev: introduce get/set attrs ops
Add two new swdev ops for get/set switch port attributes.  Most swdev
interactions on a port are gets or sets on port attributes, so rather than
adding ops for each attribute, let's define clean get/set ops for all
attributes, and then we can have clear, consistent rules on how attributes
propagate on stacked devs.

Add the basic algorithms for get/set attr ops.  Use the same recusive algo
to walk lower devs we've used for STP updates, for example.  For get,
compare attr value for each lower dev and only return success if attr
values match across all lower devs.  For sets, set the same attr value for
all lower devs.  We'll use a two-phase prepare-commit transaction model for
sets.  In the first phase, the driver(s) are asked if attr set is OK.  If
all OK, the commit attr set in second phase.  A driver would NACK the
prepare phase if it can't set the attr due to lack of resources or support,
within it's control.  RTNL lock must be held across both phases because
we'll recurse all lower devs first in prepare phase, and then recurse all
lower devs again in commit phase.  If any lower dev fails the prepare
phase, we need to abort the transaction for all lower devs.

If lower dev recusion isn't desired, allow a flag SWITCHDEV_F_NO_RECURSE to
indicate get/set only work on port (lowest) device.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:53 -04:00
Jiri Pirko 9d47c0a2d9 switchdev: s/swdev_/switchdev_/
Turned out that "switchdev" sticks. So just unify all related terms to use
this prefix.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:53 -04:00
Jiri Pirko ebb9a03a59 switchdev: s/netdev_switch_/switchdev_/ and s/NETDEV_SWITCH_/SWITCHDEV_/
Turned out that "switchdev" sticks. So just unify all related terms to use
this prefix.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:52 -04:00
David Ward a3eb95f891 net_sched: gred: add TCA_GRED_LIMIT attribute
In a GRED qdisc, if the default "virtual queue" (VQ) does not have drop
parameters configured, then packets for the default VQ are not subjected
to RED and are only dropped if the queue is larger than the net_device's
tx_queue_len. This behavior is useful for WRED mode, since these packets
will still influence the calculated average queue length and (therefore)
the drop probability for all of the other VQs. However, for some drivers
tx_queue_len is zero. In other cases the user may wish to make the limit
the same for all VQs (including the default VQ with no drop parameters).

This change adds a TCA_GRED_LIMIT attribute to set the GRED queue limit,
in bytes, during qdisc setup. (This limit is in bytes to be consistent
with the drop parameters.) The default limit is the same as for a bfifo
queue (tx_queue_len * psched_mtu). If the drop parameters of any VQ are
configured with a smaller limit than the GRED queue limit, that VQ will
still observe the smaller limit instead.

Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:22:49 -04:00
Mike Snitzer 336b7e1f23 block: remove export for blk_queue_bio
With commit ff36ab345 ("dm: remove request-based logic from
make_request_fn wrapper") DM no longer calls blk_queue_bio() directly,
so remove its export.  Doing so required a forward declaration in
blk-core.c.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-12 17:21:22 -04:00