Commit Graph

77928 Commits

Author SHA1 Message Date
Ken-ichirou MATSUZAWA a29a9a585b netfilter: nfnetlink_log: allow to attach conntrack
This patch enables to include the conntrack information together
with the packet that is sent to user-space via NFLOG, then a
user-space program can acquire NATed information by this NFULA_CT
attribute.

Including the conntrack information is optional, you can set it
via NFULNL_CFG_F_CONNTRACK flag with the NFULA_CFG_FLAGS attribute
like NFQUEUE.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-05 17:32:14 +02:00
Ken-ichirou MATSUZAWA 224a05975e netfilter: ctnetlink: add const qualifier to nfnl_hook.get_ct
get_ct as is and will not update its skb argument, and users of
nfnl_ct_hook is currently only nfqueue, we can add const qualifier.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
2015-10-05 17:32:13 +02:00
Ken-ichirou MATSUZAWA a4b4766c3c netfilter: nfnetlink_queue: rename related to nfqueue attaching conntrack info
The idea of this series of patch is to attach conntrack information to
nflog like nfqueue has already done. nfqueue conntrack info attaching
basis is generic, rename those names to generic one, glue.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-05 17:32:11 +02:00
Flavio Leitner 0647e70834 netfilter: remove dead code
Remove __nf_conntrack_find() from headers.

Fixes: dcd93ed4cd ("netfilter: nf_conntrack: remove dead code")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-05 17:32:01 +02:00
Pablo Neira Ayuso b7bd1809e0 netfilter: nfnetlink_queue: get rid of nfnetlink_queue_ct.c
The original intention was to avoid dependencies between nfnetlink_queue and
conntrack without ifdef pollution. However, we can achieve this by moving the
conntrack dependent code into ctnetlink and keep some glue code to access the
nfq_ct indirection from nfqueue.

After this patch, the nfq_ct indirection is always compiled in the netfilter
core to avoid polluting nfqueue with ifdefs. Thus, if nf_conntrack is not
compiled this results in only 8-bytes of memory waste in x86_64.

This patch also adds ctnetlink_nfqueue_seqadj() to avoid that the nf_conn
structure layout if exposed to nf_queue, which creates another dependency with
nf_conntrack at compilation time.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-04 21:45:44 +02:00
Eric Dumazet e96f78ab27 tcp/dccp: add SLAB_DESTROY_BY_RCU flag for request sockets
Before letting request sockets being put in TCP/DCCP regular
ehash table, we need to add either :

- SLAB_DESTROY_BY_RCU flag to their kmem_cache
- add RCU grace period before freeing them.

Since we carefully respected the SLAB_DESTROY_BY_RCU protocol
like ESTABLISH and TIMEWAIT sockets, use it here.

req_prot_init() being only used by TCP and DCCP, I did not add
a new slab_flags into their rsk_prot, but reuse prot->slab_flags

Since all reqsk_alloc() users are correctly dealing with a failure,
add the __GFP_NOWARN flag to avoid traces under pressure.

Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 13:25:20 -07:00
Simon Horman f3a6bd393c phylib: Add phy_set_max_speed helper
Add a helper to allow ethernet drivers to limit the speed of a phy
(that they are attached to).

This mainly involves factoring out the business-end of
of_set_phy_supported() and exporting a new symbol.

This code seems to be open coded in several places, in several different
variants.

It is is envisaged that this will be used in situations where setting the
"max-speed" property in DT is not appropriate, e.g. because the maximum
speed is not a property of the phy hardware.

Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 05:05:18 -07:00
Daniel Borkmann c46646d048 sched, bpf: add helper for retrieving routing realms
Using routing realms as part of the classifier is quite useful, it
can be viewed as a tag for one or multiple routing entries (think of
an analogy to net_cls cgroup for processes), set by user space routing
daemons or via iproute2 as an indicator for traffic classifiers and
later on processed in the eBPF program.

Unlike actions, the classifier can inspect device flags and enable
netif_keep_dst() if necessary. tc actions don't have that possibility,
but in case people know what they are doing, it can be used from there
as well (e.g. via devs that must keep dsts by design anyway).

If a realm is set, the handler returns the non-zero realm. User space
can set the full 32bit realm for the dst.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 05:02:41 -07:00
Daniel Borkmann a91263d520 ebpf: migrate bpf_prog's flags to bitfield
As we need to add further flags to the bpf_prog structure, lets migrate
both bools to a bitfield representation. The size of the base structure
(excluding insns) remains unchanged at 40 bytes.

Add also tags for the kmemchecker, so that it doesn't throw false
positives. Even in case gcc would generate suboptimal code, it's not
being accessed in performance critical paths.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 05:02:39 -07:00
Jiri Pirko 9e8f4a548a switchdev: push object ID back to object structure
Suggested-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:49:40 -07:00
Jiri Pirko 648b4a995a switchdev: bring back switchdev_obj and use it as a generic object param
Replace "void *obj" with a generic structure. Introduce couple of
helpers along that.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:49:39 -07:00
Jiri Pirko 52ba57cfdc switchdev: rename switchdev_obj_fdb to switchdev_obj_port_fdb
Make the struct name in sync with object id name.

Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:49:39 -07:00
Jiri Pirko 8f24f3095d switchdev: rename switchdev_obj_vlan to switchdev_obj_port_vlan
Make the struct name in sync with object id name.

Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:49:38 -07:00
Jiri Pirko 1f86839874 switchdev: rename SWITCHDEV_ATTR_* enum values to SWITCHDEV_ATTR_ID_*
To be aligned with obj.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:49:37 -07:00
Jiri Pirko 57d80838da switchdev: rename SWITCHDEV_OBJ_* enum values to SWITCHDEV_OBJ_ID_*
Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:49:36 -07:00
Eric Dumazet ef547f2ac1 tcp: remove max_qlen_log
This control variable was set at first listen(fd, backlog)
call, but not updated if application tried to increase or decrease
backlog. It made sense at the time listener had a non resizeable
hash table.

Also rounding to powers of two was not very friendly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:44 -07:00
Eric Dumazet 10cbc8f179 tcp/dccp: remove struct listen_sock
It is enough to check listener sk_state, no need for an extra
condition.

max_qlen_log can be moved into struct request_sock_queue

We can remove syn_wait_lock and the alignment it enforced.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:43 -07:00
Eric Dumazet ca6fb06518 tcp: attach SYNACK messages to request sockets instead of listener
If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.

(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)

By attaching the skb to the request socket, we remove this
source of contention.

Tested:

 listen(fd, 10485760); // single listener (no SO_REUSEPORT)
 16 RX/TX queue NIC
 Sustain a SYNFLOOD attack of ~320,000 SYN per second,
 Sending ~1,400,000 SYNACK per second.
 Perf profiles now show listener spinlock being next bottleneck.

    20.29%  [kernel]  [k] queued_spin_lock_slowpath
    10.06%  [kernel]  [k] __inet_lookup_established
     5.12%  [kernel]  [k] reqsk_timer_handler
     3.22%  [kernel]  [k] get_next_timer_interrupt
     3.00%  [kernel]  [k] tcp_make_synack
     2.77%  [kernel]  [k] ipt_do_table
     2.70%  [kernel]  [k] run_timer_softirq
     2.50%  [kernel]  [k] ip_finish_output
     2.04%  [kernel]  [k] cascade

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:43 -07:00
Eric Dumazet 1b33bc3e9e ipv6: remove obsolete inet6 functions
inet6_csk_search_req() and inet6_csk_reqsk_queue_hash_add()
no longer exist.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:42 -07:00
Eric Dumazet 81b496b31a tcp/dccp: shrink struct listen_sock
We no longer use hash_rnd, nr_table_entries and syn_table[]

For a listener with a backlog of 10 millions sockets, this
saves 80 MBytes of vmalloced memory.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:42 -07:00
Eric Dumazet 079096f103 tcp/dccp: install syn_recv requests into ehash table
In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.

ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.

In nominal conditions, this halves pressure on listener lock.

Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.

We will shrink listen_sock in the following patch to ease
code review.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:41 -07:00
Eric Dumazet 2feda34192 tcp/dccp: remove inet_csk_reqsk_queue_added() timeout argument
This is no longer used.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:40 -07:00
Eric Dumazet aa3a0c8ce6 tcp: get_openreq[46]() changes
When request sockets are no longer in a per listener hash table
but on regular TCP ehash, we need to access listener uid
through req->rsk_listener

get_openreq6() also gets a const for its request socket argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:40 -07:00
Eric Dumazet b267cdd107 tcp/dccp: init sk_prot and call sk_node_init() in reqsk_alloc()
We plan to use generic functions to insert request sockets
into ehash table.

sk_prot needs to be set (to retrieve sk_prot->h.hashinfo)
sk_node needs to be cleared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:38 -07:00
Eric Dumazet 8d2675f1e4 tcp: move synflood_warned into struct request_sock_queue
long term plan is to remove struct listen_sock when its hash
table is no longer there.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:37 -07:00
Eric Dumazet aac065c50a tcp: move qlen/young out of struct listen_sock
qlen_inc & young_inc were protected by listener lock,
while qlen_dec & young_dec were atomic fields.

Everything needs to be atomic for upcoming lockless listener.

Also move qlen/young in request_sock_queue as we'll get rid
of struct listen_sock eventually.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:36 -07:00
Eric Dumazet fff1f3001c tcp: add a spinlock to protect struct request_sock_queue
struct request_sock_queue fields are currently protected
by the listener 'lock' (not a real spinlock)

We need to add a private spinlock instead, so that softirq handlers
creating children do not have to worry with backlog notion
that the listener 'lock' carries.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:36 -07:00
David S. Miller f6d3125fa3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/dsa/slave.c

net/dsa/slave.c simply had overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-02 07:21:25 -07:00
Linus Torvalds 8c25ab8b5a Merge git://git.infradead.org/intel-iommu
Pull IOVA fixes from David Woodhouse:
 "The main fix here is the first one, fixing the over-allocation of
   size-aligned requests.  The other patches simply make the existing
  IOVA code available to users other than the Intel VT-d driver, with no
  functional change.

  I concede the latter really *should* have been submitted during the
  merge window, but since it's basically risk-free and people are
  waiting to build on top of it and it's my fault I didn't get it in, I
  (and they) would be grateful if you'd take it"

* git://git.infradead.org/intel-iommu:
  iommu: Make the iova library a module
  iommu: iova: Export symbols
  iommu: iova: Move iova cache management to the iova library
  iommu/iova: Avoid over-allocating when size-aligned
2015-10-02 07:59:29 -04:00
Linus Torvalds bde17b90dd Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "12 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  dmapool: fix overflow condition in pool_find_page()
  thermal: avoid division by zero in power allocator
  memcg: remove pcp_counter_lock
  kprobes: use _do_fork() in samples to make them work again
  drivers/input/joystick/Kconfig: zhenhua.c needs BITREVERSE
  memcg: make mem_cgroup_read_stat() unsigned
  memcg: fix dirty page migration
  dax: fix NULL pointer in __dax_pmd_fault()
  mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a fault
  mm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1)
  userfaultfd: remove kernel header include from uapi header
  arch/x86/include/asm/efi.h: fix build failure
2015-10-01 22:20:11 -04:00
Linus Torvalds 1bca1000fa Power management and ACPI material for v4.3-rc4
- intel_idle driver fixup for the recently added Skylake chips
    support (Len Brown).
 
  - Operating Performance Points (OPP) library fix related to the
    recently added support for new DT bindings and a fix for a typo
    in a comment (Viresh Kumar, Stephen Boyd).
 
  - ACPI EC driver fix for a recently introduced memory leak in an
    error code path (Lv Zheng).
 
  - ACPI PCI IRQ management fix for the issue where an ISA IRQ is
    shared with a PCI device which requires it to be configured in a
    different way and may cause an interrupt storm to happen as a
    result with an extra ACPI SCI IRQ handling simplification on top
    of it (Jiang Liu).
 
  - Update of the PCI power management documentation that became
    outdated and started to actively confuse the readers to make
    it actually reflect the code (Rafael J Wysocki).
 
  - turbostat fixes including an IVB Xeon regression fix (related to
    the --debug command line option), Skylake adjustment for the TSC
    running at a frequency that doesn't match the base one exactly,
    and a Knights Landing quirk to account for the fact that it only
    updates APERF and MPERF every 1024 clock cycles plus bumping up
    the turbostat version number (Len Brown, Hubert Chrzaniuk).
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJWDag4AAoJEILEb/54YlRxy/UQAJa39EC2IQd+PrMlgMx3cp2N
 ssotwuQiQ0jL2V/qc36wfzgu3A5k0ldHHQGbgX0f/z9LjD+zLsZiPtHj27LrNtG5
 J9DgViLh9vut4XEsLlzj8W2z1OcTyAmZyTIiVeFlj/zM517oeXKVYMX2RuhHQk0r
 lwDI/hc1rtpUkdN7gkT9DqyO32r1LgNkDt6+ubRr/qrYVhYPXSrp4k9wxnr9j1Bx
 0G9bvCz8ETTclRPcfToGU9P86snk5FS3veSm231ioABdry7BxhTZHjQKSZyjuvx4
 l8YedxBc0ks7yyeN9lvWPbNSpHLjhYen+d9q1koQsHJYb+gWJ/KbSGu3kfg0bPDj
 Rzh1u76ak7MOYpkn+95MRhzIiFxG3IhUoqYhIGGyCNFGAJgPfFos2IJTISAxSmTE
 ebCyFEX07AdhjHac4RyRCnMVavZthgLyXHwXiNqG9gdW9aOEzN65svH2LLMBiKcH
 IGRCsjom1uCUT0y1gy3R7q1nTCi112IcXwvAziX7QKCNOxLIH8HJNiraVcyl2vY5
 BbDyTOQ7VboviWWSQ09+bQFq4CAhe4b9+nR4XhvHO9F0ffxBujBoCwjjFQY+yJIH
 9nYaYyUynpi1m0Y1AwlrI8wgVLDfNEE6UU63clHQ2PoOFfDDE+/5I/l3yuWubo0I
 cUtW1RVEgDaa61ehyFuS
 =ELup
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI fixes from Rafael Wysocki:
 "These are fixes mostly, for a few changes made in this cycle (the
  intel_idle driver, the OPP library, the ACPI EC driver, turbostat) and
  for some issues that have just been discovered (ACPI PCI IRQ
  management, PCI power management documentation, turbostat), with a
  couple of cleanups on top of them.

  Specifics:

   - intel_idle driver fixup for the recently added Skylake chips
     support (Len Brown).

   - Operating Performance Points (OPP) library fix related to the
     recently added support for new DT bindings and a fix for a typo in
     a comment (Viresh Kumar, Stephen Boyd).

   - ACPI EC driver fix for a recently introduced memory leak in an
     error code path (Lv Zheng).

   - ACPI PCI IRQ management fix for the issue where an ISA IRQ is
     shared with a PCI device which requires it to be configured in a
     different way and may cause an interrupt storm to happen as a
     result with an extra ACPI SCI IRQ handling simplification on top of
     it (Jiang Liu).

   - Update of the PCI power management documentation that became
     outdated and started to actively confuse the readers to make it
     actually reflect the code (Rafael J Wysocki).

   - turbostat fixes including an IVB Xeon regression fix (related to
     the --debug command line option), Skylake adjustment for the TSC
     running at a frequency that doesn't match the base one exactly, and
     a Knights Landing quirk to account for the fact that it only
     updates APERF and MPERF every 1024 clock cycles plus bumping up the
     turbostat version number (Len Brown, Hubert Chrzaniuk)"

* tag 'pm+acpi-4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  tools/power turbosat: update version number
  tools/power turbostat: SKL: Adjust for TSC difference from base frequency
  tools/power turbostat: KNL workaround for %Busy and Avg_MHz
  tools/power turbostat: IVB Xeon: fix --debug regression
  ACPI / PCI: Remove duplicated penalty on SCI IRQ
  ACPI, PCI, irq: Do not share PCI IRQ with ISA IRQ
  ACPI / EC: Fix a memory leak issue in acpi_ec_query()
  PM / OPP: Fix typo modifcation -> modification
  PCI / PM: Update runtime PM documentation for PCI devices
  PM / OPP: of_property_count_u32_elems() can return errors
  intel_idle: Skylake Client Support - updated
2015-10-01 22:06:40 -04:00
Linus Torvalds 3deaa4f531 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

1) Fix regression in SKB partial checksum handling, from Pravin B
   Shalar.

2) Fix VLAN inside of VXLAN handling in i40e driver, from Jesse
   Brandeburg.

3) Cure softlockups during accept() in SCTP, from Karl Heiss.

4) MSG_PEEK should return multiple SKBs worth of data in AF_UNIX, from
   Aaron Conole.

5) IPV6 erroneously ignores output interface specifier in lookup key for
   route lookups, fix from David Ahern.

6) In Marvell DSA driver, forward unknown frames to CPU port, from
   Andrew Lunn.

7) Mission flow flag initializations in some code paths, from David
   Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: Initialize flow flags in input path
  net: dsa: fix preparation of a port STP update
  testptp: Silence compiler warnings on ppc64
  net/mlx4: Handle return codes in mlx4_qp_attach_common
  dsa: mv88e6xxx: Enable forwarding for unknown to the CPU port
  skbuff: Fix skb checksum partial check.
  net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set
  net sysfs: Print link speed as signed integer
  bna: fix error handling
  af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag
  af_unix: Convert the unix_sk macro to an inline function for type safety
  net: sctp: Don't use 64 kilobyte lookup table for four elements
  l2tp: protect tunnel->del_work by ref_count
  net/ibm/emac: bump version numbers for correct work with ethtool
  sctp: Prevent soft lockup when sctp_accept() is called during a timeout event
  sctp: Whitespace fix
  i40e/i40evf: check for stopped admin queue
  i40e: fix VLAN inside VXLAN
  r8169: fix handling rtl_readphy result
  net: hisilicon: fix handling platform_get_irq result
2015-10-01 21:55:35 -04:00
Greg Thelen ef510194ce memcg: remove pcp_counter_lock
Commit 733a572e66 ("memcg: make mem_cgroup_read_{stat|event}() iterate
possible cpus instead of online") removed the last use of the per memcg
pcp_counter_lock but forgot to remove the variable.

Kill the vestigial variable.

Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01 21:42:35 -04:00
Greg Thelen 0610c25daa memcg: fix dirty page migration
The problem starts with a file backed dirty page which is charged to a
memcg.  Then page migration is used to move oldpage to newpage.

Migration:
 - copies the oldpage's data to newpage
 - clears oldpage.PG_dirty
 - sets newpage.PG_dirty
 - uncharges oldpage from memcg
 - charges newpage to memcg

Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
count.

However, because newpage is not yet charged, setting newpage.PG_dirty
does not increment the memcg's dirty page count.  After migration
completes newpage.PG_dirty is eventually cleared, often in
account_page_cleaned().  At this time newpage is charged to a memcg so
the memcg's dirty page count is decremented which causes underflow
because the count was not previously incremented by migration.  This
underflow causes balance_dirty_pages() to see a very large unsigned
number of dirty memcg pages which leads to aggressive throttling of
buffered writes by processes in non root memcg.

This issue:
 - can harm performance of non root memcg buffered writes.
 - can report too small (even negative) values in
   memory.stat[(total_)dirty] counters of all memcg, including the root.

To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
page_memcg() and set_page_memcg() helpers.

Test:
    0) setup and enter limited memcg
    mkdir /sys/fs/cgroup/test
    echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
    echo $$ > /sys/fs/cgroup/test/cgroup.procs

    1) buffered writes baseline
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

    2) buffered writes with compaction antagonist to induce migration
    yes 1 > /proc/sys/vm/compact_memory &
    rm -rf /data/tmp/foo
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    kill %
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

    3) buffered writes without antagonist, should match baseline
    rm -rf /data/tmp/foo
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

                       (speed, dirty residue)
             unpatched                       patched
    1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
    2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
    3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages

    Notice that unpatched baseline performance (1) fell after
    migration (3): 841 -> 114 MB/s.  In the patched kernel, post
    migration performance matches baseline.

Fixes: c4843a7593 ("memcg: add per cgroup dirty page accounting")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>	[4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01 21:42:35 -04:00
Andre Przywara 9ff42d10c3 userfaultfd: remove kernel header include from uapi header
As include/uapi/linux/userfaultfd.h is a user visible header file, it
should not include kernel-exclusive header files.

So trying to build the userfaultfd test program from the selftests
directory fails, since it contains a reference to linux/compiler.h.  As
it turns out, that header is not really needed there, so we can simply
remove it to fix that issue.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01 21:42:35 -04:00
Linus Torvalds 46c8217c4a Changes for 4.3-rc4
- Fixes for mlx5 related issues
 - Fixes for ipoib multicast handling
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWCfALAAoJELgmozMOVy/dc+MQAKoD6echYpTkWE0otMuHQcYf
 zMaVVots+JdRKpA6OqHYQHgKGA80z21BpnjGYwcwB5zB1zPrJwz4vxwGlOBHt01T
 xLBReFgSKyJlgOWLXKfPx4bXUdivOBKm203wY0dh+/dC/VROGYoiXYTmSDsfsuKa
 8OXT1kWgzRVLtqwqj5GSkgWvtFZ28CjKh6d9egjqcj9tpbh2UupQDZzMyOtZ52X6
 Nz/Vo3u4T7qjzlhHOlCwHCDw+97x0yvmvLY1mWweGPfKOnxtXjkzQmTQEpyzU5Mo
 EwcqJucrBnmjbLAIBMrbR1mzTUQeD4dHz1jx+EzWE0lVnRL3twe1UaY40176sNlm
 aCBA4bIOQ242r3IJ++ss15ol1k5hu7PYKRn9Q8d2sSbQGcSnCHe/YOutQQ+FTEFG
 yE9xiLL+pgT8koauROnxg66E3HDM78NGTpjP3EuG4r2Qwa1iFANPfDB6kikuv8bO
 rG3qUJcloEPvfatZY+h5QC4UCoB0/W1DAhlfzE3tPBYPmhSEgQDfEOzXTKDakeF0
 VB903bYrOL3CVOun4I7fLrDc1leVeiAUKqO2orZs3qIpRWvAKyV/VjolAusMv2+F
 /4xPyh95AEMTFfmZogOCofQFk3eOnkWpLdrVTYCKy3i6NVBoy2wHldrl+LuCAN/m
 r/DNRBmazShashbeU6wg
 =8+cX
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma fixes from Doug Ledford:
 - Fixes for mlx5 related issues
 - Fixes for ipoib multicast handling

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
  IB/ipoib: increase the max mcast backlog queue
  IB/ipoib: Make sendonly multicast joins create the mcast group
  IB/ipoib: Expire sendonly multicast joins
  IB/mlx5: Remove pa_lkey usages
  IB/mlx5: Remove support for IB_DEVICE_LOCAL_DMA_LKEY
  IB/iser: Add module parameter for always register memory
  xprtrdma: Replace global lkey with lkey local to PD
2015-10-01 16:38:52 -04:00
Rafael J. Wysocki dd953d318d Merge branches 'pm-pci' and 'acpi-pci'
* pm-pci:
  PCI / PM: Update runtime PM documentation for PCI devices

* acpi-pci:
  ACPI / PCI: Remove duplicated penalty on SCI IRQ
  ACPI, PCI, irq: Do not share PCI IRQ with ISA IRQ
2015-10-01 22:30:12 +02:00
David S. Miller 4bf1b54f9d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following pull request contains Netfilter/IPVS updates for net-next
containing 90 patches from Eric Biederman.

The main goal of this batch is to avoid recurrent lookups for the netns
pointer, that happens over and over again in our Netfilter/IPVS code. The idea
consists of passing netns pointer from the hook state to the relevant functions
and objects where this may be needed.

You can find more information on the IPVS updates from Simon Horman's commit
merge message:

c3456026ad ("Merge tag 'ipvs2-for-v4.4' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next").

Exceptionally, this time, I'm not posting the patches again on netdev, Eric
already Cc'ed this mailing list in the original submission. If you need me to
make, just let me know.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 21:46:21 -07:00
David Ahern 21fdd092ac net: Add support for filtering neigh dump by master device
Add support for filtering neighbor dumps by master device by adding
the NDA_MASTER attribute to the dump request. A new netlink flag,
NLM_F_DUMP_FILTERED, is added to indicate the kernel supports the
request and output is filtered as requested.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 21:33:54 -07:00
Vivien Didelot 44bbcf5c4a net: switchdev: extract struct switchdev_obj_*
Now that switchdev and its drivers directly use specific switchdev_obj_*
structures, move them out of the switchdev_obj union and get rif of this
outer structure.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 21:32:00 -07:00
Vivien Didelot ab06900230 net: switchdev: abstract object in add/del ops
Similar to the notifier_call callback of a notifier_block, change the
function signature of switchdev add and del operations to:

    int switchdev_port_obj_add/del(struct net_device *dev,
                                   enum switchdev_obj_id id, void *obj);

This allows the caller to pass a specific switchdev_obj_* structure
instead of the generic switchdev_obj one.

Drivers implementation of these operations and switchdev have been
changed accordingly.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 21:31:59 -07:00
Vivien Didelot 25f07adc47 net: switchdev: pass callback to dump operation
Similar to the notifier_call callback of a notifier_block, change the
function signature of switchdev dump operation to:

    int switchdev_port_obj_dump(struct net_device *dev,
                                enum switchdev_obj_id id, void *obj,
                                int (*cb)(void *obj));

This allows the caller to pass and expect back a specific
switchdev_obj_* structure instead of the generic switchdev_obj one.

Drivers implementation of dump operation can now expect this specific
structure and call the callback with it. Drivers have been changed
accordingly.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 21:31:59 -07:00
Vivien Didelot 03d5fb1862 net: switchdev: remove dev from switchdev_obj cb
The net_device associated to a dump operation does not have to be passed
to the callback. switchdev stores it in a superset struct, if needed.

Also some drivers (such as DSA drivers) may not have easy access to it.

This will simplify pushing the callback function down to the drivers.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 21:31:59 -07:00
David Ahern 9478d12d33 net: Move netif_index_is_l3_master to l3mdev.h
Change CONFIG dependency to CONFIG_NET_L3_MASTER_DEV as well.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 20:40:34 -07:00
David Ahern ec539514e5 net: Remove vrf header file
Move remaining structs to VRF driver and delete the vrf header file.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 20:40:33 -07:00
David Ahern 93a7e7e837 net: Remove the now unused vrf_ptr
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 20:40:33 -07:00
David Ahern 8e1ed7058b net: Replace calls to vrf_dev_get_rth
Replace calls to vrf_dev_get_rth with l3mdev_get_rtable.
The check on the flow flags is handled in the l3mdev operation.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 20:40:33 -07:00
David Ahern 3236b0042b net: Replace vrf_dev_table and friends
Replace calls to vrf_dev_table and friends with l3mdev_fib_table
and kin.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 20:40:33 -07:00
David Ahern 385add906b net: Replace vrf_master_ifindex{, _rcu} with l3mdev equivalents
Replace calls to vrf_master_ifindex_rcu and vrf_master_ifindex with either
l3mdev_master_ifindex_rcu or l3mdev_master_ifindex.

The pattern:
    oif = vrf_master_ifindex(dev) ? : dev->ifindex;
is replaced with
    oif = l3mdev_fib_oif(dev);

And remove the now unused vrf macros.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 20:40:33 -07:00
David Ahern 1b69c6d0ae net: Introduce L3 Master device abstraction
L3 master devices allow users of the abstraction to influence FIB lookups
for enslaved devices. Current API provides a means for the master device
to return a specific FIB table for an enslaved device, to return an
rtable/custom dst and influence the OIF used for fib lookups.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 20:40:32 -07:00