Linux IPv6 forwards unicast packets, which are link layer multicasts...
The hole was present since day one. I was 100% this check is there, but it is not.
The problem shows itself, f.e. when Microsoft Network Load Balancer runs on a network.
This software resolves IPv6 unicast addresses to multicast MAC addresses.
Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
ceph messenger code does a rather complex dancing around multithread
workqueue to make sure the same work item isn't executed concurrently
on different CPUs. This restriction can be provided by workqueue with
WQ_NON_REENTRANT.
Make ceph_msgr_wq non-reentrant workqueue with the default concurrency
level and remove the QUEUED/BUSY logic.
* This removes backoff handling in con_work() but it couldn't reliably
block execution of con_work() to begin with - queue_con() can be
called after the work started but before BUSY is set. It seems that
it was an optimization for a rather cold path and can be safely
removed.
* The number of concurrent work items is bound by the number of
connections and connetions are independent from each other. With
the default concurrency level, different connections will be
executed independently.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sage Weil <sage@newdream.net>
Cc: ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Add a ceph_dir_layout to the inode, and calculate dentry hash values based
on the parent directory's specified dir_hash function. This is needed
because the old default Linux dcache hash function is extremely week and
leads to a poor distribution of files among dir fragments.
Signed-off-by: Sage Weil <sage@newdream.net>
The IPv6 tproxy patches split IPv6 defragmentation off of conntrack, but
failed to update the #ifdef stanzas guarding the defragmentation related
fields and code in skbuff and conntrack related code in nf_defrag_ipv6.c.
This patch adds the required #ifdefs so that IPv6 tproxy can truly be used
without connection tracking.
Original report:
http://marc.info/?l=linux-netdev&m=129010118516341&w=2
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit fe10ae5338 adds a memset() to clear
the structure being sent back to userspace, but accidentally used the
wrong size.
Reported-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Cc: stable@kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (67 commits)
cxgb4vf: recover from failure in cxgb4vf_open()
netfilter: ebtables: make broute table work again
netfilter: fix race in conntrack between dump_table and destroy
ah: reload pointers to skb data after calling skb_cow_data()
ah: update maximum truncated ICV length
xfrm: check trunc_len in XFRMA_ALG_AUTH_TRUNC
ehea: Increase the skb array usage
net/fec: remove config FEC2 as it's used nowhere
pcnet_cs: add new_id
tcp: disallow bind() to reuse addr/port
net/r8169: Update the function of parsing firmware
net: ppp: use {get,put}_unaligned_be{16,32}
CAIF: Fix IPv6 support in receive path for GPRS/3G
arp: allow to invalidate specific ARP entries
net_sched: factorize qdisc stats handling
mlx4: Call alloc_etherdev to allocate RX and TX queues
net: Add alloc_netdev_mqs function
caif: don't set connection request param size before copying data
cxgb4vf: fix mailbox data/control coherency domain race
qlcnic: change module parameter permissions
...
* 'nfs-for-2.6.38' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (89 commits)
NFS fix the setting of exchange id flag
NFS: Don't use vm_map_ram() in readdir
NFSv4: Ensure continued open and lockowner name uniqueness
NFS: Move cl_delegations to the nfs_server struct
NFS: Introduce nfs_detach_delegations()
NFS: Move cl_state_owners and related fields to the nfs_server struct
NFS: Allow walking nfs_client.cl_superblocks list outside client.c
pnfs: layout roc code
pnfs: update nfs4_callback_recallany to handle layouts
pnfs: add CB_LAYOUTRECALL handling
pnfs: CB_LAYOUTRECALL xdr code
pnfs: change lo refcounting to atomic_t
pnfs: check that partial LAYOUTGET return is ignored
pnfs: add layout to client list before sending rpc
pnfs: serialize LAYOUTGET(openstateid)
pnfs: layoutget rpc code cleanup
pnfs: change how lsegs are removed from layout list
pnfs: change layout state seqlock to a spinlock
pnfs: add prefix to struct pnfs_layout_hdr fields
pnfs: add prefix to struct pnfs_layout_segment fields
...
The netlink interface to dump the connection tracking table has a race
when entries are deleted at the same time. A customer reported a crash
and the backtrace showed thatctnetlink_dump_table was running while a
conntrack entry was being destroyed.
(see https://bugzilla.vyatta.com/show_bug.cgi?id=6402).
According to RCU documentation, when using hlist_nulls the reader
must handle the case of seeing a deleted entry and not proceed
further down the linked list. The old code would continue
which caused the scan to walk into the free list.
This patch uses locking (rather than RCU) for this operation which
is guaranteed safe, and no longer requires getting reference while
doing dump operation.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
skb_cow_data() may allocate a new data buffer, so pointers on
skb should be set after this function.
Bug was introduced by commit dff3bb06 ("ah4: convert to ahash")
and 8631e9bd ("ah6: convert to ahash").
Signed-off-by: Wang Xuefu <xuefu.wang@6wind.com>
Acked-by: Krzysztof Witek <krzysztof.witek@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maximum trunc length is defined by MAX_AH_AUTH_LEN (in bytes)
and need to be checked when this value is set (in bits) by
the user. In ah4.c and ah6.c a BUG_ON() checks this condiftion.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_csk_bind_conflict() logic currently disallows a bind() if
it finds a friend socket (a socket bound on same address/port)
satisfying a set of conditions :
1) Current (to be bound) socket doesnt have sk_reuse set
OR
2) other socket doesnt have sk_reuse set
OR
3) other socket is in LISTEN state
We should add the CLOSE state in the 3) condition, in order to avoid two
REUSEADDR sockets in CLOSE state with same local address/port, since
this can deny further operations.
Note : a prior patch tried to address the problem in a different (and
buggy) way. (commit fda48a0d7a tcp: bind() fix when many ports
are bound).
Reported-by: Gaspar Chilingarov <gasparch@gmail.com>
Reported-by: Daniel Baluta <daniel.baluta@gmail.com>
Tested-by: Daniel Baluta <daniel.baluta@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows us to reuse the xprt associated with a server connection if
one has already been set up.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Multiple backchannels can share the same tcp connection; from rfc 5661 section
2.10.3.1:
A connection's association with a session is not exclusive. A
connection associated with the channel(s) of one session may be
simultaneously associated with the channel(s) of other sessions
including sessions associated with other client IDs.
However, multiple backchannels share a connection, they must all share
the same xid stream (hence the same rpc_xprt); the only way we have to
match replies with calls at the rpc layer is using the xid.
So, keep the rpc_xprt around as long as the connection lasts, in case
we're asked to use the connection as a backchannel again.
Requests to create new backchannel clients over a given server
connection should results in creating new clients that reuse the
existing rpc_xprt.
But to start, just reject attempts to associate multiple rpc_xprt's with
the same underlying bc_xprt.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This seems obviously transport-level information even if it's currently
used only by the server socket code.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Use proper data types for storing the count of the binary blob and
length of a string. Without this patch length calculation of string will
always result in -1 because of comparision between signed and unsigned
integer.
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Checks version field of IP in the receive path for GPRS/3G data
and appropriately sets the value of skb->protocol.
Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv4 over firewire needs to be able to remove ARP entries
from the ARP cache that belong to nodes that are removed, because
IPv4 over firewire uses ARP packets for private information
about nodes.
This information becomes invalid as soon as node drops
off the bus and when it reconnects, its only possible
to start talking to it after it responded to an ARP packet.
But ARP cache prevents such packets from being sent.
Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
HTB takes into account skb is segmented in stats updates.
Generalize this to all schedulers.
They should use qdisc_bstats_update() helper instead of manipulating
bstats.bytes and bstats.packets
Add bstats_update() helper too for classes that use
gnet_stats_basic_packed fields.
Note : Right now, TCQ_F_CAN_BYPASS shortcurt can be taken only if no
stab is setup on qdisc.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added alloc_netdev_mqs function which allows the number of transmit and
receive queues to be specified independenty. alloc_netdev_mq was
changed to a macro to call the new function. Also added
alloc_etherdev_mqs with same purpose.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The size field should not be set until after the data is successfully
copied in.
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Rosenberg pointed out that there were some signed comparison bugs
in the phonet protocol.
http://marc.info/?l=full-disclosure&m=129424528425330&w=2
The problem is that we check for array overflows but "protocol" is
signed and we don't check for array underflows. If you have already
have CAP_SYS_ADMIN then you could use the bugs to get root, or someone
could cause an oops by mistake.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vm_map_ram() is not available on NOMMU platforms, and causes trouble
on incoherrent architectures such as ARM when we access the page data
through both the direct and the virtual mapping.
The alternative is to use the direct mapping to access page data
for the case when we are not crossing a page boundary, but to copy
the data into a linear scratch buffer when we are accessing data
that spans page boundaries.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: stable@kernel.org [2.6.37]
Using "iptables -L" with a lot of rules have a too big BH latency.
Jesper mentioned ~6 ms and worried of frame drops.
Switch to a per_cpu seqlock scheme, so that taking a snapshot of
counters doesnt need to block BH (for this cpu, but also other cpus).
This adds two increments on seqlock sequence per ipt_do_table() call,
its a reasonable cost for allowing "iptables -L" not block BH
processing.
Reported-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In order to compute the features for other offloads (primarily
scatter/gather), we need to first check the ability of the NIC to
offload the checksum for the packet. Since we have already computed
this, we can directly use the result instead of figuring it out
again.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This switches skb_need_linearize() to use the features that have
been centrally computed. In doing so, this fixes a problem where
scatter/gather should not be used because the card does not support
checksum offloading on that type of packet. On device registration
we only check that some form of checksum offloading is available if
scatter/gatther is enabled but we must also check at transmission
time. Examples of this include IPv6 or vlan packets on a NIC that
only supports IPv4 offloading.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This switches dev_gso_segment() to use the device features computed
by the centralized routine. In doing so, it fixes a problem where
it would always use dev->features, instead of those appropriate
to the number of vlan tags if any are present.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that there is a single function that can compute the device
features relevant to a packet, we don't want to run it for each
offload. This converts netif_needs_gso() to take the features
of the device, rather than computing them itself.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_get_vlan_features() is currently only used by netif_needs_gso(),
so it only concerns itself with GSO features. However, several other
places also should take into account the contents of the packet when
deciding whether to offload to hardware. This generalizes the function
to return features about all of the various forms of offloading. Since
offloads tend to be linked together, this avoids duplicating the logic
in each location (i.e. the scatter/gather code also needs the checksum
logic).
Suggested-by: Michał Mirosław <mirqus@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently only have software fallback for one type of checksum: the
TCP/UDP one's complement. This means that a protocol that uses hardware
offloading for a different type of checksum (FCoE, SCTP) must directly
check the device's features and do the right thing ahead of time. By
the time we get to dev_can_checksum(), we're only deciding whether to
apply the one algorithm in software or hardware. NETIF_F_HW_CSUM has the
same capabilities as the software version, so we should always use it if
present. The primary advantage of this is multiply tagged vlans can use
hardware checksumming.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix new kernel-doc notation warning in net/core/filter.c:
Warning(net/core/filter.c:172): No description found for parameter 'fentry'
Warning(net/core/filter.c:172): Excess function parameter 'filter' description in 'sk_run_filter'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to NLM_F_DUMP is composed of two bits, NLM_F_ROOT | NLM_F_MATCH,
when doing "if (x & NLM_F_DUMP)", it tests for _either_ of the bits
being set. Because NLM_F_MATCH's value overlaps with NLM_F_EXCL,
non-dump requests with NLM_F_EXCL set are mistaken as dump requests.
Substitute the condition to test for _all_ bits being set.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits)
usb: don't use flush_scheduled_work()
speedtch: don't abuse struct delayed_work
media/video: don't use flush_scheduled_work()
media/video: explicitly flush request_module work
ioc4: use static work_struct for ioc4_load_modules()
init: don't call flush_scheduled_work() from do_initcalls()
s390: don't use flush_scheduled_work()
rtc: don't use flush_scheduled_work()
mmc: update workqueue usages
mfd: update workqueue usages
dvb: don't use flush_scheduled_work()
leds-wm8350: don't use flush_scheduled_work()
mISDN: don't use flush_scheduled_work()
macintosh/ams: don't use flush_scheduled_work()
vmwgfx: don't use flush_scheduled_work()
tpm: don't use flush_scheduled_work()
sonypi: don't use flush_scheduled_work()
hvsi: don't use flush_scheduled_work()
xen: don't use flush_scheduled_work()
gdrom: don't use flush_scheduled_work()
...
Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c
as per Tejun.
The 'seq_window' sysctl sets the initial value for the DCCP Sequence Window,
which may range from 32..2^46-1 (RFC 4340, 7.5.2). The patch sets the upper
bound consistently to 2^32-1 on both 32 and 64 bit systems, which should be
sufficient - with a RTT of 1sec and 1-byte packets, a seq_window of 2^32-1
corresponds to a link speed of 34 Gbps.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Currently dccp_check_seqno allows any valid packet to update the Greatest
Sequence Number Received, even if that packet's sequence number is less than
the current GSR. This patch adds a check to make sure that the new packet's
sequence number is greater than GSR.
Signed-off-by: Samuel Jero <sj323707@ohio.edu>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Currently dccp_check_seqno returns 0 (indicating a valid packet) if the
acknowledgment number is out of bounds and the sync that RFC 4340 mandates at
this point is currently being rate-limited. This function should return -1,
indicating an invalid packet.
Signed-off-by: Samuel Jero <sj323707@ohio.edu>
Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Regardless of how much we possibly try to scale dcache, there is likely
always going to be some fundamental contention when adding or removing children
under the same parent. Pseudo filesystems do not seem need to have connected
dentries because by definition they are disconnected.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.
Patched with:
git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Pseudo filesystems that don't put inode on RCU list or reachable by
rcu-walk dentries do not need to RCU free their inodes.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
RCU free the struct inode. This will allow:
- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.
The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.
In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.
The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.
This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Differentiate from server backchannel
Signed-off-by: Andy Adamson <andros@netapp.com>
Acked-by: Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The sessions based callback service is started prior to the CREATE_SESSION call
so that it can handle CB_NULL requests which can be sent before the
CREATE_SESSION call returns and the session ID is known.
Set the callback sessionid after a sucessful CREATE_SESSION.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the current sock create and destroy routines into the new transport ops.
Back channel socket will be destroyed by the svc_closs_all call in svc_destroy.
Added check: only TCP supported on shared back channel.
Signed-off-by: Andy Adamson <andros@netapp.com>
Acked-by: Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFSv4.1 shared back channel does not need to call svc_drop because the
callback service never outlives the single connection it services, and it
reuses it's buffers and keeps the trasport.
Signed-off-by: Andy Adamson <andros@netapp.com>
Acked-by: Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since nf_bridge_maybe_copy_header() may change the length of skb,
we should check the length of skb after it to handle the ppoe skbs.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 1ae4de0cdf, the secctx was exported
via the /proc/net/netfilter/nf_conntrack and ctnetlink interfaces
instead of the secmark.
That patch introduced the use of security_secid_to_secctx() which may
return a non-zero value on error.
In one of my setups, I have NF_CONNTRACK_SECMARK enabled but no
security modules. Thus, security_secid_to_secctx() returns a negative
value that results in the breakage of the /proc and `conntrack -L'
outputs. To fix this, we skip the inclusion of secctx if the
aforementioned function fails.
This patch also fixes the dynamic netlink message size calculation
if security_secid_to_secctx() returns an error, since its logic is
also wrong.
This problem exists in Linux kernel >= 2.6.37.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since nf_ct_expect_dst_hash() may be called without nf_conntrack_lock
locked, nf_ct_expect_hash_rnd should be initialized in the atomic way.
In this patch, we use nf_conntrack_hash_rnd instead of
nf_ct_expect_hash_rnd.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC3168 (The Addition of Explicit Congestion Notification to IP)
states :
5.3. Fragmentation
ECN-capable packets MAY have the DF (Don't Fragment) bit set.
Reassembly of a fragmented packet MUST NOT lose indications of
congestion. In other words, if any fragment of an IP packet to be
reassembled has the CE codepoint set, then one of two actions MUST be
taken:
* Set the CE codepoint on the reassembled packet. However, this
MUST NOT occur if any of the other fragments contributing to
this reassembly carries the Not-ECT codepoint.
* The packet is dropped, instead of being reassembled, for any
other reason.
This patch implements this requirement for IPv4, choosing the first
action :
If one fragment had NO-ECT codepoint
reassembled frame has NO-ECT
ElIf one fragment had CE codepoint
reassembled frame has CE
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The original code has a use after free bug because it's not using the
_safe() version of the list_for_each_entry() macro.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a "goto nla_put_failure" hidden inside the NLA_PUT() macro, but
we're holding the dcb_lock so we need to unlock first.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Leonardo Chiquitto found poll() could block forever on tcp sockets and
Urgent data was received, if the event flag only contains POLLPRI.
He did a bisection and found commit 4938d7e023 (poll: avoid extra
wakeups in select/poll) was the source of the problem.
Problem is TCP sockets use standard sock_def_readable() function for
their sk_data_ready() handler, and sock_def_readable() doesnt signal
POLLPRI.
Only TCP is affected by the problem. Adding POLLPRI to the list of flags
might trigger unnecessary schedules, but URGENT handling is such a
seldom used feature this seems a good compromise.
Thanks a lot to Leonardo for providing the bisection result and a test
program as well.
Reference : http://www.spinics.net/lists/netdev/msg151793.html
Reported-and-bisected-by: Leonardo Chiquitto <leonardo.lists@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
unix_release() can asynchornously set socket->sk to NULL, and
it does so without holding the unix_state_lock() on "other"
during stream connects.
However, the reverse mapping, sk->sk_socket, is only transitioned
to NULL under the unix_state_lock().
Therefore make the security hooks follow the reverse mapping instead
of the forward mapping.
Reported-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 57dbb2d83d (sched: add head drop fifo queue)
introduced pfifo_head_drop, and broke the invariant that
sch->bstats.bytes and sch->bstats.packets are COUNTER (increasing
counters only)
This can break estimators because est_timer() handles unsigned deltas
only. A decreasing counter can then give a huge unsigned delta.
My mid term suggestion would be to change things so that
sch->bstats.bytes and sch->bstats.packets are incremented in dequeue()
only, not at enqueue() time. We also could add drop_bytes/drop_packets
and provide estimations of drop rates.
It would be more sensible anyway for very low speeds, and big bursts.
Right now, if we drop packets, they still are accounted in byte/packets
abolute counters and rate estimators.
Before this mid term change, this patch makes pfifo_head_drop behavior
similar to other qdiscs in case of drops :
Dont decrement sch->bstats.bytes and sch->bstats.packets
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Somehow this snuck into my earlier patch, and
only now did I see a compiler warning:
net/mac80211/led.c:218:13: warning: function '__ieee80211_create_tpt_led_trigger' with external linkage has definition
Remove the stray extern.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When the driver has remain-on-channel offload,
implement off-channel transmission using that
primitive.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This allows drivers to support remain-on-channel
offload if they implement smarter timing or need
to use a device implementation like iwlwifi.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Up to now /proc/interrupts only has statistics for external and i/o
interrupts but doesn't split up them any further.
This patch adds a line for every single interrupt source so that it
is possible to easier tell what the machine is/was doing.
Part of the output now looks like this;
CPU0 CPU2 CPU4
EXT: 3898 4232 2305
I/O: 782 315 245
CLK: 1029 1964 727 [EXT] Clock Comparator
IPI: 2868 2267 1577 [EXT] Signal Processor
TMR: 0 0 0 [EXT] CPU Timer
TAL: 0 0 0 [EXT] Timing Alert
PFL: 0 0 0 [EXT] Pseudo Page Fault
[...]
NMI: 0 1 1 [NMI] Machine Checks
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Supposes cache_check runs simultaneously with an update on a different
CPU:
cache_check task doing update
^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^
1. test for CACHE_VALID 1'. set entry->data
& !CACHE_NEGATIVE
2. use entry->data 2'. set CACHE_VALID
If the two memory writes performed in step 1' and 2' appear misordered
with respect to the reads in step 1 and 2, then the caller could get
stale data at step 2 even though it saw CACHE_VALID set on the cache
entry.
Add memory barriers to prevent this.
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We attempt to turn a cache entry negative in place. But that entry may
already have been filled in by some other task since we last checked
whether it was valid, so we could be modifying an already-valid entry.
If nothing else there's a likely leak in such a case when the entry is
eventually put() and contents are not freed because it has
CACHE_NEGATIVE set.
So, take the cache_lock just as sunrpc_cache_update() does.
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently we use -EAGAIN returns to determine when to drop a deferred
request. On its own, that is error-prone, as it makes us treat -EAGAIN
returns from other functions specially to prevent inadvertent dropping.
So, use a flag on the request instead.
Returning an error on request deferral is still required, to prevent
further processing, but we no longer need worry that an error return on
its own could result in a drop.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit d29068c431 "sunrpc: Simplify cache_defer_req and related
functions." asserted that cache_check() could determine success or
failure of cache_defer_req() by checking the CACHE_PENDING bit.
This isn't quite right.
We need to know whether cache_defer_req() created a deferred request,
in which case sending an rpc reply has become the responsibility of the
deferred request, and it is important that we not send our own reply,
resulting in two different replies to the same request.
And the CACHE_PENDING bit doesn't tell us that; we could have
succesfully created a deferred request at the same time as another
thread cleared the CACHE_PENDING bit.
So, partially revert that commit, to ensure that cache_check() returns
-EAGAIN if and only if a deferred request has been created.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: NeilBrown <neilb@suse.de>
[bfields@redhat.com: moved svcauth_unix_purge outside ifdef's.]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Once a sunrpc cache entry is VALID, we should be replacing it (and
allowing any concurrent users to destroy it on last put) instead of
trying to update it in place.
Otherwise someone referencing the ip_map we're modifying here could try
to use the m_client just as we're putting the last reference.
The bug should only be seen by users of the legacy nfsd interfaces.
(Thanks to Neil for suggestion to use sunrpc_invalidate.)
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The key documentation is slightly out of date, fix
that. Also, the list entry in the key struct is no
longer used that way, so list_del_init() isn't
necessary any more there.
Finally, ieee80211_key_link() is no longer invoked
under RCU read lock, but rather with an appropriate
station lock held.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This reverts enables the reorder release timer once again.
The issues laid out in:
<http://www.spinics.net/lists/linux-wireless/msg57214.html>
Have been addressed by:
mac80211: serialize rx path workers
mac80211: ignore PSM bit of reordered frames
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch addresses the issue of serialization between
the main rx path and various reorder release timers.
<http://www.spinics.net/lists/linux-wireless/msg57214.html>
It converts the previously local "frames" queue into
a global rx queue [rx_skb_queue]. This way, everyone
(be it the main rx-path or some reorder release timeout)
can add frames to it.
Only one active rx handler worker [ieee80211_rx_handlers]
is needed. All other threads which have lost the race of
"runnning_rx_handler" can now simply "return", knowing that
the thread who had the "edge" will also take care of their
workload.
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Fixes the misplaced article in the following:
"cfg80211: Updating information on frequency 5785 MHz for
20 a MHz width channel with regulatory rule:"
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Fixed a bug where if a mesh interface has a different MAC address from its bridge
interface, then it would not be able to send data traffic to any other mesh node.
This also adds support for communication between mesh nodes and external bridged
nodes by using a 6 address format if the source is a node within the mesh and the
destination is an external node proxied by a mesh portal.
Signed-off-by: Joel A Fernandes <agnel.joel@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch tackles one of the problems of my
reorder release timer patch from August.
<http://www.spinics.net/lists/linux-wireless/msg57214.html>
=>
What if the reorder release triggers and ap_sta_ps_end
(called by ieee80211_rx_h_sta_process) accidentally clears
the WLAN_STA_PS_STA flag, because 100ms ago - when the STA
was still active - frames were put into the reorder buffer.
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The preferred source address is currently ignored for local routes,
which results in all local connections having a src address that is the
same as the local dst address. Fix this by respecting the preferred source
address when it is provided for local routes.
This bug can be demonstrated as follows:
# ifconfig dummy0 192.168.0.1
# ip route show table local | grep local.*dummy0
local 192.168.0.1 dev dummy0 proto kernel scope host src 192.168.0.1
# ip route change table local local 192.168.0.1 dev dummy0 \
proto kernel scope host src 127.0.0.1
# ip route show table local | grep local.*dummy0
local 192.168.0.1 dev dummy0 proto kernel scope host src 127.0.0.1
We now establish a local connection and verify the source IP
address selection:
# nc -l 192.168.0.1 3128 &
# nc 192.168.0.1 3128 &
# netstat -ant | grep 192.168.0.1:3128.*EST
tcp 0 0 192.168.0.1:3128 192.168.0.1:33228 ESTABLISHED
tcp 0 0 192.168.0.1:33228 192.168.0.1:3128 ESTABLISHED
Signed-off-by: Joel Sing <jsing@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit ad0e2b5a00
Author: Johannes Berg <johannes.berg@intel.com>
Date: Tue Jun 1 10:19:19 2010 +0200
mac80211: simplify key locking
removed the synchronization against RCU and thus
opened a race window where we can use a key for
TX while it is already freed. Put a synchronisation
into the right place to close that window.
Reported-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Cc: stable@kernel.org [2.6.36+]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Commit b51aff057c said:
Under memory pressure, the mac80211 mesh code
may helpfully print a message that it failed
to clone a mesh frame and then will proceed
to crash trying to use it anyway. Fix that.
Avoid the reference whenever the frame copy is unsuccessful
regardless of the debug message being suppressed or printed.
Cc: stable@kernel.org [2.6.27+]
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
net/bridge//br_stp_if.c:148:66: warning: conversion of
net/bridge//br_stp_if.c:148:66: int to
net/bridge//br_stp_if.c:148:66: int enum umh_wait
net/bridge//netfilter/ebtables.c:1150:30: warning: Using plain integer as NULL pointer
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide child qdisc backlog (byte count) information so that "tc -s
qdisc" can report it to user.
packet count is already correctly provided.
qdisc red 11: parent 1:11 limit 60Kb min 15Kb max 45Kb ecn
Sent 3116427684 bytes 1415782 pkt (dropped 8, overlimits 7866 requeues 0)
rate 242385Kbit 13630pps backlog 13560b 8p requeues 0
marked 7865 early 1 pdrop 7 other 0
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
More accurate return values for the following (new) dcbnl routines:
dcbnl_getdcbx()
dcbnl_setdcbx()
dcbnl_getfeatcfg()
dcbnl_setfeatcfg()
Signed-off-by: Shmulik Ravid <shmulikr@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit bf9ae5386b
(llc: use dev_hard_header) removed the
skb_reset_mac_header call from llc_mac_hdr_init.
This seems fine itself, but br_send_bpdu() invokes ebtables LOCAL_OUT.
We oops in ebt_basic_match() because it assumes eth_hdr(skb) returns
a meaningful result.
Cc: acme@ghostprotocols.net
References: https://bugzilla.kernel.org/show_bug.cgi?id=24532
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tipc/dbg.h file was recently renamed to tipc/log.h,
but the re-include define was not updated accordingly.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleans up TIPC's source code to eliminate the presence of unnecessary
use of {} around single statements.
These changes are purely cosmetic and do not alter the operation of TIPC
in any way.
Signed-off-by: Allan Stephens <Allan.Stephens@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleans up TIPC's source code to eliminate the needless initialization
of static variables to zero.
These changes are purely cosmetic and do not alter the operation of TIPC
in any way.
Signed-off-by: Allan Stephens <Allan.Stephens@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleans up TIPC's source code to eliminate assigning values to variables
within conditional expressions, improving code readability and reducing
warnings from various code checker tools.
These changes are purely cosmetic and do not alter the operation of TIPC
in any way.
Signed-off-by: Allan Stephens <Allan.Stephens@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleans up TIPC's source code to eliminate deviations from generally
accepted coding conventions relating to leading/trailing white space
and white space around commas, braces, cases, and sizeof.
These changes are purely cosmetic and do not alter the operation of TIPC
in any way.
Signed-off-by: Allan Stephens <Allan.Stephens@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing code for the copy to user and error handling at the
end of getsockopt isn't easy to follow, due to the excessive use
of if/else. By simply using return where appropriate, it can be
made smaller and easier to follow at the same time.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>