This drops the cmd completion list spin lock and makes the cmd
completion queue lock-less.
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Pull networking fixes from David Miller:
1) Revert iwlwifi reclaimed packet tracking, it causes problems for a
bunch of folks. From Emmanuel Grumbach.
2) Work limiting code in brcmsmac wifi driver can clear tx status
without processing the event. From Arend van Spriel.
3) rtlwifi USB driver processes wrong SKB, fix from Larry Finger.
4) l2tp tunnel delete can race with close, fix from Tom Parkin.
5) pktgen_add_device() failures are not checked at all, fix from Cong
Wang.
6) Fix unintentional removal of carrier off from tun_detach(),
otherwise we confuse userspace, from Michael S. Tsirkin.
7) Don't leak socket reference counts and ubufs in vhost-net driver,
from Jason Wang.
8) vmxnet3 driver gets it's initial carrier state wrong, fix from Neil
Horman.
9) Protect against USB networking devices which spam the host with 0
length frames, from Bjørn Mork.
10) Prevent neighbour overflows in ipv6 for locally destined routes,
from Marcelo Ricardo. This is the best short-term fix for this, a
longer term fix has been implemented in net-next.
11) L2TP uses ipv4 datagram routines in it's ipv6 code, whoops. This
mistake is largely because the ipv6 functions don't even have some
kind of prefix in their names to suggest they are ipv6 specific.
From Tom Parkin.
12) Check SYN packet drops properly in tcp_rcv_fastopen_synack(), from
Yuchung Cheng.
13) Fix races and TX skb freeing bugs in via-rhine's NAPI support, from
Francois Romieu and your's truly.
14) Fix infinite loops and divides by zero in TCP congestion window
handling, from Eric Dumazet, Neal Cardwell, and Ilpo Järvinen.
15) AF_PACKET tx ring handling can leak kernel memory to userspace, fix
from Phil Sutter.
16) Fix error handling in ipv6 GRE tunnel transmit, from Tommi Rantala.
17) Protect XEN netback driver against hostile frontend putting garbage
into the rings, don't leak pages in TX GOP checking, and add proper
resource releasing in error path of xen_netbk_get_requests(). From
Ian Campbell.
18) SCTP authentication keys should be cleared out and released with
kzfree(), from Daniel Borkmann.
19) L2TP is a bit too clever trying to maintain skb->truesize, and ends
up corrupting socket memory accounting to the point where packet
sending is halted indefinitely. Just remove the adjustments
entirely, they aren't really needed. From Eric Dumazet.
20) ATM Iphase driver uses a data type with the same name as the S390
headers, rename to fix the build. From Heiko Carstens.
21) Fix a typo in copying the inner network header offset from one SKB
to another, from Pravin B Shelar.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (56 commits)
net: sctp: sctp_endpoint_free: zero out secret key data
net: sctp: sctp_setsockopt_auth_key: use kzfree instead of kfree
atm/iphase: rename fregt_t -> ffreg_t
net: usb: fix regression from FLAG_NOARP code
l2tp: dont play with skb->truesize
net: sctp: sctp_auth_key_put: use kzfree instead of kfree
netback: correct netbk_tx_err to handle wrap around.
xen/netback: free already allocated memory on failure in xen_netbk_get_requests
xen/netback: don't leak pages on failure in xen_netbk_tx_check_gop.
xen/netback: shutdown the ring if it contains garbage.
net: qmi_wwan: add more Huawei devices, including E320
net: cdc_ncm: add another Huawei vendor specific device
ipv6/ip6_gre: fix error case handling in ip6gre_tunnel_xmit()
tcp: fix for zero packets_in_flight was too broad
brcmsmac: rework of mac80211 .flush() callback operation
ssb: unregister gpios before unloading ssb
bcma: unregister gpios before unloading bcma
rtlwifi: Fix scheduling while atomic bug
net: usbnet: fix tx_dropped statistics
tcp: ipv6: Update MIB counters for drops
...
It's OK to get kick before backend is set or after
it is cleared, we can just ignore it.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Currently, the polling errors were ignored, which can lead following issues:
- vhost remove itself unconditionally from waitqueue when stopping the poll,
this may crash the kernel since the previous attempt of starting may fail to
add itself to the waitqueue
- userspace may think the backend were successfully set even when the polling
failed.
Solve this by:
- check poll->wqh before trying to remove from waitqueue
- report polling errors in vhost_poll_start(), tx_poll_start(), the return value
will be checked and returned when userspace want to set the backend
After this fix, there still could be a polling failure after backend is set, it
will addressed by the next patch.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when vhost_init_used() fails the sock refcnt and ubufs were
leaked. Correct this by calling vhost_init_used() before assign ubufs and
restore the oldsock when it fails.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull target updates from Nicholas Bellinger:
"It has been a very busy development cycle this time around in target
land, with the highlights including:
- Kill struct se_subsystem_dev, in favor of direct se_device usage
(hch)
- Simplify reservations code by combining SPC-3 + SCSI-2 support for
virtual backends only (hch)
- Simplify ALUA code for virtual only backends, and remove left over
abstractions (hch)
- Pass sense_reason_t as return value for I/O submission path (hch)
- Refactor MODE_SENSE emulation to allow for easier addition of new
mode pages. (roland)
- Add emulation of MODE_SELECT (roland)
- Fix bug in handling of ExpStatSN wrap-around (steve)
- Fix bug in TMR ABORT_TASK lookup in qla2xxx target (steve)
- Add WRITE_SAME w/ UNMAP=0 support for IBLOCK backends (nab)
- Convert ib_srpt to use modern target_submit_cmd caller + drop
legacy ioctx->kref usage (nab)
- Convert ib_srpt to use modern target_submit_tmr caller (nab)
- Add link_magic for fabric allow_link destination target_items for
symlinks within target_core_fabric_configfs.c code (nab)
- Allocate pointers in instead of full structs for
config_group->default_groups (sebastian)
- Fix 32-bit highmem breakage for FILEIO (sebastian)
All told, hch was able to shave off another ~1K LOC by killing the
se_subsystem_dev abstraction, along with a number of PR + ALUA
simplifications. Also, a nice patch by Roland is the refactoring of
MODE_SENSE handling, along with the addition of initial MODE_SELECT
emulation support for virtual backends.
Sebastian found a long-standing issue wrt to allocation of full
config_group instead of pointers for config_group->default_group[]
setup in a number of areas, which ends up saving memory with big
configurations. He also managed to fix another long-standing BUG wrt
to broken 32-bit highmem support within the FILEIO backend driver.
Thank you again to everyone who contributed this round!"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (50 commits)
target/iscsi_target: Add NodeACL tags for initiator group support
target/tcm_fc: fix the lockdep warning due to inconsistent lock state
sbp-target: fix error path in sbp_make_tpg()
sbp-target: use simple assignment in tgt_agent_rw_agent_state()
iscsi-target: use kstrdup() for iscsi_param
target/file: merge fd_do_readv() and fd_do_writev()
target/file: Fix 32-bit highmem breakage for SGL -> iovec mapping
target: Add link_magic for fabric allow_link destination target_items
ib_srpt: Convert TMR path to target_submit_tmr
ib_srpt: Convert I/O path to target_submit_cmd + drop legacy ioctx->kref
target: Make spc_get_write_same_sectors return sector_t
target/configfs: use kmalloc() instead of kzalloc() for default groups
target/configfs: allocate only 6 slots for dev_cg->default_groups
target/configfs: allocate pointers instead of full struct for default_groups
target: update error handling for sbc_setup_write_same()
iscsit: use GFP_ATOMIC under spin lock
iscsi_target: Remove redundant null check before kfree
target/iblock: Forward declare bio helpers
target: Clean up flow in transport_check_aborted_status()
target: Clean up logic in transport_put_cmd()
...
Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...
The variable se_sess is initialized but never used
otherwise, so remove the unused variable.
dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Zero copy TX has been around for a while now.
We seem to be down to eliminating theoretical bugs
and performance tuning at this point:
it's probably time to enable it by default so that
most users get the benefit.
Keep the flag around meanwhile so users can experiment
with disabling this if they experience regressions.
I expect that we will remove it in the future.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
For short packets zerocopy mode adds overhead
of managing heads which isn't necessary: we
could simly update used ring directly
same as with zerocopy disabled.
Things seem to run a bit faster if we detect
and bypass head management when zcopy isn't used.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When memory map changes, we need to flush outstanding
DMAs as they might in theory reference old memory addresses.
To do this simply stop initiating new DMAs
and wait for ubufs ref count to drop to 0.
Afterwards reset the count back to 1.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vring changes already do a flush internally where appropriate, so we do
not need a second flush.
It's currently not very expensive but a follow-up patch makes flush more
heavy-weight, so remove the extra flush here to avoid regressing
performance if call or kick fds are changed on data path.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
These packet counters are used to drive the zercopy
selection heuristic so nothing too bad happens if they are off a bit -
and they are also reset once in a while.
But it's cleaner to clear them when backend is set so that
we start in a known state.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a single descriptor crosses a region, the
second chunk length should be decremented
by size translated so far, instead it includes
the full descriptor length.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
linux/vhost.h was included twice.
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net. Based upon a conflict resolution
patch posted by Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the sense reason as an explicit return value from the I/O submission
path instead of storing it in struct se_cmd and using negative return
values. This cleans up a lot of the code pathes, and with the sparse
annotations for the new sense_reason_t type allows for much better
error checking.
(nab: Convert spc_emulate_modesense + spc_emulate_modeselect to use
sense_reason_t with Roland's MODE SELECT changes)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
It seems that to avoid deadlocks it is enough to poll vq before
we are going to use the last buffer. This is faster than
c70aa540c7.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even when vhost-net is in zero-copy transmit mode,
net core might still decide to copy the skb later
which is somewhat slower than a copy in user
context: data copy overhead is added to the cost of
page pin/unpin. The result is that enabling tx zero copy
option leads to higher CPU utilization for guest to guest
and guest to host traffic.
To fix this, suppress zero copy tx after a given number of
packets triggered late data copy. Re-enable periodically
to detect workload changes.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zerocopy handling code is vhost-net specific.
Move it from vhost.c/vhost.h out to net.c
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will be used to disable zerocopy when error rate
is high.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Better document macros for DMA tracking. Add an
explicit one for DMA in progress instead of
relying on user supplying len != 1.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if skb is marked for zero copy, net core might still decide
to copy it later which is somewhat slower than a copy in user context:
besides copying the data we need to pin/unpin the pages.
Add a parameter reporting such cases through zero copy callback:
if this happens a lot, device can take this into account
and switch to copying in user context.
This patch updates all users but ignores the passed value for now:
it will be used by follow-up patches.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We copy head count to a 16 bit field, this works by chance on LE but on
BE guest gets 0. Fix it up.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Alexander Graf <agraf@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull scsi target updates from Nicholas Bellinger:
"Things have been calm for the most part with no new fabric drivers in
flight for v3.7 (we're up to eight now !), so this update is primarily
focused on addressing a few long-standing items within target-core and
iscsi-target fabric code.
The highlights include:
- target: Simplify fabric sense data length handling (roland)
- qla2xxx: Fix endianness of task management response code (roland)
- target: fix truncation of mode data, support zero allocation length
(paolo)
- target: Properly support zero-length commands in normal processing
path (paolo)
- iscsi-target: Correctly set 0xffffffff field within ISCSI_OP_REJECT
PDU (ronnie + nab)
- iscsi-target: Add explicit set of cache_dynamic_acls=1 for TPG
demo-mode (ronnie + nab)
- target/file: Re-enable optional fd_buffered_io=1 operation (nab +
hch)
- iscsi-target: Add MaxXmitDataSegmenthLength forr target ->
initiator MDRSL declaration (nab)
- target: Add target_submit_cmd_map_sgls for SGL fabric memory
passthrough (nab + hch)
- tcm_loop: Convert I/O path to use target_submit_cmd_map_sgls (hch +
nab)
- tcm_vhost: Convert I/O path to use target_submit_cmd_map_sgls (nab
+ hch)
The last series for adding a new target_submit_cmd_map_sgls() fabric
caller (as requested by hch) that accepts pre-allocated SGL memory
(using existing logic), along with converting tcm_loop + tcm_vhost has
only been in -next for the last days, but has gotten enough review
+testing and is clear enough a mechanical change that I think it's
reasonable to merge for -rc1 code.
Thanks again to everyone who contributed this round! Extra special
thanks to Roland (PureStorage) for tracking down the qla2xxx target
TMR response code endian issue, and to Paolo (Redhat) for resolving
the long standing zero-length CDB issues within target-core between
virtual and pSCSI backends."
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (44 commits)
iscsi-target: Bump defaults for nopin_timeout + nopin_response_timeout values
iscsit: proper endianess conversions
iscsit: use the itt_t abstract type
iscsit: add missing endianess conversion in iscsit_check_inaddr_any
iscsit: remove incorrect unlock in iscsit_build_sendtargets_resp
iscsit: mark various functions static
target/iscsi: precedence bug in iscsit_set_dataout_sequence_values()
target/usb-gadget: strlen() doesn't count the terminator
target/usb-gadget: remove duplicate initialization
tcm_vhost: Convert I/O path to use target_submit_cmd_map_sgls
target: Add control CDB READ payload zero work-around
tcm_loop: Convert I/O path to use target_submit_cmd_map_sgls
target: Add target_submit_cmd_map_sgls for SGL fabric memory passthrough
iscsi-target: Add explicit set of cache_dynamic_acls=1 for TPG demo-mode
iscsi-target: Change iscsi_target_seq_pdu_list.c to honor MaxXmitDataSegmentLength
iscsi-target: Add MaxXmitDataSegmentLength connection recovery check
iscsi-target: Convert incoming PDU payload checks to MaxXmitDataSegmentLength
iscsi-target: Enable MaxXmitDataSegmentLength operation in login path
iscsi-target: Add base MaxXmitDataSegmentLength code
target/file: Re-enable optional fd_buffered_io=1 operation
...
This patch converts tcm_vhost to use target_submit_cmd_map_sgls() for
I/O submission and mapping of pre-allocated SGL memory from incoming
virtio-scsi SGL memory -> se_cmd descriptors.
This includes removing the original open-coded fabric uses of target
core callers to support transport_generic_map_mem_to_cmd() between
target_setup_cmd_from_cdb() and transport_handle_cdb_direct() logic.
It also includes adding a handful of new tcm_vhost_cmnd member +
assignments in vhost_scsi_allocate_cmd() used from cmwq process
context I/O submission within tcm_vhost_submission_work()
(v2: Use renamed target_submit_cmd_map_sgls)
Reported-by: Christoph Hellwig <hch@lst.de>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Every fabric driver has to supply a se_tfo->set_fabric_sense_len()
method, just so iSCSI can return an offset of 2. However, every fabric
driver is already allocating a sense buffer and passing it into the
target core, either via transport_init_se_cmd() or target_submit_cmd().
So instead of having iSCSI pass the start of its sense buffer into the
core and then later tell the core to skip the first 2 bytes, it seems
easier for iSCSI just to do the offset of 2 when it passes the sense
buffer into the core. Then we can drop the se_tfo->set_fabric_sense_len()
everywhere, and just add a couple of lines of code to iSCSI to set the
sense data length to the beginning of the buffer right before it sends
it over the network.
(nab: Remove .set_fabric_sense_len usage from tcm_qla2xxx_npiv_ops +
change transport_get_sense_buffer to follow v3.6-rc6 code w/o
->set_fabric_sense_len usage)
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
There are no callers of se_tfo->get_fabric_sense_len(), so we should
stop having every fabric driver implement it.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Here TRANSPORT_IQN_LEN is 224, which is a multiple of 4.
Since vhost_tpgt is 2 bytes and abi_version is 4, the total size would
be 230. But gcc needs struct size be aligned to first field size, which
is 4 bytes, so it pads the structure by extra 2 bytes to the total of
232.
This padding is very undesirable in an ABI:
- it can not be initialized easily
- it can not be checked easily
- it can leak information between kernel and userspace
Simplest solution is probably just to make the padding
explicit.
(v2: Add check for zero'ed backend->reserved field for VHOST_SCSI_SET_ENDPOINT
and VHOST_SCSI_CLEAR_ENDPOINT ops as requested by MST)
Reported-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
This patch changes the vhost_scsi_target->vhost_wwpn[] type used
by VHOST_SCSI_* ioctls to 'char *' as requested by Blue Swirl in
order to match the latest QEMU vhost-scsi RFC-v3 userspace code.
Queuing this up into target-pending/master for a -rc3 PULL.
Reported-by: Blue Swirl <blauwirbel@gmail.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
This patch contains the post RFC-v5 (post-merge) changes, this includes:
- Add locking comment
- Move vhost_scsi_complete_cmd ahead of TFO callbacks in order to
drop forward declarations
- Drop extra '!= NULL' usage in vhost_scsi_complete_cmd_work()
- Change vhost_scsi_*_handle_kick() to use pr_debug
- Fix possible race in vhost_scsi_set_endpoint() for vs->vs_tpg checking
+ assignment.
- Convert tv_tpg->tpg_vhost_count + ->tv_tpg_port_count from atomic_t ->
int, and make sure reference is protected by ->tv_tpg_mutex.
- Drop unnecessary vhost_scsi->vhost_ref_cnt
- Add 'err:' label for exception path in vhost_scsi_clear_endpoint()
- Add enum for VQ numbers, add usage in vhost_scsi_open()
- Add vhost_scsi_flush() + vhost_scsi_flush_vq() following
drivers/vhost/net.c
- Add smp_wmb() + vhost_scsi_flush() call during vhost_scsi_set_features()
- Drop unnecessary copy_from_user() usage with GET_ABI_VERSION ioctl
- Add missing vhost_scsi_compat_ioctl() caller for vhost_scsi_fops
- Fix function parameter definition first line to follow existing
vhost code style
- Change 'vHost' usage -> 'vhost' in handful of locations
- Change -EPERM -> -EBUSY usage for two failures in tcm_vhost_drop_nexus()
- Add comment for tcm_vhost_workqueue in tcm_vhost_init()
- Make GET_ABI_VERSION return 'int' + add comment in tcm_vhost.h
Reported-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Zhi Yong Wu <wuzhy@cn.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
This patch adds the initial code for tcm_vhost, a Vhost level TCM
fabric driver for virtio SCSI initiators into KVM guest.
This code is currently up and running on v3.5-rc2 host+guest
from target-pending/for-next-merge.
Using tcm_vhost requires Zhi's -> Stefan -> nab's qemu vhost-scsi tree here:
http://git.kernel.org/?p=virt/kvm/nab/qemu-kvm.git;a=shortlog;h=refs/heads/vhost-scsi
--
Changelog v4 -> v5:
Expose ABI version via VHOST_SCSI_GET_ABI_VERSION + use Rev 0 as
starting point for v3.6-rc code (Stefan + ALiguori + nab)
Convert vhost_scsi_handle_vq() to vq_err() (nab + MST)
Minor style fixes from checkpatch (nab)
Changelog v3 -> v4:
Rename vhost_vring_target -> vhost_scsi_target (mst + nab)
Use TRANSPORT_IQN_LEN in vhost_scsi_target->vhost_wwpn[] def (nab)
Move back to drivers/vhost/, and just use drivers/vhost/Kconfig.tcm (mst)
Move TCM_VHOST related ioctl defines from include/linux/vhost.h ->
drivers/vhost/tcm_vhost.h as requested by MST (nab)
Move Kbuild.tcm include from drivers/staging -> drivers/vhost/, and
just use 'if STAGING' around 'source drivers/vhost/Kbuild.tcm'
Changelog v2 -> v3:
Unlock on error in tcm_vhost_drop_nexus() (DanC)
Fix strlen() doesn't count the terminator (DanC)
Call kfree() on an error path (DanC)
Convert tcm_vhost_write_pending to use target_execute_cmd (hch + nab)
Fix another strlen() off by one in tcm_vhost_make_tport (DanC)
Add option under drivers/staging/Kconfig, and move to drivers/vhost/tcm/
as requested by MST (nab)
Changelog v1 -> v2:
Fix tv_cmd completion -> release SGL memory leak (nab)
Fix sparse warnings for static variable usage ((Fengguang Wu)
Fix sparse warnings for min() typing + printk format specs (Fengguang Wu)
Convert to cmwq submission for I/O dispatch (nab + hch)
Changelog v0 -> v1:
Merge into single source + header file, and move to drivers/vhost/
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Zhi Yong Wu <wuzhy@cn.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
The vhost work queue allows processing to be done in vhost worker thread
context, which uses the owner process mm. Access to the vring and guest
memory is typically only possible from vhost worker context so it is
useful to allow work to be queued directly by users.
Currently vhost_net only uses the poll wrappers which do not expose the
work queue functions. However, for tcm_vhost (vhost_scsi) it will be
necessary to queue custom work.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Zhi Yong Wu <wuzhy@cn.ibm.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In order for other vhost devices to use the VHOST_FEATURES bits the
vhost-net specific bits need to be moved to their own VHOST_NET_FEATURES
constant.
(Asias: Update drivers/vhost/test.c to use VHOST_NET_FEATURES)
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Zhi Yong Wu <wuzhy@cn.ibm.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Asias He <asias@redhat.com>
Signed-off-by: Nicholas A. Bellinger <nab@risingtidesystems.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
On some architectures address spaces are set up in a way that this is
not necessary to work properly but on some others (like s390) it is.
Make sure we operate on the user address space to allow copy_xxx_user()
from the vhost_worker() thread by setting it explicitly before calling
use_mm() and revert it after unuse_mm().
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take vlan header length into account, when vlan id is stored as
vlan_tci. Otherwise tagged packets coming from macvtap will be
truncated.
Signed-off-by: Basil Gor <basil.gor@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We add used and signal guest in worker thread but did not poll the virtqueue
during the zero copy callback. This may lead the missing of adding and
signalling during zerocopy. Solve this by polling the virtqueue and let it
wakeup the worker during callback.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When a packet were fully copied in zerocopy, we don't wait for the DMA done to
mark the done flag, so after the packet were passed to lower device, we need to
add used and signal guest immediately.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Currently, we restart tx polling unconditionally when sendmsg()
fails. This would cause unnecessary wakeups of vhost wokers and waste
cpu utlization when evil userspace(guest driver) is able to hit EFAULT or
EINVAL.
The polling is only needed when the socket send buffer were exceeded or not
enough memory. So fix this by restarting polling only when sendmsg() returns
EAGAIN/ENOBUFS.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When we want to disable vhost_net backend while there's a tx work, a possible
NULL pointer defernece may happen we we try to deference the vq->bufs after
vhost_net_set_backend() assign a NULL to it.
As suggested by Michael, fix this by checking the vq->bufs instead of
vhost_sock_zcopy().
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Pull networking fixes from David Miller:
1) Fix namespace init and cleanup in phonet to fix some oopses, from
Eric W. Biederman.
2) Missing kfree_skb() in AF_KEY, from Julia Lawall.
3) Refcount leak and source address handling fix in l2tp from James
Chapman.
4) Memory leak fix in CAIF from Tomasz Gregorek.
5) When routes are cloned from ipv6 addrconf routes, we don't process
expirations properly. Fix from Gao Feng.
6) Fix panic on DMA errors in atl1 driver, from Tony Zelenoff.
7) Only enable interrupts in 8139cp driver after we've registered the
IRQ handler. From Jason Wang.
8) Fix too many reads of KS_CIDER register in ks8851 during probe,
fixing crashes on spurious interrupts. From Matt Renzelmann.
9) Missing include in ath5k driver and missing iounmap on probe
failure, from Jonathan Bither.
10) Fix RX packet handling in smsc911x driver, from Will Deacon.
11) Fix ixgbe WoL on fiber by leaving the laser on during shutdown.
12) ks8851 needs MAX_RECV_FRAMES increased otherwise the internal MAC
buffers are easily overflown. Fix from Davide Cimingahi.
13) Fix memory leaks in peak_usb CAN driver, from Jesper Juhl.
14) gred packet scheduler can dump in WRED more when doing a netlink
dump. Fix from David Ward.
15) Fix MTU in USB smsc75xx driver, from Stephane Fillod.
16) Dummy device needs ->ndo_uninit handler to properly handle
->ndo_init failures. From Hiroaki SHIMODA.
17) Fix TX fragmentation in ath9k driver, from Sujith Manoharan.
18) Missing RTNL lock in ixgbe PM resume, from Benjamin Poirier.
19) Missing iounmap in farsync WAN driver, from Julia Lawall.
20) With LRO/GRO, tcp_grow_window() is easily tricked into not growing
the receive window properly, and this hurts performance. Fix from
Eric Dumazet.
21) Network namespace init failure can leak net_generic data, fix from
Julian Anastasov.
22) Fix skb_over_panic due to mis-accounting in TCP for partially ACK'd
SKBs. From Eric Dumazet.
23) New IDs for qmi_wwan driver, from Bjørn Mork.
24) Fix races in ax25_exit(), from Eric W. Biederman.
25) IPV6 TCP doesn't handle TCP_MAXSEG socket option properly, copy over
logic from the IPV4 side. From Neal Cardwell.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
tcp: fix TCP_MAXSEG for established IPv6 passive sockets
drivers/net: Do not free an IRQ if its request failed
drop_monitor: allow more events per second
ks8851: Fix request_irq/free_irq mismatch
net/hyperv: Adding cancellation to ensure rndis filter is closed
ks8851: Fix mutex deadlock in ks8851_net_stop()
net ax25: Reorder ax25_exit to remove races.
icplus: fix interrupt for IC+ 101A/G and 1001LF
net: qmi_wwan: support Sierra Wireless MC77xx devices in QMI mode
bnx2x: off by one in bnx2x_ets_e3b0_sp_pri_to_cos_set()
ksz884x: don't copy too much in netdev_set_mac_address()
tcp: fix retransmit of partially acked frames
netns: do not leak net_generic data on failed init
net/sock.h: fix sk_peek_off kernel-doc warning
tcp: fix tcp_grow_window() for large incoming frames
drivers/net/wan/farsync.c: add missing iounmap
davinci_mdio: Fix MDIO timeout check
ipv6: clean up rt6_clean_expires
ipv6: fix rt6_update_expires
arcnet: rimi: Fix device name in debug output
...
The skb struct ubuf_info callback gets passed struct ubuf_info
itself, not the arg value as the field name and the function signature
seem to imply. Rename the arg field to ctx to match usage,
add documentation and change the callback argument type
to make usage clear and to have compiler check correctness.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit ea5d404655
broke build for the vhost test module used
by tools/virtio. Fix it up.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We shouldn't hold any locks on release path. Pass a flag to
vhost_dev_cleanup to use the lockdep info correctly.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
This is a tiny, but important, patch to vhost.
Vhost's worker thread only called schedule() when it had no work to do, and
it wanted to go to sleep. But if there's always work to do, e.g., the guest
is running a network-intensive program like netperf with small message sizes,
schedule() was *never* called. This had several negative implications (on
non-preemptive kernels):
1. Passing time was not properly accounted to the "vhost" process (ps and
top would wrongly show it using zero CPU time).
2. Sometimes error messages about RCU timeouts would be printed, if the
core running the vhost thread didn't schedule() for a very long time.
3. Worst of all, a vhost thread would "hog" the core. If several vhost
threads need to share the same core, typically one would get most of the
CPU time (and its associated guest most of the performance), while the
others hardly get any work done.
The trivial solution is to add
if (need_resched())
schedule();
After doing every piece of work. This will not do the heavy schedule() all
the time, just when the timer interrupt decided a reschedule is warranted
(so need_resched returns true).
Thanks to Abel Gordon for this patch.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
By adding some module aliases, programs (or users) won't have to explicitly
call modprobe. Vhost-net will always be available if built into the kernel.
It does require assigning a permanent minor number for depmod to work.
Also:
- use C99 style initialization.
- add missing entry in documentation for loop-control
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The meth for calculating the # of outstanding buffers gives
incorrect results when vq->upend_idx wraps around zero.
Fix that.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
On backend change, we flushed out outstanding skbs
but forgot to update the used ring, so that
done entries were left in the ubuf_info ring.
As a result we lose heads or complete incorrect ones,
crashing the guest or leaking memory.
Fix by updating the used ring.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
As we now only update used ring after enabling
the backend, we can write flags with __put_user:
as that's done on data path, it matters.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Fix get/put refcount imbalance with zero copy,
which caused qemu to hang forever on guest driver unload.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We need to log writes when updating used flags and avail event
fields. Otherwise the guest may see a stale value after migration and
miss notifying the host.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Move the used ring initialization after backend was set. This
makes it possible to disable the backend and tweak the used ring,
then restart. This will also make it possible to log the used ring
write correctly.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>From: Shirley Ma <mashirle@us.ibm.com>
This adds experimental zero copy support in vhost-net,
disabled by default. To enable, set
experimental_zcopytx module option to 1.
This patch maintains the outstanding userspace buffers in the
sequence it is delivered to vhost. The outstanding userspace buffers
will be marked as done once the lower device buffers DMA has finished.
This is monitored through last reference of kfree_skb callback. Two
buffer indices are used for this purpose.
The vhost-net device passes the userspace buffers info to lower device
skb through message control. DMA done status check and guest
notification are handled by handle_tx: in the worst case is all buffers
in the vq are in pending/done status, so we need to notify guest to
release DMA done buffers first before we get any new buffers from the
vq.
One known problem is that if the guest stops submitting
buffers, buffers might never get used until some
further action, e.g. device reset. This does not
seem to affect linux guests.
Signed-off-by: Shirley <xma@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support the new event index feature. When acked,
utilize it to reduce the # of interrupts sent to the guest.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
- Documentation/kvm/ to Documentation/virtual/kvm
- Documentation/uml/ to Documentation/virtual/uml
- Documentation/lguest/ to Documentation/virtual/lguest
throughout the kernel source tree.
Signed-off-by: Rob Landley <rob@landley.net>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Use of skb_queue_empty(&sock->sk->sk_receive_queue)
without taking the sk_receive_queue.lock is unsafe
or useless. Take it out.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vhost takes a sock lock to try and prevent
the skb from being pulled from the receive queue
after skb_peek. However this is not the right lock to use for that,
sk_receive_queue.lock is. Fix that up.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Codes duplication were found between the handling of mergeable and big
buffers, so this patch tries to unify them. This could be easily done
by adding a quota to the get_rx_bufs() which is used to limit the
number of buffers it returns (for mergeable buffer, the quota is
simply UIO_MAXIOV, for big buffers, the quota is just 1), and then the
previous handle_rx_mergeable() could be resued also for big buffers.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
No need to check the support of mergeable buffer inside the recevie
loop as the whole handle_rx()_xx is in the read critical region. So
this patch move it ahead of the receiving loop.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
copy_from_user is pretty high on perf top profile,
replacing it with __copy_from_user helps.
It's also safe because we do access_ok checks during setup.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Minor cleanup of vhost.c and net.c to match coding style.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When built with rcu checks enabled, vhost triggers
bogus warnings as vhost features are read without
dev->mutex sometimes, and private pointer is read
with our kind of rcu where work serves as a
read side critical section.
Fixing it properly is not trivial.
Disable the warnings by stubbing out the checks for now.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
To detect that a sequence number is done, we are doing math on unsigned
integers so the result is unsigned too. Not what was intended for the <=
comparison. The result is user stuck forever in flush call.
Convert to int to fix this.
Further, get rid of ({}) to make code clearer.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This adds a test module for vhost infrastructure.
Intentionally not tied to kbuild to prevent people
from installing and loading it accidentally.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We really store a page offset in write_address,
so rename it write_page to avoid confusion.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Fix two bugs in dirty page logging:
When counting pages we should increase address by 1 instead of
VHOST_PAGE_SIZE. Make log_write() correctly process requests
that cross pages with write_address not starting at page boundary.
Reported-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Incorrect rcu check was used as rcu isn't done
under mutex here. Force check to 1 for now,
to stop it from complaining.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We do access_ok checks on all ring values on an ioctl,
so we don't need to redo them on each access.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Delete successive assignments to the same location.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression i;
@@
*i = ...;
i = ...;
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
vlan: Calling vlan_hwaccel_do_receive() is always valid.
tproxy: use the interface primary IP address as a default value for --on-ip
tproxy: added IPv6 support to the socket match
cxgb3: function namespace cleanup
tproxy: added IPv6 support to the TPROXY target
tproxy: added IPv6 socket lookup function to nf_tproxy_core
be2net: Changes to use only priority codes allowed by f/w
tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
tproxy: added tproxy sockopt interface in the IPV6 layer
tproxy: added udp6_lib_lookup function
tproxy: added const specifiers to udp lookup functions
tproxy: split off ipv6 defragmentation to a separate module
l2tp: small cleanup
nf_nat: restrict ICMP translation for embedded header
can: mcp251x: fix generation of error frames
can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
can-raw: add msg_flags to distinguish local traffic
9p: client code cleanup
rds: make local functions/variables static
...
Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
drivers/net/wireless/ath/ath9k/debug.c as per David
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
access_ok() returns 1 if it's OK otherwise it should return 0.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Qemu supports up to UIO_MAXIOV s/g so we have to match that because guest
drivers may rely on this.
Allocate indirect and log arrays dynamically to avoid using too much contigious
memory and make the length of hdr array to match the header length since each
iovec entry has a least one byte.
Test with copying large files w/ and w/o migration in both linux and windows
guests.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The log eventfd signalling got put in dead code.
We didn't notice because qemu currently does polling
instead of eventfd select.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In mergeable buffer case, we use headcount, log_num
and seg as indexes in same-size arrays, and
we know that headcount <= seg and
log_num equals either 0 or seg.
Therefore, the right thing to do is range-check seg,
not headcount as we do now: these will be different
if guest chains s/g descriptors (this does not
happen now, but we can not trust the guest).
Long term, we should add BUG_ON checks to verify
two other indexes are what we think they should be.
Reported-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vhost should set worker to NULL on cgroups attach failure,
so that we won't try to destroy the worker again on close.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Since 2.6.36-rc1, non-root users of vhost-net fail to attach
if they are in any cgroups.
The reason is that when qemu uses vhost, vhost wants to attach
its thread to all cgroups that qemu has. But we got the API backwards,
so a non-priveledged process (Qemu) tried to control
the priveledged one (vhost), which fails.
Fix this by switching to the new cgroup_attach_task_all,
and running it from the vhost thread.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Its currently illegal to call kthread_stop(NULL)
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also add rcu_dereference_protected() for code paths where locks are held.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1443 commits)
phy/marvell: add 88ec048 support
igb: Program MDICNFG register prior to PHY init
e1000e: correct MAC-PHY interconnect register offset for 82579
hso: Add new product ID
can: Add driver for esd CAN-USB/2 device
l2tp: fix export of header file for userspace
can-raw: Fix skb_orphan_try handling
Revert "net: remove zap_completion_queue"
net: cleanup inclusion
phy/marvell: add 88e1121 interface mode support
u32: negative offset fix
net: Fix a typo from "dev" to "ndev"
igb: Use irq_synchronize per vector when using MSI-X
ixgbevf: fix null pointer dereference due to filter being set for VLAN 0
e1000e: Fix irq_synchronize in MSI-X case
e1000e: register pm_qos request on hardware activation
ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice
net: Add getsockopt support for TCP thin-streams
cxgb4: update driver version
cxgb4: add new PCI IDs
...
Manually fix up conflicts in:
- drivers/net/e1000e/netdev.c: due to pm_qos registration
infrastructure changes
- drivers/net/phy/marvell.c: conflict between adding 88ec048 support
and cleaning up the IDs
- drivers/net/wireless/ipw2x00/ipw2100.c: trivial ipw2100_pm_qos_req
conflict (registration change vs marking it static)
This adds support for mergeable buffers in vhost-net: this is needed
for older guests without indirect buffer support, as well
as for zero copy with some devices.
Includes changes by Michael S. Tsirkin to make the
patch as low risk as possible (i.e., close to no changes
when feature is disabled).
Signed-off-by: David Stevens <dlstevens@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>