Commit graph

211154 commits

Author SHA1 Message Date
Randy Dunlap
d187abb9a8 USB: gadget: fix composite kernel-doc warnings
Warning(include/linux/usb/composite.h:284): No description found for parameter 'disconnect'
Warning(drivers/usb/gadget/composite.c:744): No description found for parameter 'c'
Warning(drivers/usb/gadget/composite.c:744): Excess function parameter 'cdev' description in 'usb_string_ids_n'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:16 -07:00
Bill Pemberton
6b8f1ca558 USB: ssu100: set tty_flags in ssu100_process_packet
flag was never set in ssu100_process_packet.  Add logic to set it
before calling tty_insert_flip_*

Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:16 -07:00
Bill Pemberton
85dee135b8 USB: ssu100: add disconnect function for ssu100
Add a disconnect function to the functions of this device.  The
disconnect is a call to usb_serial_generic_disconnect() so it requires
that symbol to be exported from generic.c.

Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:16 -07:00
Bill Pemberton
5c7efeb76e USB: serial: export symbol usb_serial_generic_disconnect
This is needed by the ssu100 driver to use this function.

Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:16 -07:00
Bill Pemberton
f81c83db56 USB: ssu100: rework logic for TIOCMIWAIT
Rework the logic for TIOCMIWAIT to use wait_event_interruptible.

This also adds support for TIOCGICOUNT.

Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:16 -07:00
Bill Pemberton
556f1a0e9c USB: ssu100: add register parameter to ssu100_setregister
The function ssu100_setregister was hard coded to only set the MCR
register.  Add a register parameter so that other registers can be
set.

Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:15 -07:00
Bill Pemberton
79f203a26a USB: ssu100: remove duplicate #defines in ssu100
The ssu100 uses a TI16C550C UART so the SERIAL_ defines in this code
are duplicates of those found in serial_reg.h.  Remove the defines in
ssu100.c and use the ones in the header file.

Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:15 -07:00
Bill Pemberton
9b2cef31f2 USB: ssu100: refine process_packet in ssu100
The status information does not appear at the start of each incoming
packet so the check for len < 4 at the start of ssu100_process_packet
is wrong.  Remove it.

Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:15 -07:00
Bill Pemberton
175230587b USB: ssu100: add locking for port private data in ssu100
Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:15 -07:00
Axel Lin
96f2a34d2c USB: r8a66597-udc: return -ENOMEM if kzalloc() fails
Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:15 -07:00
Greg Kroah-Hartman
0827a9ff2b USB: io_ti: check firmware version before updating
If we can't read the firmware for a device from the disk, and yet the
device already has a valid firmware image in it, we don't want to
replace the firmware with something invalid.  So check the version
number to be less than the current one to verify this is the correct
thing to do.


Reported-by: Chris Beauchamp <chris@chillibean.tv>
Tested-by: Chris Beauchamp <chris@chillibean.tv>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:15 -07:00
Michael Wileczka
d1ab903d25 USB: ftdi_sio: fix endianess of max packet size
The USB max packet size (always little-endian) was not being byte
swapped on big-endian systems.

Applicable since [USB: ftdi_sio: fix hi-speed device packet size calculation] approx 2.6.31

Signed-off-by: Michael Wileczka <mikewileczka@yahoo.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:15 -07:00
Craig Shelley
72916791cb USB: CP210x Fix Break On/Off
The definitions for BREAK_ON and BREAK_OFF are inverted, causing break
requests to fail. This patch sets BREAK_ON and BREAK_OFF to the correct
values.

Signed-off-by: Craig Shelley <craig@microtron.org.uk>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:15 -07:00
Jef Driesen
f36ecd5de9 USB: pl2303: New vendor and product id
Add support for the Zeagle N2iTiON3 dive computer interface. Since
Zeagle devices are actually manufactured by Seiko, this patch will
support other Seiko based models as well.

Signed-off-by: Jef Driesen <jefdriesen@telenet.be>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:15 -07:00
Ming Lei
d92a3ca689 USB: serial: fix leak of usb serial module refrence count
The patch with title below makes reference count of usb serial module
always more than one after driver is bound.

	USB-BKL: Remove BKL use for usb serial driver probing

In fact, the patch above only replaces lock_kernel() with try_module_get()
, and does not use module_put() to do what unlock_kernel() did, so casue leak
of reference count of usb serial module and the module can not be unloaded
after serial driver is bound with device.

This patch fixes the issue, also simplifies such things:
	-only call try_module_get() once in the entry of usb_serial_probe()
	-only call module_put() once in the exit of usb_serial_probe

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Cc: Johan Hovold <jhovold@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:15 -07:00
Ross Burton
0eee6a2b2a USB: add device IDs for igotu to navman
I recently bought a i-gotU USB GPS, and whilst hunting around for linux
support discovered this post by you back in 2009:

5148644

>Try the navman driver instead.  You can either add the device id to the
> driver and rebuild it, or do this before you plug the device in:
> 	modprobe navman
> 	echo -n "0x0df7 0x0900" > /sys/bus/usb-serial/drivers/navman/new_id
>
> and then plug your device in and see if that works.

I can confirm that the navman driver works with the right device IDs on
my i-gotU GT-600, which has the same device IDs.  Attached is a patch
adding the IDs.

From: Ross Burton <ross@linux.intel.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:15 -07:00
Michael Hennerich
ebb8a4e487 USB: isp1760: use a write barrier to ensure proper ndelay timing
The ISP1760 has some timing requirements where it has to delay a short
period after a write to a register has started.  However, this delay is
from the time the write hits the USB chip (the ISP1760), not from the
time where the processor started processing the write.  So on a quick
enough processor, it is sometimes possible for the write to not hit the
device before we start delaying, and we then violate the part's timing
requirements, so things stop working.

To avoid all this, insert a write barrier after the register write and
before the timing delay/register read so we can guarantee we only start
counting time after the write has hit the device.

Signed-off-by: Michael Hennerich <michael.hennerich@analog.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:15 -07:00
Michael Tokarev
76078dc4fc USB: option: add Celot CT-650
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:14 -07:00
Dan Carpenter
9a887162be USB: uvc_v4l2: cleanup test for end of loop
We're trying to test for the the end of the loop here.  "format" is
never NULL.  We don't know what "format->fcc" is because we're past the
end of the loop and I think "fmt->fmt.pix.pixelformat" comes from the
user so we don't know what that is either.  It works, but it's cleaner
to just test to see if (i == ARRAY_SIZE(uvc_formats).

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-23 20:50:14 -07:00
Simon Horman
1726442e11 net: increase the size of priv_flags and add IFF_OVS_DATAPATH
IFF_OVS_DATAPATH is a place-holder for the Open vSwitch datapath
which I am preparing to submit for merging.

As all 16 bits of priv_flags are already assigned flags, also increase
the size of priv_flags to 32 bits.

Unfortunately, by my calculations this increases the size of
struct net_device by 4 bytes on 32bit architectures and
8 bytes on 64 bit architectures. I couldn't see an obvious
way to avoid that.

Cc: Jesse Gross <jesse@nicira.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:43:17 -07:00
stephen hemminger
0fdc100bdc ethtool: allow non-netadmin to query settings
The SNMP daemon uses ethtool to determine the speed of
network interfaces. This fails on Debian (and probably elsewhere)
because for security SNMP daemon runs as non-root user (snmp).

Note: A similar patch was rejected previously because of a concern about
the possibility that on some hardware querying the ethtool settings
requires access to the PHY and could slow the machine down.  But the
security risk of requiring SNMP daemon (and related services)
to run as root far out weighs the risk of denial-of-service.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:43:16 -07:00
Eric Dumazet
afdcba371f net: copy_rtnl_link_stats64() simplification
No need to use a temporary struct rtnl_link_stats64 variable,
just copy the source to skb buffer.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:43:16 -07:00
Changli Gao
0eec32ff35 net_sched: act_csum: coding style cleanup
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:43:15 -07:00
David S. Miller
7abac68602 pkt_sched: Make act_csum depend upon INET.
It uses ip_send_check() and stuff like that.

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:42:11 -07:00
Dimitris Michailidis
ccea790ef0 cxgb4: update PCI ids
Signed-off-by: Dimitris Michailidis <dm@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:38:15 -07:00
Dimitris Michailidis
1707aec9ac cxgb4: fix setting of the function number in transmit descriptors
Signed-off-by: Dimitris Michailidis <dm@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:38:14 -07:00
Dimitris Michailidis
1478b3ee93 cxgb4: support eeprom read/write on functions other than 0
Extend the address translation for eeprom read/write (code used by
ethtool -[eE]) to functions other than 0.

Signed-off-by: Dimitris Michailidis <dm@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:38:14 -07:00
Dimitris Michailidis
e46dab4d4b cxgb4: handle Rx/Tx queue ranges not starting at 0
Currently the driver assumes that queue IDs start at 0 but that's true
only for function 0.  To support operation on other functions get the
start of the queue ranges from FW and offset accordingly.

Signed-off-by: Dimitris Michailidis <dm@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:38:13 -07:00
David S. Miller
f04b4dd2b1 bna: Delete get_flags and set_flags ethtool methods.
This driver doesn't support LRO, NTUPLE, or the RXHASH
features.  So it should not set these ethtool operations.

This also fixes the warning:

drivers/net/bna/bnad_ethtool.c:1272: warning: initialization from incompatible pointer type

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:34:51 -07:00
Yinglin Luan
bf82791ed6 qlcnic: fix poll implementation
Function qlcnic_intr has pointer to qlcnic_host_sds_ring
as second parameter not pointer to qlcnic_adapter.

Signed-off-by: Yinglin Luan <synmyth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:28:55 -07:00
Yinglin Luan
7b589a35a1 netxen: fix poll implementation
Function netxen_intr has pointer to nx_host_sds_ring
as second parameter not pointer to netxen_adapter.

Signed-off-by: Yinglin Luan <synmyth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:28:55 -07:00
Rasesh Mody
8b230ed8ec bna: Brocade 10Gb Ethernet device driver
This is patch 1/6 which contains linux driver source for
Brocade's BR1010/BR1020 10Gb CEE capable ethernet adapter.

Signed-off-by: Debashis Dutt <ddutt@brocade.com>
Signed-off-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:24:12 -07:00
Changli Gao
4c3a76abd3 bridge: netfilter: fix a memory leak
nf_bridge_alloc() always reset the skb->nf_bridge, so we should always
put the old one.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:14:36 -07:00
Gerrit Renker
231cc2aaf1 dccp ccid-2: Replace broken RTT estimator with better algorithm
The current CCID-2 RTT estimator code is in parts broken and lags behind the
suggestions in RFC2988 of using scaled variants for SRTT/RTTVAR.

That code is replaced by the present patch, which reuses the Linux TCP RTT
estimator code.

Further details:
----------------
 1. The minimum RTO of previously one second has been replaced with TCP's, since
    RFC4341, sec. 5 says that the minimum of 1 sec. (suggested in RFC2988, 2.4)
    is not necessary. Instead, the TCP_RTO_MIN is used, which agrees with DCCP's
    concept of a default RTT (RFC 4340, 3.4).
 2. The maximum RTO has been set to DCCP_RTO_MAX (64 sec), which agrees with
    RFC2988, (2.5).
 3. De-inlined the function ccid2_new_ack().
 4. Added a FIXME: the RTT is sampled several times per Ack Vector, which will
    give the wrong estimate. It should be replaced with one sample per Ack.
    However, at the moment this can not be resolved easily, since
    - it depends on TX history code (which also needs some work),
    - the cleanest solution is not to use the `sent' time at all (saves 4 bytes
      per entry) and use DCCP timestamps / elapsed time to estimated the RTT,
      which however is non-trivial to get right (but needs to be done).

Reasons for reusing the Linux TCP estimator algorithm:
------------------------------------------------------
Some time was spent to find a better alternative, using basic RFC2988 as a first
step. Further analysis and experimentation showed that the Linux TCP RTO
estimator is superior to a basic RFC2988 implementation. A summary is on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/ccid2/rto_estimator/

In addition, this estimator fared well in a recent empirical evaluation:

    Rewaskar, Sushant, Jasleen Kaur and F. Donelson Smith.
    A Performance Study of Loss Detection/Recovery in Real-world TCP
    Implementations. Proceedings of 15th IEEE International
    Conference on Network Protocols (ICNP-07), 2007.

Thus there is significant benefit in reusing the existing TCP code.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:13:31 -07:00
Gerrit Renker
c38c92a84a dccp ccid-2: Simplify dec_pipe and rearming of RTO timer
This removes the dec_pipe function and improves the way the RTO timer is rearmed
when a new acknowledgment comes in.

Details and justification for removal:
--------------------------------------
 1) The BUG_ON in dec_pipe is never triggered: pipe is only decremented for TX
    history entries between tail and head, for which it had previously been
    incremented in tx_packet_sent; and it is not decremented twice for the same
    entry, since it is
    - either decremented when a corresponding Ack Vector cell in state 0 or 1
      was received (and then ccid2s_acked==1),
    - or it is decremented when ccid2s_acked==0, as part of the loss detection
      in tx_packet_recv (and hence it can not have been decremented earlier).

 2) Restarting the RTO timer happens for every single entry in each Ack Vector
    parsed by tx_packet_recv (according to RFC 4340, 11.4 this can happen up to
    16192 times per Ack Vector).

 3) The RTO timer should not be restarted when all outstanding data has been
    acknowledged. This is currently done similar to (2), in dec_pipe, when
    pipe has reached 0.

The patch onsolidates the code which rearms the RTO timer, combining the
segments from new_ack and dec_pipe. As a result, the code becomes clearer
(compare with tcp_rearm_rto()).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:13:31 -07:00
Gerrit Renker
30564e3555 dccp ccid-2: Remove redundant sanity tests
This removes the ccid2_hc_tx_check_sanity function: it is redundant.

Details:

The tx_check_sanity function performs three tests:
 1) it checks that the circular TX list is sorted
    - in ascending order of sequence number (ccid2s_seq)
    - and time (ccid2s_sent),
    - in the direction from `tail' (hctx_seqt) to `head' (hctx_seqh);
 2) it ensures that the entire list has the length seqbufc * CCID2_SEQBUF_LEN;
 3) it ensures that pipe equals the number of packets that were not
    marked `acked' (ccid2s_acked) between `tail' and `head'.

The following argues that each of these tests is redundant, this can be verified
by going through the code.

(1) is not necessary, since both time and GSS increase from one packet to the
next, so that subsequent insertions in tx_packet_sent (which advance the `head'
pointer) will be in ascending order of time and sequence number.

In (2), the length of the list is always equal to seqbufc times CCID2_SEQBUF_LEN
(set to 1024) unless allocation caused an earlier failure, because:
 * at initialisation (tx_init), there is one chunk of size 1024 and seqbufc=1;
 * subsequent calls to tx_alloc_seq take place whenever head->next == tail in
   tx_packet_sent; then a new chunk of size 1024 is inserted between head and
   tail, and seqbufc is incremented by one.

To show that (3) is redundant requires looking at two cases.

The `pipe' variable of the TX socket is incremented only in tx_packet_sent, and
decremented in tx_packet_recv.  When head == tail (TX history empty) then pipe
should be 0, which is the case directly after initialisation and after a
retransmission timeout has occurred (ccid2_hc_tx_rto_expire).

The first case involves parsing Ack Vectors for packets recorded in the live
portion of the buffer, between tail and head. For each packet marked by the
receiver as received (state 0) or ECN-marked (state 1), pipe is decremented by
one, so for all such packets the BUG_ON in tx_check_sanity will not trigger.

The second case is the loss detection in the second half of tx_packet_recv,
below the comment "Check for NUMDUPACK".

The first while-loop here ensures that the sequence number of `seqp' is either
above or equal to `high_ack', or otherwise equal to the highest sequence number
sent so far (of the entry head->prev, as head points to the next unsent entry).
The next while-loop ("while (1)") counts the number of acked packets starting
from that position of seqp, going backwards in the direction from head->prev to
tail. If NUMDUPACK=3 such packets were counted within this loop, `seqp' points
to the last acknowledged packet of these, and the "if (done == NUMDUPACK)" block
is entered next.
The while-loop contained within that block in turn traverses the list backwards,
from head to tail; the position of `seqp' is saved in the variable `last_acked'.
For each packet not marked as `acked', a congestion event is triggered within
the loop, and pipe is decremented. The loop terminates when `seqp' has reached
`tail', whereupon tail is set to the position previously stored in `last_acked'.
Thus, between `last_acked' and the previous position of `tail',
 - pipe has been decremented earlier if the packet was marked as state 0 or 1;
 - pipe was decremented if the packet was not marked as acked.
That is, pipe has been decremented by the number of packets between `last_acked'
and the previous position of `tail'. As a consequence, pipe now again reflects
the number of packets which have not (yet) been acked between the new position
of tail (at `last_acked') and head->prev, or 0 if head==tail. The result is that
the BUG_ON condition in check_sanity will also not be triggered, hence the test
(3) is also redundant.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:13:30 -07:00
Gerrit Renker
51c22bb510 dccp ccid-3: No more CCID control blocks in LISTEN state
The CCIDs are activated as last of the features, at the end of the handshake,
were the LISTEN state of the master socket is inherited into the server
state of the child socket. Thus, the only states visible to CCIDs now are
OPEN/PARTOPEN, and the closing states.

This allows to remove tests which were previously necessary to protect
against referencing a socket in the listening state (in CCID-3), but which
now have become redundant.

As a further byproduct of enabling the CCIDs only after the connection has been
fully established, several typecast-initialisations of ccid3_hc_{rx,tx}_sock
can now be eliminated:
 * the CCID is loaded, so it is not necessary to test if it is NULL,
 * if it is possible to load a CCID and leave the private area NULL, then this
    is a bug, which should crash loudly - and earlier,
 * the test for state==OPEN || state==PARTOPEN now reduces only to the closing
   phase (e.g. when the node has received an unexpected Reset).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:13:30 -07:00
Gerrit Renker
67b67e365f ccid: ccid-2/3 code cosmetics
This patch collects cosmetics-only changes to separate these from
code changes:
 * update with regard to CodingStyle and whitespace changes,
 * documentation:
   - adding/revising comments,
   - remove CCID-3 RX socket documentation which is either
     duplicate or refers to fields that no longer exist,
 * expand embedded tfrc_tx_info struct inline for consistency,
   removing indirections via #define.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:13:29 -07:00
Christoph Hellwig
b5420f2359 xfs: do not discard page cache data on EAGAIN
If xfs_map_blocks returns EAGAIN because of lock contention we must redirty the
page and not disard the pagecache content and return an error from writepage.
We used to do this correctly, but the logic got lost during the recent
reshuffle of the writepage code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Mike Gao <ygao.linux@gmail.com>
Tested-by: Mike Gao <ygao.linux@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
2010-08-24 11:47:51 +10:00
Dave Chinner
3b93c7aaef xfs: don't do memory allocation under the CIL context lock
Formatting items requires memory allocation when using delayed
logging. Currently that memory allocation is done while holding the
CIL context lock in read mode. This means that if memory allocation
takes some time (e.g. enters reclaim), we cannot push on the CIL
until the allocation(s) required by formatting complete. This can
stall CIL pushes for some time, and once a push is stalled so are
all new transaction commits.

Fix this splitting the item formatting into two steps. The first
step which does the allocation and memcpy() into the allocated
buffer is now done outside the CIL context lock, and only the CIL
insert is done inside the CIL context lock. This avoids the stall
issue.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-08-24 11:45:53 +10:00
Dave Chinner
a44f13edf0 xfs: Reduce log force overhead for delayed logging
Delayed logging adds some serialisation to the log force process to
ensure that it does not deference a bad commit context structure
when determining if a CIL push is necessary or not. It does this by
grabing the CIL context lock exclusively, then dropping it before
pushing the CIL if necessary. This causes serialisation of all log
forces and pushes regardless of whether a force is necessary or not.
As a result fsync heavy workloads (like dbench) can be significantly
slower with delayed logging than without.

To avoid this penalty, copy the current sequence from the context to
the CIL structure when they are swapped. This allows us to do
unlocked checks on the current sequence without having to worry
about dereferencing context structures that may have already been
freed. Hence we can remove the CIL context locking in the forcing
code and only call into the push code if the current context matches
the sequence we need to force.

By passing the sequence into the push code, we can check the
sequence again once we have the CIL lock held exclusive and abort if
the sequence has already been pushed. This avoids a lock round-trip
and unnecessary CIL pushes when we have racing push calls.

The result is that the regression in dbench performance goes away -
this change improves dbench performance on a ramdisk from ~2100MB/s
to ~2500MB/s. This compares favourably to not using delayed logging
which retuns ~2500MB/s for the same workload.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-08-24 11:40:03 +10:00
Dave Chinner
1a387d3be2 xfs: dummy transactions should not dirty VFS state
When we  need to cover the log, we issue dummy transactions to ensure
the current log tail is on disk. Unfortunately we currently use the
root inode in the dummy transaction, and the act of committing the
transaction dirties the inode at the VFS level.

As a result, the VFS writeback of the dirty inode will prevent the
filesystem from idling long enough for the log covering state
machine to complete. The state machine gets stuck in a loop issuing
new dummy transactions to cover the log and never makes progress.

To avoid this problem, the dummy transactions should not cause
externally visible state changes. To ensure this occurs, make sure
that dummy transactions log an unchanging field in the superblock as
it's state is never propagated outside the filesystem. This allows
the log covering state machine to complete successfully and the
filesystem now correctly enters a fully idle state about 90s after
the last modification was made.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-08-24 11:46:31 +10:00
Stuart Brodsky
2fe33661fc xfs: ensure f_ffree returned by statfs() is non-negative
Because of delayed updates to sb_icount field in the super block, it
is possible to allocate over maxicount number of inodes.  This
causes the arithmetic to calculate a negative number of free inodes
in user commands like df or stat -f.

Since maxicount is a somewhat arbitrary number, a slight over
allocation is not critical but user commands should be displayed as
0 or greater and never go negative.  To do this the value in the
stats buffer f_ffree is capped to never go negative.

[ Modified to use max_t as per Christoph's comment. ]

Signed-off-by: Stu Brodsky <sbrodsky@sgi.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
2010-08-24 11:46:05 +10:00
Dave Chinner
efceab1d56 xfs: handle negative wbc->nr_to_write during sync writeback
During data integrity (WB_SYNC_ALL) writeback, wbc->nr_to_write will
go negative on inodes with more than 1024 dirty pages due to
implementation details of write_cache_pages(). Currently XFS will
abort page clustering in writeback once nr_to_write drops below
zero, and so for data integrity writeback we will do very
inefficient page at a time allocation and IO submission for inodes
with large numbers of dirty pages.

Fix this by only aborting the page clustering code when
wbc->nr_to_write is negative and the sync mode is WB_SYNC_NONE.

Cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-08-24 11:44:56 +10:00
Dave Chinner
546a192422 writeback: write_cache_pages doesn't terminate at nr_to_write <= 0
I noticed XFS writeback in 2.6.36-rc1 was much slower than it should have
been. Enabling writeback tracing showed:

    flush-253:16-8516  [007] 1342952.351608: wbc_writepage: bdi 253:16: towrt=1024 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
    flush-253:16-8516  [007] 1342952.351654: wbc_writepage: bdi 253:16: towrt=1023 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
    flush-253:16-8516  [000] 1342952.369520: wbc_writepage: bdi 253:16: towrt=0 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
    flush-253:16-8516  [000] 1342952.369542: wbc_writepage: bdi 253:16: towrt=-1 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
    flush-253:16-8516  [000] 1342952.369549: wbc_writepage: bdi 253:16: towrt=-2 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0

Writeback is not terminating in background writeback if ->writepage is
returning with wbc->nr_to_write == 0, resulting in sub-optimal single page
writeback on XFS.

Fix the write_cache_pages loop to terminate correctly when this situation
occurs and so prevent this sub-optimal background writeback pattern. This
improves sustained sequential buffered write performance from around
250MB/s to 750MB/s for a 100GB file on an XFS filesystem on my 8p test VM.

Cc:<stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-08-24 11:44:34 +10:00
Dave Chinner
4536f2ad8b xfs: fix untrusted inode number lookup
Commit 7124fe0a5b ("xfs: validate untrusted inode
numbers during lookup") changes the inode lookup code to do btree lookups for
untrusted inode numbers. This change made an invalid assumption about the
alignment of inodes and hence incorrectly calculated the first inode in the
cluster. As a result, some inode numbers were being incorrectly considered
invalid when they were actually valid.

The issue was not picked up by the xfstests suite because it always runs fsr
and dump (the two utilities that utilise the bulkstat interface) on cache hot
inodes and hence the lookup code in the cold cache path was not sufficiently
exercised to uncover this intermittent problem.

Fix the issue by relaxing the btree lookup criteria and then checking if the
record returned contains the inode number we are lookup for. If it we get an
incorrect record, then the inode number is invalid.

Cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-08-24 11:42:30 +10:00
Dave Chinner
5b3eed756c xfs: ensure we mark all inodes in a freed cluster XFS_ISTALE
Under heavy load parallel metadata loads (e.g. dbench), we can fail
to mark all the inodes in a cluster being freed as XFS_ISTALE as we
skip inodes we cannot get the XFS_ILOCK_EXCL or the flush lock on.
When this happens and the inode cluster buffer has already been
marked stale and freed, inode reclaim can try to write the inode out
as it is dirty and not marked stale. This can result in writing th
metadata to an freed extent, or in the case it has already
been overwritten trigger a magic number check failure and return an
EUCLEAN error such as:

Filesystem "ram0": inode 0x442ba1 background reclaim flush failed with 117

Fix this by ensuring that we hoover up all in memory inodes in the
cluster and mark them XFS_ISTALE when freeing the cluster.

Cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-08-24 11:42:41 +10:00
Dave Chinner
d17c701ce6 xfs: unlock items before allowing the CIL to commit
When we commit a transaction using delayed logging, we need to
unlock the items in the transaciton before we unlock the CIL context
and allow it to be checkpointed. If we unlock them after we release
the CIl context lock, the CIL can checkpoint and complete before
we free the log items. This breaks stale buffer item unlock and
unpin processing as there is an implicit assumption that the unlock
will occur before the unpin.

Also, some log items need to store the LSN of the transaction commit
in the item (inodes and EFIs) and so can race with other transaction
completions if we don't prevent the CIL from checkpointing before
the unlock occurs.

Cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-08-24 11:42:52 +10:00
Linus Torvalds
d1b113bb02 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
  netfilter: fix CONFIG_COMPAT support
  isdn/avm: fix build when PCMCIA is not enabled
  header: fix broken headers for user space
  e1000e: don't check for alternate MAC addr on parts that don't support it
  e1000e: disable ASPM L1 on 82573
  ll_temac: Fix poll implementation
  netxen: fix a race in netxen_nic_get_stats()
  qlnic: fix a race in qlcnic_get_stats()
  irda: fix a race in irlan_eth_xmit()
  net: sh_eth: remove unused variable
  netxen: update version 4.0.74
  netxen: fix inconsistent lock state
  vlan: Match underlying dev carrier on vlan add
  ibmveth: Fix opps during MTU change on an active device
  ehea: Fix synchronization between HW and SW send queue
  bnx2x: Update bnx2x version to 1.52.53-4
  bnx2x: Fix PHY locking problem
  rds: fix a leak of kernel memory
  netlink: fix compat recvmsg
  netfilter: fix userspace header warning
  ...
2010-08-23 18:30:30 -07:00
Linus Torvalds
9c5ea3675d Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mjg59/platform-drivers-x86
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mjg59/platform-drivers-x86:
  hp-wmi: Fix query interface
  ACPI_TOSHIBA needs LEDS support
2010-08-23 18:29:34 -07:00