Currently when we noticed a HW ECC error we would request the use reload
the driver to force a reset of the part. This was done due to the mistaken
believe that a normal reset would not be sufficient. Well it turns out it
would be so now we just schedule a reset upon seeing the ECC.
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This option has the same semantic as IP_PMTUDISC_OMIT for IPv4 which
got recently introduced. It doesn't honor the path mtu discovered by the
host but in contrary to IPV6_PMTUDISC_INTERFACE allows the generation of
fragments if the packet size exceeds the MTU of the outgoing interface
MTU.
Fixes: 93b36cf342 ("ipv6: support IPV6_PMTU_INTERFACE on sockets")
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
generation of fragments if the interface mtu is exceeded, it is very
hard to make use of this option in already deployed name server software
for which I introduced this option.
This patch adds yet another new IP_MTU_DISCOVER option to not honor any
path mtu information and not accepting new icmp notifications destined for
the socket this option is enabled on. But we allow outgoing fragmentation
in case the packet size exceeds the outgoing interface mtu.
As such this new option can be used as a drop-in replacement for
IP_PMTUDISC_DONT, which is currently in use by most name server software
making the adoption of this option very smooth and easy.
The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
ignoring incoming path MTU updates and not honoring discovered path MTUs
in the output path.
Fixes: 482fc6094a ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_skb_dst_mtu mostly falls back to ip_dst_mtu_maybe_forward if no socket
is attached to the skb (in case of forwarding) or determines the mtu like
we do in ip_finish_output, which actually checks if we should branch to
ip_fragment. Thus use the same function to determine the mtu here, too.
This is important for the introduction of IP_PMTUDISC_OMIT, where we
want the packets getting cut in pieces of the size of the outgoing
interface mtu. IPv6 already does this correctly.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new F: line for the intel subdirectories.
This allows get_maintainers to avoid using git log
and cc'ing people that have submitted clean-up style
patches for all first level directories under
drivers/net/ethernet/intel/
This does not make e100.c maintained.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
iproute2 arpd seems to expect this as there's code and comments
to handle netlink probes with NUD_PROBE set. It is used to flush
the arpd cached mappings.
opennhrp instead turns off unicast probes (so it can handle all
neighbour discovery). Without this change it will not see NUD_PROBE
probes and cannot reconfirm the mapping. Thus currently neigh entry
will just fail and can cause few packets dropped until broadcast
discovery is restarted.
Earlier discussion on the subject:
http://marc.info/?t=139305877100001&r=1&w=2
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 8fad346f36 ("eee802154: add basic support for RF212 to
at86rf230 driver") introduced the new function is_rf212() with some
minor issues in declaration:
1) Fix the function type by changing it to bool as the function
definition returns a boolean value. Additionally both callers of
is_rf212() are expected to return a boolean value.
2) Fix the function specifier by deleting the inline keyword as the
compiler takes care of that.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Cc: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
These info messages are rather pointless without any means to identify
the source of the bogus packets. Logging the src and dst addresses and
ports may help a bit.
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amir Vadai says:
====================
net, net/mlx4: Add sysfs file for port number
Modern distro's are using biosdevname to rename interface to a name based on
slot/port number.
biosdevname can't get the port number of devices that have multiple ports that
share the same PCI function.
This patch adds a sysfs file under: /sys/devices/.../net/<interface>/dev_port,
that contains the port number (0 based) - to be used by biosdevname.
Also, dev_id was wrongly used in mlx4_en driver - added a patch that fix it.
This patch was tested and applied over commit 51adfcc "net: bcmgenet: remove
unused bh_lock member"
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
dev_id should be set for multiple netdev's sharing the same MAC, which
is not the case here.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize dev_port with port number (0 based) to be accessed through
sysfs from user space.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a sysfs file to enable user space to query the device
port number used by a netdevice instance. This is needed for
devices that have multiple ports on the same PCI function.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Schmidt says:
====================
bnx2x: minimize RAM usage in kdump
kdump kernels usually have only a small amount of memory reserved.
bnx2x can be memory-hungry. Let's minimize its memory usage when
running in kdump.
I detect kdump by looking at the "reset_devices" flag. A couple of
storage drivers (cciss, hpsa) use it for the same purpose. I am not sure
this is the best way to solve the problem, but it works.
Should it be made more generic by, say, looking at the total amount
of lowmem instead? Not using TPA by default when lowmem is small and/or
defaulting to fewer queues would help 32bit systems where a driver for
a multi-function multi-queue NIC can consume a significant amount
of available memory. Or do we want no such heuristics?
Is this something to consider doing for other network drivers too?
====================
Acked-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When running in a kdump kernel, disable TPA. This saves memory, which
tends to be scarce in kdump.
TPA, being a receive acceleration, is unlikely to be useful for kdump,
whose purpose is to send the memory image out.
This saves additional 5 MB in the kdump environment.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When running in a kdump kernel, make sure to use only a single ethernet
queue even if a num_queues option in /etc/modprobe.d/*.conf would specify
otherwise. This saves memory, which tends to be scarce in kdump.
This saves about 40 MB in the kdump environment on a setup with
num_queues=8 in the config file.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the clamp() macro to make the calculation of the number of queues
slightly easier to understand. It also avoids a crash when someone
accidentally passes a negative value in num_queues= module parameter.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Three counters are added:
- one to track when we went from non-zero to zero window
- one to track the reverse
- one counter incremented when we want to announce zero window,
but can't because we would shrink current window.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we receive a PTP event from the NIC when we haven't set up PTP state
in the driver, we attempt to read through a NULL pointer efx->ptp_data,
triggering a panic.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES can only be incremented
in tcp_transmit_skb() from softirq (incoming message or timer
activation), it is better to use NET_INC_STATS() instead of
NET_INC_STATS_BH() as tcp_transmit_skb() can be called from process
context.
This will avoid copy/paste confusion when/if we want to add
other SNMP counters in tcp_transmit_skb()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
All ethertypes other than ETH_P_MPLS_UC, ETH_P_MPLS_MC and
ETH_P_ATMMPOA were already ordered numerically. This commit moves
those three ETH_P_... values into correct numerical order too.
Signed-off-by: Neil Jerram <Neil.Jerram@metaswitch.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Nomadik debugfs screws up multiplatform boots if debugfs
is enabled on the multiplatform image, since it's a simple
initcall that is unconditionally executed and reads from certain
memory locations.
Fix this by checking that the driver has been properly
initialized, so a base offset to the Nomadik SRC controller
exists, before proceeding to register debugfs files.
Reported-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Mike Turquette <mturquette@linaro.org>
Read-only large sptes can be created due to read-only faults as
follows:
- QEMU pagetable entry that maps guest memory is read-only
due to COW.
- Guest read faults such memory, COW is not broken, because
it is a read-only fault.
- Enable dirty logging, large spte not nuked because it is read-only.
- Write-fault on such memory causes guest to loop endlessly
(which must go down to level 1 because dirty logging is enabled).
Fix by dropping large spte when necessary.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix a memory leak in the lp3943_pwm_request_map() error handling path.
Make sure already allocated pwm map memory is freed correctly.
Detected by Coverity: CID 1162829.
Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Acked-by: Milo Kim <milo.kim@ti.com>
Signed-off-by: Thierry Reding <thierry.reding@gmail.com>
An invalid ioctl will never be valid, irrespective of whether multipath
has active paths or not. So for invalid ioctls we do not have to wait
for multipath to activate any paths, but can rather return an error
code immediately. This fix resolves numerous instances of:
udevd[]: worker [] unexpectedly returned with status 0x0100
that have been seen during testing.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The driver reads from the DC offset control registers during callibration
but since the registers are marked as volatile and there is a register
cache the values will not be read from the hardware after the first reading
rendering the callibration ineffective.
It appears that the driver was originally written for the ASoC level
register I/O code but converted to regmap prior to merge and this issue
was missed during the conversion as the framework level volatile register
functionality was not being used.
Signed-off-by: Mark Brown <broonie@linaro.org>
Acked-by: Adam Thomson <Adam.Thomson.Opensource@diasemi.com>
Cc: stable@vger.kernel.org
When a policy is unlinked from the lists in thread context,
the xfrm timer can fire before we can mark this policy as dead.
So reinitialize the bydst hlist, then hlist_unhashed() will
notice that this policy is not linked and will avoid a
doulble unlink of that policy.
Reported-by: Xianpeng Zhao <673321875@qq.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Include kASLR offset in VMCOREINFO ELF notes to assist in debugging.
[ hpa: pushing this for v3.14 to avoid having a kernel version with
kASLR where we can't debug output. ]
Signed-off-by: Eugene Surovegin <surovegin@google.com>
Link: http://lkml.kernel.org/r/20140123173120.GA25474@www.outflux.net
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
During restore, pm_notifier chain are called with
PM_RESTORE_PREPARE. The firmware_class driver handler
fw_pm_notify does not have a handler for this. As a result,
it keeps a reader on the kmod.c umhelper_sem. During
freeze_processes, the call to __usermodehelper_disable tries to
take a write lock on this semaphore and hangs waiting.
Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org>
Acked-by: Ming Lei <ming.lei@canonical.com>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit fcb6a15c2e (intel_pstate: Take core C0 time into account for
core busy calculation) introduced a regression on some processor SKUs
supported by intel_pstate. This was due to the truncation caused by
using integer math to calculate core busy and C0 percentages.
On a i7-4770K processor operating at 800Mhz going to 100% utilization
the percent busy of the CPU using integer math is 22%, but it actually
is 22.85%. This value scaled to the current frequency returned 97
which the PID interpreted as no error and did not adjust the P state.
Tested on i7-4770K, i7-2600, i5-3230M.
Fixes: fcb6a15c2e (intel_pstate: Take core C0 time into account for core busy calculation)
References: https://lkml.org/lkml/2014/2/19/626
References: https://bugzilla.kernel.org/show_bug.cgi?id=70941
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Masking the link partner's capabilities with local capabilities can be
misleading in autonegotiation scenarios such as PAUSE frame
autonegotiation.
This patch calculates the join between the local capabilities and the
link parner capabilities, when it determines the speed and duplex
settings, but does not mask any of the link partner capabilities when
it calculates PAUSE frame settings.
Signed-off-by: Cristian Bercaru <cristian.bercaru@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge misc fixes from Andrew Morton.
* emailed patches from Andrew Morton akpm@linux-foundation.org>:
MAINTAINERS: change mailing list address for Altera UART drivers
Makefile: fix build with make 3.80 again
MAINTAINERS: update L: misuses
Makefile: fix extra parenthesis typo when CC_STACKPROTECTOR_REGULAR is enabled
ipc,mqueue: remove limits for the amount of system-wide queues
memcg: change oom_info_lock to mutex
mm, thp: fix infinite loop on memcg OOM
drivers/fmc/fmc-write-eeprom.c: fix decimal permissions
drivers/iommu/omap-iommu-debug.c: fix decimal permissions
mm, hwpoison: release page on PageHWPoison() in __do_fault()
netlink_sendmsg() was changed to prevent non-root processes from sending
messages with dst_pid != 0.
netlink_connect() however still only checks if nladdr->nl_groups is set.
This patch modifies netlink_connect() to check for the same condition.
Signed-off-by: Mike Pecovnik <mike.pecovnik@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without a shutdown handler, T4 cards behave very badly after a kexec.
Some firmware calls return errors indicating allocation failures, for
example. This is probably because thouse resources were not released by
a BYE message to the firmware, for example.
Using the remove handler guarantees we will use a well tested path.
With this patch I applied, I managed to use kexec multiple times and
probe and iSCSI login worked every time.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shahed Shaikh says:
====================
qlcnic: Bug fixes
This patch series includes following bug fixes,
* Fix for return value handling of function qlcnic_enable_msi_legacy().
* Fix for the usage of module parameters for interrupt mode.
Driver should use flags while checking for driver's interrupt mode instead of
module parameters.
* Revert commit 1414abea04 (qlcnic: Restrict VF from configuring any VLAN mode),
in order to save some multicast filters.
* Fix a bug where driver was not re-setting sds ring count to 1 when
it falls back from MSI-x mode to legacy interrupt mode.
Please apply to net.
Change in v2 -
Dropped patch "qlcnic: reset firmware API lock during driver load" for further rework.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
o Driver was not re-setting sds ring count to 1 after failing
to allocate msi-x interrupts.
Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
o This patch reverts commit 1414abea04
(qlcnic: Restrict VF from configuring any VLAN mode.)
This will allow same multicast address to be used with any VLAN
instead of programming seperate (MAC, VLAN) tuples in adapter.
This will help save some multicast filters.
Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once interrupts are enabled, instead of using module parameters,
use flags (QLCNIC_MSI_ENABLED and QLCNIC_MSIX_ENABLED) set by driver
to check interrupt mode.
Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Driver was treating -ve return value as success in case of
qlcnic_enable_msi_legacy() failure
Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To guarantee that a qdio ccw_device no longer touches the
qdio memory shared with Linux, the qdio ccw_device should
be offline when freeing the qdio memory. Thus this patch
postpones freeing of qdio memory.
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Reviewed-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the UFO fragmentation process does not correctly handle inner
UDP frames.
(The following tcpdumps are captured on the parent interface with ufo
disabled while tunnel has ufo enabled, 2000 bytes payload, mtu 1280,
both sit device):
IPv6:
16:39:10.031613 IP (tos 0x0, ttl 64, id 3208, offset 0, flags [DF], proto IPv6 (41), length 1300)
192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 1240) 2001::1 > 2001::8: frag (0x00000001:0|1232) 44883 > distinct: UDP, length 2000
16:39:10.031709 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPv6 (41), length 844)
192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 784) 2001::1 > 2001::8: frag (0x00000001:0|776) 58979 > 46366: UDP, length 5471
We can see that fragmentation header offset is not correctly updated.
(fragmentation id handling is corrected by 916e4cf46d ("ipv6: reuse
ip6_frag_id from ip6_ufo_append_data")).
IPv4:
16:39:57.737761 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPIP (4), length 1296)
192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57034, offset 0, flags [none], proto UDP (17), length 1276)
192.168.99.1.35961 > 192.168.99.2.distinct: UDP, length 2000
16:39:57.738028 IP (tos 0x0, ttl 64, id 3210, offset 0, flags [DF], proto IPIP (4), length 792)
192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57035, offset 0, flags [none], proto UDP (17), length 772)
192.168.99.1.13531 > 192.168.99.2.20653: UDP, length 51109
In this case fragmentation id is incremented and offset is not updated.
First, I aligned inet_gso_segment and ipv6_gso_segment:
* align naming of flags
* ipv6_gso_segment: setting skb->encapsulation is unnecessary, as we
always ensure that the state of this flag is left untouched when
returning from upper gso segmenation function
* ipv6_gso_segment: move skb_reset_inner_headers below updating the
fragmentation header data, we don't care for updating fragmentation
header data
* remove currently unneeded comment indicating skb->encapsulation might
get changed by upper gso_segment callback (gre and udp-tunnel reset
encapsulation after segmentation on each fragment)
If we encounter an IPIP or SIT gso skb we now check for the protocol ==
IPPROTO_UDP and that we at least have already traversed another ip(6)
protocol header.
The reason why we have to special case GSO_IPIP and GSO_SIT is that
we reset skb->encapsulation to 0 while skb_mac_gso_segment the inner
protocol of GSO_UDP_TUNNEL or GSO_GRE packets.
Reported-by: Wolfgang Walter <linux@stwm.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nios2-dev list has been moved to the RocketBoards infrastructure, so
adjust the address accordingly.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to Documentation/Changes, make 3.80 is still being supported
for building the kernel, hence make files must not make (unconditional)
use of features introduced only in newer versions. Commit 8779657d29
("stackprotector: Introduce CONFIG_CC_STACKPROTECTOR_STRONG") however
introduced an "else ifdef" construct which make 3.80 doesn't understand.
Also correct a warning message still referencing the old config option
name.
Apart from that I question the use of "ifdef" here (but it was used that
way already prior to said commit): ifeq (,y) would seem more to the
point.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Marek <mmarek@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
L: lines are for the email addresses of traditional mailing lists.
W: lines are for URLs.
Convert two L: misuses to W: links.
Signed-off-by: Joe Perches <joe@perches.com>
Reported-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An extra parenthesis typo introduced in 19952a9203 ("stackprotector:
Unify the HAVE_CC_STACKPROTECTOR logic between architectures") is
causing the following error when CONFIG_CC_STACKPROTECTOR_REGULAR is
enabled:
Makefile:608: Cannot use CONFIG_CC_STACKPROTECTOR: -fstack-protector not supported by compiler
Makefile:608: *** missing separator. Stop.
Signed-off-by: Fathi Boudra <fathi.boudra@linaro.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 93e6f119c0 ("ipc/mqueue: cleanup definition names and
locations") added global hardcoded limits to the amount of message
queues that can be created. While these limits are per-namespace,
reality is that it ends up breaking userspace applications.
Historically users have, at least in theory, been able to create up to
INT_MAX queues, and limiting it to just 1024 is way too low and dramatic
for some workloads and use cases. For instance, Madars reports:
"This update imposes bad limits on our multi-process application. As
our app uses approaches that each process opens its own set of queues
(usually something about 3-5 queues per process). In some scenarios
we might run up to 3000 processes or more (which of-course for linux
is not a problem). Thus we might need up to 9000 queues or more. All
processes run under one user."
Other affected users can be found in launchpad bug #1155695:
https://bugs.launchpad.net/ubuntu/+source/manpages/+bug/1155695
Instead of increasing this limit, revert it entirely and fallback to the
original way of dealing queue limits -- where once a user's resource
limit is reached, and all memory is used, new queues cannot be created.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reported-by: Madars Vitolins <m@silodev.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org> [3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill has reported the following:
Task in /test killed as a result of limit of /test
memory: usage 10240kB, limit 10240kB, failcnt 51
memory+swap: usage 10240kB, limit 10240kB, failcnt 0
kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
Memory cgroup stats for /test:
BUG: sleeping function called from invalid context at kernel/cpu.c:68
in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
2 locks held by memcg_test/66:
#0: (memcg_oom_lock#2){+.+...}, at: [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
#1: (oom_info_lock){+.+...}, at: [<ffffffff81197b2a>] mem_cgroup_print_oom_info+0x2a/0x390
CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
Call Trace:
__might_sleep+0x16a/0x210
get_online_cpus+0x1c/0x60
mem_cgroup_read_stat+0x27/0xb0
mem_cgroup_print_oom_info+0x260/0x390
dump_header+0x88/0x251
? trace_hardirqs_on+0xd/0x10
oom_kill_process+0x258/0x3d0
mem_cgroup_oom_synchronize+0x656/0x6c0
? mem_cgroup_charge_common+0xd0/0xd0
pagefault_out_of_memory+0x14/0x90
mm_fault_error+0x91/0x189
__do_page_fault+0x48e/0x580
do_page_fault+0xe/0x10
page_fault+0x22/0x30
which complains that mem_cgroup_read_stat cannot be called from an atomic
context but mem_cgroup_print_oom_info takes a spinlock. Change
oom_info_lock to a mutex.
This was introduced by 947b3dd1a8 ("memcg, oom: lock
mem_cgroup_print_oom_info").
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Masayoshi Mizuma reported a bug with the hang of an application under
the memcg limit. It happens on write-protection fault to huge zero page
If we successfully allocate a huge page to replace zero page but hit the
memcg limit we need to split the zero page with split_huge_page_pmd()
and fallback to small pages.
The other part of the problem is that VM_FAULT_OOM has special meaning
in do_huge_pmd_wp_page() context. __handle_mm_fault() expects the page
to be split if it sees VM_FAULT_OOM and it will will retry page fault
handling. This causes an infinite loop if the page was not split.
do_huge_pmd_wp_zero_page_fallback() can return VM_FAULT_OOM if it failed
to allocate one small page, so fallback to small pages will not help.
The solution for this part is to replace VM_FAULT_OOM with
VM_FAULT_FALLBACK is fallback required.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>