* remotes/origin/tmp-2f0de51: Linux 4.4.38 esp6: Fix integrity verification when ESN are used esp4: Fix integrity verification when ESN are used ipv4: Set skb->protocol properly for local output ipv6: Set skb->protocol properly for local output Don't feed anything but regular iovec's to blk_rq_map_user_iov constify iov_iter_count() and iter_is_iovec() sparc64: fix compile warning section mismatch in find_node() sparc64: Fix find_node warning if numa node cannot be found sparc32: Fix inverted invalid_frame_pointer checks on sigreturns net: ping: check minimum size on ICMP header length net: avoid signed overflows for SO_{SND|RCV}BUFFORCE geneve: avoid use-after-free of skb->data sh_eth: remove unchecked interrupts for RZ/A1 net: bcmgenet: Utilize correct struct device for all DMA operations packet: fix race condition in packet_set_ring net/dccp: fix use-after-free in dccp_invalid_packet netlink: Do not schedule work from sk_destruct netlink: Call cb->done from a worker thread net/sched: pedit: make sure that offset is valid net, sched: respect rcu grace period on cls destruction net: dsa: bcm_sf2: Ensure we re-negotiate EEE during after link change l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind() rtnetlink: fix FDB size computation af_unix: conditionally use freezable blocking calls in read net: sky2: Fix shutdown crash ip6_tunnel: disable caching when the traffic class is inherited net: check dead netns for peernet2id_alloc() virtio-net: add a missing synchronize_net() Linux 4.4.37 arm64: suspend: Reconfigure PSTATE after resume from idle arm64: mm: Set PSTATE.PAN from the cpu_enable_pan() call arm64: cpufeature: Schedule enable() calls instead of calling them via IPI pwm: Fix device reference leak mwifiex: printk() overflow with 32-byte SSIDs PCI: Set Read Completion Boundary to 128 iff Root Port supports it (_HPX) PCI: Export pcie_find_root_port rcu: Fix soft lockup for rcu_nocb_kthread ALSA: pcm : Call kill_fasync() in stream lock x86/traps: Ignore high word of regs->cs in early_fixup_exception() kasan: update kasan_global for gcc 7 zram: fix unbalanced idr management at hot removal ARC: Don't use "+l" inline asm constraint Linux 4.4.36 scsi: mpt3sas: Unblock device after controller reset flow_dissect: call init_default_flow_dissectors() earlier mei: fix return value on disconnection mei: me: fix place for kaby point device ids. mei: me: disable driver on SPT SPS firmware drm/radeon: Ensure vblank interrupt is enabled on DPMS transition to on mpi: Fix NULL ptr dereference in mpi_powm() [ver #3] parisc: Also flush data TLB in flush_icache_page_asm parisc: Fix race in pci-dma.c parisc: Fix races in parisc_setup_cache_timing() NFSv4.x: hide array-bounds warning apparmor: fix change_hat not finding hat after policy replacement cfg80211: limit scan results cache size tile: avoid using clocksource_cyc2ns with absolute cycle count scsi: mpt3sas: Fix secure erase premature termination Fix USB CB/CBI storage devices with CONFIG_VMAP_STACK=y USB: serial: ftdi_sio: add support for TI CC3200 LaunchPad USB: serial: cp210x: add ID for the Zone DPMX usb: chipidea: move the lock initialization to core file KVM: x86: check for pic and ioapic presence before use KVM: x86: drop error recovery in em_jmp_far and em_ret_far iommu/vt-d: Fix IOMMU lookup for SR-IOV Virtual Functions iommu/vt-d: Fix PASID table allocation sched: tune: Fix lacking spinlock initialization UPSTREAM: trace: Update documentation for mono, mono_raw and boot clock UPSTREAM: trace: Add an option for boot clock as trace clock UPSTREAM: timekeeping: Add a fast and NMI safe boot clock ANDROID: goldfish_pipe: fix allmodconfig build ANDROID: goldfish: goldfish_pipe: fix locking errors ANDROID: video: goldfishfb: fix platform_no_drv_owner.cocci warnings ANDROID: goldfish_pipe: fix call_kern.cocci warnings arm64: rename ranchu defconfig to ranchu64 ANDROID: arch: x86: disable pic for Android toolchain ANDROID: goldfish_pipe: An implementation of more parallel pipe ANDROID: goldfish_pipe: bugfixes and performance improvements. ANDROID: goldfish: Add goldfish sync driver ANDROID: goldfish: add ranchu defconfigs ANDROID: goldfish_audio: Clear audio read buffer status after each read ANDROID: goldfish_events: no extra EV_SYN; register goldfish ANDROID: goldfish_fb: Set pixclock = 0 ANDROID: goldfish: Enable ACPI-based enumeration for goldfish audio ANDROID: goldfish: Enable ACPI-based enumeration for goldfish framebuffer ANDROID: video: goldfishfb: add devicetree bindings BACKPORT: staging: goldfish: audio: fix compiliation on arm BACKPORT: Input: goldfish_events - enable ACPI-based enumeration for goldfish events BACKPORT: goldfish: Enable ACPI-based enumeration for goldfish battery BACKPORT: drivers: tty: goldfish: Add device tree bindings BACKPORT: tty: goldfish: support platform_device with id -1 BACKPORT: Input: goldfish_events - add devicetree bindings BACKPORT: power: goldfish_battery: add devicetree bindings BACKPORT: staging: goldfish: audio: add devicetree bindings ANDROID: usb: gadget: function: cleanup: Add blank line after declaration cpufreq: sched: Fix kernel crash on accessing sysfs file usb: gadget: f_mtp: simplify ptp NULL pointer check cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/ cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type writeback: initialize inode members that track writeback history mm: page_alloc: generalize the dirty balance reserve block: fix module reference leak on put_disk() call for cgroups throttle Linux 4.4.35 netfilter: nft_dynset: fix element timeout for HZ != 1000 IB/cm: Mark stale CM id's whenever the mad agent was unregistered IB/uverbs: Fix leak of XRC target QPs IB/core: Avoid unsigned int overflow in sg_alloc_table IB/mlx5: Fix fatal error dispatching IB/mlx5: Use cache line size to select CQE stride IB/mlx4: Fix create CQ error flow IB/mlx4: Check gid_index return value PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails PM / sleep: fix device reference leak in test_suspend uwb: fix device reference leaks mfd: core: Fix device reference leak in mfd_clone_cell iwlwifi: pcie: fix SPLC structure parsing rtc: omap: Fix selecting external osc clk: mmp: mmp2: fix return value check in mmp2_clk_init() clk: mmp: pxa168: fix return value check in pxa168_clk_init() clk: mmp: pxa910: fix return value check in pxa910_clk_init() drm/amdgpu: Attach exclusive fence to prime exported bo's. (v5) crypto: caam - do not register AES-XTS mode on LP units ext4: sanity check the block and cluster size at mount time kbuild: Steal gcc's pie from the very beginning x86/kexec: add -fno-PIE scripts/has-stack-protector: add -fno-PIE kbuild: add -fno-PIE i2c: mux: fix up dependencies can: bcm: fix warning in bcm_connect/proc_register mfd: intel-lpss: Do not put device in reset state on suspend fuse: fix fuse_write_end() if zero bytes were copied KVM: Disable irq while unregistering user notifier KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr x86/cpu/AMD: Fix cpu_llc_id for AMD Fam17h systems Linux 4.4.34 sparc64: Delete now unused user copy fixup functions. sparc64: Delete now unused user copy assembler helpers. sparc64: Convert U3copy_{from,to}_user to accurate exception reporting. sparc64: Convert NG2copy_{from,to}_user to accurate exception reporting. sparc64: Convert NGcopy_{from,to}_user to accurate exception reporting. sparc64: Convert NG4copy_{from,to}_user to accurate exception reporting. sparc64: Convert U1copy_{from,to}_user to accurate exception reporting. sparc64: Convert GENcopy_{from,to}_user to accurate exception reporting. sparc64: Convert copy_in_user to accurate exception reporting. sparc64: Prepare to move to more saner user copy exception handling. sparc64: Delete __ret_efault. sparc64: Handle extremely large kernel TLB range flushes more gracefully. sparc64: Fix illegal relative branches in hypervisor patched TLB cross-call code. sparc64: Fix instruction count in comment for __hypervisor_flush_tlb_pending. sparc64: Fix illegal relative branches in hypervisor patched TLB code. sparc64: Handle extremely large kernel TSB range flushes sanely. sparc: Handle negative offsets in arch_jump_label_transform sparc64 mm: Fix base TSB sizing when hugetlb pages are used sparc: serial: sunhv: fix a double lock bug sparc: Don't leak context bits into thread->fault_address tty: Prevent ldisc drivers from re-using stale tty fields tcp: take care of truncations done by sk_filter() ipv4: use new_gw for redirect neigh lookup net: __skb_flow_dissect() must cap its return value sock: fix sendmmsg for partial sendmsg fib_trie: Correct /proc/net/route off by one error sctp: assign assoc_id earlier in __sctp_connect ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped ipv6: dccp: fix out of bound access in dccp_v6_err() dccp: fix out of bound access in dccp_v4_err() dccp: do not send reset to already closed sockets tcp: fix potential memory corruption ip6_tunnel: Clear IP6CB in ip6tunnel_xmit() bgmac: stop clearing DMA receive control register right after it is set net: mangle zero checksum in skb_checksum_help() net: clear sk_err_soft in sk_clone_lock() dctcp: avoid bogus doubling of cwnd after loss ARM: 8485/1: cpuidle: remove cpu parameter from the cpuidle_ops suspend hook Linux 4.4.33 netfilter: fix namespace handling in nf_log_proc_dostring btrfs: qgroup: Prevent qgroup->reserved from going subzero mmc: mxs: Initialize the spinlock prior to using it ASoC: sun4i-codec: return error code instead of NULL when create_card fails ACPI / APEI: Fix incorrect return value of ghes_proc() i40e: fix call of ndo_dflt_bridge_getlink() hwrng: core - Don't use a stack buffer in add_early_randomness() lib/genalloc.c: start search from start of chunk mei: bus: fix received data size check in NFC fixup iommu/vt-d: Fix dead-locks in disable_dmar_iommu() path iommu/amd: Free domain id when free a domain of struct dma_ops_domain tty/serial: at91: fix hardware handshake on Atmel platforms dmaengine: at_xdmac: fix spurious flag status for mem2mem transfers drm/i915: Respect alternate_ddc_pin for all DDI ports KVM: MIPS: Precalculate MMIO load resume PC scsi: mpt3sas: Fix for block device of raid exists even after deleting raid disk scsi: qla2xxx: Fix scsi scan hang triggered if adapter fails during init iio: orientation: hid-sensor-rotation: Add PM function (fix non working driver) iio: hid-sensors: Increase the precision of scale to fix wrong reading interpretation. clk: qoriq: Don't allow CPU clocks higher than starting value toshiba-wmi: Fix loading the driver on non Toshiba laptops drbd: Fix kernel_sendmsg() usage - potential NULL deref usb: gadget: u_ether: remove interrupt throttling USB: cdc-acm: fix TIOCMIWAIT staging: nvec: remove managed resource from PS2 driver Revert "staging: nvec: ps2: change serio type to passthrough" drivers: staging: nvec: remove bogus reset command for PS/2 interface staging: iio: ad5933: avoid uninitialized variable in error case pinctrl: cherryview: Prevent possible interrupt storm on resume pinctrl: cherryview: Serialize register access in suspend/resume ARC: timer: rtc: implement read loop in "C" vs. inline asm s390/hypfs: Use get_free_page() instead of kmalloc to ensure page alignment coredump: fix unfreezable coredumping task swapfile: fix memory corruption via malformed swapfile dib0700: fix nec repeat handling ASoC: cs4270: fix DAPM stream name mismatch ALSA: info: Limit the proc text input size ALSA: info: Return error for invalid read/write arm64: Enable KPROBES/HIBERNATION/CORESIGHT in defconfig arm64: kvm: allows kvm cpu hotplug arm64: KVM: Register CPU notifiers when the kernel runs at HYP arm64: KVM: Skip HYP setup when already running in HYP arm64: hyp/kvm: Make hyp-stub reject kvm_call_hyp() arm64: hyp/kvm: Make hyp-stub extensible arm64: kvm: Move lr save/restore from do_el2_call into EL1 arm64: kvm: deal with kernel symbols outside of linear mapping arm64: introduce KIMAGE_VADDR as the virtual base of the kernel region ANDROID: video: adf: Avoid directly referencing user pointers ANDROID: usb: gadget: audio_source: fix comparison of distinct pointer types android: binder: support for file-descriptor arrays. android: binder: support for scatter-gather. android: binder: add extra size to allocator. android: binder: refactor binder_transact() android: binder: support multiple /dev instances. android: binder: deal with contexts in debugfs. android: binder: support multiple context managers. android: binder: split flat_binder_object. disable aio support in recommended configuration Linux 4.4.32 scsi: megaraid_sas: fix macro MEGASAS_IS_LOGICAL to avoid regression drm/radeon: fix DP mode validation drm/radeon/dp: add back special handling for NUTMEG drm/amdgpu: fix DP mode validation drm/amdgpu/dp: add back special handling for NUTMEG KVM: MIPS: Drop other CPU ASIDs on guest MMU changes Revert KVM: MIPS: Drop other CPU ASIDs on guest MMU changes of: silence warnings due to max() usage packet: on direct_xmit, limit tso and csum to supported devices sctp: validate chunk len before actually using it net sched filters: fix notification of filter delete with proper handle udp: fix IP_CHECKSUM handling net: sctp, forbid negative length ipv4: use the right lock for ping_group_range ipv4: disable BH in set_ping_group_range() net: add recursion limit to GRO rtnetlink: Add rtnexthop offload flag to compare mask bridge: multicast: restore perm router ports on multicast enable net: pktgen: remove rcu locking in pktgen_change_name() ipv6: correctly add local routes when lo goes up ip6_tunnel: fix ip6_tnl_lookup ipv6: tcp: restore IP6CB for pktoptions skbs netlink: do not enter direct reclaim from netlink_dump() packet: call fanout_release, while UNREGISTERING a netdev net: Add netdev all_adj_list refcnt propagation to fix panic net/sched: act_vlan: Push skb->data to mac_header prior calling skb_vlan_*() functions net: pktgen: fix pkt_size net: fec: set mac address unconditionally tg3: Avoid NULL pointer dereference in tg3_io_error_detected() ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_route ip6_gre: fix flowi6_proto value in ip6gre_xmit_other() tcp: fix a compile error in DBGUNDO() tcp: fix wrong checksum calculation on MTU probing net: avoid sk_forward_alloc overflows tcp: fix overflow in __tcp_retransmit_skb() arm64/kvm: fix build issue on kvm debug arm64: ptdump: Indicate whether memory should be faulting arm64: Add support for ARCH_SUPPORTS_DEBUG_PAGEALLOC arm64: Drop alloc function from create_mapping arm64: allow vmalloc regions to be set with set_memory_* arm64: kernel: implement ACPI parking protocol arm64: mm: create new fine-grained mappings at boot arm64: ensure _stext and _etext are page-aligned arm64: mm: allow passing a pgdir to alloc_init_* arm64: mm: allocate pagetables anywhere arm64: mm: use fixmap when creating page tables arm64: mm: add functions to walk tables in fixmap arm64: mm: add __{pud,pgd}_populate arm64: mm: avoid redundant __pa(__va(x)) Linux 4.4.31 HID: usbhid: add ATEN CS962 to list of quirky devices ubi: fastmap: Fix add_vol() return value test in ubi_attach_fastmap() kvm: x86: Check memopp before dereference (CVE-2016-8630) tty: vt, fix bogus division in csi_J usb: dwc3: Fix size used in dma_free_coherent() pwm: Unexport children before chip removal UBI: fastmap: scrub PEB when bitflips are detected in a free PEB EC header Disable "frame-address" warning smc91x: avoid self-comparison warning cgroup: avoid false positive gcc-6 warning drm/exynos: fix error handling in exynos_drm_subdrv_open mm/cma: silence warnings due to max() usage ARM: 8584/1: floppy: avoid gcc-6 warning powerpc/ptrace: Fix out of bounds array access warning x86/xen: fix upper bound of pmd loop in xen_cleanhighmap() perf build: Fix traceevent plugins build race drm/dp/mst: Check peer device type before attempting EDID read drm/radeon: drop register readback in cayman_cp_int_cntl_setup drm/radeon/si_dpm: workaround for SI kickers drm/radeon/si_dpm: Limit clocks on HD86xx part Revert "drm/radeon: fix DP link training issue with second 4K monitor" mmc: dw_mmc-pltfm: fix the potential NULL pointer dereference scsi: arcmsr: Send SYNCHRONIZE_CACHE command to firmware scsi: scsi_debug: Fix memory leak if LBP enabled and module is unloaded scsi: megaraid_sas: Fix data integrity failure for JBOD (passthrough) devices mac80211: discard multicast and 4-addr A-MSDUs firewire: net: fix fragmented datagram_size off-by-one firewire: net: guard against rx buffer overflows Input: i8042 - add XMG C504 to keyboard reset table dm mirror: fix read error on recovery after default leg failure virtio: console: Unlock vqs while freeing buffers virtio_ring: Make interrupt suppression spec compliant parisc: Ensure consistent state when switching to kernel stack at syscall entry ovl: fsync after copy-up KVM: MIPS: Make ERET handle ERL before EXL KVM: x86: fix wbinvd_dirty_mask use-after-free dm: free io_barrier after blk_cleanup_queue call USB: serial: cp210x: fix tiocmget error handling tty: limit terminal size to 4M chars xhci: add restart quirk for Intel Wildcatpoint PCH hv: do not lose pending heartbeat vmbus packets vt: clear selection before resizing Fix potential infoleak in older kernels GenWQE: Fix bad page access during abort of resource allocation usb: increase ohci watchdog delay to 275 msec xhci: use default USB_RESUME_TIMEOUT when resuming ports. USB: serial: ftdi_sio: add support for Infineon TriBoard TC2X7 USB: serial: fix potential NULL-dereference at probe usb: gadget: function: u_ether: don't starve tx request queue mei: txe: don't clean an unprocessed interrupt cause. ubifs: Fix regression in ubifs_readdir() ubifs: Abort readdir upon error btrfs: fix races on root_log_ctx lists ANDROID: binder: Clear binder and cookie when setting handle in flat binder struct ANDROID: binder: Add strong ref checks ALSA: hda - Fix headset mic detection problem for two Dell laptops ALSA: hda - Adding a new group of pin cfg into ALC295 pin quirk table ALSA: hda - allow 40 bit DMA mask for NVidia devices ALSA: hda - Raise AZX_DCAPS_RIRB_DELAY handling into top drivers ALSA: hda - Merge RIRB_PRE_DELAY into CTX_WORKAROUND caps ALSA: usb-audio: Add quirk for Syntek STK1160 KEYS: Fix short sprintf buffer in /proc/keys show function mm: memcontrol: do not recurse in direct reclaim mm/list_lru.c: avoid error-path NULL pointer deref libxfs: clean up _calc_dquots_per_chunk h8300: fix syscall restarting drm/dp/mst: Clear port->pdt when tearing down the i2c adapter i2c: core: fix NULL pointer dereference under race condition i2c: xgene: Avoid dma_buffer overrun arm64:cpufeature ARM64_NCAPS is the indicator of last feature arm64: hibernate: Refuse to hibernate if the boot cpu is offline PM / sleep: Add support for read-only sysfs attributes arm64: kernel: Add support for hibernate/suspend-to-disk arm64: mm: add functions to walk page tables by PA arm64: mm: move pte_* macros PM / Hibernate: Call flush_icache_range() on pages restored in-place arm64: Add new asm macro copy_page arm64: Promote KERNEL_START/KERNEL_END definitions to a header file arm64: kernel: Include _AC definition in page.h arm64: Change cpu_resume() to enable mmu early then access sleep_sp by va arm64: kernel: Rework finisher callback out of __cpu_suspend_enter() arm64: Cleanup SCTLR flags arm64: Fold proc-macros.S into assembler.h arm/arm64: KVM: Add hook for C-based stage2 init arm/arm64: KVM: Detect vGIC presence at runtime arm64: KVM: Add support for 16-bit VMID arm: KVM: Make kvm_arm.h friendly to assembly code arm/arm64: KVM: Remove unreferenced S2_PGD_ORDER arm64: KVM: debug: Remove spurious inline attributes ARM: KVM: Cleanup exception injection arm64: KVM: Remove weak attributes arm64: KVM: Cleanup asm-offset.c arm64: KVM: Turn system register numbers to an enum arm64: KVM: VHE: Patch out use of HVC arm64: Add ARM64_HAS_VIRT_HOST_EXTN feature arm/arm64: Add new is_kernel_in_hyp_mode predicate arm64: KVM: Move away from the assembly version of the world switch arm64: KVM: Map the kernel RO section into HYP arm64: KVM: Add compatibility aliases arm64: KVM: Implement vgic-v3 save/restore arm64: KVM: Add panic handling arm64: KVM: HYP mode entry points arm64: KVM: Implement TLB handling arm64: KVM: Implement fpsimd save/restore arm64: KVM: Implement the core world switch arm64: KVM: Add patchable function selector arm64: KVM: Implement guest entry arm64: KVM: Implement debug save/restore arm64: KVM: Implement 32bit system register save/restore arm64: KVM: Implement system register save/restore arm64: KVM: Implement timer save/restore arm64: KVM: Implement vgic-v2 save/restore arm64: KVM: Add a HYP-specific header file KVM: arm/arm64: vgic-v3: Make the LR indexing macro public arm64: Add macros to read/write system registers Linux 4.4.30 Revert "fix minor infoleak in get_user_ex()" Revert "x86/mm: Expand the exception table logic to allow new handling options" Linux 4.4.29 ARM: pxa: pxa_cplds: fix interrupt handling powerpc/nvram: Fix an incorrect partition merge mpt3sas: Don't spam logs if logging level is 0 perf symbols: Fixup symbol sizes before picking best ones perf symbols: Check symbol_conf.allow_aliases for kallsyms loading too perf hists browser: Fix event group display clk: divider: Fix clk_divider_round_rate() to use clk_readl() clk: qoriq: fix a register offset error s390/con3270: fix insufficient space padding s390/con3270: fix use of uninitialised data s390/cio: fix accidental interrupt enabling during resume x86/mm: Expand the exception table logic to allow new handling options dmaengine: ipu: remove bogus NO_IRQ reference power: bq24257: Fix use of uninitialized pointer bq->charger staging: r8188eu: Fix scheduling while atomic splat ASoC: dapm: Fix kcontrol creation for output driver widget ASoC: dapm: Fix value setting for _ENUM_DOUBLE MUX's second channel ASoC: dapm: Fix possible uninitialized variable in snd_soc_dapm_get_volsw() ASoC: topology: Fix error return code in soc_tplg_dapm_widget_create() hwrng: omap - Only fail if pm_runtime_get_sync returns < 0 crypto: arm/ghash-ce - add missing async import/export crypto: gcm - Fix IV buffer size in crypto_gcm_setkey mwifiex: correct aid value during tdls setup spi: spi-fsl-dspi: Drop extra spi_master_put in device remove function ARM: clk-imx35: fix name for ckil clk uio: fix dmem_region_start computation genirq/generic_chip: Add irq_unmap callback perf stat: Fix interval output values powerpc/eeh: Null check uses of eeh_pe_bus_get tunnels: Remove encapsulation offloads on decap. tunnels: Don't apply GRO to multiple layers of encapsulation. ipip: Properly mark ipip GRO packets as encapsulated. posix_acl: Clear SGID bit when setting file permissions brcmfmac: avoid potential stack overflow in brcmf_cfg80211_start_ap() mm/hugetlb: fix memory offline with hugepage size > memory block size drm/i915: Unalias obj->phys_handle and obj->userptr drm/i915: Account for TSEG size when determining 865G stolen base Revert "drm/i915: Check live status before reading edid" drm/i915/gen9: fix the WaWmMemoryReadLatency implementation xenbus: don't look up transaction IDs for ordinary writes drm/vmwgfx: Limit the user-space command buffer size drm/radeon: change vblank_time's calculation method to reduce computational error. drm/radeon/si/dpm: fix phase shedding setup drm/radeon: narrow asic_init for virtualization drm/amdgpu: change vblank_time's calculation method to reduce computational error. drm/amdgpu/dce11: add missing drm_mode_config_cleanup call drm/amdgpu/dce11: disable hpd on local panels drm/amdgpu/dce8: disable hpd on local panels drm/amdgpu/dce10: disable hpd on local panels drm/amdgpu: fix IB alignment for UVD drm/prime: Pass the right module owner through to dma_buf_export() Linux 4.4.28 target: Don't override EXTENDED_COPY xcopy_pt_cmd SCSI status code target: Make EXTENDED_COPY 0xe4 failure return COPY TARGET DEVICE NOT REACHABLE target: Re-add missing SCF_ACK_KREF assignment in v4.1.y ubifs: Fix xattr_names length in exit paths jbd2: fix incorrect unlock on j_list_lock ext4: do not advertise encryption support when disabled mmc: rtsx_usb_sdmmc: Handle runtime PM while changing the led mmc: rtsx_usb_sdmmc: Avoid keeping the device runtime resumed when unused mmc: core: Annotate cmd_hdr as __le32 powerpc/mm: Prevent unlikely crash in copro_calculate_slb() ceph: fix error handling in ceph_read_iter arm64: kernel: Init MDCR_EL2 even in the absence of a PMU arm64: percpu: rewrite ll/sc loops in assembly memstick: rtsx_usb_ms: Manage runtime PM when accessing the device memstick: rtsx_usb_ms: Runtime resume the device when polling for cards isofs: Do not return EACCES for unknown filesystems irqchip/gic-v3-its: Fix entry size mask for GITS_BASER s390/mm: fix gmap tlb flush issues Using BUG_ON() as an assert() is _never_ acceptable mm: filemap: fix mapping->nrpages double accounting in fuse mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page() acpi, nfit: check for the correct event code in notifications net/mlx4_core: Allow resetting VF admin mac to zero bnx2x: Prevent false warning for lack of FC NPIV PKCS#7: Don't require SpcSpOpusInfo in Authenticode pkcs7 signatures hpsa: correct skipping masked peripherals sd: Fix rw_max for devices that report an optimal xfer size irqchip/gicv3: Handle loop timeout proper kvm: x86: memset whole irq_eoi x86/e820: Don't merge consecutive E820_PRAM ranges blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL Fix regression which breaks DFS mounting Cleanup missing frees on some ioctls Do not send SMB3 SET_INFO request if nothing is changing SMB3: GUIDs should be constructed as random but valid uuids Set previous session id correctly on SMB3 reconnect Display number of credits available Clarify locking of cifs file and tcon structures and make more granular fs/cifs: keep guid when assigning fid to fileinfo cifs: Limit the overall credit acquired fs/super.c: fix race between freeze_super() and thaw_super() arc: don't leak bits of kernel stack into coredump lightnvm: ensure that nvm_dev_ops can be used without CONFIG_NVM ipc/sem.c: fix complex_count vs. simple op race mm: filemap: don't plant shadow entries without radix tree node metag: Only define atomic_dec_if_positive conditionally scsi: Fix use-after-free NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic NFSv4: Open state recovery must account for file permission changes NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is invalid NFSv4: Don't report revoked delegations as valid in nfs_have_delegation() sunrpc: fix write space race causing stalls Input: elantech - add Fujitsu Lifebook E556 to force crc_enabled Input: elantech - force needed quirks on Fujitsu H760 Input: i8042 - skip selftest on ASUS laptops lib: add "on"/"off" support to kstrtobool lib: update single-char callers of strtobool() lib: move strtobool() to kstrtobool() MIPS: ptrace: Fix regs_return_value for kernel context MIPS: Fix -mabi=64 build of vdso.lds ALSA: hda - Fix a failure of micmute led when having multi adcs cx231xx: fix GPIOs for Pixelview SBTVD hybrid cx231xx: don't return error on success mb86a20s: fix demod settings mb86a20s: fix the locking logic ovl: copy_up_xattr(): use strnlen ovl: Fix info leak in ovl_lookup_temp() fbdev/efifb: Fix 16 color palette entry calculation scsi: zfcp: spin_lock_irqsave() is not nestable zfcp: trace full payload of all SAN records (req,resp,iels) zfcp: fix payload trace length for SAN request&response zfcp: fix D_ID field with actual value on tracing SAN responses zfcp: restore tracing of handle for port and LUN with HBA records zfcp: trace on request for open and close of WKA port zfcp: restore: Dont use 0 to indicate invalid LUN in rec trace zfcp: retain trace level for SCSI and HBA FSF response records zfcp: close window with unblocked rport during rport gone zfcp: fix ELS/GS request&response length for hardware data router zfcp: fix fc_host port_type with NPIV ubi: Deal with interrupted erasures in WL powerpc/pseries: Fix stack corruption in htpe code powerpc/64: Fix incorrect return value from __copy_tofrom_user powerpc/powernv: Use CPU-endian PEST in pnv_pci_dump_p7ioc_diag_data() powerpc/powernv: Use CPU-endian hub diag-data type in pnv_eeh_get_and_dump_hub_diag() powerpc/powernv: Pass CPU-endian PE number to opal_pci_eeh_freeze_clear() powerpc/vdso64: Use double word compare on pointers dm crypt: fix crash on exit dm mpath: check if path's request_queue is dying in activate_path() dm: return correct error code in dm_resume()'s retry loop dm: mark request_queue dead before destroying the DM device perf intel-pt: Fix MTC timestamp calculation for large MTC periods perf intel-pt: Fix estimated timestamps for cycle-accurate mode perf intel-pt: Fix snapshot overlap detection decoder errors pstore/ram: Use memcpy_fromio() to save old buffer pstore/ram: Use memcpy_toio instead of memcpy pstore/core: drop cmpxchg based updates pstore/ramoops: fixup driver removal parisc: Increase initial kernel mapping size parisc: Fix kernel memory layout regarding position of __gp parisc: Increase KERNEL_INITIAL_SIZE for 32-bit SMP kernels cpufreq: intel_pstate: Fix unsafe HWP MSR access platform: don't return 0 from platform_get_irq[_byname]() on error PCI: Mark Atheros AR9580 to avoid bus reset mmc: sdhci: cast unsigned int to unsigned long long to avoid unexpeted error mmc: block: don't use CMD23 with very old MMC cards rtlwifi: Fix missing country code for Great Britain PM / devfreq: event: remove duplicate devfreq_event_get_drvdata() clk: imx6: initialize GPU clocks regulator: tps65910: Work around silicon erratum SWCZ010 mei: me: add kaby point device ids gpio: mpc8xxx: Correct irq handler function cgroup: Change from CAP_SYS_NICE to CAP_SYS_RESOURCE for cgroup migration permissions UPSTREAM: cpu/hotplug: Handle unbalanced hotplug enable/disable UPSTREAM: arm64: kaslr: fix breakage with CONFIG_MODVERSIONS=y UPSTREAM: arm64: kaslr: keep modules close to the kernel when DYNAMIC_FTRACE=y cgroup: Remove leftover instances of allow_attach BACKPORT: lib: harden strncpy_from_user CHROMIUM: cgroups: relax permissions on moving tasks between cgroups CHROMIUM: remove Android's cgroup generic permissions checks Linux 4.4.27 cfq: fix starvation of asynchronous writes vfs: move permission checking into notify_change() for utimes(NULL) dlm: free workqueues after the connections crypto: vmx - Fix memory corruption caused by p8_ghash crypto: ghash-generic - move common definitions to a new header file ext4: release bh in make_indexed_dir ext4: allow DAX writeback for hole punch ext4: fix memory leak in ext4_insert_range() ext4: reinforce check of i_dtime when clearing high fields of uid and gid ext4: enforce online defrag restriction for encrypted files scsi: ibmvfc: Fix I/O hang when port is not mapped scsi: arcmsr: Simplify user_len checking scsi: arcmsr: Buffer overflow in arcmsr_iop_message_xfer() async_pq_val: fix DMA memory leak reiserfs: switch to generic_{get,set,remove}xattr() reiserfs: Unlock superblock before calling reiserfs_quota_on_mount() ASoC: Intel: Atom: add a missing star in a memcpy call brcmfmac: fix memory leak in brcmf_fill_bss_param i40e: avoid NULL pointer dereference and recursive errors on early PCI error fuse: fix killing s[ug]id in setattr fuse: invalidate dir dentry after chmod fuse: listxattr: verify xattr list drivers: base: dma-mapping: page align the size when unmap_kernel_range btrfs: assign error values to the correct bio structs serial: 8250_dw: Check the data->pclk when get apb_pclk arm64: Use PoU cache instr for I/D coherency arm64: mm: add code to safely replace TTBR1_EL1 arm64: mm: place __cpu_setup in .text arm64: add function to install the idmap arm64: unmap idmap earlier arm64: unify idmap removal arm64: mm: place empty_zero_page in bss arm64: head.S: use memset to clear BSS arm64: mm: specialise pagetable allocators arm64: mm: remove pointless PAGE_MASKing asm-generic: Fix local variable shadow in __set_fixmap_offset arm64: mm: fold alternatives into .init ARM: 8511/1: ARM64: kernel: PSCI: move PSCI idle management code to drivers/firmware ARM: 8481/2: drivers: psci: replace psci firmware calls ARM: 8480/2: arm64: add implementation for arm-smccc ARM: 8479/2: add implementation for arm-smccc ARM: 8478/2: arm/arm64: add arm-smccc ARM: 8510/1: rework ARM_CPU_SUSPEND dependencies ARM: 8458/1: bL_switcher: add GIC dependency Linux 4.4.26 mm: remove gup_flags FOLL_WRITE games from __get_user_pages() x86/build: Build compressed x86 kernels as PIE arm64: Remove stack duplicating code from jprobes arm64: kprobes: Add KASAN instrumentation around stack accesses arm64: kprobes: Cleanup jprobe_return arm64: kprobes: Fix overflow when saving stack arm64: kprobes: WARN if attempting to step with PSTATE.D=1 kprobes: Add arm64 case in kprobe example module arm64: Add kernel return probes support (kretprobes) arm64: Add trampoline code for kretprobes arm64: kprobes instruction simulation support arm64: Treat all entry code as non-kprobe-able arm64: Blacklist non-kprobe-able symbol arm64: Kprobes with single stepping support arm64: add conditional instruction simulation support arm64: Add more test functions to insn.c arm64: Add HAVE_REGS_AND_STACK_ACCESS_API feature Linux 4.4.25 tpm_crb: fix crb_req_canceled behavior tpm: fix a race condition in tpm2_unseal_trusted() ima: use file_dentry() ARM: cpuidle: Fix error return code ARM: dts: MSM8064 remove flags from SPMI/MPP IRQs ARM: dts: mvebu: armada-390: add missing compatibility string and bracket x86/dumpstack: Fix x86_32 kernel_stack_pointer() previous stack access x86/irq: Prevent force migration of irqs which are not in the vector domain x86/boot: Fix kdump, cleanup aborted E820_PRAM max_pfn manipulation KVM: PPC: BookE: Fix a sanity check KVM: MIPS: Drop other CPU ASIDs on guest MMU changes KVM: PPC: Book3s PR: Allow access to unprivileged MMCR2 register mfd: wm8350-i2c: Make sure the i2c regmap functions are compiled mfd: 88pm80x: Double shifting bug in suspend/resume mfd: atmel-hlcdc: Do not sleep in atomic context mfd: rtsx_usb: Avoid setting ucr->current_sg.status ALSA: usb-line6: use the same declaration as definition in header for MIDI manufacturer ID ALSA: usb-audio: Extend DragonFly dB scale quirk to cover other variants ALSA: ali5451: Fix out-of-bound position reporting timekeeping: Fix __ktime_get_fast_ns() regression time: Add cycles to nanoseconds translation mm: Fix build for hardened usercopy ANDROID: binder: Clear binder and cookie when setting handle in flat binder struct ANDROID: binder: Add strong ref checks UPSTREAM: staging/android/ion : fix a race condition in the ion driver ANDROID: android-base: CONFIG_HARDENED_USERCOPY=y UPSTREAM: fs/proc/kcore.c: Add bounce buffer for ktext data UPSTREAM: fs/proc/kcore.c: Make bounce buffer global for read BACKPORT: arm64: Correctly bounds check virt_addr_valid Fix a build breakage in IO latency hist code. UPSTREAM: efi: include asm/early_ioremap.h not asm/efi.h to get early_memremap UPSTREAM: ia64: split off early_ioremap() declarations into asm/early_ioremap.h FROMLIST: arm64: Enable CONFIG_ARM64_SW_TTBR0_PAN FROMLIST: arm64: xen: Enable user access before a privcmd hvc call FROMLIST: arm64: Handle faults caused by inadvertent user access with PAN enabled FROMLIST: arm64: Disable TTBR0_EL1 during normal kernel execution FROMLIST: arm64: Introduce uaccess_{disable,enable} functionality based on TTBR0_EL1 FROMLIST: arm64: Factor out TTBR0_EL1 post-update workaround into a specific asm macro FROMLIST: arm64: Factor out PAN enabling/disabling into separate uaccess_* macros UPSTREAM: arm64: Handle el1 synchronous instruction aborts cleanly UPSTREAM: arm64: include alternative handling in dcache_by_line_op UPSTREAM: arm64: fix "dc cvau" cache operation on errata-affected core UPSTREAM: Revert "arm64: alternatives: add enable parameter to conditional asm macros" UPSTREAM: arm64: Add new asm macro copy_page UPSTREAM: arm64: kill ESR_LNX_EXEC UPSTREAM: arm64: add macro to extract ESR_ELx.EC UPSTREAM: arm64: mm: mark fault_info table const UPSTREAM: arm64: fix dump_instr when PAN and UAO are in use BACKPORT: arm64: Fold proc-macros.S into assembler.h UPSTREAM: arm64: choose memstart_addr based on minimum sparsemem section alignment UPSTREAM: arm64/mm: ensure memstart_addr remains sufficiently aligned UPSTREAM: arm64/kernel: fix incorrect EL0 check in inv_entry macro UPSTREAM: arm64: Add macros to read/write system registers UPSTREAM: arm64/efi: refactor EFI init and runtime code for reuse by 32-bit ARM UPSTREAM: arm64/efi: split off EFI init and runtime code for reuse by 32-bit ARM UPSTREAM: arm64/efi: mark UEFI reserved regions as MEMBLOCK_NOMAP BACKPORT: arm64: only consider memblocks with NOMAP cleared for linear mapping UPSTREAM: mm/memblock: add MEMBLOCK_NOMAP attribute to memblock memory table ANDROID: dm: android-verity: Remove fec_header location constraint BACKPORT: audit: consistently record PIDs with task_tgid_nr() android-base.cfg: Enable kernel ASLR UPSTREAM: vmlinux.lds.h: allow arch specific handling of ro_after_init data section UPSTREAM: arm64: spinlock: fix spin_unlock_wait for LSE atomics UPSTREAM: arm64: avoid TLB conflict with CONFIG_RANDOMIZE_BASE UPSTREAM: arm64: Only select ARM64_MODULE_PLTS if MODULES=y sched: Add Kconfig option DEFAULT_USE_ENERGY_AWARE to set ENERGY_AWARE feature flag sched/fair: remove printk while schedule is in progress ANDROID: fs: FS tracepoints to track IO. sched/walt: Drop arch-specific timer access ANDROID: fiq_debugger: Pass task parameter to unwind_frame() eas/sched/fair: Fixing comments in find_best_target. input: keyreset: switch to orderly_reboot UPSTREAM: tun: fix transmit timestamp support UPSTREAM: arch/arm/include/asm/pgtable-3level.h: add pmd_mkclean for THP net: inet: diag: expose the socket mark to privileged processes. net: diag: make udp_diag_destroy work for mapped addresses. net: diag: support SOCK_DESTROY for UDP sockets net: diag: allow socket bytecode filters to match socket marks net: diag: slightly refactor the inet_diag_bc_audit error checks. net: diag: Add support to filter on device index UPSTREAM: brcmfmac: avoid potential stack overflow in brcmf_cfg80211_start_ap() Linux 4.4.24 ALSA: hda - Add the top speaker pin config for HP Spectre x360 ALSA: hda - Fix headset mic detection problem for several Dell laptops ACPICA: acpi_get_sleep_type_data: Reduce warnings ALSA: hda - Adding one more ALC255 pin definition for headset problem Revert "usbtmc: convert to devm_kzalloc" USB: serial: cp210x: Add ID for a Juniper console Staging: fbtft: Fix bug in fbtft-core usb: misc: legousbtower: Fix NULL pointer deference USB: serial: cp210x: fix hardware flow-control disable dm log writes: fix bug with too large bios clk: xgene: Add missing parenthesis when clearing divider value aio: mark AIO pseudo-fs noexec batman-adv: remove unused callback from batadv_algo_ops struct IB/mlx4: Use correct subnet-prefix in QP1 mads under SR-IOV IB/mlx4: Fix code indentation in QP1 MAD flow IB/mlx4: Fix incorrect MC join state bit-masking on SR-IOV IB/ipoib: Don't allow MC joins during light MC flush IB/core: Fix use after free in send_leave function IB/ipoib: Fix memory corruption in ipoib cm mode connect flow KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write dmaengine: at_xdmac: fix to pass correct device identity to free_irq() kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd ASoC: omap-mcpdm: Fix irq resource handling sysctl: handle error writing UINT_MAX to u32 fields powerpc/prom: Fix sub-processor option passed to ibm, client-architecture-support brcmsmac: Initialize power in brcms_c_stf_ss_algo_channel_get() brcmsmac: Free packet if dma_mapping_error() fails in dma_rxfill brcmfmac: Fix glob_skb leak in brcmf_sdiod_recv_chain ASoC: Intel: Skylake: Fix error return code in skl_probe() pNFS/flexfiles: Fix layoutcommit after a commit to DS pNFS/files: Fix layoutcommit after a commit to DS NFS: Don't drop CB requests with invalid principals svc: Avoid garbage replies when pc_func() returns rpc_drop_reply dmaengine: at_xdmac: fix debug string fnic: pci_dma_mapping_error() doesn't return an error code avr32: off by one in at32_init_pio() ath9k: Fix programming of minCCA power threshold gspca: avoid unused variable warnings em28xx-i2c: rt_mutex_trylock() returns zero on failure NFC: fdp: Detect errors from fdp_nci_create_conn() iwlmvm: mvm: set correct state in smart-fifo configuration tile: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO pstore: drop file opened reference count blk-mq: actually hook up defer list when running requests hwrng: omap - Fix assumption that runtime_get_sync will always succeed ARM: sa1111: fix pcmcia suspend/resume ARM: shmobile: fix regulator quirk for Gen2 ARM: sa1100: clear reset status prior to reboot ARM: sa1100: fix 3.6864MHz clock ARM: sa1100: register clocks early ARM: sun5i: Fix typo in trip point temperature regulator: qcom_smd: Fix voltage ranges for pm8x41 regulator: qcom_spmi: Update mvs1/mvs2 switches on pm8941 regulator: qcom_spmi: Add support for get_mode/set_mode on switches regulator: qcom_spmi: Add support for S4 supply on pm8941 tpm: fix byte-order for the value read by tpm2_get_tpm_pt printk: fix parsing of "brl=" option MIPS: uprobes: fix use of uninitialised variable MIPS: Malta: Fix IOCU disable switch read for MIPS64 MIPS: fix uretprobe implementation MIPS: uprobes: remove incorrect set_orig_insn arm64: debug: avoid resetting stepping state machine when TIF_SINGLESTEP ARM: 8618/1: decompressor: reset ttbcr fields to use TTBR0 on ARMv7 irqchip/gicv3: Silence noisy DEBUG_PER_CPU_MAPS warning gpio: sa1100: fix irq probing for ucb1x00 usb: gadget: fsl_qe_udc: signedness bug in qe_get_frame() ceph: fix race during filling readdir cache iwlwifi: mvm: don't use ret when not initialised iwlwifi: pcie: fix access to scratch buffer spi: sh-msiof: Avoid invalid clock generator parameters hwmon: (adt7411) set bit 3 in CFG1 register nvmem: Declare nvmem_cell_read() consistently ipvs: fix bind to link-local mcast IPv6 address in backup tools/vm/slabinfo: fix an unintentional printf mmc: pxamci: fix potential oops drivers/perf: arm_pmu: Fix leak in error path pinctrl: Flag strict is a field in struct pinmux_ops pinctrl: uniphier: fix .pin_dbg_show() callback i40e: avoid null pointer dereference perf/core: Fix pmu::filter_match for SW-led groups iwlwifi: mvm: fix a few firmware capability checks usb: musb: fix DMA for host mode usb: musb: Fix DMA desired mode for Mentor DMA engine ARM: 8617/1: dma: fix dma_max_pfn() ARM: 8616/1: dt: Respect property size when parsing CPUs drm/radeon/si/dpm: add workaround for for Jet parts drm/nouveau/fifo/nv04: avoid ramht race against cookie insertion x86/boot: Initialize FPU and X86_FEATURE_ALWAYS even if we don't have CPUID x86/init: Fix cr4_init_shadow() on CR4-less machines can: dev: fix deadlock reported after bus-off mm,ksm: fix endless looping in allocating memory when ksm enable mtd: nand: davinci: Reinitialize the HW ECC engine in 4bit hwctl cpuset: handle race between CPU hotplug and cpuset_hotplug_work usercopy: fold builtin_const check into inline function Linux 4.4.23 hostfs: Freeing an ERR_PTR in hostfs_fill_sb_common() qxl: check for kmap failures power: supply: max17042_battery: fix model download bug. power_supply: tps65217-charger: fix missing platform_set_drvdata() PM / hibernate: Fix rtree_next_node() to avoid walking off list ends PM / hibernate: Restore processor state before using per-CPU variables MIPS: paravirt: Fix undefined reference to smp_bootstrap MIPS: Add a missing ".set pop" in an early commit MIPS: Avoid a BUG warning during prctl(PR_SET_FP_MODE, ...) MIPS: Remove compact branch policy Kconfig entries MIPS: vDSO: Fix Malta EVA mapping to vDSO page structs MIPS: SMP: Fix possibility of deadlock when bringing CPUs online MIPS: Fix pre-r6 emulation FPU initialisation i2c: qup: skip qup_i2c_suspend if the device is already runtime suspended i2c-eg20t: fix race between i2c init and interrupt enable btrfs: ensure that file descriptor used with subvol ioctls is a dir nl80211: validate number of probe response CSA counters can: flexcan: fix resume function mm: delete unnecessary and unsafe init_tlb_ubc() tracing: Move mutex to protect against resetting of seq data fix memory leaks in tracing_buffers_splice_read() power: reset: hisi-reboot: Unmap region obtained by of_iomap mtd: pmcmsp-flash: Allocating too much in init_msp_flash() mtd: maps: sa1100-flash: potential NULL dereference fix fault_in_multipages_...() on architectures with no-op access_ok() fanotify: fix list corruption in fanotify_get_response() fsnotify: add a way to stop queueing events on group shutdown xfs: prevent dropping ioend completions during buftarg wait autofs: use dentry flags to block walks during expire autofs races pwm: Mark all devices as "might sleep" bridge: re-introduce 'fix parsing of MLDv2 reports' net: smc91x: fix SMC accesses Revert "phy: IRQ cannot be shared" net: dsa: bcm_sf2: Fix race condition while unmasking interrupts net/mlx5: Added missing check of msg length in verifying its signature tipc: fix NULL pointer dereference in shutdown() net/irda: handle iriap_register_lsap() allocation failure vti: flush x-netns xfrm cache when vti interface is removed af_unix: split 'u->readlock' into two: 'iolock' and 'bindlock' Revert "af_unix: Fix splice-bind deadlock" bonding: Fix bonding crash megaraid: fix null pointer check in megasas_detach_one(). nouveau: fix nv40_perfctr_next() cleanup regression Staging: iio: adc: fix indent on break statement iwlegacy: avoid warning about missing braces ath9k: fix misleading indentation am437x-vfpe: fix typo in vpfe_get_app_input_index Add braces to avoid "ambiguous ‘else’" compiler warnings net: caif: fix misleading indentation Makefile: Mute warning for __builtin_return_address(>0) for tracing only Disable "frame-address" warning Disable "maybe-uninitialized" warning globally gcov: disable -Wmaybe-uninitialized warning Kbuild: disable 'maybe-uninitialized' warning for CONFIG_PROFILE_ALL_BRANCHES kbuild: forbid kernel directory to contain spaces and colons tools: Support relative directory path for 'O=' Makefile: revert "Makefile: Document ability to make file.lst and file.S" partially kbuild: Do not run modules_install and install in paralel ocfs2: fix start offset to ocfs2_zero_range_for_truncate() ocfs2/dlm: fix race between convert and migration crypto: echainiv - Replace chaining with multiplication crypto: skcipher - Fix blkcipher walk OOM crash crypto: arm/aes-ctr - fix NULL dereference in tail processing crypto: arm64/aes-ctr - fix NULL dereference in tail processing tcp: properly scale window in tcp_v[46]_reqsk_send_ack() tcp: fix use after free in tcp_xmit_retransmit_queue() tcp: cwnd does not increase in TCP YeAH ipv6: release dst in ping_v6_sendmsg ipv4: panic in leaf_walk_rcu due to stale node pointer reiserfs: fix "new_insert_key may be used uninitialized ..." Fix build warning in kernel/cpuset.c include/linux/kernel.h: change abs() macro so it uses consistent return type Linux 4.4.22 openrisc: fix the fix of copy_from_user() avr32: fix 'undefined reference to `___copy_from_user' ia64: copy_from_user() should zero the destination on access_ok() failure genirq/msi: Fix broken debug output ppc32: fix copy_from_user() sparc32: fix copy_from_user() mn10300: copy_from_user() should zero on access_ok() failure... nios2: copy_from_user() should zero the tail of destination openrisc: fix copy_from_user() parisc: fix copy_from_user() metag: copy_from_user() should zero the destination on access_ok() failure alpha: fix copy_from_user() asm-generic: make copy_from_user() zero the destination properly mips: copy_from_user() must zero the destination on access_ok() failure hexagon: fix strncpy_from_user() error return sh: fix copy_from_user() score: fix copy_from_user() and friends blackfin: fix copy_from_user() cris: buggered copy_from_user/copy_to_user/clear_user frv: fix clear_user() asm-generic: make get_user() clear the destination on errors ARC: uaccess: get_user to zero out dest in cause of fault s390: get_user() should zero on failure score: fix __get_user/get_user nios2: fix __get_user() sh64: failing __get_user() should zero m32r: fix __get_user() mn10300: failing __get_user() and get_user() should zero fix minor infoleak in get_user_ex() microblaze: fix copy_from_user() avr32: fix copy_from_user() microblaze: fix __get_user() fix iov_iter_fault_in_readable() irqchip/atmel-aic: Fix potential deadlock in ->xlate() genirq: Provide irq_gc_{lock_irqsave,unlock_irqrestore}() helpers drm: Only use compat ioctl for addfb2 on X86/IA64 drm: atmel-hlcdc: Fix vertical scaling net: simplify napi_synchronize() to avoid warnings kconfig: tinyconfig: provide whole choice blocks to avoid warnings soc: qcom/spm: shut up uninitialized variable warning pinctrl: at91-pio4: use %pr format string for resource mmc: dw_mmc: use resource_size_t to store physical address drm/i915: Avoid pointer arithmetic in calculating plane surface offset mpssd: fix buffer overflow warning gma500: remove annoying deprecation warning ipv6: addrconf: fix dev refcont leak when DAD failed sched/core: Fix a race between try_to_wake_up() and a woken up task Revert "wext: Fix 32 bit iwpriv compatibility issue with 64 bit Kernel" ath9k: fix using sta->drv_priv before initializing it md-cluster: make md-cluster also can work when compiled into kernel xhci: fix null pointer dereference in stop command timeout function fuse: direct-io: don't dirty ITER_BVEC pages Btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns crypto: cryptd - initialize child shash_desc on import arm64: spinlocks: implement smp_mb__before_spinlock() as smp_mb() pinctrl: sunxi: fix uart1 CTS/RTS pins at PG on A23/A33 pinctrl: pistachio: fix mfio pll_lock pinmux dm crypt: fix error with too large bios dm log writes: move IO accounting earlier to fix error path dm log writes: fix check of kthread_run() return value bus: arm-ccn: Fix XP watchpoint settings bitmask bus: arm-ccn: Do not attempt to configure XPs for cycle counter bus: arm-ccn: Fix PMU handling of MN ARM: dts: STiH407-family: Provide interconnect clock for consumption in ST SDHCI ARM: dts: overo: fix gpmc nand on boards with ethernet ARM: dts: overo: fix gpmc nand cs0 range ARM: dts: imx6qdl: Fix SPDIF regression ARM: OMAP3: hwmod data: Add sysc information for DSI ARM: kirkwood: ib62x0: fix size of u-boot environment partition ARM: imx6: add missing BM_CLPCR_BYPASS_PMIC_READY setting for imx6sx ARM: imx6: add missing BM_CLPCR_BYP_MMDC_CH0_LPM_HS setting for imx6ul ARM: AM43XX: hwmod: Fix RSTST register offset for pruss cpuset: make sure new tasks conform to the current config of the cpuset net: thunderx: Fix OOPs with ethtool --register-dump USB: change bInterval default to 10 ms ARM: dts: STiH410: Handle interconnect clock required by EHCI/OHCI (USB) usb: chipidea: udc: fix NULL ptr dereference in isr_setup_status_phase usb: renesas_usbhs: fix clearing the {BRDY,BEMP}STS condition USB: serial: simple: add support for another Infineon flashloader serial: 8250: added acces i/o products quad and octal serial cards serial: 8250_mid: fix divide error bug if baud rate is 0 iio: ensure ret is initialized to zero before entering do loop iio:core: fix IIO_VAL_FRACTIONAL sign handling iio: accel: kxsd9: Fix scaling bug iio: fix pressure data output unit in hid-sensor-attributes iio: accel: bmc150: reset chip at init time iio: adc: at91: unbreak channel adc channel 3 iio: ad799x: Fix buffered capture for ad7991/ad7995/ad7999 iio: adc: ti_am335x_adc: Increase timeout value waiting for ADC sample iio: adc: ti_am335x_adc: Protect FIFO1 from concurrent access iio: adc: rockchip_saradc: reset saradc controller before programming it iio: proximity: as3935: set up buffer timestamps for non-zero values iio: accel: kxsd9: Fix raw read return kvm-arm: Unmap shadow pagetables properly x86/AMD: Apply erratum 665 on machines without a BIOS fix x86/paravirt: Do not trace _paravirt_ident_*() functions ARC: mm: fix build breakage with STRICT_MM_TYPECHECKS IB/uverbs: Fix race between uverbs_close and remove_one dm flakey: fix reads to be issued if drop_writes configured audit: fix exe_file access in audit_exe_compare mm: introduce get_task_exe_file kexec: fix double-free when failing to relocate the purgatory NFSv4.1: Fix the CREATE_SESSION slot number accounting pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised nfsd: Close race between nfsd4_release_lockowner and nfsd4_lock NFSv4.x: Fix a refcount leak in nfs_callback_up_net pNFS: The client must not do I/O to the DS if it's lease has expired kernfs: don't depend on d_find_any_alias() when generating notifications powerpc/mm: Don't alias user region to other regions below PAGE_OFFSET powerpc/powernv : Drop reference added by kset_find_obj() powerpc/tm: do not use r13 for tabort_syscall tipc: move linearization of buffers to generic code lightnvm: put bio before return fscrypto: require write access to mount to set encryption policy Revert "KVM: x86: fix missed hardware breakpoints" MIPS: KVM: Check for pfn noslot case clocksource/drivers/sun4i: Clear interrupts after stopping timer in probe function fscrypto: add authorization check for setting encryption policy ext4: use __GFP_NOFAIL in ext4_free_blocks() Conflicts: arch/arm/kernel/devtree.c arch/arm64/Kconfig arch/arm64/kernel/arm64ksyms.c arch/arm64/kernel/psci.c arch/arm64/mm/fault.c drivers/android/binder.c drivers/usb/host/xhci-hub.c fs/ext4/readpage.c include/linux/mmc/core.h include/linux/mmzone.h mm/memcontrol.c net/core/filter.c net/netlink/af_netlink.c net/netlink/af_netlink.h Change-Id: I99fe7a0914e83e284b11b33185b71448a8999d1f Signed-off-by: Runmin Wang <runminw@codeaurora.org> Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
3021 lines
78 KiB
C
3021 lines
78 KiB
C
/*
|
|
* linux/mm/swapfile.c
|
|
*
|
|
* Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
|
|
* Swap reorganised 29.12.95, Stephen Tweedie
|
|
*/
|
|
|
|
#include <linux/mm.h>
|
|
#include <linux/hugetlb.h>
|
|
#include <linux/mman.h>
|
|
#include <linux/slab.h>
|
|
#include <linux/kernel_stat.h>
|
|
#include <linux/swap.h>
|
|
#include <linux/vmalloc.h>
|
|
#include <linux/pagemap.h>
|
|
#include <linux/namei.h>
|
|
#include <linux/shmem_fs.h>
|
|
#include <linux/blkdev.h>
|
|
#include <linux/random.h>
|
|
#include <linux/writeback.h>
|
|
#include <linux/proc_fs.h>
|
|
#include <linux/seq_file.h>
|
|
#include <linux/init.h>
|
|
#include <linux/ksm.h>
|
|
#include <linux/rmap.h>
|
|
#include <linux/security.h>
|
|
#include <linux/backing-dev.h>
|
|
#include <linux/mutex.h>
|
|
#include <linux/capability.h>
|
|
#include <linux/syscalls.h>
|
|
#include <linux/memcontrol.h>
|
|
#include <linux/poll.h>
|
|
#include <linux/oom.h>
|
|
#include <linux/frontswap.h>
|
|
#include <linux/swapfile.h>
|
|
#include <linux/export.h>
|
|
|
|
#include <asm/pgtable.h>
|
|
#include <asm/tlbflush.h>
|
|
#include <linux/swapops.h>
|
|
#include <linux/swap_cgroup.h>
|
|
|
|
static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
|
|
unsigned char);
|
|
static void free_swap_count_continuations(struct swap_info_struct *);
|
|
static sector_t map_swap_entry(swp_entry_t, struct block_device**);
|
|
|
|
DEFINE_SPINLOCK(swap_lock);
|
|
static unsigned int nr_swapfiles;
|
|
atomic_long_t nr_swap_pages;
|
|
/* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
|
|
long total_swap_pages;
|
|
static int least_priority;
|
|
|
|
static const char Bad_file[] = "Bad swap file entry ";
|
|
static const char Unused_file[] = "Unused swap file entry ";
|
|
static const char Bad_offset[] = "Bad swap offset entry ";
|
|
static const char Unused_offset[] = "Unused swap offset entry ";
|
|
|
|
/*
|
|
* all active swap_info_structs
|
|
* protected with swap_lock, and ordered by priority.
|
|
*/
|
|
PLIST_HEAD(swap_active_head);
|
|
|
|
/*
|
|
* all available (active, not full) swap_info_structs
|
|
* protected with swap_avail_lock, ordered by priority.
|
|
* This is used by get_swap_page() instead of swap_active_head
|
|
* because swap_active_head includes all swap_info_structs,
|
|
* but get_swap_page() doesn't need to look at full ones.
|
|
* This uses its own lock instead of swap_lock because when a
|
|
* swap_info_struct changes between not-full/full, it needs to
|
|
* add/remove itself to/from this list, but the swap_info_struct->lock
|
|
* is held and the locking order requires swap_lock to be taken
|
|
* before any swap_info_struct->lock.
|
|
*/
|
|
PLIST_HEAD(swap_avail_head);
|
|
DEFINE_SPINLOCK(swap_avail_lock);
|
|
|
|
struct swap_info_struct *swap_info[MAX_SWAPFILES];
|
|
|
|
static DEFINE_MUTEX(swapon_mutex);
|
|
|
|
static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
|
|
/* Activity counter to indicate that a swapon or swapoff has occurred */
|
|
static atomic_t proc_poll_event = ATOMIC_INIT(0);
|
|
|
|
static inline unsigned char swap_count(unsigned char ent)
|
|
{
|
|
return ent & ~SWAP_HAS_CACHE; /* may include SWAP_HAS_CONT flag */
|
|
}
|
|
|
|
bool is_swap_fast(swp_entry_t entry)
|
|
{
|
|
struct swap_info_struct *p;
|
|
unsigned long type;
|
|
|
|
if (non_swap_entry(entry))
|
|
return false;
|
|
|
|
type = swp_type(entry);
|
|
if (type >= nr_swapfiles)
|
|
return false;
|
|
|
|
p = swap_info[type];
|
|
|
|
if (p->flags & SWP_FAST)
|
|
return true;
|
|
|
|
return false;
|
|
}
|
|
|
|
/* returns 1 if swap entry is freed */
|
|
static int
|
|
__try_to_reclaim_swap(struct swap_info_struct *si, unsigned long offset)
|
|
{
|
|
swp_entry_t entry = swp_entry(si->type, offset);
|
|
struct page *page;
|
|
int ret = 0;
|
|
|
|
page = find_get_page(swap_address_space(entry), entry.val);
|
|
if (!page)
|
|
return 0;
|
|
/*
|
|
* This function is called from scan_swap_map() and it's called
|
|
* by vmscan.c at reclaiming pages. So, we hold a lock on a page, here.
|
|
* We have to use trylock for avoiding deadlock. This is a special
|
|
* case and you should use try_to_free_swap() with explicit lock_page()
|
|
* in usual operations.
|
|
*/
|
|
if (trylock_page(page)) {
|
|
ret = try_to_free_swap(page);
|
|
unlock_page(page);
|
|
}
|
|
page_cache_release(page);
|
|
return ret;
|
|
}
|
|
|
|
/*
|
|
* swapon tell device that all the old swap contents can be discarded,
|
|
* to allow the swap device to optimize its wear-levelling.
|
|
*/
|
|
static int discard_swap(struct swap_info_struct *si)
|
|
{
|
|
struct swap_extent *se;
|
|
sector_t start_block;
|
|
sector_t nr_blocks;
|
|
int err = 0;
|
|
|
|
/* Do not discard the swap header page! */
|
|
se = &si->first_swap_extent;
|
|
start_block = (se->start_block + 1) << (PAGE_SHIFT - 9);
|
|
nr_blocks = ((sector_t)se->nr_pages - 1) << (PAGE_SHIFT - 9);
|
|
if (nr_blocks) {
|
|
err = blkdev_issue_discard(si->bdev, start_block,
|
|
nr_blocks, GFP_KERNEL, 0);
|
|
if (err)
|
|
return err;
|
|
cond_resched();
|
|
}
|
|
|
|
list_for_each_entry(se, &si->first_swap_extent.list, list) {
|
|
start_block = se->start_block << (PAGE_SHIFT - 9);
|
|
nr_blocks = (sector_t)se->nr_pages << (PAGE_SHIFT - 9);
|
|
|
|
err = blkdev_issue_discard(si->bdev, start_block,
|
|
nr_blocks, GFP_KERNEL, 0);
|
|
if (err)
|
|
break;
|
|
|
|
cond_resched();
|
|
}
|
|
return err; /* That will often be -EOPNOTSUPP */
|
|
}
|
|
|
|
/*
|
|
* swap allocation tell device that a cluster of swap can now be discarded,
|
|
* to allow the swap device to optimize its wear-levelling.
|
|
*/
|
|
static void discard_swap_cluster(struct swap_info_struct *si,
|
|
pgoff_t start_page, pgoff_t nr_pages)
|
|
{
|
|
struct swap_extent *se = si->curr_swap_extent;
|
|
int found_extent = 0;
|
|
|
|
while (nr_pages) {
|
|
struct list_head *lh;
|
|
|
|
if (se->start_page <= start_page &&
|
|
start_page < se->start_page + se->nr_pages) {
|
|
pgoff_t offset = start_page - se->start_page;
|
|
sector_t start_block = se->start_block + offset;
|
|
sector_t nr_blocks = se->nr_pages - offset;
|
|
|
|
if (nr_blocks > nr_pages)
|
|
nr_blocks = nr_pages;
|
|
start_page += nr_blocks;
|
|
nr_pages -= nr_blocks;
|
|
|
|
if (!found_extent++)
|
|
si->curr_swap_extent = se;
|
|
|
|
start_block <<= PAGE_SHIFT - 9;
|
|
nr_blocks <<= PAGE_SHIFT - 9;
|
|
if (blkdev_issue_discard(si->bdev, start_block,
|
|
nr_blocks, GFP_NOIO, 0))
|
|
break;
|
|
}
|
|
|
|
lh = se->list.next;
|
|
se = list_entry(lh, struct swap_extent, list);
|
|
}
|
|
}
|
|
|
|
#define LATENCY_LIMIT 256
|
|
|
|
static inline void cluster_set_flag(struct swap_cluster_info *info,
|
|
unsigned int flag)
|
|
{
|
|
info->flags = flag;
|
|
}
|
|
|
|
static inline unsigned int cluster_count(struct swap_cluster_info *info)
|
|
{
|
|
return info->data;
|
|
}
|
|
|
|
static inline void cluster_set_count(struct swap_cluster_info *info,
|
|
unsigned int c)
|
|
{
|
|
info->data = c;
|
|
}
|
|
|
|
static inline void cluster_set_count_flag(struct swap_cluster_info *info,
|
|
unsigned int c, unsigned int f)
|
|
{
|
|
info->flags = f;
|
|
info->data = c;
|
|
}
|
|
|
|
static inline unsigned int cluster_next(struct swap_cluster_info *info)
|
|
{
|
|
return info->data;
|
|
}
|
|
|
|
static inline void cluster_set_next(struct swap_cluster_info *info,
|
|
unsigned int n)
|
|
{
|
|
info->data = n;
|
|
}
|
|
|
|
static inline void cluster_set_next_flag(struct swap_cluster_info *info,
|
|
unsigned int n, unsigned int f)
|
|
{
|
|
info->flags = f;
|
|
info->data = n;
|
|
}
|
|
|
|
static inline bool cluster_is_free(struct swap_cluster_info *info)
|
|
{
|
|
return info->flags & CLUSTER_FLAG_FREE;
|
|
}
|
|
|
|
static inline bool cluster_is_null(struct swap_cluster_info *info)
|
|
{
|
|
return info->flags & CLUSTER_FLAG_NEXT_NULL;
|
|
}
|
|
|
|
static inline void cluster_set_null(struct swap_cluster_info *info)
|
|
{
|
|
info->flags = CLUSTER_FLAG_NEXT_NULL;
|
|
info->data = 0;
|
|
}
|
|
|
|
/* Add a cluster to discard list and schedule it to do discard */
|
|
static void swap_cluster_schedule_discard(struct swap_info_struct *si,
|
|
unsigned int idx)
|
|
{
|
|
/*
|
|
* If scan_swap_map() can't find a free cluster, it will check
|
|
* si->swap_map directly. To make sure the discarding cluster isn't
|
|
* taken by scan_swap_map(), mark the swap entries bad (occupied). It
|
|
* will be cleared after discard
|
|
*/
|
|
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
|
|
SWAP_MAP_BAD, SWAPFILE_CLUSTER);
|
|
|
|
if (cluster_is_null(&si->discard_cluster_head)) {
|
|
cluster_set_next_flag(&si->discard_cluster_head,
|
|
idx, 0);
|
|
cluster_set_next_flag(&si->discard_cluster_tail,
|
|
idx, 0);
|
|
} else {
|
|
unsigned int tail = cluster_next(&si->discard_cluster_tail);
|
|
cluster_set_next(&si->cluster_info[tail], idx);
|
|
cluster_set_next_flag(&si->discard_cluster_tail,
|
|
idx, 0);
|
|
}
|
|
|
|
schedule_work(&si->discard_work);
|
|
}
|
|
|
|
/*
|
|
* Doing discard actually. After a cluster discard is finished, the cluster
|
|
* will be added to free cluster list. caller should hold si->lock.
|
|
*/
|
|
static void swap_do_scheduled_discard(struct swap_info_struct *si)
|
|
{
|
|
struct swap_cluster_info *info;
|
|
unsigned int idx;
|
|
|
|
info = si->cluster_info;
|
|
|
|
while (!cluster_is_null(&si->discard_cluster_head)) {
|
|
idx = cluster_next(&si->discard_cluster_head);
|
|
|
|
cluster_set_next_flag(&si->discard_cluster_head,
|
|
cluster_next(&info[idx]), 0);
|
|
if (cluster_next(&si->discard_cluster_tail) == idx) {
|
|
cluster_set_null(&si->discard_cluster_head);
|
|
cluster_set_null(&si->discard_cluster_tail);
|
|
}
|
|
spin_unlock(&si->lock);
|
|
|
|
discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
|
|
SWAPFILE_CLUSTER);
|
|
|
|
spin_lock(&si->lock);
|
|
cluster_set_flag(&info[idx], CLUSTER_FLAG_FREE);
|
|
if (cluster_is_null(&si->free_cluster_head)) {
|
|
cluster_set_next_flag(&si->free_cluster_head,
|
|
idx, 0);
|
|
cluster_set_next_flag(&si->free_cluster_tail,
|
|
idx, 0);
|
|
} else {
|
|
unsigned int tail;
|
|
|
|
tail = cluster_next(&si->free_cluster_tail);
|
|
cluster_set_next(&info[tail], idx);
|
|
cluster_set_next_flag(&si->free_cluster_tail,
|
|
idx, 0);
|
|
}
|
|
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
|
|
0, SWAPFILE_CLUSTER);
|
|
}
|
|
}
|
|
|
|
static void swap_discard_work(struct work_struct *work)
|
|
{
|
|
struct swap_info_struct *si;
|
|
|
|
si = container_of(work, struct swap_info_struct, discard_work);
|
|
|
|
spin_lock(&si->lock);
|
|
swap_do_scheduled_discard(si);
|
|
spin_unlock(&si->lock);
|
|
}
|
|
|
|
/*
|
|
* The cluster corresponding to page_nr will be used. The cluster will be
|
|
* removed from free cluster list and its usage counter will be increased.
|
|
*/
|
|
static void inc_cluster_info_page(struct swap_info_struct *p,
|
|
struct swap_cluster_info *cluster_info, unsigned long page_nr)
|
|
{
|
|
unsigned long idx = page_nr / SWAPFILE_CLUSTER;
|
|
|
|
if (!cluster_info)
|
|
return;
|
|
if (cluster_is_free(&cluster_info[idx])) {
|
|
VM_BUG_ON(cluster_next(&p->free_cluster_head) != idx);
|
|
cluster_set_next_flag(&p->free_cluster_head,
|
|
cluster_next(&cluster_info[idx]), 0);
|
|
if (cluster_next(&p->free_cluster_tail) == idx) {
|
|
cluster_set_null(&p->free_cluster_tail);
|
|
cluster_set_null(&p->free_cluster_head);
|
|
}
|
|
cluster_set_count_flag(&cluster_info[idx], 0, 0);
|
|
}
|
|
|
|
VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
|
|
cluster_set_count(&cluster_info[idx],
|
|
cluster_count(&cluster_info[idx]) + 1);
|
|
}
|
|
|
|
/*
|
|
* The cluster corresponding to page_nr decreases one usage. If the usage
|
|
* counter becomes 0, which means no page in the cluster is in using, we can
|
|
* optionally discard the cluster and add it to free cluster list.
|
|
*/
|
|
static void dec_cluster_info_page(struct swap_info_struct *p,
|
|
struct swap_cluster_info *cluster_info, unsigned long page_nr)
|
|
{
|
|
unsigned long idx = page_nr / SWAPFILE_CLUSTER;
|
|
|
|
if (!cluster_info)
|
|
return;
|
|
|
|
VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
|
|
cluster_set_count(&cluster_info[idx],
|
|
cluster_count(&cluster_info[idx]) - 1);
|
|
|
|
if (cluster_count(&cluster_info[idx]) == 0) {
|
|
/*
|
|
* If the swap is discardable, prepare discard the cluster
|
|
* instead of free it immediately. The cluster will be freed
|
|
* after discard.
|
|
*/
|
|
if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
|
|
(SWP_WRITEOK | SWP_PAGE_DISCARD)) {
|
|
swap_cluster_schedule_discard(p, idx);
|
|
return;
|
|
}
|
|
|
|
cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
|
|
if (cluster_is_null(&p->free_cluster_head)) {
|
|
cluster_set_next_flag(&p->free_cluster_head, idx, 0);
|
|
cluster_set_next_flag(&p->free_cluster_tail, idx, 0);
|
|
} else {
|
|
unsigned int tail = cluster_next(&p->free_cluster_tail);
|
|
cluster_set_next(&cluster_info[tail], idx);
|
|
cluster_set_next_flag(&p->free_cluster_tail, idx, 0);
|
|
}
|
|
}
|
|
}
|
|
|
|
/*
|
|
* It's possible scan_swap_map() uses a free cluster in the middle of free
|
|
* cluster list. Avoiding such abuse to avoid list corruption.
|
|
*/
|
|
static bool
|
|
scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
|
|
unsigned long offset)
|
|
{
|
|
struct percpu_cluster *percpu_cluster;
|
|
bool conflict;
|
|
|
|
offset /= SWAPFILE_CLUSTER;
|
|
conflict = !cluster_is_null(&si->free_cluster_head) &&
|
|
offset != cluster_next(&si->free_cluster_head) &&
|
|
cluster_is_free(&si->cluster_info[offset]);
|
|
|
|
if (!conflict)
|
|
return false;
|
|
|
|
percpu_cluster = this_cpu_ptr(si->percpu_cluster);
|
|
cluster_set_null(&percpu_cluster->index);
|
|
return true;
|
|
}
|
|
|
|
/*
|
|
* Try to get a swap entry from current cpu's swap entry pool (a cluster). This
|
|
* might involve allocating a new cluster for current CPU too.
|
|
*/
|
|
static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
|
|
unsigned long *offset, unsigned long *scan_base)
|
|
{
|
|
struct percpu_cluster *cluster;
|
|
bool found_free;
|
|
unsigned long tmp;
|
|
|
|
new_cluster:
|
|
cluster = this_cpu_ptr(si->percpu_cluster);
|
|
if (cluster_is_null(&cluster->index)) {
|
|
if (!cluster_is_null(&si->free_cluster_head)) {
|
|
cluster->index = si->free_cluster_head;
|
|
cluster->next = cluster_next(&cluster->index) *
|
|
SWAPFILE_CLUSTER;
|
|
} else if (!cluster_is_null(&si->discard_cluster_head)) {
|
|
/*
|
|
* we don't have free cluster but have some clusters in
|
|
* discarding, do discard now and reclaim them
|
|
*/
|
|
swap_do_scheduled_discard(si);
|
|
*scan_base = *offset = si->cluster_next;
|
|
goto new_cluster;
|
|
} else
|
|
return;
|
|
}
|
|
|
|
found_free = false;
|
|
|
|
/*
|
|
* Other CPUs can use our cluster if they can't find a free cluster,
|
|
* check if there is still free entry in the cluster
|
|
*/
|
|
tmp = cluster->next;
|
|
while (tmp < si->max && tmp < (cluster_next(&cluster->index) + 1) *
|
|
SWAPFILE_CLUSTER) {
|
|
if (!si->swap_map[tmp]) {
|
|
found_free = true;
|
|
break;
|
|
}
|
|
tmp++;
|
|
}
|
|
if (!found_free) {
|
|
cluster_set_null(&cluster->index);
|
|
goto new_cluster;
|
|
}
|
|
cluster->next = tmp + 1;
|
|
*offset = tmp;
|
|
*scan_base = tmp;
|
|
}
|
|
|
|
static unsigned long scan_swap_map(struct swap_info_struct *si,
|
|
unsigned char usage)
|
|
{
|
|
unsigned long offset;
|
|
unsigned long scan_base;
|
|
unsigned long last_in_cluster = 0;
|
|
int latency_ration = LATENCY_LIMIT;
|
|
|
|
/*
|
|
* We try to cluster swap pages by allocating them sequentially
|
|
* in swap. Once we've allocated SWAPFILE_CLUSTER pages this
|
|
* way, however, we resort to first-free allocation, starting
|
|
* a new cluster. This prevents us from scattering swap pages
|
|
* all over the entire swap partition, so that we reduce
|
|
* overall disk seek times between swap pages. -- sct
|
|
* But we do now try to find an empty cluster. -Andrea
|
|
* And we let swap pages go all over an SSD partition. Hugh
|
|
*/
|
|
|
|
si->flags += SWP_SCANNING;
|
|
scan_base = offset = si->cluster_next;
|
|
|
|
/* SSD algorithm */
|
|
if (si->cluster_info) {
|
|
scan_swap_map_try_ssd_cluster(si, &offset, &scan_base);
|
|
goto checks;
|
|
}
|
|
|
|
if (unlikely(!si->cluster_nr--)) {
|
|
if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
|
|
si->cluster_nr = SWAPFILE_CLUSTER - 1;
|
|
goto checks;
|
|
}
|
|
|
|
spin_unlock(&si->lock);
|
|
|
|
/*
|
|
* If seek is expensive, start searching for new cluster from
|
|
* start of partition, to minimize the span of allocated swap.
|
|
* If seek is cheap, that is the SWP_SOLIDSTATE si->cluster_info
|
|
* case, just handled by scan_swap_map_try_ssd_cluster() above.
|
|
*/
|
|
scan_base = offset = si->lowest_bit;
|
|
last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
|
|
|
|
/* Locate the first empty (unaligned) cluster */
|
|
for (; last_in_cluster <= si->highest_bit; offset++) {
|
|
if (si->swap_map[offset])
|
|
last_in_cluster = offset + SWAPFILE_CLUSTER;
|
|
else if (offset == last_in_cluster) {
|
|
spin_lock(&si->lock);
|
|
offset -= SWAPFILE_CLUSTER - 1;
|
|
si->cluster_next = offset;
|
|
si->cluster_nr = SWAPFILE_CLUSTER - 1;
|
|
goto checks;
|
|
}
|
|
if (unlikely(--latency_ration < 0)) {
|
|
cond_resched();
|
|
latency_ration = LATENCY_LIMIT;
|
|
}
|
|
}
|
|
|
|
offset = scan_base;
|
|
spin_lock(&si->lock);
|
|
si->cluster_nr = SWAPFILE_CLUSTER - 1;
|
|
}
|
|
|
|
checks:
|
|
if (si->cluster_info) {
|
|
while (scan_swap_map_ssd_cluster_conflict(si, offset))
|
|
scan_swap_map_try_ssd_cluster(si, &offset, &scan_base);
|
|
}
|
|
if (!(si->flags & SWP_WRITEOK))
|
|
goto no_page;
|
|
if (!si->highest_bit)
|
|
goto no_page;
|
|
if (offset > si->highest_bit)
|
|
scan_base = offset = si->lowest_bit;
|
|
|
|
/* reuse swap entry of cache-only swap if not busy. */
|
|
if (vm_swap_full(si) && si->swap_map[offset] == SWAP_HAS_CACHE) {
|
|
int swap_was_freed;
|
|
spin_unlock(&si->lock);
|
|
swap_was_freed = __try_to_reclaim_swap(si, offset);
|
|
spin_lock(&si->lock);
|
|
/* entry was freed successfully, try to use this again */
|
|
if (swap_was_freed)
|
|
goto checks;
|
|
goto scan; /* check next one */
|
|
}
|
|
|
|
if (si->swap_map[offset])
|
|
goto scan;
|
|
|
|
if (offset == si->lowest_bit)
|
|
si->lowest_bit++;
|
|
if (offset == si->highest_bit)
|
|
si->highest_bit--;
|
|
si->inuse_pages++;
|
|
if (si->inuse_pages == si->pages) {
|
|
si->lowest_bit = si->max;
|
|
si->highest_bit = 0;
|
|
spin_lock(&swap_avail_lock);
|
|
plist_del(&si->avail_list, &swap_avail_head);
|
|
spin_unlock(&swap_avail_lock);
|
|
}
|
|
si->swap_map[offset] = usage;
|
|
inc_cluster_info_page(si, si->cluster_info, offset);
|
|
si->cluster_next = offset + 1;
|
|
si->flags -= SWP_SCANNING;
|
|
|
|
return offset;
|
|
|
|
scan:
|
|
spin_unlock(&si->lock);
|
|
while (++offset <= si->highest_bit) {
|
|
if (!si->swap_map[offset]) {
|
|
spin_lock(&si->lock);
|
|
goto checks;
|
|
}
|
|
if (vm_swap_full(si) &&
|
|
si->swap_map[offset] == SWAP_HAS_CACHE) {
|
|
spin_lock(&si->lock);
|
|
goto checks;
|
|
}
|
|
if (unlikely(--latency_ration < 0)) {
|
|
cond_resched();
|
|
latency_ration = LATENCY_LIMIT;
|
|
}
|
|
}
|
|
offset = si->lowest_bit;
|
|
while (offset < scan_base) {
|
|
if (!si->swap_map[offset]) {
|
|
spin_lock(&si->lock);
|
|
goto checks;
|
|
}
|
|
if (vm_swap_full(si) &&
|
|
si->swap_map[offset] == SWAP_HAS_CACHE) {
|
|
spin_lock(&si->lock);
|
|
goto checks;
|
|
}
|
|
if (unlikely(--latency_ration < 0)) {
|
|
cond_resched();
|
|
latency_ration = LATENCY_LIMIT;
|
|
}
|
|
offset++;
|
|
}
|
|
spin_lock(&si->lock);
|
|
|
|
no_page:
|
|
si->flags -= SWP_SCANNING;
|
|
return 0;
|
|
}
|
|
|
|
swp_entry_t get_swap_page(void)
|
|
{
|
|
struct swap_info_struct *si, *next;
|
|
pgoff_t offset;
|
|
int swap_ratio_off = 0;
|
|
|
|
if (atomic_long_read(&nr_swap_pages) <= 0)
|
|
goto noswap;
|
|
atomic_long_dec(&nr_swap_pages);
|
|
|
|
lock_and_start:
|
|
spin_lock(&swap_avail_lock);
|
|
|
|
start_over:
|
|
plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
|
|
|
|
if (sysctl_swap_ratio && !swap_ratio_off) {
|
|
int ret;
|
|
|
|
spin_unlock(&swap_avail_lock);
|
|
ret = swap_ratio(&si);
|
|
if (0 > ret) {
|
|
/*
|
|
* Error. Start again with swap
|
|
* ratio disabled.
|
|
*/
|
|
swap_ratio_off = 1;
|
|
goto lock_and_start;
|
|
} else {
|
|
goto start;
|
|
}
|
|
}
|
|
|
|
/* requeue si to after same-priority siblings */
|
|
plist_requeue(&si->avail_list, &swap_avail_head);
|
|
spin_unlock(&swap_avail_lock);
|
|
start:
|
|
spin_lock(&si->lock);
|
|
if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
|
|
spin_lock(&swap_avail_lock);
|
|
if (plist_node_empty(&si->avail_list)) {
|
|
spin_unlock(&si->lock);
|
|
goto nextsi;
|
|
}
|
|
WARN(!si->highest_bit,
|
|
"swap_info %d in list but !highest_bit\n",
|
|
si->type);
|
|
WARN(!(si->flags & SWP_WRITEOK),
|
|
"swap_info %d in list but !SWP_WRITEOK\n",
|
|
si->type);
|
|
plist_del(&si->avail_list, &swap_avail_head);
|
|
spin_unlock(&si->lock);
|
|
goto nextsi;
|
|
}
|
|
|
|
/* This is called for allocating swap entry for cache */
|
|
offset = scan_swap_map(si, SWAP_HAS_CACHE);
|
|
spin_unlock(&si->lock);
|
|
if (offset)
|
|
return swp_entry(si->type, offset);
|
|
pr_debug("scan_swap_map of si %d failed to find offset\n",
|
|
si->type);
|
|
spin_lock(&swap_avail_lock);
|
|
nextsi:
|
|
/*
|
|
* if we got here, it's likely that si was almost full before,
|
|
* and since scan_swap_map() can drop the si->lock, multiple
|
|
* callers probably all tried to get a page from the same si
|
|
* and it filled up before we could get one; or, the si filled
|
|
* up between us dropping swap_avail_lock and taking si->lock.
|
|
* Since we dropped the swap_avail_lock, the swap_avail_head
|
|
* list may have been modified; so if next is still in the
|
|
* swap_avail_head list then try it, otherwise start over.
|
|
*/
|
|
if (plist_node_empty(&next->avail_list))
|
|
goto start_over;
|
|
}
|
|
|
|
spin_unlock(&swap_avail_lock);
|
|
|
|
atomic_long_inc(&nr_swap_pages);
|
|
noswap:
|
|
return (swp_entry_t) {0};
|
|
}
|
|
|
|
/* The only caller of this function is now suspend routine */
|
|
swp_entry_t get_swap_page_of_type(int type)
|
|
{
|
|
struct swap_info_struct *si;
|
|
pgoff_t offset;
|
|
|
|
si = swap_info[type];
|
|
spin_lock(&si->lock);
|
|
if (si && (si->flags & SWP_WRITEOK)) {
|
|
atomic_long_dec(&nr_swap_pages);
|
|
/* This is called for allocating swap entry, not cache */
|
|
offset = scan_swap_map(si, 1);
|
|
if (offset) {
|
|
spin_unlock(&si->lock);
|
|
return swp_entry(type, offset);
|
|
}
|
|
atomic_long_inc(&nr_swap_pages);
|
|
}
|
|
spin_unlock(&si->lock);
|
|
return (swp_entry_t) {0};
|
|
}
|
|
|
|
static struct swap_info_struct *swap_info_get(swp_entry_t entry)
|
|
{
|
|
struct swap_info_struct *p;
|
|
unsigned long offset, type;
|
|
|
|
if (!entry.val)
|
|
goto out;
|
|
type = swp_type(entry);
|
|
if (type >= nr_swapfiles)
|
|
goto bad_nofile;
|
|
p = swap_info[type];
|
|
if (!(p->flags & SWP_USED))
|
|
goto bad_device;
|
|
offset = swp_offset(entry);
|
|
if (offset >= p->max)
|
|
goto bad_offset;
|
|
if (!p->swap_map[offset])
|
|
goto bad_free;
|
|
spin_lock(&p->lock);
|
|
return p;
|
|
|
|
bad_free:
|
|
pr_err("swap_free: %s%08lx\n", Unused_offset, entry.val);
|
|
goto out;
|
|
bad_offset:
|
|
pr_err("swap_free: %s%08lx\n", Bad_offset, entry.val);
|
|
goto out;
|
|
bad_device:
|
|
pr_err("swap_free: %s%08lx\n", Unused_file, entry.val);
|
|
goto out;
|
|
bad_nofile:
|
|
pr_err("swap_free: %s%08lx\n", Bad_file, entry.val);
|
|
out:
|
|
return NULL;
|
|
}
|
|
|
|
static unsigned char swap_entry_free(struct swap_info_struct *p,
|
|
swp_entry_t entry, unsigned char usage)
|
|
{
|
|
unsigned long offset = swp_offset(entry);
|
|
unsigned char count;
|
|
unsigned char has_cache;
|
|
|
|
count = p->swap_map[offset];
|
|
has_cache = count & SWAP_HAS_CACHE;
|
|
count &= ~SWAP_HAS_CACHE;
|
|
|
|
if (usage == SWAP_HAS_CACHE) {
|
|
VM_BUG_ON(!has_cache);
|
|
has_cache = 0;
|
|
} else if (count == SWAP_MAP_SHMEM) {
|
|
/*
|
|
* Or we could insist on shmem.c using a special
|
|
* swap_shmem_free() and free_shmem_swap_and_cache()...
|
|
*/
|
|
count = 0;
|
|
} else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
|
|
if (count == COUNT_CONTINUED) {
|
|
if (swap_count_continued(p, offset, count))
|
|
count = SWAP_MAP_MAX | COUNT_CONTINUED;
|
|
else
|
|
count = SWAP_MAP_MAX;
|
|
} else
|
|
count--;
|
|
}
|
|
|
|
if (!count)
|
|
mem_cgroup_uncharge_swap(entry);
|
|
|
|
usage = count | has_cache;
|
|
p->swap_map[offset] = usage;
|
|
|
|
/* free if no reference */
|
|
if (!usage) {
|
|
dec_cluster_info_page(p, p->cluster_info, offset);
|
|
if (offset < p->lowest_bit)
|
|
p->lowest_bit = offset;
|
|
if (offset > p->highest_bit) {
|
|
bool was_full = !p->highest_bit;
|
|
p->highest_bit = offset;
|
|
if (was_full && (p->flags & SWP_WRITEOK)) {
|
|
spin_lock(&swap_avail_lock);
|
|
WARN_ON(!plist_node_empty(&p->avail_list));
|
|
if (plist_node_empty(&p->avail_list))
|
|
plist_add(&p->avail_list,
|
|
&swap_avail_head);
|
|
spin_unlock(&swap_avail_lock);
|
|
}
|
|
}
|
|
atomic_long_inc(&nr_swap_pages);
|
|
p->inuse_pages--;
|
|
frontswap_invalidate_page(p->type, offset);
|
|
if (p->flags & SWP_BLKDEV) {
|
|
struct gendisk *disk = p->bdev->bd_disk;
|
|
if (disk->fops->swap_slot_free_notify)
|
|
disk->fops->swap_slot_free_notify(p->bdev,
|
|
offset);
|
|
}
|
|
}
|
|
|
|
return usage;
|
|
}
|
|
|
|
/*
|
|
* Caller has made sure that the swap device corresponding to entry
|
|
* is still around or has not been recycled.
|
|
*/
|
|
void swap_free(swp_entry_t entry)
|
|
{
|
|
struct swap_info_struct *p;
|
|
|
|
p = swap_info_get(entry);
|
|
if (p) {
|
|
swap_entry_free(p, entry, 1);
|
|
spin_unlock(&p->lock);
|
|
}
|
|
}
|
|
|
|
/*
|
|
* Called after dropping swapcache to decrease refcnt to swap entries.
|
|
*/
|
|
void swapcache_free(swp_entry_t entry)
|
|
{
|
|
struct swap_info_struct *p;
|
|
|
|
p = swap_info_get(entry);
|
|
if (p) {
|
|
swap_entry_free(p, entry, SWAP_HAS_CACHE);
|
|
spin_unlock(&p->lock);
|
|
}
|
|
}
|
|
|
|
/*
|
|
* How many references to page are currently swapped out?
|
|
* This does not give an exact answer when swap count is continued,
|
|
* but does include the high COUNT_CONTINUED flag to allow for that.
|
|
*/
|
|
int page_swapcount(struct page *page)
|
|
{
|
|
int count = 0;
|
|
struct swap_info_struct *p;
|
|
swp_entry_t entry;
|
|
|
|
entry.val = page_private(page);
|
|
p = swap_info_get(entry);
|
|
if (p) {
|
|
count = swap_count(p->swap_map[swp_offset(entry)]);
|
|
spin_unlock(&p->lock);
|
|
}
|
|
return count;
|
|
}
|
|
|
|
/*
|
|
* How many references to @entry are currently swapped out?
|
|
* This considers COUNT_CONTINUED so it returns exact answer.
|
|
*/
|
|
int swp_swapcount(swp_entry_t entry)
|
|
{
|
|
int count, tmp_count, n;
|
|
struct swap_info_struct *p;
|
|
struct page *page;
|
|
pgoff_t offset;
|
|
unsigned char *map;
|
|
|
|
p = swap_info_get(entry);
|
|
if (!p)
|
|
return 0;
|
|
|
|
count = swap_count(p->swap_map[swp_offset(entry)]);
|
|
if (!(count & COUNT_CONTINUED))
|
|
goto out;
|
|
|
|
count &= ~COUNT_CONTINUED;
|
|
n = SWAP_MAP_MAX + 1;
|
|
|
|
offset = swp_offset(entry);
|
|
page = vmalloc_to_page(p->swap_map + offset);
|
|
offset &= ~PAGE_MASK;
|
|
VM_BUG_ON(page_private(page) != SWP_CONTINUED);
|
|
|
|
do {
|
|
page = list_entry(page->lru.next, struct page, lru);
|
|
map = kmap_atomic(page);
|
|
tmp_count = map[offset];
|
|
kunmap_atomic(map);
|
|
|
|
count += (tmp_count & ~COUNT_CONTINUED) * n;
|
|
n *= (SWAP_CONT_MAX + 1);
|
|
} while (tmp_count & COUNT_CONTINUED);
|
|
out:
|
|
spin_unlock(&p->lock);
|
|
return count;
|
|
}
|
|
|
|
/*
|
|
* We can write to an anon page without COW if there are no other references
|
|
* to it. And as a side-effect, free up its swap: because the old content
|
|
* on disk will never be read, and seeking back there to write new content
|
|
* later would only waste time away from clustering.
|
|
*/
|
|
int reuse_swap_page(struct page *page)
|
|
{
|
|
int count;
|
|
|
|
VM_BUG_ON_PAGE(!PageLocked(page), page);
|
|
if (unlikely(PageKsm(page)))
|
|
return 0;
|
|
count = page_mapcount(page);
|
|
if (count <= 1 && PageSwapCache(page)) {
|
|
count += page_swapcount(page);
|
|
if (count == 1 && !PageWriteback(page)) {
|
|
delete_from_swap_cache(page);
|
|
SetPageDirty(page);
|
|
}
|
|
}
|
|
return count <= 1;
|
|
}
|
|
|
|
/*
|
|
* If swap is getting full, or if there are no more mappings of this page,
|
|
* then try_to_free_swap is called to free its swap space.
|
|
*/
|
|
int try_to_free_swap(struct page *page)
|
|
{
|
|
VM_BUG_ON_PAGE(!PageLocked(page), page);
|
|
|
|
if (!PageSwapCache(page))
|
|
return 0;
|
|
if (PageWriteback(page))
|
|
return 0;
|
|
if (page_swapcount(page))
|
|
return 0;
|
|
|
|
/*
|
|
* Once hibernation has begun to create its image of memory,
|
|
* there's a danger that one of the calls to try_to_free_swap()
|
|
* - most probably a call from __try_to_reclaim_swap() while
|
|
* hibernation is allocating its own swap pages for the image,
|
|
* but conceivably even a call from memory reclaim - will free
|
|
* the swap from a page which has already been recorded in the
|
|
* image as a clean swapcache page, and then reuse its swap for
|
|
* another page of the image. On waking from hibernation, the
|
|
* original page might be freed under memory pressure, then
|
|
* later read back in from swap, now with the wrong data.
|
|
*
|
|
* Hibernation suspends storage while it is writing the image
|
|
* to disk so check that here.
|
|
*/
|
|
if (pm_suspended_storage())
|
|
return 0;
|
|
|
|
delete_from_swap_cache(page);
|
|
SetPageDirty(page);
|
|
return 1;
|
|
}
|
|
|
|
/*
|
|
* Free the swap entry like above, but also try to
|
|
* free the page cache entry if it is the last user.
|
|
*/
|
|
int free_swap_and_cache(swp_entry_t entry)
|
|
{
|
|
struct swap_info_struct *p;
|
|
struct page *page = NULL;
|
|
|
|
if (non_swap_entry(entry))
|
|
return 1;
|
|
|
|
p = swap_info_get(entry);
|
|
if (p) {
|
|
if (swap_entry_free(p, entry, 1) == SWAP_HAS_CACHE) {
|
|
page = find_get_page(swap_address_space(entry),
|
|
entry.val);
|
|
if (page && !trylock_page(page)) {
|
|
page_cache_release(page);
|
|
page = NULL;
|
|
}
|
|
}
|
|
spin_unlock(&p->lock);
|
|
}
|
|
if (page) {
|
|
/*
|
|
* Not mapped elsewhere, or swap space full? Free it!
|
|
* Also recheck PageSwapCache now page is locked (above).
|
|
*/
|
|
if (PageSwapCache(page) && !PageWriteback(page) &&
|
|
(!page_mapped(page) ||
|
|
vm_swap_full(page_swap_info(page)))) {
|
|
delete_from_swap_cache(page);
|
|
SetPageDirty(page);
|
|
}
|
|
unlock_page(page);
|
|
page_cache_release(page);
|
|
}
|
|
return p != NULL;
|
|
}
|
|
|
|
#ifdef CONFIG_HIBERNATION
|
|
/*
|
|
* Find the swap type that corresponds to given device (if any).
|
|
*
|
|
* @offset - number of the PAGE_SIZE-sized block of the device, starting
|
|
* from 0, in which the swap header is expected to be located.
|
|
*
|
|
* This is needed for the suspend to disk (aka swsusp).
|
|
*/
|
|
int swap_type_of(dev_t device, sector_t offset, struct block_device **bdev_p)
|
|
{
|
|
struct block_device *bdev = NULL;
|
|
int type;
|
|
|
|
if (device)
|
|
bdev = bdget(device);
|
|
|
|
spin_lock(&swap_lock);
|
|
for (type = 0; type < nr_swapfiles; type++) {
|
|
struct swap_info_struct *sis = swap_info[type];
|
|
|
|
if (!(sis->flags & SWP_WRITEOK))
|
|
continue;
|
|
|
|
if (!bdev) {
|
|
if (bdev_p)
|
|
*bdev_p = bdgrab(sis->bdev);
|
|
|
|
spin_unlock(&swap_lock);
|
|
return type;
|
|
}
|
|
if (bdev == sis->bdev) {
|
|
struct swap_extent *se = &sis->first_swap_extent;
|
|
|
|
if (se->start_block == offset) {
|
|
if (bdev_p)
|
|
*bdev_p = bdgrab(sis->bdev);
|
|
|
|
spin_unlock(&swap_lock);
|
|
bdput(bdev);
|
|
return type;
|
|
}
|
|
}
|
|
}
|
|
spin_unlock(&swap_lock);
|
|
if (bdev)
|
|
bdput(bdev);
|
|
|
|
return -ENODEV;
|
|
}
|
|
|
|
/*
|
|
* Get the (PAGE_SIZE) block corresponding to given offset on the swapdev
|
|
* corresponding to given index in swap_info (swap type).
|
|
*/
|
|
sector_t swapdev_block(int type, pgoff_t offset)
|
|
{
|
|
struct block_device *bdev;
|
|
|
|
if ((unsigned int)type >= nr_swapfiles)
|
|
return 0;
|
|
if (!(swap_info[type]->flags & SWP_WRITEOK))
|
|
return 0;
|
|
return map_swap_entry(swp_entry(type, offset), &bdev);
|
|
}
|
|
|
|
/*
|
|
* Return either the total number of swap pages of given type, or the number
|
|
* of free pages of that type (depending on @free)
|
|
*
|
|
* This is needed for software suspend
|
|
*/
|
|
unsigned int count_swap_pages(int type, int free)
|
|
{
|
|
unsigned int n = 0;
|
|
|
|
spin_lock(&swap_lock);
|
|
if ((unsigned int)type < nr_swapfiles) {
|
|
struct swap_info_struct *sis = swap_info[type];
|
|
|
|
spin_lock(&sis->lock);
|
|
if (sis->flags & SWP_WRITEOK) {
|
|
n = sis->pages;
|
|
if (free)
|
|
n -= sis->inuse_pages;
|
|
}
|
|
spin_unlock(&sis->lock);
|
|
}
|
|
spin_unlock(&swap_lock);
|
|
return n;
|
|
}
|
|
#endif /* CONFIG_HIBERNATION */
|
|
|
|
static inline int maybe_same_pte(pte_t pte, pte_t swp_pte)
|
|
{
|
|
#ifdef CONFIG_MEM_SOFT_DIRTY
|
|
/*
|
|
* When pte keeps soft dirty bit the pte generated
|
|
* from swap entry does not has it, still it's same
|
|
* pte from logical point of view.
|
|
*/
|
|
pte_t swp_pte_dirty = pte_swp_mksoft_dirty(swp_pte);
|
|
return pte_same(pte, swp_pte) || pte_same(pte, swp_pte_dirty);
|
|
#else
|
|
return pte_same(pte, swp_pte);
|
|
#endif
|
|
}
|
|
|
|
/*
|
|
* No need to decide whether this PTE shares the swap entry with others,
|
|
* just let do_wp_page work it out if a write is requested later - to
|
|
* force COW, vm_page_prot omits write permission from any private vma.
|
|
*/
|
|
static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
|
|
unsigned long addr, swp_entry_t entry, struct page *page)
|
|
{
|
|
struct page *swapcache;
|
|
struct mem_cgroup *memcg;
|
|
spinlock_t *ptl;
|
|
pte_t *pte;
|
|
int ret = 1;
|
|
|
|
swapcache = page;
|
|
page = ksm_might_need_to_copy(page, vma, addr);
|
|
if (unlikely(!page))
|
|
return -ENOMEM;
|
|
|
|
if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
|
|
ret = -ENOMEM;
|
|
goto out_nolock;
|
|
}
|
|
|
|
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
|
|
if (unlikely(!maybe_same_pte(*pte, swp_entry_to_pte(entry)))) {
|
|
mem_cgroup_cancel_charge(page, memcg);
|
|
ret = 0;
|
|
goto out;
|
|
}
|
|
|
|
dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
|
|
inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
|
|
get_page(page);
|
|
set_pte_at(vma->vm_mm, addr, pte,
|
|
pte_mkold(mk_pte(page, vma->vm_page_prot)));
|
|
if (page == swapcache) {
|
|
page_add_anon_rmap(page, vma, addr);
|
|
mem_cgroup_commit_charge(page, memcg, true);
|
|
} else { /* ksm created a completely new copy */
|
|
page_add_new_anon_rmap(page, vma, addr);
|
|
mem_cgroup_commit_charge(page, memcg, false);
|
|
lru_cache_add_active_or_unevictable(page, vma);
|
|
}
|
|
swap_free(entry);
|
|
/*
|
|
* Move the page to the active list so it is not
|
|
* immediately swapped out again after swapon.
|
|
*/
|
|
activate_page(page);
|
|
out:
|
|
pte_unmap_unlock(pte, ptl);
|
|
out_nolock:
|
|
if (page != swapcache) {
|
|
unlock_page(page);
|
|
put_page(page);
|
|
}
|
|
return ret;
|
|
}
|
|
|
|
static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
|
|
unsigned long addr, unsigned long end,
|
|
swp_entry_t entry, struct page *page)
|
|
{
|
|
pte_t swp_pte = swp_entry_to_pte(entry);
|
|
pte_t *pte;
|
|
int ret = 0;
|
|
|
|
/*
|
|
* We don't actually need pte lock while scanning for swp_pte: since
|
|
* we hold page lock and mmap_sem, swp_pte cannot be inserted into the
|
|
* page table while we're scanning; though it could get zapped, and on
|
|
* some architectures (e.g. x86_32 with PAE) we might catch a glimpse
|
|
* of unmatched parts which look like swp_pte, so unuse_pte must
|
|
* recheck under pte lock. Scanning without pte lock lets it be
|
|
* preemptable whenever CONFIG_PREEMPT but not CONFIG_HIGHPTE.
|
|
*/
|
|
pte = pte_offset_map(pmd, addr);
|
|
do {
|
|
/*
|
|
* swapoff spends a _lot_ of time in this loop!
|
|
* Test inline before going to call unuse_pte.
|
|
*/
|
|
if (unlikely(maybe_same_pte(*pte, swp_pte))) {
|
|
pte_unmap(pte);
|
|
ret = unuse_pte(vma, pmd, addr, entry, page);
|
|
if (ret)
|
|
goto out;
|
|
pte = pte_offset_map(pmd, addr);
|
|
}
|
|
} while (pte++, addr += PAGE_SIZE, addr != end);
|
|
pte_unmap(pte - 1);
|
|
out:
|
|
return ret;
|
|
}
|
|
|
|
static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
|
|
unsigned long addr, unsigned long end,
|
|
swp_entry_t entry, struct page *page)
|
|
{
|
|
pmd_t *pmd;
|
|
unsigned long next;
|
|
int ret;
|
|
|
|
pmd = pmd_offset(pud, addr);
|
|
do {
|
|
next = pmd_addr_end(addr, end);
|
|
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
|
|
continue;
|
|
ret = unuse_pte_range(vma, pmd, addr, next, entry, page);
|
|
if (ret)
|
|
return ret;
|
|
} while (pmd++, addr = next, addr != end);
|
|
return 0;
|
|
}
|
|
|
|
static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
|
|
unsigned long addr, unsigned long end,
|
|
swp_entry_t entry, struct page *page)
|
|
{
|
|
pud_t *pud;
|
|
unsigned long next;
|
|
int ret;
|
|
|
|
pud = pud_offset(pgd, addr);
|
|
do {
|
|
next = pud_addr_end(addr, end);
|
|
if (pud_none_or_clear_bad(pud))
|
|
continue;
|
|
ret = unuse_pmd_range(vma, pud, addr, next, entry, page);
|
|
if (ret)
|
|
return ret;
|
|
} while (pud++, addr = next, addr != end);
|
|
return 0;
|
|
}
|
|
|
|
static int unuse_vma(struct vm_area_struct *vma,
|
|
swp_entry_t entry, struct page *page)
|
|
{
|
|
pgd_t *pgd;
|
|
unsigned long addr, end, next;
|
|
int ret;
|
|
|
|
if (page_anon_vma(page)) {
|
|
addr = page_address_in_vma(page, vma);
|
|
if (addr == -EFAULT)
|
|
return 0;
|
|
else
|
|
end = addr + PAGE_SIZE;
|
|
} else {
|
|
addr = vma->vm_start;
|
|
end = vma->vm_end;
|
|
}
|
|
|
|
pgd = pgd_offset(vma->vm_mm, addr);
|
|
do {
|
|
next = pgd_addr_end(addr, end);
|
|
if (pgd_none_or_clear_bad(pgd))
|
|
continue;
|
|
ret = unuse_pud_range(vma, pgd, addr, next, entry, page);
|
|
if (ret)
|
|
return ret;
|
|
} while (pgd++, addr = next, addr != end);
|
|
return 0;
|
|
}
|
|
|
|
static int unuse_mm(struct mm_struct *mm,
|
|
swp_entry_t entry, struct page *page)
|
|
{
|
|
struct vm_area_struct *vma;
|
|
int ret = 0;
|
|
|
|
if (!down_read_trylock(&mm->mmap_sem)) {
|
|
/*
|
|
* Activate page so shrink_inactive_list is unlikely to unmap
|
|
* its ptes while lock is dropped, so swapoff can make progress.
|
|
*/
|
|
activate_page(page);
|
|
unlock_page(page);
|
|
down_read(&mm->mmap_sem);
|
|
lock_page(page);
|
|
}
|
|
for (vma = mm->mmap; vma; vma = vma->vm_next) {
|
|
if (vma->anon_vma && (ret = unuse_vma(vma, entry, page)))
|
|
break;
|
|
}
|
|
up_read(&mm->mmap_sem);
|
|
return (ret < 0)? ret: 0;
|
|
}
|
|
|
|
/*
|
|
* Scan swap_map (or frontswap_map if frontswap parameter is true)
|
|
* from current position to next entry still in use.
|
|
* Recycle to start on reaching the end, returning 0 when empty.
|
|
*/
|
|
static unsigned int find_next_to_unuse(struct swap_info_struct *si,
|
|
unsigned int prev, bool frontswap)
|
|
{
|
|
unsigned int max = si->max;
|
|
unsigned int i = prev;
|
|
unsigned char count;
|
|
|
|
/*
|
|
* No need for swap_lock here: we're just looking
|
|
* for whether an entry is in use, not modifying it; false
|
|
* hits are okay, and sys_swapoff() has already prevented new
|
|
* allocations from this area (while holding swap_lock).
|
|
*/
|
|
for (;;) {
|
|
if (++i >= max) {
|
|
if (!prev) {
|
|
i = 0;
|
|
break;
|
|
}
|
|
/*
|
|
* No entries in use at top of swap_map,
|
|
* loop back to start and recheck there.
|
|
*/
|
|
max = prev + 1;
|
|
prev = 0;
|
|
i = 1;
|
|
}
|
|
if (frontswap) {
|
|
if (frontswap_test(si, i))
|
|
break;
|
|
else
|
|
continue;
|
|
}
|
|
count = READ_ONCE(si->swap_map[i]);
|
|
if (count && swap_count(count) != SWAP_MAP_BAD)
|
|
break;
|
|
}
|
|
return i;
|
|
}
|
|
|
|
/*
|
|
* We completely avoid races by reading each swap page in advance,
|
|
* and then search for the process using it. All the necessary
|
|
* page table adjustments can then be made atomically.
|
|
*
|
|
* if the boolean frontswap is true, only unuse pages_to_unuse pages;
|
|
* pages_to_unuse==0 means all pages; ignored if frontswap is false
|
|
*/
|
|
int try_to_unuse(unsigned int type, bool frontswap,
|
|
unsigned long pages_to_unuse)
|
|
{
|
|
struct swap_info_struct *si = swap_info[type];
|
|
struct mm_struct *start_mm;
|
|
volatile unsigned char *swap_map; /* swap_map is accessed without
|
|
* locking. Mark it as volatile
|
|
* to prevent compiler doing
|
|
* something odd.
|
|
*/
|
|
unsigned char swcount;
|
|
struct page *page;
|
|
swp_entry_t entry;
|
|
unsigned int i = 0;
|
|
int retval = 0;
|
|
|
|
/*
|
|
* When searching mms for an entry, a good strategy is to
|
|
* start at the first mm we freed the previous entry from
|
|
* (though actually we don't notice whether we or coincidence
|
|
* freed the entry). Initialize this start_mm with a hold.
|
|
*
|
|
* A simpler strategy would be to start at the last mm we
|
|
* freed the previous entry from; but that would take less
|
|
* advantage of mmlist ordering, which clusters forked mms
|
|
* together, child after parent. If we race with dup_mmap(), we
|
|
* prefer to resolve parent before child, lest we miss entries
|
|
* duplicated after we scanned child: using last mm would invert
|
|
* that.
|
|
*/
|
|
start_mm = &init_mm;
|
|
atomic_inc(&init_mm.mm_users);
|
|
|
|
/*
|
|
* Keep on scanning until all entries have gone. Usually,
|
|
* one pass through swap_map is enough, but not necessarily:
|
|
* there are races when an instance of an entry might be missed.
|
|
*/
|
|
while ((i = find_next_to_unuse(si, i, frontswap)) != 0) {
|
|
if (signal_pending(current)) {
|
|
retval = -EINTR;
|
|
break;
|
|
}
|
|
|
|
/*
|
|
* Get a page for the entry, using the existing swap
|
|
* cache page if there is one. Otherwise, get a clean
|
|
* page and read the swap into it.
|
|
*/
|
|
swap_map = &si->swap_map[i];
|
|
entry = swp_entry(type, i);
|
|
page = read_swap_cache_async(entry,
|
|
GFP_HIGHUSER_MOVABLE, NULL, 0);
|
|
if (!page) {
|
|
/*
|
|
* Either swap_duplicate() failed because entry
|
|
* has been freed independently, and will not be
|
|
* reused since sys_swapoff() already disabled
|
|
* allocation from here, or alloc_page() failed.
|
|
*/
|
|
swcount = *swap_map;
|
|
/*
|
|
* We don't hold lock here, so the swap entry could be
|
|
* SWAP_MAP_BAD (when the cluster is discarding).
|
|
* Instead of fail out, We can just skip the swap
|
|
* entry because swapoff will wait for discarding
|
|
* finish anyway.
|
|
*/
|
|
if (!swcount || swcount == SWAP_MAP_BAD)
|
|
continue;
|
|
retval = -ENOMEM;
|
|
break;
|
|
}
|
|
|
|
/*
|
|
* Don't hold on to start_mm if it looks like exiting.
|
|
*/
|
|
if (atomic_read(&start_mm->mm_users) == 1) {
|
|
mmput(start_mm);
|
|
start_mm = &init_mm;
|
|
atomic_inc(&init_mm.mm_users);
|
|
}
|
|
|
|
/*
|
|
* Wait for and lock page. When do_swap_page races with
|
|
* try_to_unuse, do_swap_page can handle the fault much
|
|
* faster than try_to_unuse can locate the entry. This
|
|
* apparently redundant "wait_on_page_locked" lets try_to_unuse
|
|
* defer to do_swap_page in such a case - in some tests,
|
|
* do_swap_page and try_to_unuse repeatedly compete.
|
|
*/
|
|
wait_on_page_locked(page);
|
|
wait_on_page_writeback(page);
|
|
lock_page(page);
|
|
wait_on_page_writeback(page);
|
|
|
|
/*
|
|
* Remove all references to entry.
|
|
*/
|
|
swcount = *swap_map;
|
|
if (swap_count(swcount) == SWAP_MAP_SHMEM) {
|
|
retval = shmem_unuse(entry, page);
|
|
/* page has already been unlocked and released */
|
|
if (retval < 0)
|
|
break;
|
|
continue;
|
|
}
|
|
if (swap_count(swcount) && start_mm != &init_mm)
|
|
retval = unuse_mm(start_mm, entry, page);
|
|
|
|
if (swap_count(*swap_map)) {
|
|
int set_start_mm = (*swap_map >= swcount);
|
|
struct list_head *p = &start_mm->mmlist;
|
|
struct mm_struct *new_start_mm = start_mm;
|
|
struct mm_struct *prev_mm = start_mm;
|
|
struct mm_struct *mm;
|
|
|
|
atomic_inc(&new_start_mm->mm_users);
|
|
atomic_inc(&prev_mm->mm_users);
|
|
spin_lock(&mmlist_lock);
|
|
while (swap_count(*swap_map) && !retval &&
|
|
(p = p->next) != &start_mm->mmlist) {
|
|
mm = list_entry(p, struct mm_struct, mmlist);
|
|
if (!atomic_inc_not_zero(&mm->mm_users))
|
|
continue;
|
|
spin_unlock(&mmlist_lock);
|
|
mmput(prev_mm);
|
|
prev_mm = mm;
|
|
|
|
cond_resched();
|
|
|
|
swcount = *swap_map;
|
|
if (!swap_count(swcount)) /* any usage ? */
|
|
;
|
|
else if (mm == &init_mm)
|
|
set_start_mm = 1;
|
|
else
|
|
retval = unuse_mm(mm, entry, page);
|
|
|
|
if (set_start_mm && *swap_map < swcount) {
|
|
mmput(new_start_mm);
|
|
atomic_inc(&mm->mm_users);
|
|
new_start_mm = mm;
|
|
set_start_mm = 0;
|
|
}
|
|
spin_lock(&mmlist_lock);
|
|
}
|
|
spin_unlock(&mmlist_lock);
|
|
mmput(prev_mm);
|
|
mmput(start_mm);
|
|
start_mm = new_start_mm;
|
|
}
|
|
if (retval) {
|
|
unlock_page(page);
|
|
page_cache_release(page);
|
|
break;
|
|
}
|
|
|
|
/*
|
|
* If a reference remains (rare), we would like to leave
|
|
* the page in the swap cache; but try_to_unmap could
|
|
* then re-duplicate the entry once we drop page lock,
|
|
* so we might loop indefinitely; also, that page could
|
|
* not be swapped out to other storage meanwhile. So:
|
|
* delete from cache even if there's another reference,
|
|
* after ensuring that the data has been saved to disk -
|
|
* since if the reference remains (rarer), it will be
|
|
* read from disk into another page. Splitting into two
|
|
* pages would be incorrect if swap supported "shared
|
|
* private" pages, but they are handled by tmpfs files.
|
|
*
|
|
* Given how unuse_vma() targets one particular offset
|
|
* in an anon_vma, once the anon_vma has been determined,
|
|
* this splitting happens to be just what is needed to
|
|
* handle where KSM pages have been swapped out: re-reading
|
|
* is unnecessarily slow, but we can fix that later on.
|
|
*/
|
|
if (swap_count(*swap_map) &&
|
|
PageDirty(page) && PageSwapCache(page)) {
|
|
struct writeback_control wbc = {
|
|
.sync_mode = WB_SYNC_NONE,
|
|
};
|
|
|
|
swap_writepage(page, &wbc);
|
|
lock_page(page);
|
|
wait_on_page_writeback(page);
|
|
}
|
|
|
|
/*
|
|
* It is conceivable that a racing task removed this page from
|
|
* swap cache just before we acquired the page lock at the top,
|
|
* or while we dropped it in unuse_mm(). The page might even
|
|
* be back in swap cache on another swap area: that we must not
|
|
* delete, since it may not have been written out to swap yet.
|
|
*/
|
|
if (PageSwapCache(page) &&
|
|
likely(page_private(page) == entry.val))
|
|
delete_from_swap_cache(page);
|
|
|
|
/*
|
|
* So we could skip searching mms once swap count went
|
|
* to 1, we did not mark any present ptes as dirty: must
|
|
* mark page dirty so shrink_page_list will preserve it.
|
|
*/
|
|
SetPageDirty(page);
|
|
unlock_page(page);
|
|
page_cache_release(page);
|
|
|
|
/*
|
|
* Make sure that we aren't completely killing
|
|
* interactive performance.
|
|
*/
|
|
cond_resched();
|
|
if (frontswap && pages_to_unuse > 0) {
|
|
if (!--pages_to_unuse)
|
|
break;
|
|
}
|
|
}
|
|
|
|
mmput(start_mm);
|
|
return retval;
|
|
}
|
|
|
|
/*
|
|
* After a successful try_to_unuse, if no swap is now in use, we know
|
|
* we can empty the mmlist. swap_lock must be held on entry and exit.
|
|
* Note that mmlist_lock nests inside swap_lock, and an mm must be
|
|
* added to the mmlist just after page_duplicate - before would be racy.
|
|
*/
|
|
static void drain_mmlist(void)
|
|
{
|
|
struct list_head *p, *next;
|
|
unsigned int type;
|
|
|
|
for (type = 0; type < nr_swapfiles; type++)
|
|
if (swap_info[type]->inuse_pages)
|
|
return;
|
|
spin_lock(&mmlist_lock);
|
|
list_for_each_safe(p, next, &init_mm.mmlist)
|
|
list_del_init(p);
|
|
spin_unlock(&mmlist_lock);
|
|
}
|
|
|
|
/*
|
|
* Use this swapdev's extent info to locate the (PAGE_SIZE) block which
|
|
* corresponds to page offset for the specified swap entry.
|
|
* Note that the type of this function is sector_t, but it returns page offset
|
|
* into the bdev, not sector offset.
|
|
*/
|
|
static sector_t map_swap_entry(swp_entry_t entry, struct block_device **bdev)
|
|
{
|
|
struct swap_info_struct *sis;
|
|
struct swap_extent *start_se;
|
|
struct swap_extent *se;
|
|
pgoff_t offset;
|
|
|
|
sis = swap_info[swp_type(entry)];
|
|
*bdev = sis->bdev;
|
|
|
|
offset = swp_offset(entry);
|
|
start_se = sis->curr_swap_extent;
|
|
se = start_se;
|
|
|
|
for ( ; ; ) {
|
|
struct list_head *lh;
|
|
|
|
if (se->start_page <= offset &&
|
|
offset < (se->start_page + se->nr_pages)) {
|
|
return se->start_block + (offset - se->start_page);
|
|
}
|
|
lh = se->list.next;
|
|
se = list_entry(lh, struct swap_extent, list);
|
|
sis->curr_swap_extent = se;
|
|
BUG_ON(se == start_se); /* It *must* be present */
|
|
}
|
|
}
|
|
|
|
/*
|
|
* Returns the page offset into bdev for the specified page's swap entry.
|
|
*/
|
|
sector_t map_swap_page(struct page *page, struct block_device **bdev)
|
|
{
|
|
swp_entry_t entry;
|
|
entry.val = page_private(page);
|
|
return map_swap_entry(entry, bdev);
|
|
}
|
|
|
|
/*
|
|
* Free all of a swapdev's extent information
|
|
*/
|
|
static void destroy_swap_extents(struct swap_info_struct *sis)
|
|
{
|
|
while (!list_empty(&sis->first_swap_extent.list)) {
|
|
struct swap_extent *se;
|
|
|
|
se = list_entry(sis->first_swap_extent.list.next,
|
|
struct swap_extent, list);
|
|
list_del(&se->list);
|
|
kfree(se);
|
|
}
|
|
|
|
if (sis->flags & SWP_FILE) {
|
|
struct file *swap_file = sis->swap_file;
|
|
struct address_space *mapping = swap_file->f_mapping;
|
|
|
|
sis->flags &= ~SWP_FILE;
|
|
mapping->a_ops->swap_deactivate(swap_file);
|
|
}
|
|
}
|
|
|
|
/*
|
|
* Add a block range (and the corresponding page range) into this swapdev's
|
|
* extent list. The extent list is kept sorted in page order.
|
|
*
|
|
* This function rather assumes that it is called in ascending page order.
|
|
*/
|
|
int
|
|
add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
|
|
unsigned long nr_pages, sector_t start_block)
|
|
{
|
|
struct swap_extent *se;
|
|
struct swap_extent *new_se;
|
|
struct list_head *lh;
|
|
|
|
if (start_page == 0) {
|
|
se = &sis->first_swap_extent;
|
|
sis->curr_swap_extent = se;
|
|
se->start_page = 0;
|
|
se->nr_pages = nr_pages;
|
|
se->start_block = start_block;
|
|
return 1;
|
|
} else {
|
|
lh = sis->first_swap_extent.list.prev; /* Highest extent */
|
|
se = list_entry(lh, struct swap_extent, list);
|
|
BUG_ON(se->start_page + se->nr_pages != start_page);
|
|
if (se->start_block + se->nr_pages == start_block) {
|
|
/* Merge it */
|
|
se->nr_pages += nr_pages;
|
|
return 0;
|
|
}
|
|
}
|
|
|
|
/*
|
|
* No merge. Insert a new extent, preserving ordering.
|
|
*/
|
|
new_se = kmalloc(sizeof(*se), GFP_KERNEL);
|
|
if (new_se == NULL)
|
|
return -ENOMEM;
|
|
new_se->start_page = start_page;
|
|
new_se->nr_pages = nr_pages;
|
|
new_se->start_block = start_block;
|
|
|
|
list_add_tail(&new_se->list, &sis->first_swap_extent.list);
|
|
return 1;
|
|
}
|
|
|
|
/*
|
|
* A `swap extent' is a simple thing which maps a contiguous range of pages
|
|
* onto a contiguous range of disk blocks. An ordered list of swap extents
|
|
* is built at swapon time and is then used at swap_writepage/swap_readpage
|
|
* time for locating where on disk a page belongs.
|
|
*
|
|
* If the swapfile is an S_ISBLK block device, a single extent is installed.
|
|
* This is done so that the main operating code can treat S_ISBLK and S_ISREG
|
|
* swap files identically.
|
|
*
|
|
* Whether the swapdev is an S_ISREG file or an S_ISBLK blockdev, the swap
|
|
* extent list operates in PAGE_SIZE disk blocks. Both S_ISREG and S_ISBLK
|
|
* swapfiles are handled *identically* after swapon time.
|
|
*
|
|
* For S_ISREG swapfiles, setup_swap_extents() will walk all the file's blocks
|
|
* and will parse them into an ordered extent list, in PAGE_SIZE chunks. If
|
|
* some stray blocks are found which do not fall within the PAGE_SIZE alignment
|
|
* requirements, they are simply tossed out - we will never use those blocks
|
|
* for swapping.
|
|
*
|
|
* For S_ISREG swapfiles we set S_SWAPFILE across the life of the swapon. This
|
|
* prevents root from shooting her foot off by ftruncating an in-use swapfile,
|
|
* which will scribble on the fs.
|
|
*
|
|
* The amount of disk space which a single swap extent represents varies.
|
|
* Typically it is in the 1-4 megabyte range. So we can have hundreds of
|
|
* extents in the list. To avoid much list walking, we cache the previous
|
|
* search location in `curr_swap_extent', and start new searches from there.
|
|
* This is extremely effective. The average number of iterations in
|
|
* map_swap_page() has been measured at about 0.3 per page. - akpm.
|
|
*/
|
|
static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
|
|
{
|
|
struct file *swap_file = sis->swap_file;
|
|
struct address_space *mapping = swap_file->f_mapping;
|
|
struct inode *inode = mapping->host;
|
|
int ret;
|
|
|
|
if (S_ISBLK(inode->i_mode)) {
|
|
ret = add_swap_extent(sis, 0, sis->max, 0);
|
|
*span = sis->pages;
|
|
return ret;
|
|
}
|
|
|
|
if (mapping->a_ops->swap_activate) {
|
|
ret = mapping->a_ops->swap_activate(sis, swap_file, span);
|
|
if (!ret) {
|
|
sis->flags |= SWP_FILE;
|
|
ret = add_swap_extent(sis, 0, sis->max, 0);
|
|
*span = sis->pages;
|
|
}
|
|
return ret;
|
|
}
|
|
|
|
return generic_swapfile_activate(sis, swap_file, span);
|
|
}
|
|
|
|
static void _enable_swap_info(struct swap_info_struct *p, int prio,
|
|
unsigned char *swap_map,
|
|
struct swap_cluster_info *cluster_info)
|
|
{
|
|
if (prio >= 0)
|
|
p->prio = prio;
|
|
else
|
|
p->prio = --least_priority;
|
|
/*
|
|
* the plist prio is negated because plist ordering is
|
|
* low-to-high, while swap ordering is high-to-low
|
|
*/
|
|
p->list.prio = -p->prio;
|
|
p->avail_list.prio = -p->prio;
|
|
p->swap_map = swap_map;
|
|
p->cluster_info = cluster_info;
|
|
p->flags |= SWP_WRITEOK;
|
|
atomic_long_add(p->pages, &nr_swap_pages);
|
|
total_swap_pages += p->pages;
|
|
|
|
assert_spin_locked(&swap_lock);
|
|
/*
|
|
* both lists are plists, and thus priority ordered.
|
|
* swap_active_head needs to be priority ordered for swapoff(),
|
|
* which on removal of any swap_info_struct with an auto-assigned
|
|
* (i.e. negative) priority increments the auto-assigned priority
|
|
* of any lower-priority swap_info_structs.
|
|
* swap_avail_head needs to be priority ordered for get_swap_page(),
|
|
* which allocates swap pages from the highest available priority
|
|
* swap_info_struct.
|
|
*/
|
|
plist_add(&p->list, &swap_active_head);
|
|
spin_lock(&swap_avail_lock);
|
|
plist_add(&p->avail_list, &swap_avail_head);
|
|
spin_unlock(&swap_avail_lock);
|
|
}
|
|
|
|
static void enable_swap_info(struct swap_info_struct *p, int prio,
|
|
unsigned char *swap_map,
|
|
struct swap_cluster_info *cluster_info,
|
|
unsigned long *frontswap_map)
|
|
{
|
|
frontswap_init(p->type, frontswap_map);
|
|
spin_lock(&swap_lock);
|
|
spin_lock(&p->lock);
|
|
_enable_swap_info(p, prio, swap_map, cluster_info);
|
|
spin_unlock(&p->lock);
|
|
spin_unlock(&swap_lock);
|
|
}
|
|
|
|
static void reinsert_swap_info(struct swap_info_struct *p)
|
|
{
|
|
spin_lock(&swap_lock);
|
|
spin_lock(&p->lock);
|
|
_enable_swap_info(p, p->prio, p->swap_map, p->cluster_info);
|
|
spin_unlock(&p->lock);
|
|
spin_unlock(&swap_lock);
|
|
}
|
|
|
|
SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
|
|
{
|
|
struct swap_info_struct *p = NULL;
|
|
unsigned char *swap_map;
|
|
struct swap_cluster_info *cluster_info;
|
|
unsigned long *frontswap_map;
|
|
struct file *swap_file, *victim;
|
|
struct address_space *mapping;
|
|
struct inode *inode;
|
|
struct filename *pathname;
|
|
int err, found = 0;
|
|
unsigned int old_block_size;
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
return -EPERM;
|
|
|
|
BUG_ON(!current->mm);
|
|
|
|
pathname = getname(specialfile);
|
|
if (IS_ERR(pathname))
|
|
return PTR_ERR(pathname);
|
|
|
|
victim = file_open_name(pathname, O_RDWR|O_LARGEFILE, 0);
|
|
err = PTR_ERR(victim);
|
|
if (IS_ERR(victim))
|
|
goto out;
|
|
|
|
mapping = victim->f_mapping;
|
|
spin_lock(&swap_lock);
|
|
plist_for_each_entry(p, &swap_active_head, list) {
|
|
if (p->flags & SWP_WRITEOK) {
|
|
if (p->swap_file->f_mapping == mapping) {
|
|
found = 1;
|
|
break;
|
|
}
|
|
}
|
|
}
|
|
if (!found) {
|
|
err = -EINVAL;
|
|
spin_unlock(&swap_lock);
|
|
goto out_dput;
|
|
}
|
|
if (!security_vm_enough_memory_mm(current->mm, p->pages))
|
|
vm_unacct_memory(p->pages);
|
|
else {
|
|
err = -ENOMEM;
|
|
spin_unlock(&swap_lock);
|
|
goto out_dput;
|
|
}
|
|
spin_lock(&swap_avail_lock);
|
|
plist_del(&p->avail_list, &swap_avail_head);
|
|
spin_unlock(&swap_avail_lock);
|
|
spin_lock(&p->lock);
|
|
if (p->prio < 0) {
|
|
struct swap_info_struct *si = p;
|
|
|
|
plist_for_each_entry_continue(si, &swap_active_head, list) {
|
|
si->prio++;
|
|
si->list.prio--;
|
|
si->avail_list.prio--;
|
|
}
|
|
least_priority++;
|
|
}
|
|
plist_del(&p->list, &swap_active_head);
|
|
atomic_long_sub(p->pages, &nr_swap_pages);
|
|
total_swap_pages -= p->pages;
|
|
p->flags &= ~SWP_WRITEOK;
|
|
spin_unlock(&p->lock);
|
|
spin_unlock(&swap_lock);
|
|
|
|
set_current_oom_origin();
|
|
err = try_to_unuse(p->type, false, 0); /* force unuse all pages */
|
|
clear_current_oom_origin();
|
|
|
|
if (err) {
|
|
/* re-insert swap space back into swap_list */
|
|
reinsert_swap_info(p);
|
|
goto out_dput;
|
|
}
|
|
|
|
flush_work(&p->discard_work);
|
|
|
|
destroy_swap_extents(p);
|
|
if (p->flags & SWP_CONTINUED)
|
|
free_swap_count_continuations(p);
|
|
|
|
mutex_lock(&swapon_mutex);
|
|
spin_lock(&swap_lock);
|
|
spin_lock(&p->lock);
|
|
drain_mmlist();
|
|
|
|
/* wait for anyone still in scan_swap_map */
|
|
p->highest_bit = 0; /* cuts scans short */
|
|
while (p->flags >= SWP_SCANNING) {
|
|
spin_unlock(&p->lock);
|
|
spin_unlock(&swap_lock);
|
|
schedule_timeout_uninterruptible(1);
|
|
spin_lock(&swap_lock);
|
|
spin_lock(&p->lock);
|
|
}
|
|
|
|
swap_file = p->swap_file;
|
|
old_block_size = p->old_block_size;
|
|
p->swap_file = NULL;
|
|
p->max = 0;
|
|
swap_map = p->swap_map;
|
|
p->swap_map = NULL;
|
|
cluster_info = p->cluster_info;
|
|
p->cluster_info = NULL;
|
|
frontswap_map = frontswap_map_get(p);
|
|
spin_unlock(&p->lock);
|
|
spin_unlock(&swap_lock);
|
|
frontswap_invalidate_area(p->type);
|
|
frontswap_map_set(p, NULL);
|
|
mutex_unlock(&swapon_mutex);
|
|
free_percpu(p->percpu_cluster);
|
|
p->percpu_cluster = NULL;
|
|
vfree(swap_map);
|
|
vfree(cluster_info);
|
|
vfree(frontswap_map);
|
|
/* Destroy swap account information */
|
|
swap_cgroup_swapoff(p->type);
|
|
|
|
inode = mapping->host;
|
|
if (S_ISBLK(inode->i_mode)) {
|
|
struct block_device *bdev = I_BDEV(inode);
|
|
set_blocksize(bdev, old_block_size);
|
|
blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
|
|
} else {
|
|
mutex_lock(&inode->i_mutex);
|
|
inode->i_flags &= ~S_SWAPFILE;
|
|
mutex_unlock(&inode->i_mutex);
|
|
}
|
|
filp_close(swap_file, NULL);
|
|
|
|
/*
|
|
* Clear the SWP_USED flag after all resources are freed so that swapon
|
|
* can reuse this swap_info in alloc_swap_info() safely. It is ok to
|
|
* not hold p->lock after we cleared its SWP_WRITEOK.
|
|
*/
|
|
spin_lock(&swap_lock);
|
|
p->flags = 0;
|
|
spin_unlock(&swap_lock);
|
|
|
|
err = 0;
|
|
atomic_inc(&proc_poll_event);
|
|
wake_up_interruptible(&proc_poll_wait);
|
|
|
|
out_dput:
|
|
filp_close(victim, NULL);
|
|
out:
|
|
putname(pathname);
|
|
return err;
|
|
}
|
|
|
|
#ifdef CONFIG_PROC_FS
|
|
static unsigned swaps_poll(struct file *file, poll_table *wait)
|
|
{
|
|
struct seq_file *seq = file->private_data;
|
|
|
|
poll_wait(file, &proc_poll_wait, wait);
|
|
|
|
if (seq->poll_event != atomic_read(&proc_poll_event)) {
|
|
seq->poll_event = atomic_read(&proc_poll_event);
|
|
return POLLIN | POLLRDNORM | POLLERR | POLLPRI;
|
|
}
|
|
|
|
return POLLIN | POLLRDNORM;
|
|
}
|
|
|
|
/* iterator */
|
|
static void *swap_start(struct seq_file *swap, loff_t *pos)
|
|
{
|
|
struct swap_info_struct *si;
|
|
int type;
|
|
loff_t l = *pos;
|
|
|
|
mutex_lock(&swapon_mutex);
|
|
|
|
if (!l)
|
|
return SEQ_START_TOKEN;
|
|
|
|
for (type = 0; type < nr_swapfiles; type++) {
|
|
smp_rmb(); /* read nr_swapfiles before swap_info[type] */
|
|
si = swap_info[type];
|
|
if (!(si->flags & SWP_USED) || !si->swap_map)
|
|
continue;
|
|
if (!--l)
|
|
return si;
|
|
}
|
|
|
|
return NULL;
|
|
}
|
|
|
|
static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
|
|
{
|
|
struct swap_info_struct *si = v;
|
|
int type;
|
|
|
|
if (v == SEQ_START_TOKEN)
|
|
type = 0;
|
|
else
|
|
type = si->type + 1;
|
|
|
|
for (; type < nr_swapfiles; type++) {
|
|
smp_rmb(); /* read nr_swapfiles before swap_info[type] */
|
|
si = swap_info[type];
|
|
if (!(si->flags & SWP_USED) || !si->swap_map)
|
|
continue;
|
|
++*pos;
|
|
return si;
|
|
}
|
|
|
|
return NULL;
|
|
}
|
|
|
|
static void swap_stop(struct seq_file *swap, void *v)
|
|
{
|
|
mutex_unlock(&swapon_mutex);
|
|
}
|
|
|
|
static int swap_show(struct seq_file *swap, void *v)
|
|
{
|
|
struct swap_info_struct *si = v;
|
|
struct file *file;
|
|
int len;
|
|
|
|
if (si == SEQ_START_TOKEN) {
|
|
seq_puts(swap,"Filename\t\t\t\tType\t\tSize\tUsed\tPriority\n");
|
|
return 0;
|
|
}
|
|
|
|
file = si->swap_file;
|
|
len = seq_file_path(swap, file, " \t\n\\");
|
|
seq_printf(swap, "%*s%s\t%u\t%u\t%d\n",
|
|
len < 40 ? 40 - len : 1, " ",
|
|
S_ISBLK(file_inode(file)->i_mode) ?
|
|
"partition" : "file\t",
|
|
si->pages << (PAGE_SHIFT - 10),
|
|
si->inuse_pages << (PAGE_SHIFT - 10),
|
|
si->prio);
|
|
return 0;
|
|
}
|
|
|
|
static const struct seq_operations swaps_op = {
|
|
.start = swap_start,
|
|
.next = swap_next,
|
|
.stop = swap_stop,
|
|
.show = swap_show
|
|
};
|
|
|
|
static int swaps_open(struct inode *inode, struct file *file)
|
|
{
|
|
struct seq_file *seq;
|
|
int ret;
|
|
|
|
ret = seq_open(file, &swaps_op);
|
|
if (ret)
|
|
return ret;
|
|
|
|
seq = file->private_data;
|
|
seq->poll_event = atomic_read(&proc_poll_event);
|
|
return 0;
|
|
}
|
|
|
|
static const struct file_operations proc_swaps_operations = {
|
|
.open = swaps_open,
|
|
.read = seq_read,
|
|
.llseek = seq_lseek,
|
|
.release = seq_release,
|
|
.poll = swaps_poll,
|
|
};
|
|
|
|
static int __init procswaps_init(void)
|
|
{
|
|
proc_create("swaps", 0, NULL, &proc_swaps_operations);
|
|
return 0;
|
|
}
|
|
__initcall(procswaps_init);
|
|
#endif /* CONFIG_PROC_FS */
|
|
|
|
#ifdef MAX_SWAPFILES_CHECK
|
|
static int __init max_swapfiles_check(void)
|
|
{
|
|
MAX_SWAPFILES_CHECK();
|
|
return 0;
|
|
}
|
|
late_initcall(max_swapfiles_check);
|
|
#endif
|
|
|
|
static struct swap_info_struct *alloc_swap_info(void)
|
|
{
|
|
struct swap_info_struct *p;
|
|
unsigned int type;
|
|
|
|
p = kzalloc(sizeof(*p), GFP_KERNEL);
|
|
if (!p)
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
spin_lock(&swap_lock);
|
|
for (type = 0; type < nr_swapfiles; type++) {
|
|
if (!(swap_info[type]->flags & SWP_USED))
|
|
break;
|
|
}
|
|
if (type >= MAX_SWAPFILES) {
|
|
spin_unlock(&swap_lock);
|
|
kfree(p);
|
|
return ERR_PTR(-EPERM);
|
|
}
|
|
if (type >= nr_swapfiles) {
|
|
p->type = type;
|
|
swap_info[type] = p;
|
|
/*
|
|
* Write swap_info[type] before nr_swapfiles, in case a
|
|
* racing procfs swap_start() or swap_next() is reading them.
|
|
* (We never shrink nr_swapfiles, we never free this entry.)
|
|
*/
|
|
smp_wmb();
|
|
nr_swapfiles++;
|
|
} else {
|
|
kfree(p);
|
|
p = swap_info[type];
|
|
/*
|
|
* Do not memset this entry: a racing procfs swap_next()
|
|
* would be relying on p->type to remain valid.
|
|
*/
|
|
}
|
|
INIT_LIST_HEAD(&p->first_swap_extent.list);
|
|
plist_node_init(&p->list, 0);
|
|
plist_node_init(&p->avail_list, 0);
|
|
p->flags = SWP_USED;
|
|
spin_unlock(&swap_lock);
|
|
spin_lock_init(&p->lock);
|
|
|
|
return p;
|
|
}
|
|
|
|
static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)
|
|
{
|
|
int error;
|
|
|
|
if (S_ISBLK(inode->i_mode)) {
|
|
p->bdev = bdgrab(I_BDEV(inode));
|
|
error = blkdev_get(p->bdev,
|
|
FMODE_READ | FMODE_WRITE | FMODE_EXCL, p);
|
|
if (error < 0) {
|
|
p->bdev = NULL;
|
|
return error;
|
|
}
|
|
p->old_block_size = block_size(p->bdev);
|
|
error = set_blocksize(p->bdev, PAGE_SIZE);
|
|
if (error < 0)
|
|
return error;
|
|
p->flags |= SWP_BLKDEV;
|
|
} else if (S_ISREG(inode->i_mode)) {
|
|
p->bdev = inode->i_sb->s_bdev;
|
|
mutex_lock(&inode->i_mutex);
|
|
if (IS_SWAPFILE(inode))
|
|
return -EBUSY;
|
|
} else
|
|
return -EINVAL;
|
|
|
|
return 0;
|
|
}
|
|
|
|
static unsigned long read_swap_header(struct swap_info_struct *p,
|
|
union swap_header *swap_header,
|
|
struct inode *inode)
|
|
{
|
|
int i;
|
|
unsigned long maxpages;
|
|
unsigned long swapfilepages;
|
|
unsigned long last_page;
|
|
|
|
if (memcmp("SWAPSPACE2", swap_header->magic.magic, 10)) {
|
|
pr_err("Unable to find swap-space signature\n");
|
|
return 0;
|
|
}
|
|
|
|
/* swap partition endianess hack... */
|
|
if (swab32(swap_header->info.version) == 1) {
|
|
swab32s(&swap_header->info.version);
|
|
swab32s(&swap_header->info.last_page);
|
|
swab32s(&swap_header->info.nr_badpages);
|
|
if (swap_header->info.nr_badpages > MAX_SWAP_BADPAGES)
|
|
return 0;
|
|
for (i = 0; i < swap_header->info.nr_badpages; i++)
|
|
swab32s(&swap_header->info.badpages[i]);
|
|
}
|
|
/* Check the swap header's sub-version */
|
|
if (swap_header->info.version != 1) {
|
|
pr_warn("Unable to handle swap header version %d\n",
|
|
swap_header->info.version);
|
|
return 0;
|
|
}
|
|
|
|
p->lowest_bit = 1;
|
|
p->cluster_next = 1;
|
|
p->cluster_nr = 0;
|
|
|
|
/*
|
|
* Find out how many pages are allowed for a single swap
|
|
* device. There are two limiting factors: 1) the number
|
|
* of bits for the swap offset in the swp_entry_t type, and
|
|
* 2) the number of bits in the swap pte as defined by the
|
|
* different architectures. In order to find the
|
|
* largest possible bit mask, a swap entry with swap type 0
|
|
* and swap offset ~0UL is created, encoded to a swap pte,
|
|
* decoded to a swp_entry_t again, and finally the swap
|
|
* offset is extracted. This will mask all the bits from
|
|
* the initial ~0UL mask that can't be encoded in either
|
|
* the swp_entry_t or the architecture definition of a
|
|
* swap pte.
|
|
*/
|
|
maxpages = swp_offset(pte_to_swp_entry(
|
|
swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;
|
|
last_page = swap_header->info.last_page;
|
|
if (last_page > maxpages) {
|
|
pr_warn("Truncating oversized swap area, only using %luk out of %luk\n",
|
|
maxpages << (PAGE_SHIFT - 10),
|
|
last_page << (PAGE_SHIFT - 10));
|
|
}
|
|
if (maxpages > last_page) {
|
|
maxpages = last_page + 1;
|
|
/* p->max is an unsigned int: don't overflow it */
|
|
if ((unsigned int)maxpages == 0)
|
|
maxpages = UINT_MAX;
|
|
}
|
|
p->highest_bit = maxpages - 1;
|
|
|
|
if (!maxpages)
|
|
return 0;
|
|
swapfilepages = i_size_read(inode) >> PAGE_SHIFT;
|
|
if (swapfilepages && maxpages > swapfilepages) {
|
|
pr_warn("Swap area shorter than signature indicates\n");
|
|
return 0;
|
|
}
|
|
if (swap_header->info.nr_badpages && S_ISREG(inode->i_mode))
|
|
return 0;
|
|
if (swap_header->info.nr_badpages > MAX_SWAP_BADPAGES)
|
|
return 0;
|
|
|
|
return maxpages;
|
|
}
|
|
|
|
static int setup_swap_map_and_extents(struct swap_info_struct *p,
|
|
union swap_header *swap_header,
|
|
unsigned char *swap_map,
|
|
struct swap_cluster_info *cluster_info,
|
|
unsigned long maxpages,
|
|
sector_t *span)
|
|
{
|
|
int i;
|
|
unsigned int nr_good_pages;
|
|
int nr_extents;
|
|
unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
|
|
unsigned long idx = p->cluster_next / SWAPFILE_CLUSTER;
|
|
|
|
nr_good_pages = maxpages - 1; /* omit header page */
|
|
|
|
cluster_set_null(&p->free_cluster_head);
|
|
cluster_set_null(&p->free_cluster_tail);
|
|
cluster_set_null(&p->discard_cluster_head);
|
|
cluster_set_null(&p->discard_cluster_tail);
|
|
|
|
for (i = 0; i < swap_header->info.nr_badpages; i++) {
|
|
unsigned int page_nr = swap_header->info.badpages[i];
|
|
if (page_nr == 0 || page_nr > swap_header->info.last_page)
|
|
return -EINVAL;
|
|
if (page_nr < maxpages) {
|
|
swap_map[page_nr] = SWAP_MAP_BAD;
|
|
nr_good_pages--;
|
|
/*
|
|
* Haven't marked the cluster free yet, no list
|
|
* operation involved
|
|
*/
|
|
inc_cluster_info_page(p, cluster_info, page_nr);
|
|
}
|
|
}
|
|
|
|
/* Haven't marked the cluster free yet, no list operation involved */
|
|
for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
|
|
inc_cluster_info_page(p, cluster_info, i);
|
|
|
|
if (nr_good_pages) {
|
|
swap_map[0] = SWAP_MAP_BAD;
|
|
/*
|
|
* Not mark the cluster free yet, no list
|
|
* operation involved
|
|
*/
|
|
inc_cluster_info_page(p, cluster_info, 0);
|
|
p->max = maxpages;
|
|
p->pages = nr_good_pages;
|
|
nr_extents = setup_swap_extents(p, span);
|
|
if (nr_extents < 0)
|
|
return nr_extents;
|
|
nr_good_pages = p->pages;
|
|
}
|
|
if (!nr_good_pages) {
|
|
pr_warn("Empty swap-file\n");
|
|
return -EINVAL;
|
|
}
|
|
|
|
if (!cluster_info)
|
|
return nr_extents;
|
|
|
|
for (i = 0; i < nr_clusters; i++) {
|
|
if (!cluster_count(&cluster_info[idx])) {
|
|
cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
|
|
if (cluster_is_null(&p->free_cluster_head)) {
|
|
cluster_set_next_flag(&p->free_cluster_head,
|
|
idx, 0);
|
|
cluster_set_next_flag(&p->free_cluster_tail,
|
|
idx, 0);
|
|
} else {
|
|
unsigned int tail;
|
|
|
|
tail = cluster_next(&p->free_cluster_tail);
|
|
cluster_set_next(&cluster_info[tail], idx);
|
|
cluster_set_next_flag(&p->free_cluster_tail,
|
|
idx, 0);
|
|
}
|
|
}
|
|
idx++;
|
|
if (idx == nr_clusters)
|
|
idx = 0;
|
|
}
|
|
return nr_extents;
|
|
}
|
|
|
|
/*
|
|
* Helper to sys_swapon determining if a given swap
|
|
* backing device queue supports DISCARD operations.
|
|
*/
|
|
static bool swap_discardable(struct swap_info_struct *si)
|
|
{
|
|
struct request_queue *q = bdev_get_queue(si->bdev);
|
|
|
|
if (!q || !blk_queue_discard(q))
|
|
return false;
|
|
|
|
return true;
|
|
}
|
|
|
|
SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
|
|
{
|
|
struct swap_info_struct *p;
|
|
struct filename *name;
|
|
struct file *swap_file = NULL;
|
|
struct address_space *mapping;
|
|
int prio;
|
|
int error;
|
|
union swap_header *swap_header;
|
|
int nr_extents;
|
|
sector_t span;
|
|
unsigned long maxpages;
|
|
unsigned char *swap_map = NULL;
|
|
struct swap_cluster_info *cluster_info = NULL;
|
|
unsigned long *frontswap_map = NULL;
|
|
struct page *page = NULL;
|
|
struct inode *inode = NULL;
|
|
|
|
if (swap_flags & ~SWAP_FLAGS_VALID)
|
|
return -EINVAL;
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
return -EPERM;
|
|
|
|
p = alloc_swap_info();
|
|
if (IS_ERR(p))
|
|
return PTR_ERR(p);
|
|
|
|
INIT_WORK(&p->discard_work, swap_discard_work);
|
|
|
|
name = getname(specialfile);
|
|
if (IS_ERR(name)) {
|
|
error = PTR_ERR(name);
|
|
name = NULL;
|
|
goto bad_swap;
|
|
}
|
|
swap_file = file_open_name(name, O_RDWR|O_LARGEFILE, 0);
|
|
if (IS_ERR(swap_file)) {
|
|
error = PTR_ERR(swap_file);
|
|
swap_file = NULL;
|
|
goto bad_swap;
|
|
}
|
|
|
|
p->swap_file = swap_file;
|
|
mapping = swap_file->f_mapping;
|
|
inode = mapping->host;
|
|
|
|
/* If S_ISREG(inode->i_mode) will do mutex_lock(&inode->i_mutex); */
|
|
error = claim_swapfile(p, inode);
|
|
if (unlikely(error))
|
|
goto bad_swap;
|
|
|
|
/*
|
|
* Read the swap header.
|
|
*/
|
|
if (!mapping->a_ops->readpage) {
|
|
error = -EINVAL;
|
|
goto bad_swap;
|
|
}
|
|
page = read_mapping_page(mapping, 0, swap_file);
|
|
if (IS_ERR(page)) {
|
|
error = PTR_ERR(page);
|
|
goto bad_swap;
|
|
}
|
|
swap_header = kmap(page);
|
|
|
|
maxpages = read_swap_header(p, swap_header, inode);
|
|
if (unlikely(!maxpages)) {
|
|
error = -EINVAL;
|
|
goto bad_swap;
|
|
}
|
|
|
|
/* OK, set up the swap map and apply the bad block list */
|
|
swap_map = vzalloc(maxpages);
|
|
if (!swap_map) {
|
|
error = -ENOMEM;
|
|
goto bad_swap;
|
|
}
|
|
if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
|
|
int cpu;
|
|
|
|
p->flags |= SWP_SOLIDSTATE;
|
|
/*
|
|
* select a random position to start with to help wear leveling
|
|
* SSD
|
|
*/
|
|
p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
|
|
|
|
cluster_info = vzalloc(DIV_ROUND_UP(maxpages,
|
|
SWAPFILE_CLUSTER) * sizeof(*cluster_info));
|
|
if (!cluster_info) {
|
|
error = -ENOMEM;
|
|
goto bad_swap;
|
|
}
|
|
p->percpu_cluster = alloc_percpu(struct percpu_cluster);
|
|
if (!p->percpu_cluster) {
|
|
error = -ENOMEM;
|
|
goto bad_swap;
|
|
}
|
|
for_each_possible_cpu(cpu) {
|
|
struct percpu_cluster *cluster;
|
|
cluster = per_cpu_ptr(p->percpu_cluster, cpu);
|
|
cluster_set_null(&cluster->index);
|
|
}
|
|
}
|
|
|
|
error = swap_cgroup_swapon(p->type, maxpages);
|
|
if (error)
|
|
goto bad_swap;
|
|
|
|
nr_extents = setup_swap_map_and_extents(p, swap_header, swap_map,
|
|
cluster_info, maxpages, &span);
|
|
if (unlikely(nr_extents < 0)) {
|
|
error = nr_extents;
|
|
goto bad_swap;
|
|
}
|
|
/* frontswap enabled? set up bit-per-page map for frontswap */
|
|
if (frontswap_enabled)
|
|
frontswap_map = vzalloc(BITS_TO_LONGS(maxpages) * sizeof(long));
|
|
|
|
if (p->bdev &&(swap_flags & SWAP_FLAG_DISCARD) && swap_discardable(p)) {
|
|
/*
|
|
* When discard is enabled for swap with no particular
|
|
* policy flagged, we set all swap discard flags here in
|
|
* order to sustain backward compatibility with older
|
|
* swapon(8) releases.
|
|
*/
|
|
p->flags |= (SWP_DISCARDABLE | SWP_AREA_DISCARD |
|
|
SWP_PAGE_DISCARD);
|
|
|
|
/*
|
|
* By flagging sys_swapon, a sysadmin can tell us to
|
|
* either do single-time area discards only, or to just
|
|
* perform discards for released swap page-clusters.
|
|
* Now it's time to adjust the p->flags accordingly.
|
|
*/
|
|
if (swap_flags & SWAP_FLAG_DISCARD_ONCE)
|
|
p->flags &= ~SWP_PAGE_DISCARD;
|
|
else if (swap_flags & SWAP_FLAG_DISCARD_PAGES)
|
|
p->flags &= ~SWP_AREA_DISCARD;
|
|
|
|
/* issue a swapon-time discard if it's still required */
|
|
if (p->flags & SWP_AREA_DISCARD) {
|
|
int err = discard_swap(p);
|
|
if (unlikely(err))
|
|
pr_err("swapon: discard_swap(%p): %d\n",
|
|
p, err);
|
|
}
|
|
}
|
|
|
|
if (p->bdev && blk_queue_fast(bdev_get_queue(p->bdev)))
|
|
p->flags |= SWP_FAST;
|
|
|
|
mutex_lock(&swapon_mutex);
|
|
prio = -1;
|
|
if (swap_flags & SWAP_FLAG_PREFER) {
|
|
prio =
|
|
(swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT;
|
|
setup_swap_ratio(p, prio);
|
|
}
|
|
enable_swap_info(p, prio, swap_map, cluster_info, frontswap_map);
|
|
|
|
pr_info("Adding %uk swap on %s. "
|
|
"Priority:%d extents:%d across:%lluk %s%s%s%s%s\n",
|
|
p->pages<<(PAGE_SHIFT-10), name->name, p->prio,
|
|
nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10),
|
|
(p->flags & SWP_SOLIDSTATE) ? "SS" : "",
|
|
(p->flags & SWP_DISCARDABLE) ? "D" : "",
|
|
(p->flags & SWP_AREA_DISCARD) ? "s" : "",
|
|
(p->flags & SWP_PAGE_DISCARD) ? "c" : "",
|
|
(frontswap_map) ? "FS" : "");
|
|
|
|
mutex_unlock(&swapon_mutex);
|
|
atomic_inc(&proc_poll_event);
|
|
wake_up_interruptible(&proc_poll_wait);
|
|
|
|
if (S_ISREG(inode->i_mode))
|
|
inode->i_flags |= S_SWAPFILE;
|
|
error = 0;
|
|
goto out;
|
|
bad_swap:
|
|
free_percpu(p->percpu_cluster);
|
|
p->percpu_cluster = NULL;
|
|
if (inode && S_ISBLK(inode->i_mode) && p->bdev) {
|
|
set_blocksize(p->bdev, p->old_block_size);
|
|
blkdev_put(p->bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
|
|
}
|
|
destroy_swap_extents(p);
|
|
swap_cgroup_swapoff(p->type);
|
|
spin_lock(&swap_lock);
|
|
p->swap_file = NULL;
|
|
p->flags = 0;
|
|
spin_unlock(&swap_lock);
|
|
vfree(swap_map);
|
|
vfree(cluster_info);
|
|
if (swap_file) {
|
|
if (inode && S_ISREG(inode->i_mode)) {
|
|
mutex_unlock(&inode->i_mutex);
|
|
inode = NULL;
|
|
}
|
|
filp_close(swap_file, NULL);
|
|
}
|
|
out:
|
|
if (page && !IS_ERR(page)) {
|
|
kunmap(page);
|
|
page_cache_release(page);
|
|
}
|
|
if (name)
|
|
putname(name);
|
|
if (inode && S_ISREG(inode->i_mode))
|
|
mutex_unlock(&inode->i_mutex);
|
|
return error;
|
|
}
|
|
|
|
void si_swapinfo(struct sysinfo *val)
|
|
{
|
|
unsigned int type;
|
|
unsigned long nr_to_be_unused = 0;
|
|
|
|
spin_lock(&swap_lock);
|
|
for (type = 0; type < nr_swapfiles; type++) {
|
|
struct swap_info_struct *si = swap_info[type];
|
|
|
|
if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK))
|
|
nr_to_be_unused += si->inuse_pages;
|
|
}
|
|
val->freeswap = atomic_long_read(&nr_swap_pages) + nr_to_be_unused;
|
|
val->totalswap = total_swap_pages + nr_to_be_unused;
|
|
spin_unlock(&swap_lock);
|
|
}
|
|
|
|
/*
|
|
* Verify that a swap entry is valid and increment its swap map count.
|
|
*
|
|
* Returns error code in following case.
|
|
* - success -> 0
|
|
* - swp_entry is invalid -> EINVAL
|
|
* - swp_entry is migration entry -> EINVAL
|
|
* - swap-cache reference is requested but there is already one. -> EEXIST
|
|
* - swap-cache reference is requested but the entry is not used. -> ENOENT
|
|
* - swap-mapped reference requested but needs continued swap count. -> ENOMEM
|
|
*/
|
|
static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
|
|
{
|
|
struct swap_info_struct *p;
|
|
unsigned long offset, type;
|
|
unsigned char count;
|
|
unsigned char has_cache;
|
|
int err = -EINVAL;
|
|
|
|
if (non_swap_entry(entry))
|
|
goto out;
|
|
|
|
type = swp_type(entry);
|
|
if (type >= nr_swapfiles)
|
|
goto bad_file;
|
|
p = swap_info[type];
|
|
offset = swp_offset(entry);
|
|
|
|
spin_lock(&p->lock);
|
|
if (unlikely(offset >= p->max))
|
|
goto unlock_out;
|
|
|
|
count = p->swap_map[offset];
|
|
|
|
/*
|
|
* swapin_readahead() doesn't check if a swap entry is valid, so the
|
|
* swap entry could be SWAP_MAP_BAD. Check here with lock held.
|
|
*/
|
|
if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
|
|
err = -ENOENT;
|
|
goto unlock_out;
|
|
}
|
|
|
|
has_cache = count & SWAP_HAS_CACHE;
|
|
count &= ~SWAP_HAS_CACHE;
|
|
err = 0;
|
|
|
|
if (usage == SWAP_HAS_CACHE) {
|
|
|
|
/* set SWAP_HAS_CACHE if there is no cache and entry is used */
|
|
if (!has_cache && count)
|
|
has_cache = SWAP_HAS_CACHE;
|
|
else if (has_cache) /* someone else added cache */
|
|
err = -EEXIST;
|
|
else /* no users remaining */
|
|
err = -ENOENT;
|
|
|
|
} else if (count || has_cache) {
|
|
|
|
if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
|
|
count += usage;
|
|
else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
|
|
err = -EINVAL;
|
|
else if (swap_count_continued(p, offset, count))
|
|
count = COUNT_CONTINUED;
|
|
else
|
|
err = -ENOMEM;
|
|
} else
|
|
err = -ENOENT; /* unused swap entry */
|
|
|
|
p->swap_map[offset] = count | has_cache;
|
|
|
|
unlock_out:
|
|
spin_unlock(&p->lock);
|
|
out:
|
|
return err;
|
|
|
|
bad_file:
|
|
pr_err("swap_dup: %s%08lx\n", Bad_file, entry.val);
|
|
goto out;
|
|
}
|
|
|
|
/*
|
|
* Help swapoff by noting that swap entry belongs to shmem/tmpfs
|
|
* (in which case its reference count is never incremented).
|
|
*/
|
|
void swap_shmem_alloc(swp_entry_t entry)
|
|
{
|
|
__swap_duplicate(entry, SWAP_MAP_SHMEM);
|
|
}
|
|
|
|
/*
|
|
* Increase reference count of swap entry by 1.
|
|
* Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
|
|
* but could not be atomically allocated. Returns 0, just as if it succeeded,
|
|
* if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
|
|
* might occur if a page table entry has got corrupted.
|
|
*/
|
|
int swap_duplicate(swp_entry_t entry)
|
|
{
|
|
int err = 0;
|
|
|
|
while (!err && __swap_duplicate(entry, 1) == -ENOMEM)
|
|
err = add_swap_count_continuation(entry, GFP_ATOMIC);
|
|
return err;
|
|
}
|
|
|
|
/*
|
|
* @entry: swap entry for which we allocate swap cache.
|
|
*
|
|
* Called when allocating swap cache for existing swap entry,
|
|
* This can return error codes. Returns 0 at success.
|
|
* -EBUSY means there is a swap cache.
|
|
* Note: return code is different from swap_duplicate().
|
|
*/
|
|
int swapcache_prepare(swp_entry_t entry)
|
|
{
|
|
return __swap_duplicate(entry, SWAP_HAS_CACHE);
|
|
}
|
|
|
|
struct swap_info_struct *page_swap_info(struct page *page)
|
|
{
|
|
swp_entry_t swap = { .val = page_private(page) };
|
|
BUG_ON(!PageSwapCache(page));
|
|
return swap_info[swp_type(swap)];
|
|
}
|
|
|
|
/*
|
|
* out-of-line __page_file_ methods to avoid include hell.
|
|
*/
|
|
struct address_space *__page_file_mapping(struct page *page)
|
|
{
|
|
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
|
|
return page_swap_info(page)->swap_file->f_mapping;
|
|
}
|
|
EXPORT_SYMBOL_GPL(__page_file_mapping);
|
|
|
|
pgoff_t __page_file_index(struct page *page)
|
|
{
|
|
swp_entry_t swap = { .val = page_private(page) };
|
|
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
|
|
return swp_offset(swap);
|
|
}
|
|
EXPORT_SYMBOL_GPL(__page_file_index);
|
|
|
|
/*
|
|
* add_swap_count_continuation - called when a swap count is duplicated
|
|
* beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
|
|
* page of the original vmalloc'ed swap_map, to hold the continuation count
|
|
* (for that entry and for its neighbouring PAGE_SIZE swap entries). Called
|
|
* again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc.
|
|
*
|
|
* These continuation pages are seldom referenced: the common paths all work
|
|
* on the original swap_map, only referring to a continuation page when the
|
|
* low "digit" of a count is incremented or decremented through SWAP_MAP_MAX.
|
|
*
|
|
* add_swap_count_continuation(, GFP_ATOMIC) can be called while holding
|
|
* page table locks; if it fails, add_swap_count_continuation(, GFP_KERNEL)
|
|
* can be called after dropping locks.
|
|
*/
|
|
int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
|
|
{
|
|
struct swap_info_struct *si;
|
|
struct page *head;
|
|
struct page *page;
|
|
struct page *list_page;
|
|
pgoff_t offset;
|
|
unsigned char count;
|
|
|
|
/*
|
|
* When debugging, it's easier to use __GFP_ZERO here; but it's better
|
|
* for latency not to zero a page while GFP_ATOMIC and holding locks.
|
|
*/
|
|
page = alloc_page(gfp_mask | __GFP_HIGHMEM);
|
|
|
|
si = swap_info_get(entry);
|
|
if (!si) {
|
|
/*
|
|
* An acceptable race has occurred since the failing
|
|
* __swap_duplicate(): the swap entry has been freed,
|
|
* perhaps even the whole swap_map cleared for swapoff.
|
|
*/
|
|
goto outer;
|
|
}
|
|
|
|
offset = swp_offset(entry);
|
|
count = si->swap_map[offset] & ~SWAP_HAS_CACHE;
|
|
|
|
if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
|
|
/*
|
|
* The higher the swap count, the more likely it is that tasks
|
|
* will race to add swap count continuation: we need to avoid
|
|
* over-provisioning.
|
|
*/
|
|
goto out;
|
|
}
|
|
|
|
if (!page) {
|
|
spin_unlock(&si->lock);
|
|
return -ENOMEM;
|
|
}
|
|
|
|
/*
|
|
* We are fortunate that although vmalloc_to_page uses pte_offset_map,
|
|
* no architecture is using highmem pages for kernel page tables: so it
|
|
* will not corrupt the GFP_ATOMIC caller's atomic page table kmaps.
|
|
*/
|
|
head = vmalloc_to_page(si->swap_map + offset);
|
|
offset &= ~PAGE_MASK;
|
|
|
|
/*
|
|
* Page allocation does not initialize the page's lru field,
|
|
* but it does always reset its private field.
|
|
*/
|
|
if (!page_private(head)) {
|
|
BUG_ON(count & COUNT_CONTINUED);
|
|
INIT_LIST_HEAD(&head->lru);
|
|
set_page_private(head, SWP_CONTINUED);
|
|
si->flags |= SWP_CONTINUED;
|
|
}
|
|
|
|
list_for_each_entry(list_page, &head->lru, lru) {
|
|
unsigned char *map;
|
|
|
|
/*
|
|
* If the previous map said no continuation, but we've found
|
|
* a continuation page, free our allocation and use this one.
|
|
*/
|
|
if (!(count & COUNT_CONTINUED))
|
|
goto out;
|
|
|
|
map = kmap_atomic(list_page) + offset;
|
|
count = *map;
|
|
kunmap_atomic(map);
|
|
|
|
/*
|
|
* If this continuation count now has some space in it,
|
|
* free our allocation and use this one.
|
|
*/
|
|
if ((count & ~COUNT_CONTINUED) != SWAP_CONT_MAX)
|
|
goto out;
|
|
}
|
|
|
|
list_add_tail(&page->lru, &head->lru);
|
|
page = NULL; /* now it's attached, don't free it */
|
|
out:
|
|
spin_unlock(&si->lock);
|
|
outer:
|
|
if (page)
|
|
__free_page(page);
|
|
return 0;
|
|
}
|
|
|
|
/*
|
|
* swap_count_continued - when the original swap_map count is incremented
|
|
* from SWAP_MAP_MAX, check if there is already a continuation page to carry
|
|
* into, carry if so, or else fail until a new continuation page is allocated;
|
|
* when the original swap_map count is decremented from 0 with continuation,
|
|
* borrow from the continuation and report whether it still holds more.
|
|
* Called while __swap_duplicate() or swap_entry_free() holds swap_lock.
|
|
*/
|
|
static bool swap_count_continued(struct swap_info_struct *si,
|
|
pgoff_t offset, unsigned char count)
|
|
{
|
|
struct page *head;
|
|
struct page *page;
|
|
unsigned char *map;
|
|
|
|
head = vmalloc_to_page(si->swap_map + offset);
|
|
if (page_private(head) != SWP_CONTINUED) {
|
|
BUG_ON(count & COUNT_CONTINUED);
|
|
return false; /* need to add count continuation */
|
|
}
|
|
|
|
offset &= ~PAGE_MASK;
|
|
page = list_entry(head->lru.next, struct page, lru);
|
|
map = kmap_atomic(page) + offset;
|
|
|
|
if (count == SWAP_MAP_MAX) /* initial increment from swap_map */
|
|
goto init_map; /* jump over SWAP_CONT_MAX checks */
|
|
|
|
if (count == (SWAP_MAP_MAX | COUNT_CONTINUED)) { /* incrementing */
|
|
/*
|
|
* Think of how you add 1 to 999
|
|
*/
|
|
while (*map == (SWAP_CONT_MAX | COUNT_CONTINUED)) {
|
|
kunmap_atomic(map);
|
|
page = list_entry(page->lru.next, struct page, lru);
|
|
BUG_ON(page == head);
|
|
map = kmap_atomic(page) + offset;
|
|
}
|
|
if (*map == SWAP_CONT_MAX) {
|
|
kunmap_atomic(map);
|
|
page = list_entry(page->lru.next, struct page, lru);
|
|
if (page == head)
|
|
return false; /* add count continuation */
|
|
map = kmap_atomic(page) + offset;
|
|
init_map: *map = 0; /* we didn't zero the page */
|
|
}
|
|
*map += 1;
|
|
kunmap_atomic(map);
|
|
page = list_entry(page->lru.prev, struct page, lru);
|
|
while (page != head) {
|
|
map = kmap_atomic(page) + offset;
|
|
*map = COUNT_CONTINUED;
|
|
kunmap_atomic(map);
|
|
page = list_entry(page->lru.prev, struct page, lru);
|
|
}
|
|
return true; /* incremented */
|
|
|
|
} else { /* decrementing */
|
|
/*
|
|
* Think of how you subtract 1 from 1000
|
|
*/
|
|
BUG_ON(count != COUNT_CONTINUED);
|
|
while (*map == COUNT_CONTINUED) {
|
|
kunmap_atomic(map);
|
|
page = list_entry(page->lru.next, struct page, lru);
|
|
BUG_ON(page == head);
|
|
map = kmap_atomic(page) + offset;
|
|
}
|
|
BUG_ON(*map == 0);
|
|
*map -= 1;
|
|
if (*map == 0)
|
|
count = 0;
|
|
kunmap_atomic(map);
|
|
page = list_entry(page->lru.prev, struct page, lru);
|
|
while (page != head) {
|
|
map = kmap_atomic(page) + offset;
|
|
*map = SWAP_CONT_MAX | count;
|
|
count = COUNT_CONTINUED;
|
|
kunmap_atomic(map);
|
|
page = list_entry(page->lru.prev, struct page, lru);
|
|
}
|
|
return count == COUNT_CONTINUED;
|
|
}
|
|
}
|
|
|
|
/*
|
|
* free_swap_count_continuations - swapoff free all the continuation pages
|
|
* appended to the swap_map, after swap_map is quiesced, before vfree'ing it.
|
|
*/
|
|
static void free_swap_count_continuations(struct swap_info_struct *si)
|
|
{
|
|
pgoff_t offset;
|
|
|
|
for (offset = 0; offset < si->max; offset += PAGE_SIZE) {
|
|
struct page *head;
|
|
head = vmalloc_to_page(si->swap_map + offset);
|
|
if (page_private(head)) {
|
|
struct list_head *this, *next;
|
|
list_for_each_safe(this, next, &head->lru) {
|
|
struct page *page;
|
|
page = list_entry(this, struct page, lru);
|
|
list_del(this);
|
|
__free_page(page);
|
|
}
|
|
}
|
|
}
|
|
}
|