In case the binfmt_misc binary handler is registered *before* the e.g.
script one (when for example being compiled as a module) the following
situation may occur:
1. user launches a script, whose interpreter is a misc binary;
2. the load_misc_binary sets the misc_bang and returns -ENOEVEC,
since the binary is a script;
3. the load_script_binary loads one and calls for search_binary_hander
to run the interpreter;
4. the load_misc_binary is called again, but refuses to load the
binary due to misc_bang bit set.
The fix is to move the misc_bang setting lower - prior to the actual
call to the search_binary_handler.
Caused by the commit 3a2e7f47 (binfmt_misc.c: avoid potential kernel
stack overflow)
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Reported-by: Kirill A. Shutemov <kirill@shutemov.name>
Tested-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: <stable@kernel.org> [2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There was another FAT BKL conversion deadlock reported by Bart
Trojanowski due to the BKL being used as a recursive lock by FAT, which
was missed because it only triggers with 'sync' (or 'dirsync') mounts.
The recursion worked for the BKL, but after the conversion to lock_super
(which uses a mutex), it just deadlocks.
Thanks to Bart for debugging this and testing the fix. The lock
debugging information from the original report:
=============================================
[ INFO: possible recursive locking detected ]
2.6.27-rc3-bisect-00448-ga7f5aaf #16
---------------------------------------------
mv/4020 is trying to acquire lock:
(&type->s_lock_key#9){--..}, at: [<c01a90fe>] lock_super+0x1e/0x20
but task is already holding lock:
(&type->s_lock_key#9){--..}, at: [<c01a90fe>] lock_super+0x1e/0x20
other info that might help us debug this:
3 locks held by mv/4020:
#0: (&sb->s_type->i_mutex_key#9/1){--..}, at: [<c01b2336>] do_unlinkat+0x66/0x140
#1: (&sb->s_type->i_mutex_key#9){--..}, at: [<c01b0954>] vfs_unlink+0x84/0x110
#2: (&type->s_lock_key#9){--..}, at: [<c01a90fe>] lock_super+0x1e/0x20
stack backtrace:
Pid: 4020, comm: mv Not tainted 2.6.27-rc3-bisect-00448-ga7f5aaf #16
[<c014e694>] validate_chain+0x984/0xea0
[<c0108d70>] ? native_sched_clock+0x0/0xf0
[<c014ee9c>] __lock_acquire+0x2ec/0x9b0
[<c014f5cf>] lock_acquire+0x6f/0x90
[<c01a90fe>] ? lock_super+0x1e/0x20
[<c044e5fd>] mutex_lock_nested+0xad/0x300
[<c01a90fe>] ? lock_super+0x1e/0x20
[<c01a90fe>] ? lock_super+0x1e/0x20
[<c01a90fe>] lock_super+0x1e/0x20
[<f8b3a700>] fat_write_inode+0x60/0x2b0 [fat]
[<c0450878>] ? _spin_unlock_irqrestore+0x48/0x80
[<f8b3a953>] ? fat_sync_inode+0x3/0x20 [fat]
[<f8b3a962>] fat_sync_inode+0x12/0x20 [fat]
[<f8b37c7e>] fat_remove_entries+0xbe/0x120 [fat]
[<f8b422ef>] vfat_unlink+0x5f/0x90 [vfat]
[<f8b42290>] ? vfat_unlink+0x0/0x90 [vfat]
[<c01b0968>] vfs_unlink+0x98/0x110
[<c01b2400>] do_unlinkat+0x130/0x140
[<c016a8f5>] ? audit_syscall_entry+0x105/0x150
[<c01b253b>] sys_unlinkat+0x3b/0x40
[<c01040d3>] sysenter_do_call+0x12/0x3f
=======================
where the deadlock is due to the nesting of lock_super from vfat_unlink
to fat_write_inode:
- do_unlinkat
- vfs_unlink
- vfat_unlink
* lock_super
- fat_remove_entries
- fat_sync_inode
- fat_write_inode
* lock_super
and the fix is to simply remove the use of lock_super() in fat_write_inode.
The lock_super() there had been just an automatic conversion of the
kernel lock to the superblock lock, but no locking was actually needed
there, since the code in fat_write_inode already protected all relevant
accesses with a spinlock (sbi->inode_hash_lock to be exact). The only
code inside the BKL (and thus the superblock lock) was accesses tp local
variables or calls to functions that have long been SMP-safe (i.e.
sb_bread, mark_buffe_dirty and brlese).
Bart reports:
"Looks good. I ran 10 parallel processes creating 1M files truncating
them, writing to them again and then deleting them. This patch fixes
the issue I ran into.
Signed-off-by: Bart Trojanowski <bart@jukie.net>"
Reported-and-tested-by: Bart Trojanowski <bart@jukie.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are a bit agressive in invalidating all the pages. But
it is ok because we really don't know why the block allocation
failed and it is better to come of the writeback path
so that user can look for more info.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
percpu_counter_sum_and_set() and percpu_counter_sum() is the same except
the former updates the global counter after accounting. Since we are
taking the fbc->lock to calculate the precise value of the counter in
percpu_counter_sum() anyway, it should simply set fbc->count too, as the
percpu_counter_sum_and_set() does.
This patch merges these two interfaces into one.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Properly handle MSKRB5 by passing sec=mskrb5 to the upcall so that the
spengo blob can be generated appropriately. Also, make
decode_negTokenInit prefer whichever mechanism is first in the list.
Needed for some NetApp servers, and possibly some older
versions of Windows which treat the two KRB5 mechanisms differently.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
cifs_setup_session references pSesInfo->server several times. That
pointer shouldn't change during the life of the function so grab it
once and store it in a local var. This makes the code look a little
cleaner too.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
I case we failed to allocate memory for inode when creating it, we did not
properly free block already allocated for this inode. Move memory allocation
before the block allocation which fixes this issue (thanks for the idea go to
Ingo Oeser <ioe-lkml@rameria.de>). Also remove a few superfluous
initializations already done in udf_alloc_inode().
Reviewed-by: Ingo Oeser <ioe-lkml@rameria.de>
Signed-off-by: Jan Kara <jack@suse.cz>
A memory allocation inside alloc_mutex must not recurse back into the
filesystem itself because that leads to lock inversion between iprune_mutex and
alloc_mutex (and thus to deadlocks - see traces below). alloc_mutex is actually
needed only to update allocation statistics in the superblock so we can drop it
before we start allocating memory for the inode.
tar D ffff81015b9c8c90 0 6614 6612
ffff8100d5a21a20 0000000000000086 0000000000000000 00000000ffff0000
ffff81015b9c8c90 ffff81015b8f0cd0 ffff81015b9c8ee0 0000000000000000
0000000000000003 0000000000000000 0000000000000000 0000000000000000
Call Trace:
[<ffffffff803c1d8a>] __mutex_lock_slowpath+0x64/0x9b
[<ffffffff803c1bef>] mutex_lock+0xa/0xb
[<ffffffff8027f8c2>] shrink_icache_memory+0x38/0x200
[<ffffffff80257742>] shrink_slab+0xe3/0x15b
[<ffffffff802579db>] try_to_free_pages+0x221/0x30d
[<ffffffff8025657e>] isolate_pages_global+0x0/0x31
[<ffffffff8025324b>] __alloc_pages_internal+0x252/0x3ab
[<ffffffff8026b08b>] cache_alloc_refill+0x22e/0x47b
[<ffffffff8026ae37>] kmem_cache_alloc+0x3b/0x61
[<ffffffff8026b15b>] cache_alloc_refill+0x2fe/0x47b
[<ffffffff8026b34e>] __kmalloc+0x76/0x9c
[<ffffffffa00751f2>] :udf:udf_new_inode+0x202/0x2e2
[<ffffffffa007ae5e>] :udf:udf_create+0x2f/0x16d
[<ffffffffa0078f27>] :udf:udf_lookup+0xa6/0xad
...
kswapd0 D ffff81015b9d9270 0 125 2
ffff81015b903c28 0000000000000046 ffffffff8028cbb0 00000000fffffffb
ffff81015b9d9270 ffff81015b8f0cd0 ffff81015b9d94c0 000000000271b490
ffffe2000271b458 ffffe2000271b420 ffffe20002728dc8 ffffe20002728d90
Call Trace:
[<ffffffff8028cbb0>] __set_page_dirty+0xeb/0xf5
[<ffffffff8025403a>] get_dirty_limits+0x1d/0x22f
[<ffffffff803c1d8a>] __mutex_lock_slowpath+0x64/0x9b
[<ffffffff803c1bef>] mutex_lock+0xa/0xb
[<ffffffffa0073f58>] :udf:udf_bitmap_free_blocks+0x47/0x1eb
[<ffffffffa007df31>] :udf:udf_discard_prealloc+0xc6/0x172
[<ffffffffa007875a>] :udf:udf_clear_inode+0x1e/0x48
[<ffffffff8027f121>] clear_inode+0x6d/0xc4
[<ffffffff8027f7f2>] dispose_list+0x56/0xee
[<ffffffff8027fa5a>] shrink_icache_memory+0x1d0/0x200
[<ffffffff80257742>] shrink_slab+0xe3/0x15b
[<ffffffff80257e93>] kswapd+0x346/0x447
...
Reported-by: Tibor Tajti <tibor.tajti@gmail.com>
Reviewed-by: Ingo Oeser <ioe-lkml@rameria.de>
Signed-off-by: Jan Kara <jack@suse.cz>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] mount of IPC$ breaks with iget patch
[CIFS] remove trailing whitespace
[CIFS] if get root inode fails during mount, cleanup tree connection
* 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6: (29 commits)
UBIFS: xattr bugfixes
UBIFS: remove unneeded check
UBIFS: few commentary fixes
UBIFS: fix budgeting request alignment in xattr code
UBIFS: improve arguments checking in debugging messages
UBIFS: always set i_generation to 0
UBIFS: correct spelling of "thrice".
UBIFS: support splice_write
UBIFS: minor tweaks in commit
UBIFS: reserve more space for index
UBIFS: print pid in dump function
UBIFS: align inode data to eight
UBIFS: improve budgeting checks
UBIFS: correct orphan deletion order
UBIFS: fix typos in comments
UBIFS: do not union creat_sqnum and del_cmtno
UBIFS: optimize deletions
UBIFS: increment commit number earlier
UBIFS: remove another unneeded function parameter
UBIFS: remove unneeded function parameter
...
A fuzzed fileystem image failed with OMFS when the extent count was
used in a loop without being checked against the max number of extents.
It also provoked a signed division for an array index that was checked
as if unsigned, leading to index by -1.
omfsck will be updated to fix these cases, in the meantime bail out
gracefully.
Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
write_cache_pages() uses i_mapping->writeback_index to pick up where it
left off the last time a given inode was found by pdflush or
balance_dirty_pages (or anyone else who sets wbc->range_cyclic)
alloc_inode() should set it to a sane value so that writeback doesn't
start in the middle of a file. It is somewhat difficult to notice the bug
since write_cache_pages will loop around to the start of the file and the
elevator helps hide the resulting seeks.
For whatever reason, Btrfs hits this often. Unpatched, untarring 30
copies of the linux kernel in series runs at 47MB/s on a single sata
drive. With this fix, it jumps to 62MB/s.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xattr code has not been tested for a while and there were
serveral bugs. One of them is using wrong inode in
'ubifs_jnl_change_xattr()'. The other is a deadlock in
'ubifs_setxattr()': the i_mutex is locked in
'cap_inode_need_killpriv()' path, so deadlock happens when
'ubifs_setxattr()' tries to lock it again.
Thanks to Zoltan Sogor for finding these bugs.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
In looking at network named pipe support on cifs, I noticed that
Dave Howell's iget patch:
iget: stop CIFS from using iget() and read_inode()
broke mounts to IPC$ (the interprocess communication share), and don't
handle the error case (when getting info on the root inode fails).
Thanks to Gunter who noted a typo in a debug line in the original
version of this patch.
CC: David Howells <dhowells@redhat.com>
CC: Gunter Kukkukk <linux@kukkukk.com>
CC: Stable Kernel <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The patches that are intended to introduce copy-on-write credentials for 2.6.28
require abstraction of access to some fields of the task structure,
particularly for the case of one task accessing another's credentials where RCU
will have to be observed.
Introduced here are trivial no-op versions of the desired accessors for current
and other tasks so that other subsystems can start to be converted over more
easily.
Wrappers are introduced into a new header (linux/cred.h) for UID/GID,
EUID/EGID, SUID/SGID, FSUID/FSGID, cap_effective and current's subscribed
user_struct. These wrappers are macros because the ordering between header
files mitigates against making them inline functions.
linux/cred.h is #included from linux/sched.h.
Further, XFS is modified such that it no longer defines and uses parameterised
versions of current_fs[ug]id(), thus getting rid of the namespace collision
otherwise incurred.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
* git://oss.sgi.com:8090/xfs/linux-2.6: (45 commits)
[XFS] Fix use after free in xfs_log_done().
[XFS] Make xfs_bmap_*_count_leaves void.
[XFS] Use KM_NOFS for debug trace buffers
[XFS] use KM_MAYFAIL in xfs_mountfs
[XFS] refactor xfs_mount_free
[XFS] don't call xfs_freesb from xfs_unmountfs
[XFS] xfs_unmountfs should return void
[XFS] cleanup xfs_mountfs
[XFS] move root inode IRELE into xfs_unmountfs
[XFS] stop using file_update_time
[XFS] optimize xfs_ichgtime
[XFS] update timestamp in xfs_ialloc manually
[XFS] remove the sema_t from XFS.
[XFS] replace dquot flush semaphore with a completion
[XFS] replace inode flush semaphore with a completion
[XFS] extend completions to provide XFS object flush requirements
[XFS] replace the XFS buf iodone semaphore with a completion
[XFS] clean up stale references to semaphores
[XFS] use get_unaligned_* helpers
[XFS] Fix compile failure in xfs_buf_trace()
...
Add a dlm_ prefix to the struct names in config.c. This resolves a
conflict with struct node in particular, when include/linux/node.h
happens to be included.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Teigland <teigland@redhat.com>
A couple of unlikely error conditions were missing a kfree on the error
exit path.
Reported-by: Juha Leppanen <juha_motorsportcom@luukku.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Commit d70b67c8bc fixed VFS and
it never calls FS lookup function in deleted directories now.
We may remove corresponding UBIFS check.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch fixes a problem whereby simultaneous unlink, rmdir,
rename and link operations (e.g. rm -fR *) from multiple nodes
on the same GFS2 file system can cause kernel panics, hangs,
and/or memory corruption. It also gets rid of all the non-rgrp
calls to gfs2_glock_nq_m.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch is intended to fix the issues reported in bz #457798. Instead
of having the metafs as a separate filesystem, it becomes a second root
of gfs2. As a result it will appear as type gfs2 in /proc/mounts, but it
is still possible (for backwards compatibility purposes) to mount it as
type gfs2meta. A new mount flag "meta" is introduced so that its possible
to tell the two cases apart in /proc/mounts.
As a result it becomes possible to mount type gfs2 with -o meta and
get the same result as mounting type gfs2meta. So it is possible to
mount just the metafs on its own. Currently if you do this, its then
impossible to mount the "normal" root of the gfs2 filesystem without
first unmounting the metafs root. I'm not sure if thats a feature or
a bug :-)
Either way, this is a great improvement on the previous scheme and I've
verified that it works ok with bind mounts on both the "normal" root
and the metafs root in various combinations.
There were also a bunch of functions in super.c which didn't belong there,
so this moves them into ops_fstype.c where they can be static. Hopefully
the mount/umount sequence is now more obvious as a result.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Alexander Viro <aviro@redhat.com>
Due to an incorrect iterator, some glocks were being missed from the
glock dumps obtained via debugfs. This patch fixes the problem and
ensures that we don't miss any glocks in future.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Data length has to be aligned in the budgeting request. Code
in xattr.c did not do this.
Signed-off-by: Zoltan Sogor <weth@inf.u-szeged.hu>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Use "if (0) printk()" construct in debugging print macros to
make the debugging messages be checked even if debugging is
off.
This patch also removes some unneeded spaces and blank lines.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
UBIFS does not presently re-use inode numbers, so leaving
i_generation zero is most appropriate for now.
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
At the moment UBIFS reserves twice old index size space for the
index. But this is not enough in some cases, because if the indexing
node are very fragmented and there are many small gaps, while the
dirty index has big znodes - in-the-gaps method would fail.
Thus, reserve trise as more, in which case we are guaranteed that
we can commit in any case.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
UBIFS aligns node lengths to 8, so budgeting has to do the
same. Well, direntry, inode, and page budgets are already
aligned, but not inode data budget (e.g., data in special
devices or symlinks). Do this for inode data as well.
Also, add corresponding debugging checks.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Budgeting is a crucial UBIFS subsystem - add more assertions
to improve requests checking. This is not compiled in when
UBIFS debugging is disabled.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The debug function that checks orphans, does so using the
TNC mutex. That means it will not see a correct picture
if the inode is removed from the orphan tree before it is
removed from TNC.
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
The values in these two fields need to be preserved independently
and so a union cannot be used.
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Every time anything is deleted, UBIFS writes the deletion inode
node twice - once in 'ubifs_jnl_update()' and the second time in
'ubifs_jnl_write_inode()'. However, the second write is not needed
if no commit happened after 'ubifs_jnl_update()'. This patch
checks that condition and avoids writing the deletion inode for
the second time.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Increment the commit number at the beginnig of the commit, instead
of doing this after the commit. This is needed for further
optimizations.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The 'last_reference' parameter of 'pack_inode()' is not really
needed because 'inode->i_nlink' may be tested instead. Zap it.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Simplify 'ubifs_jnl_write_inode()' by removing the 'deletion'
parameter which is not really needed because we may test
inode->i_nlink and check whether this is a deletion or not.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Orphan inodes are deleted inodes which will disappear after FS
re-mount. There is not need to write orphan inodes back, because
they are not needed on the flash media.
So optimize orphans a little by not writing them back. Just mark
them as clean, free the budget, and report success to VFS.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
We use ubifs_ro_mode() quite a lot, and not in fast-path, so
there is no reason to blow the code up by having it inlined.
Also, we usually want R/O mode change to be seen to other
CPUs as soon as possible, so when we make this a function
call, we will automatically have a memory barrier.
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>