[clue-tech] NFS problems anyone?

Collins Richey crichey at gmail.com
Tue Dec 2 20:37:18 MST 2008


We've had a long day debugging an NFS problem, and we're nowhere near
figuring it out. Here's the setup.

We have a number of RHEL3/RHEL4 servers and workstations that share
clearcase data stored on three separate servers mounted via NFS. The
RHEL4 servers are at varying maintenance levels (U3, U4, U7), and
upgrading the problem servers to U7 does not help. The three problem
servers (A, B, C) each have a large directory (D1 on A, D2 on B, and
D3 on C). A mounts D2 and D3, B mounts D1 and D3, and C mounts D1 and
D2). All workstations and several other servers (RHEL3/4/5) mount D1,
D2, and D3.

The problem appeared this morning when I needed to restart NFS on a
related server to change the mount status of it's directories (let's
call this server D and directory D4. This server does not share D1,
D2, or D3, but all of the servers that mount D1, D2, D3 also mount D4.
As soon as I restarted NFS on D and attempted to restart netfs on
another server (E), E refused to mount D1, D2, D3 claiming that server
refused due to permissions. All other servers and work stations
continue to use the NFS mounted D1, D2, D3, D4 continue to work with
no errors. Any system that is booted or netfs restarted, declines to
mount D1, D2, D3, but D4 mounts OK. In every case the exporting server
produces the same message whether the mount is successful or failed.
If new exports are created on A, B, or C and NFS restarted, these
exported directories also fail to mount anywhere.

No changes to the exports or software on any of these servers in the
past few months. NFS on other servers and workstations continues to
work without a hitch. Each of the problem servers has been rebooted,
but that has no effect. Firewalls have been stopped, so that is not
the answer.

Does anyone have any clues or helpful hints about interpreting strace
data for the problem?. From the client side (mount request), after a
lot of setup, the client opens a TCP socket and passes the requesting
ip address / name to the server, gets a response, opens a UDP socket,
passes the address/name again, and gets a response. The last response
is different for the successful / unsuccessful case, but I haven't
been able to find a way of interpreting this. If successful, the
actual mount command is issued and mtab is updated. If unsuccessful,
no mount is issued and the error message is generated.

I've also straced rpc.mountd on the exporting server, but I know even
less about this.

Another quirk. In both cases the strace logs an attempt to locate a
program /sbin/mount.nfs which does not exist.

A mystery within a riddle wrapped in an enigma.

The only relevant google entry we found suggested that reboot cured the problem.

-- 
Collins Richey
     If you fill your heart with regrets of yesterday and the worries
     of tomorrow, you have no today to be thankful for.


More information about the clue-tech mailing list