This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new!


Linux needs better network file systems

By Mark Stone on December 02, 2004 (8:00:00 AM)

Share    Print    Comments   

In a previous article we looked at local file systems in Linux. In this article we'll examine the range of choices available for Linux network file systems. While the choices are many, we'll see that Linux still faces significant innovation challenges; yesterday's network paradigm isn't necessarily the best approach to the network of tomorrow.
The Traditional Paradigm

Our current model of the network file system is defined by the paradigm of the enterprise workstation. In this model, a large enterprise has a number of knowledge workers based at a single campus, all using individual work stations that are tied together on a singel local area network (LAN).

In this model, it makes sense to centralize certain services and files so that those services and files reside on only one (or a few) servers rather than replicating them on every single workstation. The resulting efficiencies fall into three categories:
  • Administration. The fewer machines the IT staff has to touch, the more efficiently they can operate. File backup and restore is a simple example. Having a backup/recovery plan for a central file server for critical files is much easier than having a backup/recovery plan for every workstation on the LAN.
  • Resources. Not all resources need to be used all the time. Making infrequently used resources available to all on a central server is more efficient. Printing is a simple example. The cost, maintenance, and management overhead of attaching a printer to every workstation would be prohibitive, and indeed most printers would sit idle most of the time. A central, shared print server makes much more sense.
  • Collaboration. Groups working on a common project need to share and exchange files regularly. Dispersing group data to individual workstations makes it more difficult to share files, and also leads to confusion over which copy of a file is the master copy. Better to have a central file server for the work group to which each group member has access.
Not all knowledge workers fit the traditional paradigm. Companies have multiple campuses. Some workers work remotely. But for the era in which standard network file systems were developed, the single campus-single LAN model was fine.

Traditional Solutions: NFS and Samba

By their very nature, network file systems are superimposed on top of the local file system; without a local file system already in place, there is nothing the network file system can identify to mount over the network. Linux really doesn't have a native network file system, no network equivalent of ext2/ext3. In the LAN environment, Linux's file system capabilities have been born of the necessity to get along with other operating systems.

NFS, then, is the main network file system used by Linux in Unix envrionments. Samba is the main network file system used by Linux in Windows environments that depend on Microsoft's SBM protocol for network file sharing. Born of different operating system environments, NFS and Samba also use somewhat different metaphors.

NFS borrows its terminology from that of local file systems. Accessing a directory on another computer over the network looks like mounting a partition on a local file system. Once network-mounted, the directory is accessible as if it were another directory on the local machine.

Samba's metaphor is based notion of services. "Share," as in sharing a file or directory, is possible service. Once sharing is authorized, Samba's behavior toward the end user looks similar to NFS. Samba understands other services, however, such as "print," which lets you access another machine's printer but not its files.

Both NFS and Samba were created in a world where the dominant network paradigm was the LAN on a single campus. While both file systems have adapted to changing network conditions, that adaptation has at times been awkward.

The New Paradigm of Occasional Connectivity

Two innovations have dramatically changed the requirements for network file systems subsequent to the initial development of NFS and the SMB protocol:

  • The first, most obvious change is the widespread proliferation of Internet connectivity in the mid to late 90s, transforming corporate LANs from isolated to interconnected networks. This changed security demands dramatically; suddenly outside intrusion over the network was a serious concern. The Internet also changed use profiles; suddenly knowledge workers expected corporate network access from home or from on the road.

  • The second, more subtle change has been the proliferation of wireless network technology and portable computing devices that use wireless technology. The result is a paradoxical notion in which connectivity is both pervasive and sporadic: pervasive, in that we are now accustomed to thinking of network access as never more than a hotspot or cell phone call away; sporadic, in that users at the end of a wireless tether are still at best occaisionally connected.
To understand how these changes impact file systems, consider a simple model: The original Palm handheld. Sitting in its cradle, it was one computing device networked (in a limited sense) to another. Removed from the cradle it became a roaming device only occaisionally connected. It shared files with a desktop computer, and those files had to be synchronized. An address book or calendar entry could be changed on the Palm, on the desktop, or independently and differently on both. All of these changes had to be kept in proper synchronization.

Palm's simple approach to synchronization was to update files from whichever device had a change since last synchronization, and, when in doubt, to duplicate entries. That taught users to treat the Palm as much as possible as a read-only device and do their data entry on the desktop. The complexities that arose from this simple network structure foreshadowed many of the challenges of network file systems today.

Once your address book and calendar could go with you everywhere, knowledge workers expected to be able to access and update them everywhere. Pre-Palm, you accepted that calendar and address book updates that arose away from the office would have to wait until you returned to the office. Now Palm has spoiled us all; we expect such changes and updates to be available on demand, any time, from anywhere.

Add to that the notebook computer, at most a novelty device when NFS and SMB were born. Now not just address books and calendars are on the road, but all of a knowledge worker's digital work. To that mix we now add cell phones that act like PDAs, and a current generation of PDAs that include much of a notebook's functionality. Finally, none of these devices now need to depend on any kind of cable or wire to access a network. Fixed-point access is becoming a thing of the past.

What's emerging is a network of computing devices where any device could be connected from anywhere at any time, but where connectivity can also be lost at any time. This kind of network environment introduces three main challenges:
  • Authentication
  • Data Transport
  • Synchronization
Traditional network file systems often prove ill-adapted for these challenges. In the original design of NFS, authentication was done for hosts, not users. Thus anyone who could gain access to a given machine could also gain access to all of the machines for which that one was a valid NFS host. The addition of access control lists and privilege limiting has mitigated this problem, but these are ad hoc fixes for a system not designed for the current network environment.

Further, both NFS and the SMB protocol send data in clear text over the network. At a time when LANs were mostly isolated rather than interconnected this wasn't a problem. Today it's a major security risk.

Of course, not all problems necessarily need to be solved at the file system level. NFS can run over an ssh tunnel, allowing ssh to provide encrypted data transport and an extra level of authentication. Similarly, in a Windows environment Microsoft's VPN provides an encrypted tunnel.

What none of these approaches handle very well is synchronization. Think of someone copying a file onto a laptop, working on it on the plane, then reconnecting to a home or corporate server later. Now suppose that in the interim somone else in the group has been making different changes to the same file.

Some of these issues can be dealt with at the application level rather than the file system level. Rsync, a powerful program that came out of the Samba project, provides remote file synchronization over the network. Tackling integration problems at the application level, however, leaves either the user or IT staff responsible for setting up, managing, and tracking synchronization. To accomplish all of this seamlessly at the file system level, we aren't talking about just a network file system. We're talking about a distributed file system.

New Tricks from an Old Approach: Coda

Much of the theoretical work done on modern file systems stems from research at Carnegie Mellon University (CMU). An alternative to NFS, for example, is AFS, derived from the Andrew File System research project at CMU.

Perhaps the most ambitious file system project at CMU is Coda. Coda is a distributed file system derived originally from AFS2. It is the brainchild of Professor Satyanarayanan. Coda is designed for mobile computing in an occaisionally connected environment, is designed to work in an environment of partial network failure, and is designed to respond gracefully to total network failure. Encryption is built in for data transport, with additional security provided by authentication and access control.

The basic ideas behind Coda are:
  • The master copy of a file is kept on a Coda server
  • Coda clients maintain a persistent cache of copies of files from the server
  • Coda checks the network both for the availability of connections between client and server, and for the approximate bandwidth of the connection
  • The client cache is updated intelligently based on available bandwidth; the less bandwidth, the smaller the update increments, all the way down to a worst case of zero bandwidth, i.e. no connection
  • Updates from the client to the master must be complete; no partial file changes are ever written in the master copy
All of this sounds like a big step forward in solving the problems of a distributed file system. The technical challenges are not small, however, and Coda is still very much a work in progress. Work on Coda began in 1987, and the FAQ for the project reports, "a small userbase (20-30 users) and a few servers are pretty workable. Such a setup has been running here at CMU for the past couple of years without significant disasters. Don't expect to easily handle terabytes of data or a large group of non-technical oriented users."

Coda's Descendants: Intermezzo

Keep in mind that Coda is a research project. It aims to solve the distributed file system problems in a fundamental and comprehensive way. In the real world, an 80% solution will often do. Towards that end, a lighter weight descendant of Coda has been designed for Linux: Intermezzo.

Intermezzo has been developed by kernel hacker, file system guru, and former Coda project member Peter Braam.

Intermezzo follows a similar architectural philosophy to Coda. There is a server element to the file system, and a client element, with the client side relying on a persistent cache to keep files in synch. Communication between client and server is handled by a separate program, InterSync.

Intermezzo has been included as a file system option for the Linux kernel since kernel version 2.4.15. Like Coda, it is far from a finished project, but still represents an important future direction for Linux file systems.

A Word About Clusters

The haphazard network world of the Internet and mobile users may seem the very opposite of the tightly structured network of Linux clusters. Surprisingly, the file system challenges are quite similar.

Think of the Internet as a cluster in slow motion. In a cluster environment of fibre channel interconnects, the lag time associated with disk access can look like a server failure or network outage does on the Internet. What might look like continuous availability in another context looks like intermittent connectivity in the high demand cluster context.

Thought of in this way, it should come as no surprise that the most direct application of Intermezzo is not for mobile users, but for clusters. In fact Peter Braam and his team are working on a commercial version of their file system architecture, called "Lustre, that is available through Braam's company, ClusterFS. Lustre has been used at Lawrence Livermore National Labs, the National Center for Supercomputing Applications (NCSA), and other centers for supercomputing.

The Future of Network File Systems

In today's network paradigm, the network file system challenge has become the distributed file system challenge, as we have moved from self-contained LAN environments to a world of occaisionally connected computing. To be competitive in this environment, an operating system must have a file system that handles distribution and synchronization problems smoothly and securely.

Apple understands this. Apple's relentless focus on the "digital lifestyle" has led them to work hard at getting a wide array of devices, from cell phones to iPods to video cameras, to connect and communicate. MacOS X gets high marks for its capabilities in this area.

Microsoft certainly understands the challenge as well. While Windows-based networks today are still mostly locked into a complex of VPNs and SMB, the plans for Longhorn are quite different. The whole .Net infrastructure, and the way Avalon aims to leverage it, should address many distributed file system issues in a way that is transparent to the user.

Will Linux compete? The potential is there, and projects like Intermezzo show that many of the right building blocks are in place. What remains is for a high profile company or project to step forward and make distributed file problems a priority. So far, that hasn't happened.

Share    Print    Comments   


on Linux needs better network file systems

Note: Comments are owned by the poster. We are not responsible for their content.


Posted by: Anonymous Coward on December 02, 2004 10:28 PM
Perhaps people should take a _really_ good look
at the plan9 way of distributing things..



Posted by: Anonymous Coward on December 03, 2004 06:18 AM
What we need is a complete implementation of WebDAV, with permissions and ACL management outside httpd.conf. SSL makes is secure. It's easy to use. It's cross-platform, with existing clients. You alsa gain versionning. You loose in performance, though.


Cacheing NFS in a netboot environment

Posted by: Anonymous Coward on December 03, 2004 06:19 AM
I remember working on a system that used a cacheing version of NFS in a Netboot environment (on Solaris, late 90's). For enterprise this kind of setup is amazing!

Netboot means no state on the client machines, the things that the users sit in front of (apart from the cache in this case, but effectively none). There's no need to administer them at all. They can be swapped out when broken and users can roam and log in to any one of them.

The cacheing file system meant that the huge CAD application didn't have to shunt across the network every time it was used. In fact, only updates were shunted when the application was upgraded on the central server. That was the only time there was a network rush, when users logged in and started running the new version.

CODA seemed to promise that kind of setup and more on Linux, but I expect so many of us are stuck with old habits like SMB and NFS. Microsoft don't make it easy to plug in file systems to Windows making competition with SMB where Windows clients are involved difficult.

I'd pretty well forgotten about CODA. I'll have to give it another try on my home network.



Posted by: Anonymous Coward on December 03, 2004 06:30 AM
For those curious about Lustre, it's at
<A HREF="" title=""></a>.


Introducing OFS, Linux Labs International Inc.

Posted by: Anonymous Coward on December 03, 2004 06:37 AM
OFS: OPUS File System

(Open Parallel Unified System)

The state of filesystems: In spite of advances in clustered and distributed computing, filesystems remain in the dark ages. The primary option for sharing filesystems remains NFS. Development exists in the form of GFS and CODA, but neither really meets requirements. � Filesystems
remain intimately bound to their block device based backing store. Even with LVM and other techniques to improve the block storage itself, filesystems are not readily resizable. Most shared filesystems continue to rely on a centralized resource for storage. Many do not support fully coherant operation. The few that do not suffer these problems rely on fiber channel or other expensive (and somewhat uncommon) hardware. They still are not readily resizable or device independant. � As a result, most HPC
clusters still mount NFS storage and rely on rsync and other utilities to occasionally update local or backup storage. Few express elation with this solution, but use it because it exists. � While CODA and PVFS are fine filesystem developments, both are designed for specialized needs and not a suitable replacement NFS in most cases. CODA focuses on disconnected operation and
cannot possibly be coherent or support atomic operations. PVFS is targeted primarily as a data staging area for parallel jobs, with no support for redundancy or resilient operation. GFS comes close, but emphasizes more or less exotic specialized storage devices, and makes no effort to extend filesystem semantics or even to fully support mmap semantics. � OFS is a collection of technologies intended to facillitate RandD efforts at producing a versatile distributed collection of filesystems to address the needs of HPC and failover clusters (including true single system image) as well as general computation environments.

Fuse: Fuse is a GPL project that exports filesystem functionality from kernel to userspace (a microkernel approach). This eases development in several respects including availability of a wider range of development tools, minimizing the consequences of a crash, allowing languages other than C to be used in development, relaxing the restrictions on programming language, and simplifying the robust use of networking. � Linux Labs International, Inc. has adapted the Fuse project to support a simple inode based API for filesystem research purposes along with a Python binding for libfuse. These changes have been sent to the maintainer and are expected to be released in the upcoming 2.0 version.� The hope is to use the obvious advantages of a microkernel filesystem approach without introducing the difficulties of switching away from the most commonly used operating system in clusters.

Python: At first glance, Python may seem to be an odd choice for developing an OS resource. However, it offers the advantage of being a fully introspective object oriented language with good debugging support that lends itself well to rapid prototyping in situations where design may change frequently (as is the case for many research projects). Python also lends itself well to selectively translating performance critical portions of the program to C through it's easy to use C interfaces. � The latter advantage opens the possibility of having a usable system earlier in the process while development continues in an unstable branch.

Backing store: Initially, a BackingStore module has been written. It's primary objectives are functional correctness and ease of debugging. It consists of an Inode class for regular files, a dnode class derived from Inode for directories, and a dentry class to handle directory entries. This is wrapped in a filesystem class which manages caching operations and translation of thrown exceptions into Unix style errno values as well as superblock related functionality. � Currently, the BackingStore operates on top of any reasonable filesystem such as ext2/3. Inodes are stored in two separate files, one for metadata (such as owner, permissions, and timestamps) and a data file. This is done so that work may focus on filesystem semantics and not presently focus on block allocations and such. This also makes inspection of the low-level inode a simple matter using regular commandline utilities. Even in that simple form, it makes the filesystem easily resizable and allows the use of nothing more than CP to be used to evacuate a volume (directory!) that is to be unmounted. � As inodes are assigned without regard to block numbers, kernel imposed block device limitations do not apply to the filesystem, only filesize limits.

Glue: Finally, a glue class (which derives from Fuse, the Fuse Python binding) has been deveoped which translates kernel requests into OFS requests. Because of the design of the storage API, the glue layer is quite thin and could degenerate into a simple mix-in in the near future. � Effectively, this is a VFS implementation in userspace with an interface to the Linux kernel's VFS.

Future: With correctness and debugging under control through testing of the BackingStore module, the natural next step is to place a DistributedStore module between the glue and the BackingStore. The DistributedStore module would be responible for maintaining coherency in the distributed filesystem, local caching of data, and transparently locating Inode objects wherever they might exist in the system. � The DistributedStore module will use a Communicator class for all inter-node communication. Work has already begun on this module. To maximise it's versatility, it's design criteria include simple datagram based messages, minimal dependance on the underlying network to support reliability, fragmenting, and other semantics. By developing it in this way, it should be reletivly painless to adapt it to Dolphin interconnect, Myrinet, Raw ethernet, or Infiniband as an underlying fabric. � The communicator includes it's own discovery mechanism along with machine number to address translation. By presenting an abstract machine number to it's client, the DistributedStorage will not need to understand the addressing scheme of any particular communications fabric. The Communicator supports an authentication scheme based on SHA1 hashes to prevent unauthorized systems from accessing or altering the filesystem. However, the data is not encrypted at this time. While some applications may require encryption in the future, that will be a subject of later development, and will remain optional since it could introduce a significant performance penalty and will be unnecessary for many HPC applications.

Development Objectives: OFS is to be a fully distributed filesystem supporting fully coherant operation (implying distributed locking and atomicity), data redundancy, graceful failover, and graceful degradation. While the initial work has focused on correct implementation of POSIX file semantics, future work is likely to include inode versioning, and Plan 9 like device and pseudo device access semantics. � As VM and FS are intimately tied together, future work will naturally include interfacing with distributed shared memory through other LinuxLabs work such as l4mmu. � Given a coherant distributed VM and Plan 9 pseudo file access to system resources, a natural conclusion will be a strong single system image.

Another Linux Labs International GPL Project... Seeking Sponsors and Contributors. INFO@LINUXLABS.COM,


Re:Introducing OFS, Linux Labs International Inc.

Posted by: Anonymous Coward on December 03, 2004 04:39 PM
Sponsors and contributors found available at


stateless linux

Posted by: Anonymous Coward on December 06, 2004 05:35 AM
I applaud the author for this article, however i was surprised to see no mention of <A HREF="" title="">the stateless linux project</a>. It aims to deal with many, if not all, of the issues mentioned, and appears to be progressing rather quickly.


In addition.

Posted by: Administrator on December 04, 2004 04:19 AM
With the very real security concerns mentioned here I'd like to point out 2 possibles.

1. SHFS <A HREF="" title=""></a> is a remote file system that basically feels like NFS from an adminstrative viewpoint, but since it is operation over SSH (openSSH or SSH) it has the advantages of, a secure connection and the ability to work with key pairs to limit liability as far as password security problems go. I've also had a lot of luck with it in that it also can survive intermittent drops and improper disconnects very well. (Just like ssh does)

Reads and writes are seemless and allow for inclusion in fstab for auto mount/umount at boot/shutdown or for simplified mount commands if you choose to use it on demand.

One other nice part is that it is now part of the 2.6 kernel and modules are also available for inclusion in 2.4. It can also be built without requiring a full kernel rebuild. In fact since the product includes the ability to do a make (rpm deb tgz) it's very simple to create and distribute it out to any number of systems on your network.

Personally the longest "distance" I've used it is between California and Michigan in the US. At that distance (and given that both ends where broadband one an ISP the other a Cable Modem) I found the speed to be reasonable, for reeds and writes. Using it on a LAN and without invoking any kind of compression I've found it to be about 90% of local disk write speed. It can be used to run executables. So I have been able to use it as a Home dir, install OOo to that dir and have a user launch it without a problem.

Missing attributes. No Local Cache. If you aren't connected you lose access to an application running from an shfs mounted directory. It cannot handle conflicts, in respect to a situation where 2 people "check out" a file at the same time and then write back. It's a case of he who writes back last wins. Great for single user directorys of data files. All work is syncronous just as it would be on a local drive. If you loose connectivity you won't be able to easily "save" a file for later write back.

The Second is a correction/ introduction of sorts. Where as Intermezzo did start as a result of the incompleteness and problems with Coda. It's real roots go into the Andrews File System. Now that CMU and IBM have turned it Open Source, and with the tremendous amount of work done by the developers. <A HREF="" title=""></a> It has also been included in the 2.6 kernel.

This FS has excellent security, in that it uses kerbos for encryption. Near local speed once connected (I've seen it used by Stanford Linear Accelerator Personel from Italy connected back to the US in California) It is able to also use any of a number of authentication protocols to ensure that ONLY those you want get connected. Since Once a file or application is "launched" it is cached locally You are able to continue your work even though the connectivity is gone and then once you re-connect it is able to handle Conflicts/Mergers/etc.

Andrews has an excelent ability to handle file locking and merger (Two people check a file out, the second one to write back does so merged. Similar in nature to the way CVS handles mergers (though not exactly the same, I'm trying to paint a mental picture not create a technical paper.)

The biggest "downside" to Andrews is initial setup. However once it's up and running it's nearly trivial for someone to sit down at a box. Setup their "environement" and get to work.

It also has the ability to do disconnects. Meaning that you can be working on a project at location A, disconnect, then move to location B and reconnect without missing a beat.

Both of these projects are actively being developed. I know openAFS is being used by a number of very large corporations as well as major Universities (Stanford for one.) as such improvements and growth is very strong with both, and both in present form are more than sufficiently reliable for deployment and use for even the most delicate situations. If the Linear Accelerator people can rely on it for use to work with experiments that take years to setup and run just once then it's got to be solid enough for anyone.


Future File Systems

Posted by: Administrator on December 03, 2004 01:38 AM
Great piece and glad someone is calling attention to the issue. I, too, find Samba and NFS lacking and end up with a kludge approach on our network. We actually support Appletalk, IPX, IP, and related filesystems to tie everyone together. Yuck! I love Samba because it let me kick Windows out of the server room. I hate Samba because it is a copy of SMB and can be no better than the awful product it imitates.

I think a cross-platform, extensible solution is possible that could accomodate the evolving technologies needed to meet the requirements mentioned in the article. As a paradigm, though, I think each workstation should be able to act as a master and a client. I wouldn't propose changing the Coda or InterMezzo approach just adding to it the ability for a user who "owns" certain files to publish access to those files with fine grained ACLs. SSH and LDAP could be a part of such a system allowing a user to look up a valid account and issue a certificate of access to a person or group.

One of the problems with SMB is the chatty method used to keep peer systems "aware" of what is available. I think this could be solved easily through a new "publishing" protocol. A user wanting to make files available could publish their availability in a way that lets other users build a map of published shares within a certain domain. Several methods could be used to find available shares or a direct method could be used to import a share. For browsing and building up a local listing a user could send either broadcast or directed IP packets that first check to see if a host has something published, then it starts a conversation that checks for allowed authentication, access rules, directories and files available, etc. After that the person wanting to access a published share can get a certificate for each published share in their listing. Only when accessing a share would the client application request update information.

The problem or challenge of synchronizing could happen at a higher application layer. Versioning or check-in or out functions would also occur at this level.

Such a publishing protocol should be capable of passing multiple hops but also be restricted to local networks based on the users wishes. If someone wants to make certain files publicly available over the internet they don't need to use FTP or HTTP, but instead publish a share that allows discovery and access to the internet at large. For security, though, a WAN administrator could block the protocol in both directions to protect the unwitting.

And Windows users could replace their SMB clients with this new one and solve many headaches...!


ravindra mudumby--better network files

Posted by: Administrator on October 22, 2005 02:31 AM
I felt like the author was over reacting. As my eyes carefully glided over the heartfelt words on my screen, I started to see his point.

Mark Stone’s carefully crafted words really spoke to me. Rather than start spouting out opinions and other tired content, he really did a nice job at presenting his case in a clear concise manner


This story has been archived. Comments can no longer be posted.

Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya