AuriStor File System News
v0.168 is a critical bug fix release addressing a fileserver denial of service vulnerability [CVE-2018-7444] and a client side bug in fs setacl -negative which generates more permissive access control lists than intended [CVE-2018-7168]. The v0.168 fileserver provides cell administrators the ability to prevent clients incorporating the bug from storing ACLs. This release also adds support for Red Hat Enterprise Linux 7.5 kernels and includes optimizations to reduce the impact of Meltdown and Spectre mitigations. v0.168 also include major improvements to the volume transaction lifecycle. Interrupted or failed transactions no longer require cell admininistrators to manually clean up temporary volumes. The v0.168 release includes nearly 400 changes compared to v0.167.
v0.167 is a critical bug fix release addressing a denial of service vulnerability [CVE-2017-17432] in all services and clients. This release also adds support for the forthcoming Linux 4.15 kernel and two new vos subcommands, movesite and copysite.
v0.164 is a bug fix and performance release. This release includes a major rewrite of core cache manager I/O pathways on Linux supporting direct I/O, cache bypass, and read-ahead. This release includes additional improvements and bug fixes to UBIK beyond those shipped in v0.163 to support mixed OpenAFS and AuriStorFS deployments.
v0.163 contains major updates to the UBIK database replication protocol implementation that increase resiliency to peer communication failures and permit sites to mix IBM/OpenAFS and AuriStorFS servers without introducing single points of failures. These changes combined with those included in v0.162 simplify the migration from OpenAFS to AuriStorFS. v.163 introduces Linux 4.14 kernel support. File server detection of and protection against unresponsive cache manager callback service implementations.
v0.162 contains AFS3-compatibility changes for the UBIK database replication protocol permit AuriStorFS servers to be deployed in IBM AFS and OpenAFS cells without configuring them as clones. First release with Fedora 27 support. Bug fixes and on-going improvements.
v0.160 introduces macOS High Sierra and Apple File System support. On Linux, exporting the /afs file namespace via Linux nfsd using NFS2, NFS3, and NFS4 is now supported. Reduced memory utilization by the RX networking stack. File server workaround for deadlocks that are known to occur within IBM AFS and OpenAFS Unix cache managers. Bug fixes and general improvements.
v0.159 introduces "vos eachfs" command. Linux 4.13 kernel support.Continued performance improvements and bug fixes.
Linux 4.12 kernel support. Fedora 26 support. Fileserver support for XFS and BTRFS reflinks for improved vice partition copy-on-write performance. Volserver and "vos" support for quotas larger than 2TB. Linux cache manager performance enhancements to address parallel workflows. macOS fix for Orpheus' Lyre. On-going bug fixing and improvements.
Nico Williams, Viktor Dukhovni and Jeffrey Altman announced the discovery of the "Orpheus' Lyre puts Kerberos to Sleep" bug:
As the name suggestions, this implementation flaw can result in a failure of Kerberos mutual authentication. Kerberos is supposed to provide a secure method of network authentication impervious to man-in-the-middle attacks. Fortunately, the protocol is secure but a mistake made by many implementations permits an attacker to successfully perform service impersonation and in conjunction with credential delegation (ticket forwarding) client impersonation. The attack is silent and cannot be detected.
This is a client-side vulnerability so it must be fixed by patching the client systems and systems that have more than one Kerberos implementation installed must obtain patches from all of the implementations to be secure.
The MIT Kerberos implementation was never vulnerable. As patches for other implementations become available the https://www.orpheus-lyre.info/ site will be updated to indicate that.
Yesterday Microsoft issued patches and those should in my opinion be treated as critical with minimal delays before deployment. Heimdal also issued a patch which is included in version 7.4.
AuriStorFS bundles Heimdal when the local operating system's Kerberos and GSS-API cannot satisfy its requirements. The affected platforms include:
- Apple MacOS (all versions)
- Solaris (all versions)
- Microsoft Windows (all versions)
- Apple iOS (all versions)
This 1.6.21 release of OpenAFS includes a fix to Rx which improves the performance of Rx connections between OpenAFS and AuriStorFS when the OpenAFS peer is writing bulk data to the AuriStorFS peer. Examples include:
- OpenAFS cache manager issuing RXAFS_StoreData calls to AuriStorFS file servers
- OpenAFS volserver forwarding volume data to an AuriStorFS volserver
- OpenAFS volserver dumping volume data to an AuriStorFS "vos dump"
- OpenAFS "vos restore" restoring volume data to an AuriStorFS volserver
There are of course other scenarios involving backups, bulk vlserver queries, etc.
The fix avoids the introduction of 100ms delays as the AuriStor Rx peer attempts to re-open a call window which had been closed due to a lack receive buffers while waiting for the incoming data to be consumed.
New features include support for IBM TSM in the AuriStorFS Backup Tape Controllers. "vos eachvol" enhancements. Faster "pts examine" performance. Automated salvager repair of corrupted volume vnode index file entries. New "vos status" command provides more informative volserver transaction status output including bytes sent and received for each call when both "vos" and "volserver are v0.150 or above. Linux 4.11 kernel support. "fs commands now support -nofollow switch. Many bug fixes and reliability improvements.
This Wednesday AuriStor, Inc. is sponsoring a Hackathon and BOF in support of kAFS and AF_RXRPC development at the Linux Foundation's annual Vault conference.
What are kAFS and AF_RXRPC?
AF_RXRPC is an implementation of the Rx RPC protocol implemented in the Linux mainline network stack as a socket family accessible both to userland and in-kernel processes.
kAFS is an implementation of the AFS and AuriStorFS file system client in the Linux mainline kernel.
Why are AF_RXRPC and kAFS important?
The AFS file system namespace has been available on Linux as a third party add-on since the IBM days. The IBM AFS derived implementations suffer:
- performance limitations due to the existence of a global lock to protect internal data structures
- license incompatibility with GPL_ONLY licensed kernel functionality that further restricts performance and functional capabilities
In addition, out of tree file system modules:
- are not a standard component of most Linux distributions thereby preventing ubiquitous access to the /afs file system namespace
- are not kept in sync with core filesystem and network layer changes in the Linux kernel by the developers responsible for those changes
Collectively, these issues increase the hurdles to use of the /afs file system namespace.
- Organizations must be careful not to deploy new Linux kernel versions until such time as updated AFS or AuriStorFS kernel modules are developed and distributed.
- Organizations cannot obtain the full benefit of the latest hardware whether that be hardware support for advanced cryptographic operations, patented processes such as rcu, and other techniques that can scale file system access to tens or hundreds of cpu cores.
- Lack of common distribution complicates the use of the /afs file namespace in support of Linux container based deployments as most organizations are unwilling to or unable to deploy custom kernels across their internal and cloud (aws, azure, ...) cloud infrastructures.
What is the History of kAFS and AF_RXRPC?
David Howells began work on an in-tree AFS client for Linux circa 2001. Unlike the IBM AFS derived cache manager, David's implementation is not a monolithic file system and proprietary network stack designed for portability across operating systems. Instead, David's AFS client is designed as separate modular components that are integrated into Linux the maximize their usefulness not only for AFS but for a broader class of applications:
- Instead of implementing Rx as a proprietary component of the AFS file system, David added Rx as a native socket family integrated with the Linux kernel networking stack at the same layer as UDP and TCP processing. This produces noticeable reductions in packet processing overhead. At the same time, the Rx RPC protocol becomes readily available as a lightweight secure RPC for userland applications. As a demonstration of how easy it is to use, David implemented much of the AFS administration command suite (bos, pts, vos) in Python by combining a Python XDR class with AF_RXRPC network sockets.
- Instead of AFS Process Authentication Groups (PAGs), David designed the Linux Keyrings which are now a core Linux component used in support of many file systems and network identity solutions.
- David developed the FS-Cache file system caching layer which is used in support of NFS* and CIFS file systems.
- David's kAFS is the AFS and AuriStorFS specific file system functionality including the callback services.
Unfortunately, David Howell's has received minimal support from the AFS user community. As a result, neither AF_RXRPC nor kAFS have been included in any major Linux distribution.
Why is AuriStor, Inc. contributing?
AuriStor, Inc. has invested substantial resources into its AuriStorFS Linux client in support of Red Hat Enterprise Linux, Fedora, CentOS, Debian and Ubuntu and will continue to do so. Yet, AuriStor, Inc. recognizes that widespread adoption of AuriStorFS servers for last scale Enterprise and Research deployments require higher performance, greater scale and easier maintenance for Linux systems.
AuriStor, Inc. also recognizes that the many of the workflows that have relied upon the /afs file namespace for software and configuration distribution are migrating to containers. That transition is not without its own challenges related to the management of Container identity and authentication to persistent network based resources. AuriStor, Inc. believes that the global /afs file namespace combined with the AuriStor Security Model (combined identity authentication and multi-factor constrained elevation authorization) are best suited to addressing the outstanding Container deployment issues.
AuriStor, Inc. believes that only through a native in-tree client can these issues be addressed.
What is AuriStor, Inc. contributing?
AuriStor, Inc. has leveraged its expertise and extensive quality assurance infrastructure to identify flaws in the AF_RXRPC and kAFS implementations. Over the last year hundreds of corrections and enhancements have been merged into the Linux mainline tree. Missing functionality has been identified and is being implemented one piece at a time.
As kAFS approaches production readiness AuriStor, Inc. will contribute native AuriStorFS client support including an implementation of the "yfs-rxgk" security class to the AF_RXRPC socket family.
It is our hope that by the end of 2017 kAFS and AF_RXRPC will be ready for inclusion in major Linux distributions side-by-side with NFS* and CIFS.
AuriStor, Inc. is also working with major players in the Container eco-system and the Linux Foundation to address the identity management problem. When successful, it will be possible to launch Containers with network credentials such as Kerberos tickets and AFS/AuriStorFS tokens managed by the host. AuriStor, Inc. believes that this functionality combined with the /afs file namespace will allow true portability of Containerized processes across private and public cloud infrastructures.
v0.145 is the latest in the on-going efforts to improve the fileserver's ability to perform volume operations when the volume is under heavy load and to recover when the unexpected happens.
One of the strengths of the /afs model is the ability to move, release, backup, and dump volumes while they are being accessed by clients under production loads. For example, the following scenario should in theory be handled without a hiccup:
- Take two file servers each with at least one vice partition.
- Create a volume V on fs1/a
- Add RO sites on fs1/a and fs2/a
- On at least two clients execute "iozone -Rac" in separate directories of volume V using the RW path
- On at least one client start a loop that lists one of the directories with stat info that the iozone test is writing to from V.backup.
- On at least one client start a loop that lists one of the directories with stat info that the izone test is writing to from V.readonly.
- Repeat the following process in a tight loop
- "vos backup V"
- "vos release V"
- "vos move V fs2 a"
- "vos backup V"
- "vos release V"
- "vos move V fs1 a"
Organizations do not actively operate their cells in this fashion but the expectation is that "if they did, it should work." Of course, the answer is "it didn't before v0.145". Why not? and where did it fail?
For those of you that are unfamiliar with the fileserver architecture, here is process list as reported by the bosserver for a fileserver:
[C:\]bos status great-lakes.auristor.com -long
Instance dafs, (type is dafs) currently running normally. Auxiliary status is: file server running. Process last started at Sat Dec 24 12:19:03 2016 (3 proc starts) Command 1 is '/usr/libexec/yfs/fileserver' Command 2 is '/usr/libexec/yfs/volserver' Command 3 is '/usr/libexec/yfs/salvageserver' Command 4 is '/usr/libexec/yfs/salvager'
The first three commands execute a set of dependent processes. The fileserver process:
- Registers the fileserver with the VL service and communicates with the PT service
- Processes all requests from AFS clients (aka cache managers) via the RXAFS and RXYFS Rx network services.
- Maintains a cache of all volume headers
- Is the exclusive owner of all volumes. the volserver and salvageserver processes request readonly or exclusive access to volumes from the fileserver
- Issues requests to the salvageserver to perform consistency checks and repair volumes when a problem is detected with the volume headers, the on-disk data, or other.
The goal is to ensure that a volume is available to the fileserver process as long as there are active requests.
Each "vos" command is implemented by one or more VL and VOL RPCs. The VOL RPCs are processed by the volserver. The volserver can:
- Submit a query to the fileserver to obtain the necessary data to satisfy the request
- Request readonly access to a RW volume which can be used to produce a new clone (for .readonly or .backup or .roclone)
- Request exclusive access to a RW, RO, BK or an entire volume group
- Request a volume be salvaged by the salvageserver
When the salvageserver is asked to salvage a volume it requests exclusive access to the volume group from the fileserver. There are a lot of moving parts.
For each of the "vos backup", "vos release" and "vos move" commands the volserver will exercise different combinations of query, readonly and exclusive access to volumes. The various modes of client requests to the fileserver:
- reading from the .backup
- reading from the .readonly
- writing to the RW
Produces contention between the volservers and the fileservers for control of volume.
The expected behavior is that a volume will be offline for the shortest amount of time possible, that the client will retry the request in a timely manner, and most importantly, not respond to a temporary outage as a fatal error.
What could go wrong?
For starters, the client retry algorithm is quite poor. Whenever a volume is busy or offline the client will sleep for 15 seconds. In other words, the client will block for an eternity and when it finally does retry its likely to find the volserver with exclusive ownership of the volume and sleep again for another 15 seconds. The v0.145 release does not fix the client behavior. That will come in a future release but for testing purposes we removed all of the sleeps from the clients and forced immediate retries in order to maximize the contention.
Once sufficient contention was generated, we observed that after round trips of the volume moving from fs1/a to fs2/a to fs1/a to fs2/a and back to fs1/a, the volume would go offline and not come back. We noticed that the fileserver would be asked to put the volume into an online state but would immediately request a salvage. The salvageserver would verify the state, ask the fileserver to put the volume into service, but the fileserver would immediately request another salvage. This would repeat a dozen times before the volume would be taken offline permanently. Attempts to manually salvage the volume would succeed but the volume would still not return to an online state. The failure was 100% reproducible.
For those of you that remember IBM AFS and OpenAFS 1.4.x and earlier, there was no on-demand attachment or salvaging of volumes. The fileserver was much simpler. The fileserver forced a salvage of all volumes at startup and then attached them. The volserver always obtained exclusive access to volumes. The fileserver never cached any volume metadata. If something went wrong with a volume, it was taking offline until an administrator intervened.
The struggles of the last few months have all been directly attributable to the changes introduced as part of the demand attach functionality introduced in OpenAFS 1.6.x. We have encountered deadlocks, on-disk volume metadata corruption, copy-on-write data corruption, salvager failures, and now fileserver volume metadata cache corruption.
After two round-trips of volume movement while the RW, RO, and BK are all under active usage, the fileserver's volume group cache would end up out of sync with the volserver. The fileserver would believe the volume was still owned by the volserver when it wasn't. The fileserver would request the salvageserver to verify the volume state but nothing it could do would result could make a difference.
As of v0.145 the handoff from volserver to fileserver has been corrected. Now the volume can be backed up, released, and moved continuously while each of the volume types are actively accessed. No salvages. No VNOVOL errors. The iozone benchmark will continue to function even if the fileserver or volserver processes are periodically killed.
With a modified client to reduce the delays after receiving a VBUSY or VOFFLINE error, the iozone benchmark continues to function although at a lower rate of throughput. There are periodic pauses when large copy-on-write operations need to be performed. Which brings me to our New Year's resolution.
In 2017, AuriStor aims to remove the copy-on-write delays. The COW delays are the due to the need to copy the entire backing file each time a file is modified after a volume clone is created. On Linux distributions that include "xfs reflinks" support for vice partitions AuriStor fileservers will be able to complete COW operations in constant time without regard to the length of the file being modified. This change coupled with client-side improvements to the retry algorithms will significantly reduce the performance hit after volume clone operations.
Have a Happy New Year! The AuriStor team is looking forward to an excellent year.
Today OpenAFS announced Security Advisory OPENAFS-SA-2016-003 and released 1.6.20 which is an urgent security release for all versions of OpenAFS. IBM AFS cache managers and all OpenAFS Windows clients are also affected. There is no update to the OpenAFS Windows client.
AuriStor File System clients and servers do not experience this information leakage. However, volume migrated to AuriStorFS file servers from OpenAFS or IBM AFS file servers will retain the information leakage.
In addition to impact described in the announcement it is worth noting that all backups and any archived dump files will contain information leakage. Restoring a backup or dump file containing information leakage will restore that leaked information to the file servers where it will be delivered to cache managers.
Salvaging restored volumes with the -salvagedirs option is required to purge the information leakage.
It is worth emphasizing that IBM AFS and OpenAFS volserver operations including all backup operations occur in the clear. Therefore, all leaked information will be visible to passive viewers on the network segments across which volume backups and moves occur.