AuriStor File System News
v0.188 addresses three issues experienced by customers.
The first is UBIK coordinator term expiration of the location service after periodic load spikes that increased the size of the vlserver thread pool from 100 threads to more than 13,000 threads. The load spike would last for under a minute, the thread pool would scale back to 100 threads after 20 minutes. Forty minutes later another load spike would occur repeating the pattern. The existing rx packet allocator behaved very poorly under this workload pattern resulting in allocation of additional rx packets with each load spike. After a week more than five million rx packets had been allocated on some vlservers.
As the allocated packet counts increased and the number of threads decreased, the packets per thread ratio increased as well. When the thread pool resized to 100 threads the number of packets assigned to each thread grew to the point where packet transfers began to interfere with rx data transfer and event processing. The UBIK coordinator election algorithm is time sensitive and a failure to deliver votes or timeout RPCs in timely manner can result in election failure.
v0.188 addresses the root causes by replacing:
- the rx multi call implementation used to conduct UBIK elections with a new variation that manages its own timeouts and does not rely upon timeouts set upon each individual rx rpc.
- the condvar timed wait implementation with a version that has finer grained clock resolution: 1ns instead of 1s.
- the rx packet allocator with a new implementation that is better suited for use with dynamic thread pools and larger window sizes. The new allocator also significantly reduces lock contention when obtaining and releasing packets.
The second problem is loss of volume access after the fileserver
writes to the FileLog:
CopyOnWrite corruption prevention: detected zero nlink for volume N inode vnode:V unique:U tag:T (dest), forcing volume offline
No data corruption occurs but after each occurence the volume is salvaged. After the 16th automatic salvage the volume is taken offline until there is manual intervention by an administrator.
This bug, a file descriptor leak, was introduced in v0.184 as a side effect of one of the fixes for the libyfs_vol reference counting errors.
The final issue is the on-going problems that some customers
have experienced with Linux clients either with the shell
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
or "mount --bind" failing with
mount: mount(2) failed: No such file or directory
$ cd /afs/example.com/
$ ls -al /proc/self/cwd
/proc/self/cwd -> /afs/example.com (deleted)
The symptom occurs when a Linux dentry (directory entry) object ends up in an unhashed state although it is referenced by an inode.
Since v0.180 AuriStor has revised code paths to improve error code reporting and avoid race conditions that can generate this behavior. Apparently, there are still additional conditions that have yet to be identified. v0.188 includes a band-aid whereby an unhashed denty will be rehashed when needed. However, AuriStor is still trying to find and address the root cause. Therefore the AuriStorFS kernel module will log a warning when a dentry is rehashed
Changes since v0.184 are primarily focused on the UNIX/Linux cache manager and fixing operational issues reported in the volserver and fileserver. The v0.184 release implemented major changes to the UNIX/Linux cache manager. This release fixes bugs introduced in v0.184 and missed edge cases. It also continues the refactoring of internal interfaces to propagate error codes and signals to userland applications. VolserLog messages related to the volserver transaction lifecycle have been thoroughly revamped. Additional reliability improvements within RX are included.
- New Platforms:
- Red Hat Enterprise 8
- xfs reflinks (requires new filesystem) changes AuriStorFS StoreData RPC copy-on-write performance from O(filelength) to O(write length).
- significant improvements in udp performance compared to rhel7.
- extended berkeley packet filters provides for fairer distribution of rx call processing across multiple rx listener threads.
- Fedora 30
- Unix CM:
- v0.184 moved the /etc/yfs/cmstate.dat file to /var/yfs. With this change afsd would fail to start if /etc/yfs/cmstate.dat exists but contains invalid state information. This is fixed.
- v0.184 introduced a potential deadlock during directory processing. This is fixed.
Many sites have noticed that clients with v0.184 installed might log Lost contact with xxxx server ... referencing a strange negative error code and that fileservers might log FetchData Write failure ... errors from any Linux client version.
These errors might correlate to corruption of pages in the Linux page cache. The corruption is that one or more contiguous pages might be inappropriately zero filled.
This release implements many code changes intended prevent Linux page cache are AFS disk cache corruption.
- Better data version checks
- More invalidation of cache chunk data version when zapping
- Only zero fill pages past the server end of file
- Always advance RPC stream pointer when skipping over missing pages or when populating pages from the disk cache chunk.
- Never match a data version number equal to -1.
- Avoid truncation races between find_get_page() and page locking.
Some sites have experienced failures of Linux mount --bind of /afs paths or getcwd returning ENOENT. This release fixes a dentry race that can produce an unhashed directory entry.
Some uses of the directory will continue to work, as the first lookup following the race will associate a new dentry with the inode, as an additional alias. Directories are not supposed to have aliases on Linux, so the vfs code assumes that d_alias is at most a list of 1 element, and accesses the entry in a slightly different way in a few places. Some sites get the new hashed dentry, others get the original unhashed one.
- Propagate EINTR and ERESTARTSYS during location server queries to userland.
- Handle common error table errors obtained outside an afs_Analyze loop. Map VL errors to ENODEV and RX, RXKAD, RXGK errors to ETIMEDOUT
- Log all server down and server up events. Transition events from server probes failed to log messages.
- Avoid leaking local errors to the fileserver if a failure occurs during Direct IO processing.
- RX RPC networking:
- If the RPC initiator successfully completes a call without
consuming all of the response data fail the call by sending
an RX_PROTOCOL_ERROR ABORT to the acceptor and returning
a new error, RX_CALL_PREMATURE_END, to the initiator.
Prior to this change failure to consume all of the response data would be silently ignored by the initiator and the acceptor might resend the unconsumed data until any idle timeout expired. The default idle timeout is 60 seconds.
- Avoid event cancellation race with rx call termination during process shutdown. This race when lost can prevent a process such as vos from terminating after successfully completing its work.
- Avoid transmitting ABORT, CHALLENGE, and RESPONSE packets with an uninitialized sequence number. The sequence number is ignored for these packets but set it to zero.
- Frequent issuance of "vos listvol" commands can no longer interfere with volume transaction idle timeout processing.
Since IBM AFS 3.5 the volserver has logged transaction status every 30 seconds to the VolserLog. In v0.184 the volserver logs the following lifecycle messages at level 0:
- trans id on volume id is older than s seconds
- trans id on volume id has timed out
- trans id on volume id has been idle for more than s seconds
On a busy volserver these messages can flood the VolserLog.
This change raises the level of messages 1 and 3 to 125 and introduces a new "Created trans id on volume id" message logged at level 5.
With this change level 0 logs unexpected termination of each transaction. Level 125 will include the 30 second updates for sites that require them.
The partition, volume parentid and transaction iflags fields have been added to each log message.
- RPCs issued by vos listvol will no longer block in the volserver if the requested volume requires salvaging. The volume attachment retries can block the salvageserver from acquiring an exclusive volume lock resulting in a salvage failure and a soft-deadlock. From now on the vos listvol command will fail immediately.
- If the vice partition's backing store is unmounted or otherwise becomes unavailable the fileserver could terminate unexpectedly due to a segmentation fault. Beginning with this release the fileserver will survive but all requests for objects stored on the missing vice partition will fail.
Introduce the ability to configure random error injection during FetchStatus, FetchData, and StoreData RPC processing.
- Add File IDs to "FetchData Write Failure" FileLog messages.
- Ubik services:
- This release introduces the ability to configure a separate debug log level for ubik than for the application service. By default, when the "ubik_debug" level is unspecified or set to zero, the application's log level determines which "ubik: " log entries are written to the log.
Security improvements include volserver validation of destination volserver security policies prior to transmitting marshaled volume data. Prior to v0.184 the volservers were trusted to reject volumes whose security policy could not be enforced. Linux cache managers can no longer be keyed with rxkad tokens. Introduction of a pam module capable of managing tokens for both AuriStorFS and/or Linux Kernel kAFS.
The UNIX Cache Manager underwent major revisions to improve the end user experience by revealing more error codes, improving directory cache efficiency, and overall resiliency. The cache manager implementation was redesigned to be more compatible with operating systems such as Linux and macOS that support restartable system calls. With these changes errors such as "Operation not permitted", "No space left on device", "Quota exceeded", and "Interrupted system call" can be reliably reported to applications. Previously such errors might have been converted to "I/O error". These changes are expected to reduce the likelihood of "mount --bind" and getcwd failures on Linux with "No such file or directory" errors.
A potentially serious race condition and reference counting error in the vol package shared by the Fileserver and Volserver could prevent volumes from being detached which in turn could prevent the Fileserver and Volserver from shutting down. After 30 minutes the BOSServer would terminate both processes. The reference counting errors could also prevent a volserver from marshaling volume data for backups, releases, or migrations.
This release is moves the location of the cache manager's cmstate.dat from /etc/yfs/ to /var/yfs/ or /var/lib/yfs depending upon the operating system. The cmstate.dat file stores the cache manager's persistent UUID which must be unique. The cmstate.dat file must not be replicated. If virtual machines are cloned the cmstate.dat must be removed. The cmstate.dat file must not be managed by a configuration management system.
The release includes two new vos command options:
* "vos addsite -force"
* "vos listvol -id
Finally, this release includes a Linux PAM module as well as support for the Amazon Linux 2 distribution and many more quality and performance improvements
AuriStor, Inc. is pleased to sponsor and invite AFS and Linux kernel developers to the second Linux kernel AFS (kAFS) Hackathon and Birds of a Feather meeting. The hackathon and BoF will be co-located with the USENIX Vault '19 - Linux Storage and Filesystems Conference. Read more...
This release improves RX call reliability across network paths with a high degree of packet loss and/or round trip times larger than 60ms. The corrected bugs have been present in all IBM derived RX implementations dating back to the mid 90s. The impact of these bugs is an increased risk of timeouts and performance degradation for long lived calls over high latency network paths that periodically experience packet loss. Volume operations such as moves, releases, backups and restores over WAN connections are particularly susceptible due to the amount of data transmitted in each RPC.
One feature change is experimental support for RX windows larger than 255 packets (360KB). This release extends the RX flow control state machine to support windows larger than the Selective Acknowledgment table. The new maximum of 65535 packets (90MB) could theoretically fill a 100 gbit/second pipe provided that the packet allocator and packet queue management strategies could keep up.
A change to volume use statistics tracking when volumes are moved, copied, and restored. The AFS volume dump stream format which is used for volume archives and volume transfers can store the daily and weekly vnode access counts but none of the other extended volume statistics maintained by the fileserver. When a volume is moved it makes sense for the use counts to be migrated with the volume to the new location. When a volume is copied it makes sense that the new location should start its counters from zero instead of the values collected at the location that was used as the source. Finally, when restoring a volume or releasing a new snapshot of a volume to readonly or backup sites, the use counts should remain unaltered. Beginning with this release, when AuriStorFS v0.178 or later "vos" is used in combination with an AuriStorFS v0.178 or later destination "volserver" the desired use count management will take place. At the moment the weekly access counts are only accessible when using the "vos examine -format" switch.
Many more quality and performance improvements
AuriStor's RX implementation has undergone a major upgrade of its flow control model. Prior implementations were based on TCP Reno Congestion Control as documented in RFC5681; and SACK behavior that was loosely modelled on RFC2018. The new RX state machine implements SACK based loss recovery as documented in RFC6675, with elements of New Reno from RFC5682 on top of TCP-style congestion control elements as documented in RFC5681. The new RX also implements RFC2861 style congestion window validation.
When sending data the RX peer implementing these changes will be more likely to sustain the maximum available throughput while at the same time improving fairness towards competing network data flows. The improved estimation of available pipe capacity permits an increase in the default maximum window size from 60 packets (84.6 KB) to 128 packets (180.5 KB). The larger window size increases the per call theoretical maximum throughput on a 1ms RTT link from 693 mbit/sec to 1478 mbit/sec and on a 30ms RTT link from 23.1 mbit/sec to 49.39 mbit/sec.
Workarounds for an IBM AFS and OpenAFS RX header userStatus field information leakage bug. This bug inadvertently interferes with the RX service upgrade mechanism that permits AuriStorFS clients (including Linux kafs) and services to detect each other without undesireable timeouts or extra round trips.
When an affected IBM or OpenAFS cache manager (or fileserver) establishes a connection to an AuriStorFS server the bug can result in an unintentional RX service upgrade. For example, if a pre-v0.175 fileserver incorrectly upgraded an incoming RX connection from RXAFS to RXYFS, it would mistakenly believe the client offered the RXYFSCB callback service; which it doesn't. The failure to establish a successful connection to the RXYFSCB service would cause the fileserver to reject the client's RXAFS requests with a VBUSY error.
A fileserver is supposed to be able to serve data from a .readonly or .backup volume while the volserver is dumping or forwarding the volume contents. This functionality introduced in IBM AFS 3.3 was fatally broken in AuriStorFS v0.157 when the volume disk interface was overhauled to avoid data corruption. Then starting with v0.168 "vos release" failed to terminate the volume transaction used to clone the RW volume to the RO site on the same server. Attempts to read from volumes that were exclusively in-use by the volserver would return VOFFLINE (106) errors.
As of AuriStorFS v0.175 release .readonly and .backup volumes can once again be attached to fileservers while a "vos release" or "vos dump" command is in process. Since some of the fixed defects were in "vos" and others in the fileserver both "vos" and the fileserver must be updated to v0.175 to ensure correct behavior.
A major security model change to the Backup Tape Controller (butc), backup coordinator command, and the backup service to address OPENAFS-SA-2018-001.txt
Starting with v0.175 butc supports:
- yfs-rxgk and rxkad authentication
- AES256-CTS-HMAC-SHA1-96 or 56-bit fcrypt wire encryption
- super user authorization
- auditing of all remote procedure call requests
The new security model is incompatible with the existing "backup" and "butc" processes. The new "butc" always executes using "localauth" credentials just as any other cell service does; it can no longer be executed using tokens obtained via aklog.
The butc service will by default require all incoming RPCs to be authenticated as a super user either via use of -localauth credentials or end user identities found in the UserListExt or ReaderList bosserver configuration.
As a side effect of these changes, both backup and butc gain IPv6 support.
As the new security model is incompatible with the existing deployed butc and backup processes, the 0.175 version includes configuration knobs to force the use of the old security model for backward compatibility. Use of these knobs restores the privilege escalation vulnerability. Please contact AuriStorFS support if your site requires use of this configuration.
New data input validation improvements within the vlserver and volserver. These changes ensure that the vlserver cannot store volume location records referencing invalid fileservers or volume site parameters; and that the volserver cannot forward volume data to volservers that are not registered with the cell's location service.
The same validation has been added to vos to ensure that it cannot be instructed to violate cell constraints.
- Many more quality and performance improvements
In conjunction with Apple's release of macOS Mojave (10.14) to the general public, AuriStor announces the release of AuriStorFS v0.174 for macOS Mojave. Both AuriStorFS clients and servers can be installed.
AuriStor announces the release of AuriStor File System v0.174. In addition to the usual mix of bug fixes and functionality improvements, the v0.174 release includes a very special gift: A new x86_64 assembly language implementation of the AES256-CTS-HMAC-SHA1-96 encryption algorithm for Linux and macOS. This implementation leverages the following Intel processor extensions (when available):
- Advanced Encryption Standard New Instructions (AES-NI)
- Streaming Single Instruction Multiple Data (SSE, SSE2, SSSE3, SSE4)
- Advanced Vector Instructions (AVX, AVX2)
Originally intended for use by the Linux kernel module, the AuriStor implementation of AES256-CTS-HMAC-SHA1-96 is 2.4 times faster than OpenSSL and Apple's Common Crypto assembly language implementations. As a result, AuriStor has decided to leverage its implementation exclusively on Linux and macOS for servers, and administration tools.
The AuriStor assembly language implementation is ten times faster than the C language implementation used by previous releases of The AuriStorFS cache manager on x86_64 Linux.
On processors that implement AES-NI and AVX2 the performance cost of yfs-rxgk integrity protected and encrypted connections compared to rxnull unprotected connections is expected to be minimal. The Intel Core i5-4250U CPU @ 1.30GHz (Hazwell 22nm), a low-end consumer processor, can compute (encrypt, sign, verify, decrypt) better than 217,000 yfs-rxgk packets per second (or 2.3 Gbit/second) per core.
A 20-core server class processor with 10 cores dedicated to Rx listener threads and 10 cores remaining for application service threads (where cryptographic operations are performed) can saturate dual-bonded 10gbit/second network interfaces with yfs-rxgk protected traffic.
One customer compared "vos release" of a small volume storing 10GB in 5000 files and directories between v0.167 and v0.173 on RHEL 6.9 x86_64. It observed:
a 24% reduction in clock time to complete the operation
a 100% increase in the peak number of packets sent per second
The reductions in processor time per packet result in reduced per-packet latency and an increased capacity to scale the number of simultaneous RPCs per file server, volume server, location server and protection server.
The incentive for sites to migrate from the 1980s rxkad to yfs-rxgk is greater than ever.
v0.170 is primarily a performance improvement release. AuriStor RX v0.170 is the world's first implementation capable of transferring more than 5.5TB per call. For the first time in AFS history, volumes larger than 5.5TB can be moved, replicated, backed up and restored. v0.170 includes Meltdown and Spectre optimizations for UBIK services reducing by more than 50% the number of syscalls required to process UBIK requests. The v0.170 release includes more than 400 changes compared to v0.168. v0.169 was not publicly released.
v0.168 is a critical bug fix release addressing a fileserver denial of service vulnerability [CVE-2018-7444] and a client side bug in fs setacl -negative which generates more permissive access control lists than intended [CVE-2018-7168]. The v0.168 fileserver provides cell administrators the ability to prevent clients incorporating the bug from storing ACLs. This release also adds support for Red Hat Enterprise Linux 7.5 kernels and includes optimizations to reduce the impact of Meltdown and Spectre mitigations. v0.168 also include major improvements to the volume transaction lifecycle. Interrupted or failed transactions no longer require cell administrators to manually clean up temporary volumes. The v0.168 release includes nearly 400 changes compared to v0.167.
v0.167 is a critical bug fix release addressing a denial of service vulnerability [CVE-2017-17432] in all services and clients. This release also adds support for the forthcoming Linux 4.15 kernel and two new vos subcommands, movesite and copysite.
v0.164 is a bug fix and performance release. This release includes a major rewrite of core cache manager I/O pathways on Linux supporting direct I/O, cache bypass, and read-ahead. This release includes additional improvements and bug fixes to UBIK beyond those shipped in v0.163 to support mixed OpenAFS and AuriStorFS deployments.
v0.163 contains major updates to the UBIK database replication protocol implementation that increase resiliency to peer communication failures and permit sites to mix IBM/OpenAFS and AuriStorFS servers without introducing single points of failures. These changes combined with those included in v0.162 simplify the migration from OpenAFS to AuriStorFS. v.163 introduces Linux 4.14 kernel support. File server detection of and protection against unresponsive cache manager callback service implementations.
v0.162 contains AFS3-compatibility changes for the UBIK database replication protocol permit AuriStorFS servers to be deployed in IBM AFS and OpenAFS cells without configuring them as clones. First release with Fedora 27 support. Bug fixes and on-going improvements.
v0.160 introduces macOS High Sierra and Apple File System support. On Linux, exporting the /afs file namespace via Linux nfsd using NFS2, NFS3, and NFS4 is now supported. Reduced memory utilization by the RX networking stack. File server workaround for deadlocks that are known to occur within IBM AFS and OpenAFS Unix cache managers. Bug fixes and general improvements.
v0.159 introduces "vos eachfs" command. Linux 4.13 kernel support.Continued performance improvements and bug fixes.
Linux 4.12 kernel support. Fedora 26 support. Fileserver support for XFS and BTRFS reflinks for improved vice partition copy-on-write performance. Volserver and "vos" support for quotas larger than 2TB. Linux cache manager performance enhancements to address parallel workflows. macOS fix for Orpheus' Lyre. On-going bug fixing and improvements.
Nico Williams, Viktor Dukhovni and Jeffrey Altman announced the discovery of the "Orpheus' Lyre puts Kerberos to Sleep" bug:
As the name suggestions, this implementation flaw can result in a failure of Kerberos mutual authentication. Kerberos is supposed to provide a secure method of network authentication impervious to man-in-the-middle attacks. Fortunately, the protocol is secure but a mistake made by many implementations permits an attacker to successfully perform service impersonation and in conjunction with credential delegation (ticket forwarding) client impersonation. The attack is silent and cannot be detected.
This is a client-side vulnerability so it must be fixed by patching the client systems and systems that have more than one Kerberos implementation installed must obtain patches from all of the implementations to be secure.
The MIT Kerberos implementation was never vulnerable. As patches for other implementations become available the https://www.orpheus-lyre.info/ site will be updated to indicate that.
Yesterday Microsoft issued patches and those should in my opinion be treated as critical with minimal delays before deployment. Heimdal also issued a patch which is included in version 7.4.
AuriStorFS bundles Heimdal when the local operating system's Kerberos and GSS-API cannot satisfy its requirements. The affected platforms include:
- Apple MacOS (all versions)
- Solaris (all versions)
- Microsoft Windows (all versions)
- Apple iOS (all versions)
This 1.6.21 release of OpenAFS includes a fix to Rx which improves the performance of Rx connections between OpenAFS and AuriStorFS when the OpenAFS peer is writing bulk data to the AuriStorFS peer. Examples include:
- OpenAFS cache manager issuing RXAFS_StoreData calls to AuriStorFS file servers
- OpenAFS volserver forwarding volume data to an AuriStorFS volserver
- OpenAFS volserver dumping volume data to an AuriStorFS "vos dump"
- OpenAFS "vos restore" restoring volume data to an AuriStorFS volserver
There are of course other scenarios involving backups, bulk vlserver queries, etc.
The fix avoids the introduction of 100ms delays as the AuriStor Rx peer attempts to re-open a call window which had been closed due to a lack receive buffers while waiting for the incoming data to be consumed.
New features include support for IBM TSM in the AuriStorFS Backup Tape Controllers. "vos eachvol" enhancements. Faster "pts examine" performance. Automated salvager repair of corrupted volume vnode index file entries. New "vos status" command provides more informative volserver transaction status output including bytes sent and received for each call when both "vos" and "volserver are v0.150 or above. Linux 4.11 kernel support. "fs commands now support -nofollow switch. Many bug fixes and reliability improvements.
This Wednesday AuriStor, Inc. is sponsoring a Hackathon and BOF in support of kAFS and AF_RXRPC development at the Linux Foundation's annual Vault conference.
What are kAFS and AF_RXRPC?
AF_RXRPC is an implementation of the Rx RPC protocol implemented in the Linux mainline network stack as a socket family accessible both to userland and in-kernel processes.
kAFS is an implementation of the AFS and AuriStorFS file system client in the Linux mainline kernel.
Why are AF_RXRPC and kAFS important?
The AFS file system namespace has been available on Linux as a third party add-on since the IBM days. The IBM AFS derived implementations suffer:
- performance limitations due to the existence of a global lock to protect internal data structures
- license incompatibility with GPL_ONLY licensed kernel functionality that further restricts performance and functional capabilities
In addition, out of tree file system modules:
- are not a standard component of most Linux distributions thereby preventing ubiquitous access to the /afs file system namespace
- are not kept in sync with core filesystem and network layer changes in the Linux kernel by the developers responsible for those changes
Collectively, these issues increase the hurdles to use of the /afs file system namespace.
- Organizations must be careful not to deploy new Linux kernel versions until such time as updated AFS or AuriStorFS kernel modules are developed and distributed.
- Organizations cannot obtain the full benefit of the latest hardware whether that be hardware support for advanced cryptographic operations, patented processes such as rcu, and other techniques that can scale file system access to tens or hundreds of cpu cores.
- Lack of common distribution complicates the use of the /afs file namespace in support of Linux container based deployments as most organizations are unwilling to or unable to deploy custom kernels across their internal and cloud (aws, azure, ...) cloud infrastructures.
What is the History of kAFS and AF_RXRPC?
David Howells began work on an in-tree AFS client for Linux circa 2001. Unlike the IBM AFS derived cache manager, David's implementation is not a monolithic file system and proprietary network stack designed for portability across operating systems. Instead, David's AFS client is designed as separate modular components that are integrated into Linux the maximize their usefulness not only for AFS but for a broader class of applications:
- Instead of implementing Rx as a proprietary component of the AFS file system, David added Rx as a native socket family integrated with the Linux kernel networking stack at the same layer as UDP and TCP processing. This produces noticeable reductions in packet processing overhead. At the same time, the Rx RPC protocol becomes readily available as a lightweight secure RPC for userland applications. As a demonstration of how easy it is to use, David implemented much of the AFS administration command suite (bos, pts, vos) in Python by combining a Python XDR class with AF_RXRPC network sockets.
- Instead of AFS Process Authentication Groups (PAGs), David designed the Linux Keyrings which are now a core Linux component used in support of many file systems and network identity solutions.
- David developed the FS-Cache file system caching layer which is used in support of NFS* and CIFS file systems.
- David's kAFS is the AFS and AuriStorFS specific file system functionality including the callback services.
Unfortunately, David Howell's has received minimal support from the AFS user community. As a result, neither AF_RXRPC nor kAFS have been included in any major Linux distribution.
Why is AuriStor, Inc. contributing?
AuriStor, Inc. has invested substantial resources into its AuriStorFS Linux client in support of Red Hat Enterprise Linux, Fedora, CentOS, Debian and Ubuntu and will continue to do so. Yet, AuriStor, Inc. recognizes that widespread adoption of AuriStorFS servers for last scale Enterprise and Research deployments require higher performance, greater scale and easier maintenance for Linux systems.
AuriStor, Inc. also recognizes that the many of the workflows that have relied upon the /afs file namespace for software and configuration distribution are migrating to containers. That transition is not without its own challenges related to the management of Container identity and authentication to persistent network based resources. AuriStor, Inc. believes that the global /afs file namespace combined with the AuriStor Security Model (combined identity authentication and multi-factor constrained elevation authorization) are best suited to addressing the outstanding Container deployment issues.
AuriStor, Inc. believes that only through a native in-tree client can these issues be addressed.
What is AuriStor, Inc. contributing?
AuriStor, Inc. has leveraged its expertise and extensive quality assurance infrastructure to identify flaws in the AF_RXRPC and kAFS implementations. Over the last year hundreds of corrections and enhancements have been merged into the Linux mainline tree. Missing functionality has been identified and is being implemented one piece at a time.
As kAFS approaches production readiness AuriStor, Inc. will contribute native AuriStorFS client support including an implementation of the "yfs-rxgk" security class to the AF_RXRPC socket family.
It is our hope that by the end of 2017 kAFS and AF_RXRPC will be ready for inclusion in major Linux distributions side-by-side with NFS* and CIFS.
AuriStor, Inc. is also working with major players in the Container eco-system and the Linux Foundation to address the identity management problem. When successful, it will be possible to launch Containers with network credentials such as Kerberos tickets and AFS/AuriStorFS tokens managed by the host. AuriStor, Inc. believes that this functionality combined with the /afs file namespace will allow true portability of Containerized processes across private and public cloud infrastructures.
v0.145 is the latest in the on-going efforts to improve the fileserver's ability to perform volume operations when the volume is under heavy load and to recover when the unexpected happens.
One of the strengths of the /afs model is the ability to move, release, backup, and dump volumes while they are being accessed by clients under production loads. For example, the following scenario should in theory be handled without a hiccup:
- Take two file servers each with at least one vice partition.
- Create a volume V on fs1/a
- Add RO sites on fs1/a and fs2/a
- On at least two clients execute "iozone -Rac" in separate directories of volume V using the RW path
- On at least one client start a loop that lists one of the directories with stat info that the iozone test is writing to from V.backup.
- On at least one client start a loop that lists one of the directories with stat info that the izone test is writing to from V.readonly.
- Repeat the following process in a tight loop
- "vos backup V"
- "vos release V"
- "vos move V fs2 a"
- "vos backup V"
- "vos release V"
- "vos move V fs1 a"
Organizations do not actively operate their cells in this fashion but the expectation is that "if they did, it should work." Of course, the answer is "it didn't before v0.145". Why not? and where did it fail?
For those of you that are unfamiliar with the fileserver architecture, here is process list as reported by the bosserver for a fileserver:
[C:\]bos status great-lakes.auristor.com -long
Instance dafs, (type is dafs) currently running normally. Auxiliary status is: file server running. Process last started at Sat Dec 24 12:19:03 2016 (3 proc starts) Command 1 is '/usr/libexec/yfs/fileserver' Command 2 is '/usr/libexec/yfs/volserver' Command 3 is '/usr/libexec/yfs/salvageserver' Command 4 is '/usr/libexec/yfs/salvager'
The first three commands execute a set of dependent processes. The fileserver process:
- Registers the fileserver with the VL service and communicates with the PT service
- Processes all requests from AFS clients (aka cache managers) via the RXAFS and RXYFS Rx network services.
- Maintains a cache of all volume headers
- Is the exclusive owner of all volumes. the volserver and salvageserver processes request readonly or exclusive access to volumes from the fileserver
- Issues requests to the salvageserver to perform consistency checks and repair volumes when a problem is detected with the volume headers, the on-disk data, or other.
The goal is to ensure that a volume is available to the fileserver process as long as there are active requests.
Each "vos" command is implemented by one or more VL and VOL RPCs. The VOL RPCs are processed by the volserver. The volserver can:
- Submit a query to the fileserver to obtain the necessary data to satisfy the request
- Request readonly access to a RW volume which can be used to produce a new clone (for .readonly or .backup or .roclone)
- Request exclusive access to a RW, RO, BK or an entire volume group
- Request a volume be salvaged by the salvageserver
When the salvageserver is asked to salvage a volume it requests exclusive access to the volume group from the fileserver. There are a lot of moving parts.
For each of the "vos backup", "vos release" and "vos move" commands the volserver will exercise different combinations of query, readonly and exclusive access to volumes. The various modes of client requests to the fileserver:
- reading from the .backup
- reading from the .readonly
- writing to the RW
Produces contention between the volservers and the fileservers for control of volume.
The expected behavior is that a volume will be offline for the shortest amount of time possible, that the client will retry the request in a timely manner, and most importantly, not respond to a temporary outage as a fatal error.
What could go wrong?
For starters, the client retry algorithm is quite poor. Whenever a volume is busy or offline the client will sleep for 15 seconds. In other words, the client will block for an eternity and when it finally does retry its likely to find the volserver with exclusive ownership of the volume and sleep again for another 15 seconds. The v0.145 release does not fix the client behavior. That will come in a future release but for testing purposes we removed all of the sleeps from the clients and forced immediate retries in order to maximize the contention.
Once sufficient contention was generated, we observed that after round trips of the volume moving from fs1/a to fs2/a to fs1/a to fs2/a and back to fs1/a, the volume would go offline and not come back. We noticed that the fileserver would be asked to put the volume into an online state but would immediately request a salvage. The salvageserver would verify the state, ask the fileserver to put the volume into service, but the fileserver would immediately request another salvage. This would repeat a dozen times before the volume would be taken offline permanently. Attempts to manually salvage the volume would succeed but the volume would still not return to an online state. The failure was 100% reproducible.
For those of you that remember IBM AFS and OpenAFS 1.4.x and earlier, there was no on-demand attachment or salvaging of volumes. The fileserver was much simpler. The fileserver forced a salvage of all volumes at startup and then attached them. The volserver always obtained exclusive access to volumes. The fileserver never cached any volume metadata. If something went wrong with a volume, it was taking offline until an administrator intervened.
The struggles of the last few months have all been directly attributable to the changes introduced as part of the demand attach functionality introduced in OpenAFS 1.6.x. We have encountered deadlocks, on-disk volume metadata corruption, copy-on-write data corruption, salvager failures, and now fileserver volume metadata cache corruption.
After two round-trips of volume movement while the RW, RO, and BK are all under active usage, the fileserver's volume group cache would end up out of sync with the volserver. The fileserver would believe the volume was still owned by the volserver when it wasn't. The fileserver would request the salvageserver to verify the volume state but nothing it could do would result could make a difference.
As of v0.145 the handoff from volserver to fileserver has been corrected. Now the volume can be backed up, released, and moved continuously while each of the volume types are actively accessed. No salvages. No VNOVOL errors. The iozone benchmark will continue to function even if the fileserver or volserver processes are periodically killed.
With a modified client to reduce the delays after receiving a VBUSY or VOFFLINE error, the iozone benchmark continues to function although at a lower rate of throughput. There are periodic pauses when large copy-on-write operations need to be performed. Which brings me to our New Year's resolution.
In 2017, AuriStor aims to remove the copy-on-write delays. The COW delays are the due to the need to copy the entire backing file each time a file is modified after a volume clone is created. On Linux distributions that include "xfs reflinks" support for vice partitions AuriStor fileservers will be able to complete COW operations in constant time without regard to the length of the file being modified. This change coupled with client-side improvements to the retry algorithms will significantly reduce the performance hit after volume clone operations.
Have a Happy New Year! The AuriStor team is looking forward to an excellent year.
Today OpenAFS announced Security Advisory OPENAFS-SA-2016-003 and released 1.6.20 which is an urgent security release for all versions of OpenAFS. IBM AFS cache managers and all OpenAFS Windows clients are also affected. There is no update to the OpenAFS Windows client.
AuriStor File System clients and servers do not experience this information leakage. However, volume migrated to AuriStorFS file servers from OpenAFS or IBM AFS file servers will retain the information leakage.
In addition to impact described in the announcement it is worth noting that all backups and any archived dump files will contain information leakage. Restoring a backup or dump file containing information leakage will restore that leaked information to the file servers where it will be delivered to cache managers.
Salvaging restored volumes with the -salvagedirs option is required to purge the information leakage.
It is worth emphasizing that IBM AFS and OpenAFS volserver operations including all backup operations occur in the clear. Therefore, all leaked information will be visible to passive viewers on the network segments across which volume backups and moves occur.