AuriStor File System News
The AuriStorFS v2021.05-31 release includes support for macOS 14 Sonoma and Linux 6.6 kernels.
- macOS 14 Sonoma for Apple Silicon and Intel x64.
- Linux 6.6 kernels.
UNIX Cache Managers
If an AuriStorFS cache manager is unable to use the yfs-rxgk security class when communicating with an AuriStorFS fileserver, it must assume it is IBM AFS 3.6 or OpenAFS and upgrade it to AuriStorFS if an upgrade probe returns a positive result. Once a fileserver's type is identified as AuriStorFS the type should never be reset; even if communication with the fileserver is lost or the fileserver restarts.
If an AuriStorFS fileserver is replaced by an OpenAFS fileserver on the same endpoint, then the UUID of the OpenAFS must be different. As a result, the OpenAFS fileserver will be observed as distinct from the AuriStorFS fileserver that previously shared the endpoint.
Prior to this release there were circumstances in which the cache manager discarded the fileserver type information and would fail to recognize the fileserver as an AuriStorFS fileserver when yfs-rxgk could not be used. This release prevents the cache manager from resetting the type information if the fileserver is marked down.
If a fileserver's location service entry is updated with a new uniquifier value (aka version number), this indicates that one of the following might have changed:
- the fileserver's capabilities
- the fileserver's security policy
- the fileserver's knowledge of the cell-wide yfs-rxgk key
- the fileserver's endpoints
Beginning with this release the cache manager will force the establishment of new Rx connections to the fileserver when the uniquifier changes. This ensures that the cache manager will attempt to fetch new per-fileserver yfs-rxgk tokens from the cell's RXGK service, enforce the latest security policy, and not end up in a situation where its existing tokens cannot be used to communicate with the fileserver.
- Fix incorrect output when populating the server list for a service fails. The stashed extended error explaining the cause of the failure was not displayed.
- If a cell has neither _afs3-prserver._udp.
|DNS SRV records nor AFSDB records, the lookup of the cell's protection servers would fail if there is no local cell configuration details. The fallback to use _afs3-vlserver._udp.||DNS SRV records did not work. This is corrected in this release.|
Beginning with AuriStorFS v0.184, fileservers refused to return an EPERM error to an IBM or OpenAFS client or prior AuriStorFS clients. These clients do not always preserve and pass the EPERM error to the application. In some situations, the EPERM would be replaced with either EIO or ENOENT. An ENOENT returned when searching a directory can result in a false negative lookup being cached by the Linux vfs. If the cache manager does not advertise the ability to assert via a capability flag that EPERM errors will be preserved, then EPERM is translated to EACCES by the fileserver.
It has been discovered that this translation breaks the proper operation of "cp -p" when setting a file's ownership, group or mode fails. Whereas an EPERM will be silently ignored, an EACCES is treated as a fatal error. This release restricts the set of operations for which EPERM will be translated to EACCES to Fetch operations. Store operations will once again return EPERM to all clients.
If a RXYFSCB_TellMeAboutYourself RPC to a cache manager fails, the fileserver must not remove the failed endpoint from the host record before retrying with RXAFSCB_TellMeAboutYourself. Doing so prevents the proper logging of a failed RXAFSCB attempt.
BOS Overseer Service
bos exec commands could report failure even though the requested command successfully executed. This was due to the ECHILD signal handler consuming the exit status from the SBOZO_Exec child process before the SBOZO_Exec thread could read it. The signal handler monitors for process termination of bnode child processes and is executed anytime a child process completes. This release prevents the signal handler from reading the exit status of SBOZO_Exec child processes.
The SBOZO_Exec RPC no longer returns the return code from system(). This value is not the value passed to exit() by the child process. On failure system() can return -1 which will be viewed by bos exec as RX_CALL_DEAD and report a remote network timeout. This release will return either the exit() status or the platform specific signal number that resulted in the termination of the child process.
The AuriStorFS v2021.05-30 release includes important bug fixes for clients, servers and administrative tooling.
Linux 6.5 kernels.
Fix for rxkad_krb5 superuser tokens
OpenAFS security update OPENAFS-SA-2013-003 "Brute force DES attack permits compromise of AFS cell" introduced a bug that results in the expiration of rxkad_krb5 protected superuser connections that are intended to never expire. The security advisory was addressed by a feature change which extended the rxkad security class to support the use of Kerberos v5 service tickets with stronger encryption keys such as aes256-cts-hmac-sha1-96. The Kerberos v5 service ticket is exchanged and decrypted during the CHALLENGE | RESPONSE phase of establishing an Rx connection protected by rxkad. Once validated, any non-DES session key is used to derive a 56-bit key that can be used by rxkad for wire privacy. This Kerberos v5 extended version of rxkad is referred to as "rxkad_krb5" to distinguish it from the earlier versions that only supported DES encrypted service tickets.
rxkad was designed based Kerberos v4 which used unsigned 32-bit integers to represent timestamps. The valid range being Unix epoch (0) through 07 Feb 2106 06:28:15 GMT (0xffffffff). Kerberos v5 timestamps are represented as signed 32-bit integers with a valid range of 13 Dec 1901 20:45:52 GMT (-2147483648) through 19 Jan 2038 03:14:07 GMT (0x7fffffff). This mismatch in the valid time ranges resulted in a subtle bug when generating rxkad superuser tokens such as created for server-to-server authentication and the -localauth switch for tools such as bos, pts, and vos.
An rxkad superuser token is defined as having a start time of Unix Epoch (0) and an end time of 0xffffffff. A superuser token can never expire even though the maximum lifetime of an rxkad token is 30 days (108000 seconds). Since the maximum endtime that can be represented by a Kerberos v5 service ticket is 0x7fffffff that is what is set when the requested end time exceeds the maximum. When the Rx acceptor validates the rxkad_krb5 RESPONSE packet, it observes the lifetime range as not being equivalent to a never expiring ticket and therefore the maximum token lifetime applies.
This is not normally a problem because most Rx connections do not exist for longer than 30 days. However, the fileserver connections to the protection service do and the fileserver does not expect these connections to expire. Therefore, there was no logic in place to handle rxkad expiration errors.
There are two changes that have been introduced in v2021.05-30 related to rxkad_krb5 superuser token lifetimes. The first is that when validating an rxkad_krb5 RESPONSE packet, if the service ticket is Kerberos v5 and the start time is 0 and the end time is 0x7fffffff, then treat that as a Kerberos v4 ticket with a valid range of start 0 and end 0xffffffff. In other words, accept the Kerberos v5 ticket with the range 0..0x7fffffff as a never expiring token.
The second change is to the fileserver to handle rxkad expiration errors by generating a new token and establishing a new rxkad_krb5 connection to the protection service.
Sites that have deployed yfs-rxgk keys to all AuriStorFS servers within their cells are unaffected by this bug. This is true even if the yfs-rxgk/_afs
@ service principal has yet to be created or enabled.
Reliability of IP ACLs.
Transarc introduced IP ACLs as part of the AFS 3.1 release. The implementation in use to this day assumed that the PR_HostCPS RPC issued by the fileserver to the protection service could only fail for two reasons. First, the protection service did not support the PR_HostCPS opcode. Second, the requested entry did not exist. All failures were therefore treated as if the Host's Current Protection Set (set of group memberships) is equivalent to the empty set which was then cached for the fileserver configured "hostcpsrefresh" time which defaults to two hours.
A PR_HostCPS call can fail for reasons that were not handled. For example, a protection server can return an out of memory error, all of the protection servers might be unreachable, or as has been discovered an rxkad expiration error might be returned. Processing the incoming call from a cache manager without the correct HostCPS set can result a change in the rights granted to the client which can result in an application failure.
This release of the fileserver reacts to an inability to obtain the HostCPS in the same way that it behaves if the end user's CPS cannot be obtained. The incoming RPC from the client is rejected and a VBUSY error is returned. The VBUSY signal to the client to retry the request either to this fileserver or another one (if the data is replicated.) The fileserver will no longer cache the empty HostCPS set in case of failure.
The implemented changes will improve the consistency of fileserver's handling of IP ACLs. However, it is worth reminding everyone that the network world when IP ACLs were created is very different from today's networks. IP ACLs can result in unpredictable behavior when clients are multi-homed (IPv4 + IPv6 or multiple IPv4 addresses), move between networks, or connect via VPN connections, The use of IP ACLs is therefore discouraged. As an alternative to IP ACLs, AuriStorFS permits rights to be granted or revoked from groups of authenticated keyed cache managers.
Reconnection of yfs-rxgk authenticated connections after rekeying
An under-specification of rxgk security class protocol and a flaw in the yfs-rxgk implementation has been identified. It was assumed that an Rx acceptor would not receive an initial Rx connection that had already been rekeyed. As a result of that assumption yfs-rxgk acceptors reject an rxgk authenticator encrypted with a key generation number other than zero. This assumption is incorrect. Even though it is true that all yfs-rxgk connections are created with an initial key generation number of zero there are at least two common circumstances during which an Rx acceptor will receive a call on a previously unknown Rx connection. The first case is when an Rx connection has been idle for more than ten minutes and it has been garbage collected by the acceptor but is still in use by the initiator. The second case is when an initiator changes endpoints because it has changed networks, joined a VPN, or is communicating behind a NAT/PAT device with a short lived port mapping. In these circumstances the acceptor might receive an rxgk RESPONSE packet for an Rx connection that has already been rekeyed at least once. yfs-rxgk connections are rekeyed after each 1.7TB of data is transferred.
Failure to accept the RESPONSE can result in a cache manager marking a fileserver down or in the user's rxgk tokens being discarded. A complete fix requires that both the initiator and the acceptor be updated to v2021.05-30 or beyond.
Performance of "vos convertROtoRW"
Many sites use read-only volumes as an online backup mechanism. If a vice partition hosting read-write volumes fails, then a read-only volume on another fileserver is converted to become a replacement read-write volume. This release includes a number of performance and behavioral enhancements to reduce the time necessary to perform the conversion and simply the process. First, it is now assumed that if a read-write instance is unusable that any read-only or backup instance co-located are also unusable. Second, if the fileserver hosting the read-only that will be converted is v2021.05-26 or later, then the readonly volume will not be converted to a read-write volume but instead will be cloned to a read-write volume. These two changes reduce the number of required location service write transactions by 50% and the number of volserver vnode operations by 50%.
Performance of "vos source", "vos interactive", "vos eachvol", "pts source", "pts interactive" and "pts eachuser/eachgroup"
When executing "vos convertROtoRW" on large numbers of read-only volumes, use of "vos source" can be significantly faster than executing individual "vos convertROtoRW" commands. In addition to saving the overhead of starting a new process, if the "vos source" command can reuse a single Rx connection pool for all of the commands then it is possible to eliminate the cost of probing for the best server, identifying the ubik coordinator, and performing any required CHALLENGE/RESPONSE exchange.
Prior to this release inclusion of a -cell
, -verbose, -localauth, or -encrypt switch on the command being executed by a "source", "interactive" or "eachxxx" command would force the allocation of a new Rx connection pool for each executed command. Beginning with this release new Rx connection pools are allocated only if the state required by the command differs from the state of the top level command. This change increases the likelihood that the Rx connection pools can be reused; thereby saving time. If 10ms is saved for each of 100,000 executed commands, that is a savings of greater than 16 minutes.
Reliability of volume operations initiated from behind a NAT
Another change of note affects the use of "vos move[site]", "vos release", "vos copy[site]" when the "vos" process and one or both of the volservers are separated by a NAT/PAT/firewall, These vos operations involve a volume forward operation between two volservers. While the data is flowing between the volservers the active call between the vos process and the source volserver is has no DATA packets flowing in either direction. A PING ACK is sent by vos once every 100s. Unfortunately, many NATs have port mapping timeouts as short as 30s. If the volume forward takes longer to complete than the NAT mapping timeout, the volume forward completes successfully but the response is not received by vos. Instead, vos observes a network timeout and then attempts to clean up from what appears to be a failed volume forward. The v2021.05-29 release was intended to address this situation but the implemented changes were incomplete. v2021.05-30 ensures that the maximum time period between PING ACKs when an outgoing call is waiting for a response is 20s which should be sufficient to maintain the port mapping on NAT/PAT devices that aggressively prune the port mapping tables.
- AuriStorFS Clients and Servers version v2021.05 Patch 29 released.
- New platform: Linux 6.4 kernels.
- Execution of fs commands such as examine, whereis, listquota, fetchacl, cleanacl, storeacl, whoami, lsmount, bypassthreshold and getserverprefs could result in memory leaks by the yfs.ko kernel module.
- On Linux, prevents a kernel panic if the configured cache directory is located on a filesystem such as overlayfs which does not support the functionality required to be a cache.
- Improved operation of vos operations that involve a volume forward operation between two volservers when the vos process is separated from one or both of the volservers by a NAT/PAT/firewall which times out port mappings within 100 seconds.
- AuriStorFS Clients and Servers version v2021.05 Patch 28 released.
- Fixes for volserver bugs.
- New GPG signing key deployed for package compatibility with Fedora 38.
- AuriStorFS Clients and Servers version v2021.05 Patch 27 released.
- Fixes for bug introduced in v2021.05 Patch 26 within volserver and vos.
- AuriStorFS Clients and Servers version v2021.05 Patch 26 released.
- With this release AuriStorFS adds support for several new platforms:
- Linux 6.3 kernel
- SUSE SLES 15 (SUSE SolidDriver certified)
- CrayOS 2.3 / 2.4
- Red Hat EL8 Real Time kernels
- Red Hat EL9 Real Time kernels
- AuriStor has begun to publish repositories for Linux aarch64. The first three platforms for which aarch64 have been published are:
- Red Hat EL8, Alma8, Rocky8
- Red Hat EL9, Alma9, Rocky9
- Fedora 38
- AuriStor has been a Red Hat Partner for many years and that partnership continues to get stronger and closer. AuriStor has now established partnership relationships with SUSE and HPE Cray in order to support their respective products. With SUSE AuriStor is now a member of the SolidDriver program which ensures that SUSE will not reject a support request from a SLES customer that has the AuriStorFS kernel modules installed. Instead, SUSE and AuriStor have an established procedure for mutually investigating support requests submitted to SUSE for SLES. HPE Cray's CrayOS is forked from SLES 15 and customized for each Cray installation. AuriStor's partnership with HPE Cray provides AuriStor with the ability to provide HPE Cray customers with custom builds of AuriStorFS for their Cray installations.
- AuriStor has also expanded our membership with TSANet to obtain mutual support for our customers with many industry leading hardware and software vendors: https://tsanet.org/members/.
- In addition to the new platform support, the v2021.05-26 has major improvements in the following areas:
- Continued improvements in volume lifecycle management in the fileserver with substantial reductions in fileserver shutdown and restart times. Not all sites have more than a million volumes hosted on each fileserver, but if your does, the changes included in v2022.05-26 reduce the fileserver shutdown time from tens of minutes to a small number of seconds. Fileserver shutdown and restart time is critical to reducing the impact of fileserver updates on client systems.
- The cache manager includes a number of bug fixes and one significant design change. The v2021.05-26 release alters the logic used to track the availability of volume content on each site. This state information is particular important when volumes become temporarily unavailable due to volume management operations such as backup, move, release, salvage, etc; or due to volumes being unavailable due to a fileserver being in the process of shutting down or not yet up again. The changes in this release force the client to immediately failover to an alternate site for any replicated volume when a fileserver indicates that it is restarting or that a volume is missing. Prior cache managers would stick with a fileserver when told that it was restarting either until the fileserver came back up or until the fileserver was shutdown for a sufficient period of time. If the fileserver took tens of minutes to shutdown the client would fail to failover until it did.
- This release includes a significant change to the RX RPC stack's lifecycle management of incoming calls. A call once received transitions through the following states PRECALL -> QUEUED -> ACTIVE -> HOLD -> DALLY before it is garbage collected. The PRECALL state indicates that a call is not yet eligible to be attached or scheduled to a worker. The most common reasons are that authentication or reachability testing is pending or that the first DATA packet in a multi-packet call has yet to arrive. A QUEUED call is waiting for a worker to be assigned. An ACTIVE call is assigned to a worker. HOLD indicates that the call has completed but is waiting for the initiator to acknowledge the response DATA packets. DALLY indicates that the call is detached and ready for garbage collection. In prior releases if a call is in the PRECALL state when an ABORT from the initiator arrives, the call would be terminated immediately. However, a call in the QUEUED state required that it be assigned to a worker before it could be terminated. On extremely busy servers it has been observed that ABORTed calls can sit in the queue for up to 52 seconds before a worker is available to be assigned. Many initiated calls have connection dead times well under 52 seconds and these calls would timeout if the prior call issued in the same call channel keeps the call channel busy for longer than the dead time period. This release makes it possible for calls in the QUEUE state to be terminated immediately. This change not only prevents timeouts of incoming calls but also reduces the work load on the worker pool. These changes address timeout problems observed with vos processes performing write transactions with extremely busy location service coordinators.
- The vos process has received some significant improvements.
- verbose output is more detailed and complete. Site information (server and partition) is now included for each transaction and operation. Failures are more reliably logged.
- A new keepalive mechanism has been added to ensure that idle volserver transactions do not timeout until vos is ready to terminate them. As volume sizes and the number of vnodes within a volume have increased many sites have reported that vos operations are failing because of transaction timeouts. These timeouts occur when a vos operation involves multiple servers and a volume clone must be created or deleted on one server prior to performing the next operation on another server's transaction. If a volserver transaction is idle for five minutes or longer, it is subject to automatic garbage collection. In this release, the vos process will schedule a keep alive operation on each idle volserver transaction once a minute to ensure that the transaction is not viewed as idle.
- Many improvements have been made to "vos release" logic to address failures in various edge cases.
- When both volservers are v2021.05-26 or later, moving a volume to a site containing a pre-existing ROVOL is now more efficient as the data in the ROVOL can be used as the starting point for an incremental move instead of a full move. This is particularly beneficial for volumes large numbers of vnodes that have not changed since the ROVOL was created.
- There are improvements in AuriStor's DKMS processing.
- Man page updates.
- AuriStorFS Clients and Servers version v2021.05 Patch 25 released.
- v2021.05-25 supports Linux kernels based upon the 6.1 or 6.2 mainline releases.
- The v2021.05-25 release is important for AuriStorFS database servers deployed in environments without reliable clock synchronization. The AuriStor UBIK database replication protocol requires synchronization of clocks within two seconds. The changes to the v2021.05-25 release will strictly enforce that votes cast for the coordinator be within the window ranging from two seconds before the start of the election to two seconds after the vote was received. Receipt of a vote with a timestamp outside of that window will result in the vote being discarded and the peer being marked down. Prior to this change it was possible for a vote to be cast with a timestamp up to 23 (ubik bigtime) seconds prior to the end of the election. As a coordinator is elected for a term of 21 seconds from the start of the election and an election is conducted every 10 seconds, it was possible for a coordinator to be elected to a term that either was about to expire or had already expired. This resulted in sporadic failures of write transactions.
- The v2021.05-25 release includes further changes to RXRPC to improve reliability. The changes in this release prevent improper packet size growth. Packet size growth should never occur when a call is attempting to recover from packet loss; and is unsafe when the network path's maximum transmission unit is unknown. Packet size growth with be re-enabled in a future AuriStorFS release that includes Path MTU detection and the Extended SACK functionality.
- For Debian/Ubuntu systems which use DKMS to install the AuriStorFS kernel module, the v2021.05 release includes new functionality to retry DKMS if the matching AuriStorFS kernel module is missing from the system. This change will permit recovery when the first reboot into a new kernel of a Debian/Ubuntu system occurs without network access or prior to the release of a matching AuriStorFS kernel module.
- With this release the Linux /proc/fs/yfs directory tree has been moved to /proc/fs/auristorfs. A symlink from /proc/fs/yfs to /proc/fs/auristorfs is provided to ensure backward compatibility.
- A new /proc/fs/auristorfs/rxstats file can be used to read the RX statistics counters. This set of statistics uses 64-bit counters unlike the output from "rxdebug
-rxstat" which is limited to 32-bit counters.
- AuriStorFS Clients and Servers released for Apple MacOS 13 (Ventura) on both Apple Silicon and Intel architectures.
- New Supported Platforms:
- Fedora 37
- Linux 6.0 kernels
- UNIX Cache Manager
- Locally configured cell aliases can now be used when evaluating magic mount paths /afs/.@mount/<cell-name-or-alias>/<volume-name>/.
- RXRPC Changes
- RX calls are now created with a fixed initial congestion window instead of using the final congestion window from the most recently completed call. The use of a stashed value is inconsistent with RFC5681 and can reduce the transfer rate of subsequent calls.
- The Sent ABORT packet statistics counter is now maintained and reported. It was missing since the initial AuriStorFS release.
- Location Service (aka vlserver):
- Write transactions that were begun but failed prior to completion are no longer committed. This avoids unnecessary propagation of a database relabeling without any data change.
- This release introduces new 64-bit YFSVL CreateEntry and ReplaceEntry RPC to replace the broken but never called versions introduced in 2018 (v0.169).
- ReplaceEntry RPC audit log entries now include the volume type and release type parameters. With this change it is possible to determine from the audit event if an implement ReleaseLock was performed.
- GetEndpoints and GetAddrsU RPC audit log entries now include the requested fileserver UUID and Uniqifier values.
- vos command:
- Location server instances are categorized by the RPCs that known to be supported based upon which issued RPCs succeed or fail. The categorization algorithm was broken resulting in AuriStorFS vos commands never issuing the latest 64-bit RPCs and potentially falling back to the set of RPCs supported by IBM AFS 3.3. This release revises the algorithhm such that the AuriStorFS locations servers will be issued the latest 64-bit RPCs and fallback will proceed incrementally through older AuriStorFS server versions, OpenAFS, IBM AFS 3.6 and finally IBM AFS 3.3.
- vos dump -file <file> will now unlink the <file> if the stored dump stream is incomplete.
- Location Service
- Introduce new RPCs YFSVL_GetAuthoritativeEntryByName, YFSVL_GetAuthoritativeEntryByID, and YFSVL_AuthoritativeListAttributesU2. When available on all servers and the vos command these RPCs will be used to ensure that vos commands that modify the database will only fetch data from the elected Location Service coordinator.
- UBIK clients (vos, pts, afsbackup)
- Avoid unnecessary delays when issuing RPCs that are not supported on all of the servers in the UBIK quorum.
- Volume Server
- AuriStorFS Clients and Servers version v2021.05 Patch 21 released.
- Volume Group Object Store
- Volume clone removal time reduced by eliminating unnecessary backing store I/O operations.
- A data-version-vnode object creation race that might result in spurious creation failures was fixed.
- A data-version-vnode object reference count race that might result in improper deletion was fixed.
- vos command line tool
- vos backup and vos backupsys can now update a pre-existing backup volume while a vos release is in progress.
- RX RPC
- No longer send ACK packets in response to DATA packets if an ABORT packet has been sent.
- The Sent RX BUSY packets counter is once again reported to rxdebug server port -rxstats.
- Improve resiliency when the RX peer advertises unreasonable values for ACK packet trailer fields. This change permits continued communication with broken RX RPC implementations.
- Linux kernel module
- Linux 6.0 mainline kernels are now supported
- Fix a build error with Linux mainline 5.19 or later kernels when the architecture is aarch64.
- The kernel module now includes description, author and version information that can be displayed via modinfo.
- Volume Group Object Store
- AuriStorFS Clients and Servers version v2021.05 Patch 20 released. This release:
- Volume Server changes
- The volserver validation checks introduced in v2021.05-19 break the restoration of incremental volume dumps. This release fixes the regression and adds tests to validate the behavior.
- Volume Server changes
- AuriStorFS Clients and Servers version v2021.05 Patch 19 released. This release:
- FileServer Updates
- Reorganize how Volumes are initialized for use with the VLRU. This change avoids the possibility of an assertion failure if a volume is placed into an error state without it being detached.
- Prevent a race during startup which could result in a core dump due during startup of the VLRU Scanner Thread. This race was introduced in 2021.05-18.
- When creating a cross-directory hard link, delay the assignment of a per-file ACL until after the copy-on-write operation succeeds.
- Protect Service changes:
- Prevent crash if authenticated Kerberos v5 identity contains a dot in the first component and "allow-dotted-principals" is disabled.
- Volume Server changes
- When a Volume Dump RPC fails due to an RX Peer Unreachable error log the ICMP error details (if available).
- Introduce additional consistency checks when receiving a dump stream.
- Salvage Service changes
- Do not assign a parent directory vnode to orphaned files with attached per-file ACLs.
- Backup Service changes
- Prior releases faailed to re-open log files after receiving a signal from logrotate.
- Volume Package changes
- The volume group link table limit that prevented a vnode from being linked to more than seven volumes within a volume group has been removed. Exceeding this limit when creating a new volume clone (readonly, backup or other) could result in the entire volume group becoming unusable. The limit of seven was derived from the design of the IBM/OpenAFS volume group object store format. There is no design limit in the AuriStorFS object store format.
- UBIK Service changes
- Prevent threads attempting to perform a write transaction from jumping ahead of a queue of threads waiting for exclusive access.
- UBIK Client changes
- When it is known that a coordinator is required to complete an RPC, increase the RX connection dead time from 12s to 60s in case the coordinator is under heavy load.
- When it is known that a coordinator is required to complete an RPC, disable the RX hard dead timeout since it is not safe for a write transaction RPC to be retried.
- RX RPC
- Include the DATA packet serial number in the transmitted reachability check PING ACK. This permits the reachability test ACK to be used for RTT measurement.
- Do not terminate a call due to an idle dead timeout if there is data pending in the receive queue when the timeout period expires. Instead deliver the received data to the application. This change prevents idle dead timeouts on slow lossy network paths.
- Fix assignment of RX DATA, CHALLENGE, and RESPONSE packet serial numbers on macOS (KERNEL) and Linux (userspace). Due to a mistake in the implementation of atomic_add_and_read the wrong serial numbers were assigned to outgoing packets.
- vos subcommand changes
- Do not default the -clone switch to yes if the volume type is readonly or backup. There is no benefit to using a clone for these volume types as the volumes are not taken away from the fileserver during the volume operation. dump, shadow, and copy are affected.
- Error move volume sooner if the volume type is not read-write.
- Error movesite volume sooner if the volume type is not readonly.
- Error copysite volume sooner if the volume type is not readonly.
- When dumping a volume using a clone the clone will be flagged as temporary to ensure that if the transaction is interrupted that the clone will be automatically garbage collected.
- FileServer Updates
- AuriStorFS Clients and Servers version v2021.05 Patch 18 released. This release:
- New Supported Platforms:
- Linux Kernel 5.19
- FileServer Updates
- Improved tracking of idle but in-use volumes to avoid unnecessary volume salvaging after an emergency fileserver restart or transition from active to passive server instance.
- Streamlined the fileserver shutdown process to reduce volume contention. These changes ensure that a fileserver with millions of attached volumes can perform a clean shutdown in just a few seconds.
- Prevent idle dead timeouts during StoreData calls exceeding 8MB of data over slow lossy networks.
- Protect against signed extension parameter overflow when processing RXAFS_FetchData calls.
- Volume Server changes
- Include temporary volumes (those flagged as 'destroyMe') in the output of AFSVolListVolumes and AFSVolListOneVolume calls.
- Cache Manager
- Prevent a kernel memory leak of less than 64 bytes for each bulkstat RPC issued to a fileserver. Bulkstat RPCs can be frequently issued and over time this small leak can consume a large amount of kernel memory. Leak introduced in AuriStorFS v0.196.
- The Perl::AFS module directly executes pioctls via the OpenAFS compatibility pioctl interface instead of the AuriStorFS pioctl interface. When Perl::AFS is used to store an access control list (ACL), the deprecated RXAFS_StoreACL RPC would be used in place of the newer RXAFS_StoreACL2 or RXYFS_StoreOpaqueACL2 RPCs. This release alters the behavior of the cache manager to use the newer RPCs if available on the fileserver and fallback to the deprecated RPC. The use of the deprecated RPC was restricted to use of the OpenAFS pioctl interface.
- RX RPC
- Handle a race during RX connection pool probes that could have resulted in the wrong RX Service ID being returned for a contacted service. Failure to identify that correct service id can result in a degradation of service.
- The Path MTU detection logic sends padded PING ACK packets and requests a PING_RESPONSE ACK be sent if received. This permits the sender of the PING to probe the maximum transmission unit of the path. Under some circumstances attempts were made to send negative padding which resulted in a failure when sending the PING ACK. As a result, the Path MTU could not be measured. This release prevents the use of negative padding.
- Some shells append a slash to an expanded directory name in response to tab completion. These trailing slashes interfered with "fs lsmount", "fs flushmount" and "fs removeacl" processing. This release includes a change to prevent these commands from breaking when presented a trailing slash.
- New Supported Platforms:
- AuriStorFS Clients and Servers version v2021.05 Patch 17 released. This release:
- improves the reliability of the RX RPC protocol ACK packet processing and congestion avoidance algorithms.
- hints the creation of outgoing fileserver callback connections to improve the selection of a compatible network interface.
- improves the reliablity of UBIK client processes such as vos that mix database reads with database writes. Such clients can write stale data to the database if the source data is read from a non-coordinator replica of the database that is temporarily out of sync with the coordinator.
- New Supported Platforms:
- Red Hat Enterprise Linux 9.0
- Red Hat Enterprise Linux 8.6
- Fedora 36
- Ubuntu 22.04
- Linux Kernel 5.18
- Debian "arm hard float"
- Cell Service Database Updates
- Update cern.ch, ics.muni.cz, ifh.de, cs.cmu.edu, qatar.cmu.edu, it.kth.se
- Remove uni-hohenheim.de, rz-uni-jena.de, mathematik.uni-stuttgart.de, stud.mathematik.uni-stuttgart.de, wam.umd.edu
- Add ee.cooper.edu
- Restore ams.cern.ch, md.kth.se, italia
- AuriStorFS Client installers beginning with the v2021.05-9 release are supported on macOS 12 Monterey on both Apple Silicon and Intel Macs. Note: please upgrade Big Sur, Catalina, and Mojave macOS systems to AuriStorFS v2021.05-9 before upgrading macOS to Monterey. macOS Monterey will deadlock during shutdown when previous versions of AuriStorFS built for Big Sur, Catalina and Mojave are installed.
- Leveraging AFS Storage Systems to Ease Global Software Deployment
Tracy J. Di Marco White, Goldman Sachs
Using AFS as both a file store and an object store, we provide software to hundreds of thousands of client systems within both public and private cloud. As we see a continual increase in the frequency of software deployments, in the number of different software packages, and in the number of versions of each software package, we have also adapted our software deployment systems. Both of our software deployment systems use AFS, but one is unaware of AFS, and one makes specific use of various AFS features. I'll cover how the infrastructure has grown from several private data centers, and how our use of AFS has eased migration to both private and public cloud. I'll discuss the changes we are making to both the AFS-unaware and AFS-aware deployment systems, as well as discuss bugs, bottlenecks, and patterns of software development and usage that we've discovered through the change process.
- Hands-Off Testing for Networked Filesystems
Daria Phoebe Brashear, AuriStor, Inc.
Cross-platform network filesystems require testing, but in-kernel interface testing is problematic under the best of circumstances. This talk will discuss the techniques used at AuriStor for automating hands-off testing using buildbot, TAP, docker, and kvm.
- This release of AuriStorFS is primarily bug fixes for the UNIX Cache Managers and UBIK services (Location Service, Protection Service, and Backup Database Service).
- The UBIK service changes include several important corrections for mistakes introduced as part of the v0.208 and v2021.04 performance and reliability improvements. AuriStor recommends that all UBIK servers be updated to v2021.05.
- The UNIX Cache Manager changes provide additional data integrity checks during directory lookup operations. Some sites have observed the sporadic creation of negative dentries on LINUX in directories that have not been modified in over a year. Negative dentries are created when a lookup request fails with an ENOENT error. A year ago there were nearly locations where an ENOENT error could be returned. There are now only six and each location is one where an ENOENT is the correct result. The additional directory integrity checks are intended to ensure that an ENOENT cannot be generated as a result of the lookup being performed on invalid data. If a data integrity error is detected, an EIO error will be returned and a warning will be logged including the directory FileID and the disk cache chunk index value.
- Other bug fixes include Direct I/O and StoreMini RPCs would overwrite fileserver returned error codes such as VBUSY, VOFFLINE, ENOSPC, EACCES, EPERM, etc with a locally generated RXGEN_CC_UNMARSHAL (-451) error. These two code paths were missed when equivalent changes were applied elsewhere in the v0.205 release.
- The UNIX Cache Manager also received replacements for the subsystems responsible for managing the volume locations servers for each cell; per cell token management for each user/pag, and rx connection vectors. The new infrastructure does not rely upon the legacy global lock for thread safety and therefore permits increased parallism. The volume location server management supports multi-homed servers. Previously, each endpoint of a multi-homed server was treated as a unique server which potentially increased the delays when failover due to server restart or network path connectivity issues occurred. With the new model server probes select the best available endpoint. Failover due to Ubik errors no longer issues an RPC to an alternative endpoint belonging to a previously queried server.
- This release permits the selection of the "auth" (integrity protection only) rx security level in addition to "clear" and "crypt" (privacy and integrity protection). Use "fs setcrypt auth". For the first time users can request yfs-rxgk tokens that prefer "clear" or "auth" levels instead of "crypt". Use the new "aklog -levels <levels-list>" where <levels-list> is a comma separated ordered list of preferred minimum levels. When generating a yfs-rxgk token for an AuriStorFS cell, the RXGK service will pick the first level that is acceptable to the cell's policy. By default, the order is "crypt,auth,clear". Use of "clear" and "auth" is discouraged but might be useful to obtain additional speed by eliminating cryptographic algorithm overhead. v2021.05 is the new recommended release.
- A variety of improvements for the fileserver and volserver are included in this release but nothing urgent.
- AuriStorFS v2021.04 contains another round of significant improvements to the UBIK replication infrastructure used by the location, protection and backup services. The changes remove a serialization point by providing further data isolation between read-transactions and write-transactions. This change further reduces the risk of a thundering herd of reader threads subsequent to write-transaction completion in a heavily loaded cell (six ubik peers, peaks of 7000 reads/second/peer, and average of 22 writes/second). This release also adds additional data consistency protections to the UBIK application services (vlserver, ptserver, and budbserver). Sites that take advantage of the protection service's groups of groups (aka "supergroups") capability will appreciate a noticeable reduction in CPU usage when responding to fileserver queries for a user's current protection set (CPS). The RXGK service co-located with the location service includes a change to prepare for issuance of tokens that do not mandate wire privacy and integrity protection. AuriStor recommends that all UBIK database service instances be updated to v2021.04.
- AuriStor recommends that fileservers be updated when convenient to do so. There is a risk of partition lock deadlock during fileserver startup introduced in v0.209 and a potential vnode metadata consistency issue when clients perform cross-directory renames of vnodes with a link count greater than one.
- The UNIX / Linux cache manager changes are primarily bug fixes for issues that have been present for years. A possibility of an infinite kernel loop if a rare file write / truncate pattern occurs. A bug in silly rename handling that can prevent cache manager initiated garbage collection of vnodes. On Linux, the potential of an overwritten ERESTARTSYS error during fetch or store data rpcs could result in transient failures. Upgrading to v2021.04 is recommended but not urgent.
- v0.209 introduces a new cache manager architecture on all Linux platforms. The new architecture
includes a redesign of:
- kernel extension load
- kernel extension unload
- /afs mount
- /afs unmount
- New platform support includes:
- Linux mainline 5.11 and 5.12 kernels
- gcc11 compilation
- Updates for Linux ppc64 and ppc64le architectures
- Hardware accelerated cryptographic routines for Linux __aarch64__.
- Ubuntu 18.04 and 20.04 -oem kernel modules.
- Higher throughput read and write transactions for UBIK based
location, protection and backup database services. The
combination of write transaction isolation and lookup caching
substantially increases the rate of read and write transactions
while reducing the risk that read transactions can block for
Compared to v0.200, these changes have reduced the minimum write transaction time by 75%, the mean by 50% and the maximum by 13%. The rate of write transaction completion increased by 35%.
- unix cache manager negative volume name lookup caching
- bos getfile - similar to bos getlog but can be used to transfer files with arbitrary binary contents.
- asetkey delete by key sub-type.
- CIDR based assignment of server endpoint priorities.
- Over 600 additional improvements.
- v0.201 introduces a new cache manager architecture on all macOS
versions except for High Sierra (10.12). The new architecture
includes a redesign of:
- kernel extension load
- kernel extension unload (not available on Big Sur)
- /afs mount
- /afs unmount
- userspace networking
- The conversion to userspace networking will have two user visible
impacts for end users:
- The Apple Firewall as configured by System Preferences -> Security & Privacy -> Firewall is now enforced. The "Automatically allow downloaded signed software to receive incoming connections" includes AuriStorFS.
- Observed network throughput is likely to vary compared to previous releases.
- On Catalina the "Legacy Kernel Extension" warnings that were displayed after boot with previous releases of AuriStorFS are no longer presented with v0.201.
- AuriStorFS /afs access is expected to continue to function when upgrading from Mojave or Catalina to Big Sur. However, as AuriStorFS is built specifically for each macOS release, it is recommended that end users install a Big Sur specific AuriStorFS package.
- AuriStorFS on Apple Silicon supports hardware accelerated aes256-cts-hmac-sha1-96 and aes128-cts-hmac-sha1-96 using AuriStor's proprietary implementation.
The v0.200 release is an important release targeted at AuriStorFS server deployments and Unix cache managers.
fileserver: remote denial of service Impacted versions: All releases from v0.116 through v0.198 are affected. CVE: None CVSS Base Score: 4.9 Impact Subscore: 3.6 Exploitability Subscore: 1.2 CVSS Temporal Score: 4.8 CVSS Environmental Score: 3.1 Modified Impact Subscore: 1.8 Overall CVSS Score: 3.1 CVSS v3.1: AV:N/AC:L/PR:H/UI:N/S:U/C:N/I:N/A:H/E:F/RL:U/RC:C/CR:X/IR:X/AR:L/MAV:N/MAC:L/MPR:H/MUI:N/MS:U/MC:N/MI:N/MA:H Details on this denial of service vulnerability will be disclosed thirty (30) days after release to give customer sites an opportunity to update their fileservers to v0.200.
Linux cache managers prior to v0.199 are susceptible to a general protection fault if a server unreachable network error occurs during a direct i/o operation.
The macOS backgrounder in prior releases can repeatedly segmentation fault and restart when there is no network connectivity.
This release includes major improvements in the handling of RPCs that are interrupted by NAT/PAT devices timing out their udp endpoint mappings.
The v0.198 release is a security release targeted at AuriStorFS server deployments.
vlserver: remote denial of service Impacted versions: All releases from v0.193 through v0.197 are affected. CVE: CVE-2020-26119 CVSS Base Score: 8.6 Impact Subscore: 4.0 Exploitability Subscore: 3.9 CVSS Temporal Score: 8.0 CVSS Environmental Score: 9.3 Modified Impact Subscore: 5.9 Overall CVSS Score: 9.3 CVSS v3.1: AV:N/AC:L/PR:N/UI:N/S:C/C:N/I:N/A:H/E:F/RL:O/RC:C/CR:X/IR:X/AR:H/MAV:N/MAC:L/MPR:N/MUI:N/MS:C/MC:N/MI:N/MA:H Details on this denial of service vulnerability will be disclosed only after all customer sites have updated their location servers to v0.198.
The v0.197 release includes significant performance improvements for UNIX cache managers. Especially for macOS and Red Hat Enterprise Linux (including derivatives). macOS users accessing AuriStorFS cells with yfs-rxgk will experience AuriStor's proprietary hardware accelerated implementations of aes256-cts-hmac-sha1-96 and aes256-cts-hmac-sha514-384, and optimistic caching of vnode status information for the first time. Linux users will benefit from more aggressive optimistic caching of status information as well as support for SELinux labels and the world's first path-ioctl implementation for FUSE. CentOS users will appreciate the dedicated repository for CentOS kernel modules. This release also introduces support for the Linux mainline 5.8 and 5.9 kernels.
Cell administrators will appreciate an improved vos eachvol and new pts eachuser and pts eachgroup commands. The new pts whoami -rxkad switch and improved logging of yfs-rxgk authentication failures ease debugging of authentication and authorization configuration errors.
All servers have been updated to improve reliability and performance. New audit events have been added to the fileserver when a rename operation unlinks a existing target. A rename that replaces a directory will no longer orphan the unlinked directory. Volserver transactions can no longer be stolen by another cell administrator. UBIK service coordinators protect themselves against peers that accept a transaction rpc but never complete it.
Finally but not least, AuriStor has introduced fs ignorelist to replace fs blacklist. Likewise ignorelist-dns, ignorelist-volroot, ignorelist-afsmountdir, and ignorelist-volrootprefix replace the format [afsd] "blacklist" variants. In all cases "blacklist" is accepted as a hidden alias.
The v0.195 release is a CRITICAL update for all macOS and Linux cache managers. The changes in v0.195 correct bugs that can result in data corruption.
v0.195 also introduces support for Linux 5.7 kernels.
The v0.194 release is a CRITICAL update for all UBIK servers (vlserver, ptserver, buserver) and the macOS and Linux cache managers. The changes in v0.194 correct bugs that can result in data corruption in all of the above.
v0.194 also introduces support for Linux 5.6 kernels and Red Hat Enterprise Linux 7.8.
The 0.192 release is primarily a bug fix release focused on the Linux cache manager, the fileserver/salvageserver and "vos".
This fileserver fixes are critical for any server with a vice partition supporting Linux reflinks (xfs+reflinks, btrfs, ocfs2). AuriStor is unaware of any customers operating fileservers with these configurations. There is also a fix to permit salvaging volumes containing more than 76 million vnodes.
The Unix cache manager changes improve stability, efficiency, and scalability. Post-0.189 changes exposed race conditions and reference count errors which can lead to a system panic or deadlock. In addition to addressing these deficiencies this release removes bottlenecks that restricted the number of simultaneous vfs operations that could be processed by the auristorfs cache manager. The changes in this release have been successfully tested with greater than 400 simultaneous requests sustained for for several days.
Changes to the "vos move" and "vos movesite" commands preserve the volume's last update timestamp ensuring proper calculation of the change set for incremental transfers. The bug can result in reversion to the contents of the BACK volume if a RW volume with an existing BACK volume is moved twice without first removing or updating the BACK volume.
The AuriStor File System v0.191 release addresses bugs that were identified after the release of v0.190, compatibility with RHEL7 kernels suffering a regression, and improvements in Linux cache manager startup/shutdown. Notable changes include:
- Fileserver bug fixes when Linux reflinks are supported by vice partition backing stores. xfs+reflinks, btrfs or ocfs2.
- Re-enabling SIMD processor extensions for non-RHEL userland Linux processes when the kernel module does not export __kernel_fpu_begin or __kernel_fpu_end.
- Improvements in Linux kernel module startup and shutdown that reduce the risk that a reboot will be required.
- Work-around for a RHEL 7.6 and 7.7 regression that impacts Linux systems configured to export /afs via nfsd.
The AuriStor File System v0.190 release addresses bugs that were identified shortly after the release of v0.189.
The v0.189 release includes a broad range of performance improvements, new features, and of course, bug fixes. The highlights include:
- UNIX Cache Manager performance improvements
- Faster "git status" operation on repositories stored in /afs.
- Faster and less CPU intensive writing of (>64GB) large files to /afs. Prior to this release writing files larger than 1TB might not complete. With this release store data throughput is consistent regardless of file size. (See UNIX Cache Manager large file performance improvements).
- New Fileserver support for Data Loss Prevention scanning and Backup solutions that walk the directory tree. (See New Feature: Fileserver Implicit ACLs).
- A major rewrite of the Ubik recovery engine to eliminate contention with the coordinator election procedures. These changes ensure that regardless of the database size or network latency, Ubik recovery can no longer alter the timing of election ballots. In prior releases a lengthy recovery could prevent an election from being conducted which could result in the expiration of the coordinator's term. Reduction in thread contention will also enhance the performance of the vlserver, ptserver and buserver.
- Reduced Ubik transaction times resulting from parallelization of quorum updates.
- Many improvements intended to reduce the amount of data included in an incremental volume dump and/or avoid the need for generating a full volume dump during a "vos release".
- New support for acceptor-only keys that permit key rotation without single points of failure or flag days. (See "New Feature: Key Rotation Using Acceptor Only Keys).
- Auditing improvements.
- Updated platform support including Fedora Core 31 and Oracle Linux
- and much more ...
AuriStorFS v0.188 installer for macOS Catalina (10.15) released in conjunction with Apple's release of macOS Catalina (10.15).
v0.188 addresses three issues experienced by customers.
The first is UBIK coordinator term expiration of the location service after periodic load spikes that increased the size of the vlserver thread pool from 100 threads to more than 13,000 threads. The load spike would last for under a minute, the thread pool would scale back to 100 threads after 20 minutes. Forty minutes later another load spike would occur repeating the pattern. The existing rx packet allocator behaved very poorly under this workload pattern resulting in allocation of additional rx packets with each load spike. After a week more than five million rx packets had been allocated on some vlservers.
As the allocated packet counts increased and the number of threads decreased, the packets per thread ratio increased as well. When the thread pool resized to 100 threads the number of packets assigned to each thread grew to the point where packet transfers began to interfere with rx data transfer and event processing. The UBIK coordinator election algorithm is time sensitive and a failure to deliver votes or timeout RPCs in timely manner can result in election failure.
v0.188 addresses the root causes by replacing:
- the rx multi call implementation used to conduct UBIK elections with a new variation that manages its own timeouts and does not rely upon timeouts set upon each individual rx rpc.
- the condvar timed wait implementation with a version that has finer grained clock resolution: 1ns instead of 1s.
- the rx packet allocator with a new implementation that is better suited for use with dynamic thread pools and larger window sizes. The new allocator also significantly reduces lock contention when obtaining and releasing packets.
The second problem is loss of volume access after the fileserver
writes to the FileLog:
CopyOnWrite corruption prevention: detected zero nlink for volume N inode vnode:V unique:U tag:T (dest), forcing volume offline
No data corruption occurs but after each occurence the volume is salvaged. After the 16th automatic salvage the volume is taken offline until there is manual intervention by an administrator.
This bug, a file descriptor leak, was introduced in v0.184 as a side effect of one of the fixes for the libyfs_vol reference counting errors.
The final issue is the on-going problems that some customers
have experienced with Linux clients either with the shell
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
or "mount --bind" failing with
mount: mount(2) failed: No such file or directory
$ cd /afs/example.com/
$ ls -al /proc/self/cwd
/proc/self/cwd -> /afs/example.com (deleted)
The symptom occurs when a Linux dentry (directory entry) object ends up in an unhashed state although it is referenced by an inode.
Since v0.180 AuriStor has revised code paths to improve error code reporting and avoid race conditions that can generate this behavior. Apparently, there are still additional conditions that have yet to be identified. v0.188 includes a band-aid whereby an unhashed denty will be rehashed when needed. However, AuriStor is still trying to find and address the root cause. Therefore the AuriStorFS kernel module will log a warning when a dentry is rehashed
Changes since v0.184 are primarily focused on the UNIX/Linux cache manager and fixing operational issues reported in the volserver and fileserver. The v0.184 release implemented major changes to the UNIX/Linux cache manager. This release fixes bugs introduced in v0.184 and missed edge cases. It also continues the refactoring of internal interfaces to propagate error codes and signals to userland applications. VolserLog messages related to the volserver transaction lifecycle have been thoroughly revamped. Additional reliability improvements within RX are included.
- New Platforms:
- Red Hat Enterprise 8
- xfs reflinks (requires new filesystem) changes AuriStorFS StoreData RPC copy-on-write performance from O(filelength) to O(write length).
- significant improvements in udp performance compared to rhel7.
- extended berkeley packet filters provides for fairer distribution of rx call processing across multiple rx listener threads.
- Fedora 30
- Unix CM:
- v0.184 moved the /etc/yfs/cmstate.dat file to /var/yfs. With this change afsd would fail to start if /etc/yfs/cmstate.dat exists but contains invalid state information. This is fixed.
- v0.184 introduced a potential deadlock during directory processing. This is fixed.
Many sites have noticed that clients with v0.184 installed might log Lost contact with xxxx server ... referencing a strange negative error code and that fileservers might log FetchData Write failure ... errors from any Linux client version.
These errors might correlate to corruption of pages in the Linux page cache. The corruption is that one or more contiguous pages might be inappropriately zero filled.
This release implements many code changes intended prevent Linux page cache are AFS disk cache corruption.
- Better data version checks
- More invalidation of cache chunk data version when zapping
- Only zero fill pages past the server end of file
- Always advance RPC stream pointer when skipping over missing pages or when populating pages from the disk cache chunk.
- Never match a data version number equal to -1.
- Avoid truncation races between find_get_page() and page locking.
Some sites have experienced failures of Linux mount --bind of /afs paths or getcwd returning ENOENT. This release fixes a dentry race that can produce an unhashed directory entry.
Some uses of the directory will continue to work, as the first lookup following the race will associate a new dentry with the inode, as an additional alias. Directories are not supposed to have aliases on Linux, so the vfs code assumes that d_alias is at most a list of 1 element, and accesses the entry in a slightly different way in a few places. Some sites get the new hashed dentry, others get the original unhashed one.
- Propagate EINTR and ERESTARTSYS during location server queries to userland.
- Handle common error table errors obtained outside an afs_Analyze loop. Map VL errors to ENODEV and RX, RXKAD, RXGK errors to ETIMEDOUT
- Log all server down and server up events. Transition events from server probes failed to log messages.
- Avoid leaking local errors to the fileserver if a failure occurs during Direct IO processing.
- RX RPC networking:
- If the RPC initiator successfully completes a call without
consuming all of the response data fail the call by sending
an RX_PROTOCOL_ERROR ABORT to the acceptor and returning
a new error, RX_CALL_PREMATURE_END, to the initiator.
Prior to this change failure to consume all of the response data would be silently ignored by the initiator and the acceptor might resend the unconsumed data until any idle timeout expired. The default idle timeout is 60 seconds.
- Avoid event cancellation race with rx call termination during process shutdown. This race when lost can prevent a process such as vos from terminating after successfully completing its work.
- Avoid transmitting ABORT, CHALLENGE, and RESPONSE packets with an uninitialized sequence number. The sequence number is ignored for these packets but set it to zero.
- Frequent issuance of "vos listvol" commands can no longer interfere with volume transaction idle timeout processing.
Since IBM AFS 3.5 the volserver has logged transaction status every 30 seconds to the VolserLog. In v0.184 the volserver logs the following lifecycle messages at level 0:
- trans id on volume id is older than s seconds
- trans id on volume id has timed out
- trans id on volume id has been idle for more than s seconds
On a busy volserver these messages can flood the VolserLog.
This change raises the level of messages 1 and 3 to 125 and introduces a new "Created trans id on volume id" message logged at level 5.
With this change level 0 logs unexpected termination of each transaction. Level 125 will include the 30 second updates for sites that require them.
The partition, volume parentid and transaction iflags fields have been added to each log message.
- RPCs issued by vos listvol will no longer block in the volserver if the requested volume requires salvaging. The volume attachment retries can block the salvageserver from acquiring an exclusive volume lock resulting in a salvage failure and a soft-deadlock. From now on the vos listvol command will fail immediately.
- If the vice partition's backing store is unmounted or otherwise becomes unavailable the fileserver could terminate unexpectedly due to a segmentation fault. Beginning with this release the fileserver will survive but all requests for objects stored on the missing vice partition will fail.
Introduce the ability to configure random error injection during FetchStatus, FetchData, and StoreData RPC processing.
- Add File IDs to "FetchData Write Failure" FileLog messages.
- Ubik services:
- This release introduces the ability to configure a separate debug log level for ubik than for the application service. By default, when the "ubik_debug" level is unspecified or set to zero, the application's log level determines which "ubik: " log entries are written to the log.
Security improvements include volserver validation of destination volserver security policies prior to transmitting marshaled volume data. Prior to v0.184 the volservers were trusted to reject volumes whose security policy could not be enforced. Linux cache managers can no longer be keyed with rxkad tokens. Introduction of a pam module capable of managing tokens for both AuriStorFS and/or Linux Kernel kAFS.
The UNIX Cache Manager underwent major revisions to improve the end user experience by revealing more error codes, improving directory cache efficiency, and overall resiliency. The cache manager implementation was redesigned to be more compatible with operating systems such as Linux and macOS that support restartable system calls. With these changes errors such as "Operation not permitted", "No space left on device", "Quota exceeded", and "Interrupted system call" can be reliably reported to applications. Previously such errors might have been converted to "I/O error". These changes are expected to reduce the likelihood of "mount --bind" and getcwd failures on Linux with "No such file or directory" errors.
A potentially serious race condition and reference counting error in the vol package shared by the Fileserver and Volserver could prevent volumes from being detached which in turn could prevent the Fileserver and Volserver from shutting down. After 30 minutes the BOSServer would terminate both processes. The reference counting errors could also prevent a volserver from marshaling volume data for backups, releases, or migrations.
This release is moves the location of the cache manager's cmstate.dat from /etc/yfs/ to /var/yfs/ or /var/lib/yfs depending upon the operating system. The cmstate.dat file stores the cache manager's persistent UUID which must be unique. The cmstate.dat file must not be replicated. If virtual machines are cloned the cmstate.dat must be removed. The cmstate.dat file must not be managed by a configuration management system.
The release includes two new vos command options:
* "vos addsite -force"
* "vos listvol -id
Finally, this release includes a Linux PAM module as well as support for the Amazon Linux 2 distribution and many more quality and performance improvements
AuriStor, Inc. is pleased to sponsor and invite AFS and Linux kernel developers to the second Linux kernel AFS (kAFS) Hackathon and Birds of a Feather meeting. The hackathon and BoF will be co-located with the USENIX Vault '19 - Linux Storage and Filesystems Conference. Read more...
This release improves RX call reliability across network paths with a high degree of packet loss and/or round trip times larger than 60ms. The corrected bugs have been present in all IBM derived RX implementations dating back to the mid 90s. The impact of these bugs is an increased risk of timeouts and performance degradation for long lived calls over high latency network paths that periodically experience packet loss. Volume operations such as moves, releases, backups and restores over WAN connections are particularly susceptible due to the amount of data transmitted in each RPC.
One feature change is experimental support for RX windows larger than 255 packets (360KB). This release extends the RX flow control state machine to support windows larger than the Selective Acknowledgment table. The new maximum of 65535 packets (90MB) could theoretically fill a 100 gbit/second pipe provided that the packet allocator and packet queue management strategies could keep up.
A change to volume use statistics tracking when volumes are moved, copied, and restored. The AFS volume dump stream format which is used for volume archives and volume transfers can store the daily and weekly vnode access counts but none of the other extended volume statistics maintained by the fileserver. When a volume is moved it makes sense for the use counts to be migrated with the volume to the new location. When a volume is copied it makes sense that the new location should start its counters from zero instead of the values collected at the location that was used as the source. Finally, when restoring a volume or releasing a new snapshot of a volume to readonly or backup sites, the use counts should remain unaltered. Beginning with this release, when AuriStorFS v0.178 or later "vos" is used in combination with an AuriStorFS v0.178 or later destination "volserver" the desired use count management will take place. At the moment the weekly access counts are only accessible when using the "vos examine -format" switch.
Many more quality and performance improvements
AuriStor's RX implementation has undergone a major upgrade of its flow control model. Prior implementations were based on TCP Reno Congestion Control as documented in RFC5681; and SACK behavior that was loosely modelled on RFC2018. The new RX state machine implements SACK based loss recovery as documented in RFC6675, with elements of New Reno from RFC5682 on top of TCP-style congestion control elements as documented in RFC5681. The new RX also implements RFC2861 style congestion window validation.
When sending data the RX peer implementing these changes will be more likely to sustain the maximum available throughput while at the same time improving fairness towards competing network data flows. The improved estimation of available pipe capacity permits an increase in the default maximum window size from 60 packets (84.6 KB) to 128 packets (180.5 KB). The larger window size increases the per call theoretical maximum throughput on a 1ms RTT link from 693 mbit/sec to 1478 mbit/sec and on a 30ms RTT link from 23.1 mbit/sec to 49.39 mbit/sec.
Workarounds for an IBM AFS and OpenAFS RX header userStatus field information leakage bug. This bug inadvertently interferes with the RX service upgrade mechanism that permits AuriStorFS clients (including Linux kafs) and services to detect each other without undesireable timeouts or extra round trips.
When an affected IBM or OpenAFS cache manager (or fileserver) establishes a connection to an AuriStorFS server the bug can result in an unintentional RX service upgrade. For example, if a pre-v0.175 fileserver incorrectly upgraded an incoming RX connection from RXAFS to RXYFS, it would mistakenly believe the client offered the RXYFSCB callback service; which it doesn't. The failure to establish a successful connection to the RXYFSCB service would cause the fileserver to reject the client's RXAFS requests with a VBUSY error.
A fileserver is supposed to be able to serve data from a .readonly or .backup volume while the volserver is dumping or forwarding the volume contents. This functionality introduced in IBM AFS 3.3 was fatally broken in AuriStorFS v0.157 when the volume disk interface was overhauled to avoid data corruption. Then starting with v0.168 "vos release" failed to terminate the volume transaction used to clone the RW volume to the RO site on the same server. Attempts to read from volumes that were exclusively in-use by the volserver would return VOFFLINE (106) errors.
As of AuriStorFS v0.175 release .readonly and .backup volumes can once again be attached to fileservers while a "vos release" or "vos dump" command is in process. Since some of the fixed defects were in "vos" and others in the fileserver both "vos" and the fileserver must be updated to v0.175 to ensure correct behavior.
A major security model change to the Backup Tape Controller (butc), backup coordinator command, and the backup service to address OPENAFS-SA-2018-001.txt
Starting with v0.175 butc supports:
- yfs-rxgk and rxkad authentication
- AES256-CTS-HMAC-SHA1-96 or 56-bit fcrypt wire encryption
- super user authorization
- auditing of all remote procedure call requests
The new security model is incompatible with the existing "backup" and "butc" processes. The new "butc" always executes using "localauth" credentials just as any other cell service does; it can no longer be executed using tokens obtained via aklog.
The butc service will by default require all incoming RPCs to be authenticated as a super user either via use of -localauth credentials or end user identities found in the UserListExt or ReaderList bosserver configuration.
As a side effect of these changes, both backup and butc gain IPv6 support.
As the new security model is incompatible with the existing deployed butc and backup processes, the 0.175 version includes configuration knobs to force the use of the old security model for backward compatibility. Use of these knobs restores the privilege escalation vulnerability. Please contact AuriStorFS support if your site requires use of this configuration.
New data input validation improvements within the vlserver and volserver. These changes ensure that the vlserver cannot store volume location records referencing invalid fileservers or volume site parameters; and that the volserver cannot forward volume data to volservers that are not registered with the cell's location service.
The same validation has been added to vos to ensure that it cannot be instructed to violate cell constraints.
- Correct Linux disk cache management to support AppArmor sand boxes.
- Many more quality and performance improvements.
In conjunction with Apple's release of macOS Mojave (10.14) to the general public, AuriStor announces the release of AuriStorFS v0.174 for macOS Mojave. Both AuriStorFS clients and servers can be installed.
AuriStor announces the release of AuriStor File System v0.174. In addition to the usual mix of bug fixes and functionality improvements, the v0.174 release includes a very special gift: A new x86_64 assembly language implementation of the AES256-CTS-HMAC-SHA1-96 encryption algorithm for Linux and macOS. This implementation leverages the following Intel processor extensions (when available):
- Advanced Encryption Standard New Instructions (AES-NI)
- Streaming Single Instruction Multiple Data (SSE, SSE2, SSSE3, SSE4)
- Advanced Vector Instructions (AVX, AVX2)
Originally intended for use by the Linux kernel module, the AuriStor implementation of AES256-CTS-HMAC-SHA1-96 is 2.4 times faster than OpenSSL and Apple's Common Crypto assembly language implementations. As a result, AuriStor has decided to leverage its implementation exclusively on Linux and macOS for servers, and administration tools.
The AuriStor assembly language implementation is ten times faster than the C language implementation used by previous releases of The AuriStorFS cache manager on x86_64 Linux.
On processors that implement AES-NI and AVX2 the performance cost of yfs-rxgk integrity protected and encrypted connections compared to rxnull unprotected connections is expected to be minimal. The Intel Core i5-4250U CPU @ 1.30GHz (Hazwell 22nm), a low-end consumer processor, can compute (encrypt, sign, verify, decrypt) better than 217,000 yfs-rxgk packets per second (or 2.3 Gbit/second) per core.
A 20-core server class processor with 10 cores dedicated to Rx listener threads and 10 cores remaining for application service threads (where cryptographic operations are performed) can saturate dual-bonded 10gbit/second network interfaces with yfs-rxgk protected traffic.
One customer compared "vos release" of a small volume storing 10GB in 5000 files and directories between v0.167 and v0.173 on RHEL 6.9 x86_64. It observed:
a 24% reduction in clock time to complete the operation
a 100% increase in the peak number of packets sent per second
The reductions in processor time per packet result in reduced per-packet latency and an increased capacity to scale the number of simultaneous RPCs per file server, volume server, location server and protection server.
The incentive for sites to migrate from the 1980s rxkad to yfs-rxgk is greater than ever.
v0.170 is primarily a performance improvement release. AuriStor RX v0.170 is the world's first implementation capable of transferring more than 5.5TB per call. For the first time in AFS history, volumes larger than 5.5TB can be moved, replicated, backed up and restored. v0.170 includes Meltdown and Spectre optimizations for UBIK services reducing by more than 50% the number of syscalls required to process UBIK requests. The v0.170 release includes more than 400 changes compared to v0.168. v0.169 was not publicly released.
v0.168 is a critical bug fix release addressing a fileserver denial of service vulnerability [CVE-2018-7444] and a client side bug in fs setacl -negative which generates more permissive access control lists than intended [CVE-2018-7168]. The v0.168 fileserver provides cell administrators the ability to prevent clients incorporating the bug from storing ACLs. This release also adds support for Red Hat Enterprise Linux 7.5 kernels and includes optimizations to reduce the impact of Meltdown and Spectre mitigations. v0.168 also include major improvements to the volume transaction lifecycle. Interrupted or failed transactions no longer require cell administrators to manually clean up temporary volumes. The v0.168 release includes nearly 400 changes compared to v0.167.
v0.167 is a critical bug fix release addressing a denial of service vulnerability [CVE-2017-17432] in all services and clients. This release also adds support for the forthcoming Linux 4.15 kernel and two new vos subcommands, movesite and copysite.
v0.164 is a bug fix and performance release. This release includes a major rewrite of core cache manager I/O pathways on Linux supporting direct I/O, cache bypass, and read-ahead. This release includes additional improvements and bug fixes to UBIK beyond those shipped in v0.163 to support mixed OpenAFS and AuriStorFS deployments.
v0.163 contains major updates to the UBIK database replication protocol implementation that increase resiliency to peer communication failures and permit sites to mix IBM/OpenAFS and AuriStorFS servers without introducing single points of failures. These changes combined with those included in v0.162 simplify the migration from OpenAFS to AuriStorFS. v.163 introduces Linux 4.14 kernel support. File server detection of and protection against unresponsive cache manager callback service implementations.
v0.162 contains AFS3-compatibility changes for the UBIK database replication protocol permit AuriStorFS servers to be deployed in IBM AFS and OpenAFS cells without configuring them as clones. First release with Fedora 27 support. Bug fixes and on-going improvements.
v0.160 introduces macOS High Sierra and Apple File System support. On Linux, exporting the /afs file namespace via Linux nfsd using NFS2, NFS3, and NFS4 is now supported. Reduced memory utilization by the RX networking stack. File server workaround for deadlocks that are known to occur within IBM AFS and OpenAFS Unix cache managers. Bug fixes and general improvements.
v0.159 introduces "vos eachfs" command. Linux 4.13 kernel support.Continued performance improvements and bug fixes.
Linux 4.12 kernel support. Fedora 26 support. Fileserver support for XFS and BTRFS reflinks for improved vice partition copy-on-write performance. Volserver and "vos" support for quotas larger than 2TB. Linux cache manager performance enhancements to address parallel workflows. macOS fix for Orpheus' Lyre. On-going bug fixing and improvements.
Nico Williams, Viktor Dukhovni and Jeffrey Altman announced the discovery of the "Orpheus' Lyre puts Kerberos to Sleep" bug:
As the name suggestions, this implementation flaw can result in a failure of Kerberos mutual authentication. Kerberos is supposed to provide a secure method of network authentication impervious to man-in-the-middle attacks. Fortunately, the protocol is secure but a mistake made by many implementations permits an attacker to successfully perform service impersonation and in conjunction with credential delegation (ticket forwarding) client impersonation. The attack is silent and cannot be detected.
This is a client-side vulnerability so it must be fixed by patching the client systems and systems that have more than one Kerberos implementation installed must obtain patches from all of the implementations to be secure.
The MIT Kerberos implementation was never vulnerable. As patches for other implementations become available the https://www.orpheus-lyre.info/ site will be updated to indicate that.
Yesterday Microsoft issued patches and those should in my opinion be treated as critical with minimal delays before deployment. Heimdal also issued a patch which is included in version 7.4.
AuriStorFS bundles Heimdal when the local operating system's Kerberos and GSS-API cannot satisfy its requirements. The affected platforms include:
- Apple MacOS (all versions)
- Solaris (all versions)
- Microsoft Windows (all versions)
- Apple iOS (all versions)
This 1.6.21 release of OpenAFS includes a fix to Rx which improves the performance of Rx connections between OpenAFS and AuriStorFS when the OpenAFS peer is writing bulk data to the AuriStorFS peer. Examples include:
- OpenAFS cache manager issuing RXAFS_StoreData calls to AuriStorFS file servers
- OpenAFS volserver forwarding volume data to an AuriStorFS volserver
- OpenAFS volserver dumping volume data to an AuriStorFS "vos dump"
- OpenAFS "vos restore" restoring volume data to an AuriStorFS volserver
There are of course other scenarios involving backups, bulk vlserver queries, etc.
The fix avoids the introduction of 100ms delays as the AuriStor Rx peer attempts to re-open a call window which had been closed due to a lack receive buffers while waiting for the incoming data to be consumed.
New features include support for IBM TSM in the AuriStorFS Backup Tape Controllers. "vos eachvol" enhancements. Faster "pts examine" performance. Automated salvager repair of corrupted volume vnode index file entries. New "vos status" command provides more informative volserver transaction status output including bytes sent and received for each call when both "vos" and "volserver are v0.150 or above. Linux 4.11 kernel support. "fs commands now support -nofollow switch. Many bug fixes and reliability improvements.
This Wednesday AuriStor, Inc. is sponsoring a Hackathon and BOF in support of kAFS and AF_RXRPC development at the Linux Foundation's annual Vault conference.
What are kAFS and AF_RXRPC?
AF_RXRPC is an implementation of the Rx RPC protocol implemented in the Linux mainline network stack as a socket family accessible both to userland and in-kernel processes.
kAFS is an implementation of the AFS and AuriStorFS file system client in the Linux mainline kernel.
Why are AF_RXRPC and kAFS important?
The AFS file system namespace has been available on Linux as a third party add-on since the IBM days. The IBM AFS derived implementations suffer:
- performance limitations due to the existence of a global lock to protect internal data structures
- license incompatibility with GPL_ONLY licensed kernel functionality that further restricts performance and functional capabilities
In addition, out of tree file system modules:
- are not a standard component of most Linux distributions thereby preventing ubiquitous access to the /afs file system namespace
- are not kept in sync with core filesystem and network layer changes in the Linux kernel by the developers responsible for those changes
Collectively, these issues increase the hurdles to use of the /afs file system namespace.
- Organizations must be careful not to deploy new Linux kernel versions until such time as updated AFS or AuriStorFS kernel modules are developed and distributed.
- Organizations cannot obtain the full benefit of the latest hardware whether that be hardware support for advanced cryptographic operations, patented processes such as rcu, and other techniques that can scale file system access to tens or hundreds of cpu cores.
- Lack of common distribution complicates the use of the /afs file namespace in support of Linux container based deployments as most organizations are unwilling to or unable to deploy custom kernels across their internal and cloud (aws, azure, ...) cloud infrastructures.
What is the History of kAFS and AF_RXRPC?
David Howells began work on an in-tree AFS client for Linux circa 2001. Unlike the IBM AFS derived cache manager, David's implementation is not a monolithic file system and proprietary network stack designed for portability across operating systems. Instead, David's AFS client is designed as separate modular components that are integrated into Linux the maximize their usefulness not only for AFS but for a broader class of applications:
- Instead of implementing Rx as a proprietary component of the AFS file system, David added Rx as a native socket family integrated with the Linux kernel networking stack at the same layer as UDP and TCP processing. This produces noticeable reductions in packet processing overhead. At the same time, the Rx RPC protocol becomes readily available as a lightweight secure RPC for userland applications. As a demonstration of how easy it is to use, David implemented much of the AFS administration command suite (bos, pts, vos) in Python by combining a Python XDR class with AF_RXRPC network sockets.
- Instead of AFS Process Authentication Groups (PAGs), David designed the Linux Keyrings which are now a core Linux component used in support of many file systems and network identity solutions.
- David developed the FS-Cache file system caching layer which is used in support of NFS* and CIFS file systems.
- David's kAFS is the AFS and AuriStorFS specific file system functionality including the callback services.
Unfortunately, David Howell's has received minimal support from the AFS user community. As a result, neither AF_RXRPC nor kAFS have been included in any major Linux distribution.
Why is AuriStor, Inc. contributing?
AuriStor, Inc. has invested substantial resources into its AuriStorFS Linux client in support of Red Hat Enterprise Linux, Fedora, CentOS, Debian and Ubuntu and will continue to do so. Yet, AuriStor, Inc. recognizes that widespread adoption of AuriStorFS servers for last scale Enterprise and Research deployments require higher performance, greater scale and easier maintenance for Linux systems.
AuriStor, Inc. also recognizes that the many of the workflows that have relied upon the /afs file namespace for software and configuration distribution are migrating to containers. That transition is not without its own challenges related to the management of Container identity and authentication to persistent network based resources. AuriStor, Inc. believes that the global /afs file namespace combined with the AuriStor Security Model (combined identity authentication and multi-factor constrained elevation authorization) are best suited to addressing the outstanding Container deployment issues.
AuriStor, Inc. believes that only through a native in-tree client can these issues be addressed.
What is AuriStor, Inc. contributing?
AuriStor, Inc. has leveraged its expertise and extensive quality assurance infrastructure to identify flaws in the AF_RXRPC and kAFS implementations. Over the last year hundreds of corrections and enhancements have been merged into the Linux mainline tree. Missing functionality has been identified and is being implemented one piece at a time.
As kAFS approaches production readiness AuriStor, Inc. will contribute native AuriStorFS client support including an implementation of the "yfs-rxgk" security class to the AF_RXRPC socket family.
It is our hope that by the end of 2017 kAFS and AF_RXRPC will be ready for inclusion in major Linux distributions side-by-side with NFS* and CIFS.
AuriStor, Inc. is also working with major players in the Container eco-system and the Linux Foundation to address the identity management problem. When successful, it will be possible to launch Containers with network credentials such as Kerberos tickets and AFS/AuriStorFS tokens managed by the host. AuriStor, Inc. believes that this functionality combined with the /afs file namespace will allow true portability of Containerized processes across private and public cloud infrastructures.
v0.145 is the latest in the on-going efforts to improve the fileserver's ability to perform volume operations when the volume is under heavy load and to recover when the unexpected happens.
One of the strengths of the /afs model is the ability to move, release, backup, and dump volumes while they are being accessed by clients under production loads. For example, the following scenario should in theory be handled without a hiccup:
- Take two file servers each with at least one vice partition.
- Create a volume V on fs1/a
- Add RO sites on fs1/a and fs2/a
- On at least two clients execute "iozone -Rac" in separate directories of volume V using the RW path
- On at least one client start a loop that lists one of the directories with stat info that the iozone test is writing to from V.backup.
- On at least one client start a loop that lists one of the directories with stat info that the izone test is writing to from V.readonly.
- Repeat the following process in a tight loop
- "vos backup V"
- "vos release V"
- "vos move V fs2 a"
- "vos backup V"
- "vos release V"
- "vos move V fs1 a"
Organizations do not actively operate their cells in this fashion but the expectation is that "if they did, it should work." Of course, the answer is "it didn't before v0.145". Why not? and where did it fail?
For those of you that are unfamiliar with the fileserver architecture, here is process list as reported by the bosserver for a fileserver:
[C:\]bos status great-lakes.auristor.com -long
Instance dafs, (type is dafs) currently running normally. Auxiliary status is: file server running. Process last started at Sat Dec 24 12:19:03 2016 (3 proc starts) Command 1 is '/usr/libexec/yfs/fileserver' Command 2 is '/usr/libexec/yfs/volserver' Command 3 is '/usr/libexec/yfs/salvageserver' Command 4 is '/usr/libexec/yfs/salvager'
The first three commands execute a set of dependent processes. The fileserver process:
- Registers the fileserver with the VL service and communicates with the PT service
- Processes all requests from AFS clients (aka cache managers) via the RXAFS and RXYFS Rx network services.
- Maintains a cache of all volume headers
- Is the exclusive owner of all volumes. the volserver and salvageserver processes request readonly or exclusive access to volumes from the fileserver
- Issues requests to the salvageserver to perform consistency checks and repair volumes when a problem is detected with the volume headers, the on-disk data, or other.
The goal is to ensure that a volume is available to the fileserver process as long as there are active requests.
Each "vos" command is implemented by one or more VL and VOL RPCs. The VOL RPCs are processed by the volserver. The volserver can:
- Submit a query to the fileserver to obtain the necessary data to satisfy the request
- Request readonly access to a RW volume which can be used to produce a new clone (for .readonly or .backup or .roclone)
- Request exclusive access to a RW, RO, BK or an entire volume group
- Request a volume be salvaged by the salvageserver
When the salvageserver is asked to salvage a volume it requests exclusive access to the volume group from the fileserver. There are a lot of moving parts.
For each of the "vos backup", "vos release" and "vos move" commands the volserver will exercise different combinations of query, readonly and exclusive access to volumes. The various modes of client requests to the fileserver:
- reading from the .backup
- reading from the .readonly
- writing to the RW
Produces contention between the volservers and the fileservers for control of volume.
The expected behavior is that a volume will be offline for the shortest amount of time possible, that the client will retry the request in a timely manner, and most importantly, not respond to a temporary outage as a fatal error.
What could go wrong?
For starters, the client retry algorithm is quite poor. Whenever a volume is busy or offline the client will sleep for 15 seconds. In other words, the client will block for an eternity and when it finally does retry its likely to find the volserver with exclusive ownership of the volume and sleep again for another 15 seconds. The v0.145 release does not fix the client behavior. That will come in a future release but for testing purposes we removed all of the sleeps from the clients and forced immediate retries in order to maximize the contention.
Once sufficient contention was generated, we observed that after round trips of the volume moving from fs1/a to fs2/a to fs1/a to fs2/a and back to fs1/a, the volume would go offline and not come back. We noticed that the fileserver would be asked to put the volume into an online state but would immediately request a salvage. The salvageserver would verify the state, ask the fileserver to put the volume into service, but the fileserver would immediately request another salvage. This would repeat a dozen times before the volume would be taken offline permanently. Attempts to manually salvage the volume would succeed but the volume would still not return to an online state. The failure was 100% reproducible.
For those of you that remember IBM AFS and OpenAFS 1.4.x and earlier, there was no on-demand attachment or salvaging of volumes. The fileserver was much simpler. The fileserver forced a salvage of all volumes at startup and then attached them. The volserver always obtained exclusive access to volumes. The fileserver never cached any volume metadata. If something went wrong with a volume, it was taking offline until an administrator intervened.
The struggles of the last few months have all been directly attributable to the changes introduced as part of the demand attach functionality introduced in OpenAFS 1.6.x. We have encountered deadlocks, on-disk volume metadata corruption, copy-on-write data corruption, salvager failures, and now fileserver volume metadata cache corruption.
After two round-trips of volume movement while the RW, RO, and BK are all under active usage, the fileserver's volume group cache would end up out of sync with the volserver. The fileserver would believe the volume was still owned by the volserver when it wasn't. The fileserver would request the salvageserver to verify the volume state but nothing it could do would result could make a difference.
As of v0.145 the handoff from volserver to fileserver has been corrected. Now the volume can be backed up, released, and moved continuously while each of the volume types are actively accessed. No salvages. No VNOVOL errors. The iozone benchmark will continue to function even if the fileserver or volserver processes are periodically killed.
With a modified client to reduce the delays after receiving a VBUSY or VOFFLINE error, the iozone benchmark continues to function although at a lower rate of throughput. There are periodic pauses when large copy-on-write operations need to be performed. Which brings me to our New Year's resolution.
In 2017, AuriStor aims to remove the copy-on-write delays. The COW delays are the due to the need to copy the entire backing file each time a file is modified after a volume clone is created. On Linux distributions that include "xfs reflinks" support for vice partitions AuriStor fileservers will be able to complete COW operations in constant time without regard to the length of the file being modified. This change coupled with client-side improvements to the retry algorithms will significantly reduce the performance hit after volume clone operations.
Have a Happy New Year! The AuriStor team is looking forward to an excellent year.
Today OpenAFS announced Security Advisory OPENAFS-SA-2016-003 and released 1.6.20 which is an urgent security release for all versions of OpenAFS. IBM AFS cache managers and all OpenAFS Windows clients are also affected. There is no update to the OpenAFS Windows client.
AuriStor File System clients and servers do not experience this information leakage. However, volume migrated to AuriStorFS file servers from OpenAFS or IBM AFS file servers will retain the information leakage.
In addition to impact described in the announcement it is worth noting that all backups and any archived dump files will contain information leakage. Restoring a backup or dump file containing information leakage will restore that leaked information to the file servers where it will be delivered to cache managers.
Salvaging restored volumes with the -salvagedirs option is required to purge the information leakage.
It is worth emphasizing that IBM AFS and OpenAFS volserver operations including all backup operations occur in the clear. Therefore, all leaked information will be visible to passive viewers on the network segments across which volume backups and moves occur.