[ NOTE: machine translation with the help of DeepL translator without additional proofreading and spell checking ]
To ensure trouble-free operation of the ActiveCluster, special attention must be paid to the services required for this purpose. The availability of DNS services, for example, must be ensured at all times. A dependency is definitely given in the following two scenarios:
Use of the cloud mediator or use of further Pure1 features.
Use of an on-premise mediator with the connection via the DNS host name.
What does the mediator actually do?
The Mediator is a component of the ActiveCluster that resides in a failure domain independent of both peer FlashArrays. The requirements for the Mediator are modest - it responds to heartbeats from the peers (FlashArrays) and accepts or rejects their contact requests when they lose their connections to each other. The mediator determines which FlashArray should keep the pod online when connections fail.
The so-called "split brain" is well known in cluster systems. If two systems that share/coordinate information cannot communicate with each other, neither can determine what the other is doing, making coordination impossible. The problem is especially fatal with synchronous replication, whose purpose is to keep the volumes on both arrays identical. With ActiveCluster, updates are synchronized within so-called pod volumes, this requires the peer arrays to interact with each other on every write. If both FlashArrays receive host commands written within a pod volume but cannot communicate with each other, each must decide for itself whether to execute or reject the commands. To solve the split-brain problem, ActiveCluster uses a mediator.
The first array to reach the mediator stays online; subsequent requests from the other FlashArray are rejected. Thus, the other array "loses the race" (Pure Storage also refers to this as the Mediator Race), meaning it no longer responds to IOs directed to pod volumes (local volumes or volumes in other pods are not affected). Unlike other competitor solutions, ActiveCluster Mediator is passive: in regular mode - while ActiveCluster arrays are connected - its only task is to respond to heartbeats that ensure peers can communicate with the Mediator. When ActiveCluster peers try to contact the Mediator, it accepts the first request and rejects the second, otherwise this instance does not communicate with the FlashArrays *.
There are many deployment scenarios and it is necessary to weigh which one suits you/your company best. Do you use the on-premise or the cloud mediator? Can you do without Pure1 features in case of failure? ... this should be evaluated and defined during setup.
Depending on the size of the company, you will find a wide variety of system structures that offer countless options, but also often static and less flexible infrastructures. I would like to point out what you should not configure under any circumstances. The ActiveCluster setup is known to be easily done in a few minutes, but have you thought about all dependencies?
Gateway/Router (Cloud Mediator)
On the subject of the gateway, just very briefly, should you not have a redundant internet break-out, this "could" also put your entire HA construct at risk in the event of a failure. Pure has found a way through continuous system improvement - more about this later.
The Pure1 Cloud Mediator
I created a visualization of two setups, which in case of failure (before Purity 5.3) led to the cluster shutdown. One environment with only one gateway and one with redundant gateways. In both cases, the cloud mediator is used. We assume a replication link failure in a Uniform ActiveCluster in the following case studies.
The so-called "mediator race" now occurs: both FlashArrays try to reach the cloud mediator. However, this process will fail, because due to moves (or also during setup) of the VMs, the stored DNS servers are located within a pod, which are temporarily "frozen" and cannot process any DNS queries.
This should also be taken into account for proxies (if configured in Purity) that use pod volumes. A small carelessness, which can put us in the case of error a cluster completely "out of action" *.
To avoid the name resolution problem, we can proceed quite simply. We have to make sure that the DNS services are located in a storage area that is nevertheless redundant but independent of the pods.
To do this, we create non-pod (standalone) volumes on both FlashArrays, since these are always available as described and have no dependency on the ActiveCluster. The created volumes are mapped to the local servers (cross-RC mapping to servers is also optional). However, I do not consider the optional mapping to be absolutely necessary, since DNS servers replicate themselves on the application side, but it can be advantageous in the case of data center shutdowns (planned relocation/migration scenarios), for example.
If a spanned mapping is set up, I use VMware's on-board means (VM/host groups/rules) to ensure that the DNS VMs always run on the local hosts.
This results in the following visualizations:
As previously written, DNS queries can now be successfully resolved at any time because the virtual machines are outside the temporarily frozen pod volume "POD".
HINT: we are talking about a few seconds in the "frozen" state, which the application usually does not even notice (depending on the latency to the mediator).
In the example, FlashArray "PURE-1" wins the "race" and access to the "POD::VOL1" is activated via the available paths on the FlashArray of Datacenter 1. Systems (physical servers, VMs) are supplied - without intervention - from DC1 until the control state has been restored.
The on-premise mediator
If a Quorum is used on-premise, the DNS dependencies are not so crucial, provided that the mediator is not addressed via DNS. If this is the case, you also have to place the DNS servers outside the pod. In general, I always advise the latter, because you can always use all Pure1 features (proactive support, etc.)!
Checking the settings
If you need to check or change DNS settings, you can do this in the network settings (Settings > Network) of the FlashArrays. However, changes should be reported to the support at least in advance, otherwise tickets could be created proactively (if a heartbeat during change is not possible - interval every 5min/default).
Up to three DNS servers (Purity 6.0.4) can be stored. If you have a third data center, I advise you to specify one DNS server per data center in order to create maximum DNS availability.
The mediator used can be defined for each Pod, whereby "purestorage" always stands for the cloud mediator. With the "Failover Preference" option, it is also possible to give a FlashArray a head start in the "Mediator Race" - as a rule, this option only makes sense for non-redundant gateways. In this case, the FlashArray closest to the gateway should be the preference.
(*) Pre-Election Feature
This feature is available/active by default with Purity 5.3 (does not require activation/always-on) and ensures that the pod volumes remain online in any case in case of a lost mediator connection of both FlashArrays and additional replication connection failure.
What does Pre-Election do?
After the FlashArrays detect that the Mediator is not available, a pre-selection (= pre-elect) is made:
ensure that the pre-selected array remains online in the pod if the replication network or the NOT selected array fails.
the selected array matches the pod failover preferences, if not configured -> pre-elect as described above.
Return to ActiveCluster regular operation after the FlashArray(s) have a renewed Mediator connection.
The feature does not replace the use of a mediator for general! Pre-Election does not work if the mediator & replication connection fail at the same time. Pre-Election must have been executed in the heartbeat interval.
HINT: it is important to know that if the pre-selection feature is activated, it will not be deactivated until the pre-selected FlashArray is connected to the peer again and the mediator is reachable.
Possible failure scenarios
There is a good overview of various failure scenarios of ActiveClusters components. I must say...with the additional available features it is already extremely unlikely to not maintain the availability of the volumes - as long as the ActiveCluster basics are always observed!
I also refer here to the official "ActiveCluster Planning and Design Guide" in the Pure Technical Services Portal. Also you can find here the documentation for a successful Pure1 Mediator connection "FlashArray Port Assignments".
More info - Links
All officially published setting options in the GUI but also CLI can be read via the "on-board" user guides of the Pure Storage systems.
Click on "Help" in the Purity main menu.
The User Guide is structured like the main menu and can be opened downwards. A search function is also integrated - within here you can also search for keywords.
WEB: Pure Storage (Pure1) support portal - Ticket system and support *(requires registered FlashSystems)
PHONE: Pure Storage phone support: GER - (+49) (0)800 7239467; INTERNATIONAL - (+1) 650 7294088