Tag Archives: enhancements

How to Select a Placement Policy for Site-Aware Clusters

One of the more popular failover clustering enhancements in Windows Server 2016 and 2019 is the ability to define the different fault domains in your infrastructure. A fault domain lets you scope a single point of failure in hardware, whether this is a Hyper-V host (a cluster node), its enclosure (chassis), its server rack or an entire datacenter. To configure these fault domains, check out the Altaro blog post on configuring site-aware clusters and fault domains in Windows Server 2016 & 2019. After you have defined the hierarchy between your nodes, chassis, racks, and sites then the cluster’s placement policies, failover behavior, and health checks will be optimized. This blog will explain the automatic placements policies and advanced settings you can use to maximize the availability of your virtual machines (VMs) with site-aware clusters.

Site-Aware Placement Based on Storage Affinity

From reading the earlier Altaro blog about fault-tolerance, you may recall that the resiliency is created by distributing identical (mirrored) storage spaces direct (S2D) disks across the different fault domains.  Each node, chassis, rack or site may contain a copy of a VM’s virtual hard disks. However, you always want the VM to be in the same site as its disk for performance reasons to avoid having the I/O transmitted across distance. In the event that a VM is forced to start in a separate site from its disk, then it will automatically live migrate the VM to the same site as its disk after about a minute.  With site-awareness, the automatic enforcement of storage affinity between a VM and its disk is given the highest site placement priority.

Configuring Preferred Sites with Site-Aware Clusters

If you have configured multiple sites in your infrastructure, then you should consider which site is your “primary” site and which should be used as a backup. Many organizations will designate their primary site as the location closest to their customers or with the best hardware, and the secondary site as the failover location which may have limited hardware to only support critical workloads.  Some enterprises may deploy identical datacenters, and distribute specific workloads to each location to balance their resources. If you are splitting your workloads across different sites you can assign each clustered workload or VM (cluster group) a preferred site. Let’s say that you want your US-East VM to run in your primary datacenter and your US-West VM to run in your secondary datacenter, you could configure the following settings via PowerShell:

Designating a preferred site for the entire cluster will ensure that after a failure that the VMs will start in this location. After you defined your sites by creating a New-ClusterFaultDomain you can use the cluster-wide property PreferredSite to set the default location to launch VMs. Below is the PowerShell cmdlet:

Be aware of your capacity if you are usually distributing your workloads across two sites and they are forced to run in a single location as performance will diminish with less hardware. Consider using the VM prioritization feature and disabling automatic VM restarts after a failure, as this will ensure that only the most important VMs will run. You can find more information from this Altaro blog on how to configure start order priority for clustered VMs.

To summarize, placement priority is based on:

  • Storage affinity
  • Preferred site for a cluster group or VM
  • Preferred site for the entire cluster

Site-Aware Placement Based on Failover Affinity

When site-awareness has been configured for a cluster, there are several automatic failover policies that are enforced behind the scenes. First, a clustered VM or group will always failover to a node, chassis or rack within the same site before it moves to a different site. This is because local failover is always faster than cross-site failover since it can bring the VM online faster by accessing the local disk and avoid any network latency between sites. Similarly, site-awareness is also honored by the cluster when a node is drained for maintenance. The VMs will automatically move to a local node, rather than a cross-site node.

Cluster Shared Volumes (CSV) disks are also site-aware. A single CSV disk can store multiple Hyper-V virtual hard disks while allowing their VMs to run simultaneously on different nodes.  However, it is important that these VMs are all running on nodes within the same site. This is because the CSV service coordinates disk write access across multiple nodes to a single disk. In the case of Storage Spaces Direct (S2D), the disks are mirrored so there are identical copies running in different locations (or sites). If VMs were writing to mirrored CSV disks in different locations and replicating their data without any coordination, it could lead to disk corruption. Microsoft ensures that this problem never occurs by enforcing all VMs which share a CSV disk to run on the local site and write to a single instance of that disk. Furthermore, CSV distributes the VMs across different nodes within the same site, balancing the workloads and write requests to that coordinate node.

Site-Aware Health Checks and Cluster Heartbeats

Advanced cluster administrators may be familiar with cluster heartbeats, which are health checks between cluster nodes. This is the primary way in which cluster nodes validate that their peers are healthy and functioning. The nodes will ping each other once per predefined interval, and if a node does not respond after several attempts it will be considered offline, failed or partitioned from the rest of the cluster. When this happens, the host is not considered an active node in the cluster and it does not provide a vote towards cluster quorum (membership).

If you have configured multiple sites in different physical locations, then you should configure the frequency of these pings (CrossSiteDelay) and the number of health check which can be missed (CrossSiteThreshold) before a node is considered failed. The greater the distance between sites, the more network latency will exist, so these values should be tweaked to minimize the chances of a false failover during times when there is high network traffic. By default, the pings are sent every 1 second (1000 milliseconds) and when 20 are missed, a node is considered unavailable and any workloads it was hosting will be redistributed. You should test your network latency and cross-site resiliency regularly to determine whether you should increase or reduce these default values. Below is an example to change the testing frequency from every 1 second to 5 seconds and the number of missed responses from 20 to 30.

By increasing these values, it will now take longer for a failure to be confirmed and failover to happen resulting in greater downtime. The default time is 1-second x 20 misses = 20 seconds, and this example extends it to 5 seconds x 30 misses = 150 seconds.

Site-Aware Quorum Considerations

Cluster quorum is an algorithm that clusters use to determine whether there are enough active nodes in the cluster to run its core operations. For additional information, check out this series of blogs from Altaro about multi-site cluster quorum configuration.  In a multi-site cluster, quorum becomes complicated since there could be a different number of nodes in each site. With site-aware clusters, “dynamic quorum” will be used to automatically rebalance the number of nodes which have votes. This means that as clusters nodes drop out of membership, the number of voting nodes changes. If there are two sites with an equal number of voting nodes, then the group of nodes that are assigned to be the preferred site will stay online and run the workloads, while the lower priority site will reduce their votes and not host any VMs.

Windows Server 2012 R2 introduced a setting known as the LowerQuorumPriorityNodeID, which allowed you to set a node in a site as the least important, but this was deprecated in Windows Server 2016 and should no longer be used. The idea behind this was to easily declare which location was the least important when there were two sites with the same number of voting nodes. The site with the lower priority node would stay offline while the other partition would run the clustered workloads. That caused some confusion since the setting was only applied to a single host, but you may still see this setting referenced in blogs such as Altaro’s https://www.altaro.com/hyper-v/quorum-microsoft-failover-clusters/.

The site-awareness features added to the latest version of Window Server will greatly enhance a cluster’s resilience through a combination of user-defined policies and automatic actions. By creating the fault domains for clusters, it is easy to provide even greater VM availability by moving the workloads between nodes, chassis, racks, and sites as efficiently as possible. Failover clustering further reduces the configuration overhead by automatically applying best practices to make failover faster and keep your workloads online for longer.

Wrap-Up

Useful information yes? How many of you are using multi-site clusters in your organizations? Are you finding it easy to configure and manage? Having issues? If so, let us know in the comments section below! We’re always looking to see what challenges and successes people in the industry are running into!

Thanks for reading!


Go to Original Article
Author: Symon Perriman

Dell EMC HCI and storage cloud plans on display at VMworld

LAS VEGAS — Dell EMC launched cloud-related enhancements to its storage and hyper-converged infrastructure products today at the start of VMworld 2018.

The Dell EMC HCI and storage product launch includes a new VxRail hyper-converged appliance, which uses VMware vSAN software. The vendor also added a cloud version of the Unity midrange unified storage array and cloud enhancements to the Data Domain data deduplication platform.

Dell EMC HCI key for multi-cloud approach?

Dell EMC is also promising synchronized releases between the VxRail and the VMware vSAN software that turns the PowerEdge into an HCI system – although it could take 30 days for the “synchronization.” Still, that’s an improvement over the six months or so it now takes for the latest vSAN release to make it to VxRail.

Whether you’re protecting data or storing data, the learning curve of your operating model — regardless of whether you’re on premises or off premises — should be zero.
Sam Grocottsenior vice president of marketing, ISG, Dell EMC

Like other vendors, Dell EMC considers its HCI a key building block for private and hybrid clouds. The ability to offer private clouds with public cloud functionality is becoming an underpinning of the multi-cloud strategies at some organizations.

Sam Grocott, senior vice president of marketing for the Dell EMC infrastructure solutions group, said the strong multi-cloud flavor of the VMworld product launches reflects conversations the vendor has with its customers.

“As we talk to customers, the conversation quickly turns to what we are doing in the cloud,” Grocott said. “Customers talk about how they’re evaluating multiple cloud vendors. The reality is, they aren’t just picking one cloud, they’re picking two or even three clouds in a lot of cases. Not all your eggs will be in one basket.”

Dell EMC isn’t the only storage vendor making its storage more cloud-friendly. Its main storage rival NetApp also offers its unified primary storage and backup options that run in the cloud, and many startups focus on cloud compatibility and multi-cloud management from the start.

Grocott said Dell’s overall multi-cloud strategy is to provide a consistent operating model experience on premises, as well as in private and public clouds. That strategy covers Dell EMC and VMware products. Dell EMC VxRail is among the products that tightly integrates VMware with the vendor’s storage.

“That’s what we think is going to differentiate us from any of the competition out there,” he said. “Whether you’re protecting data or storing data, the learning curve of your operating model — regardless of whether you’re on premises or off premises — should be zero.”

Stu Miniman, a principal analyst at IT research firm Wikibon, said Dell EMC is moving toward what Wikibon calls a True Private Cloud.

Wikibon’s 2018 True Private Cloud report predicts almost all enterprise IT will move to a hybrid cloud model dominated by SaaS and true private cloud. Wikibon defines true private cloud as completely integrating all aspects of a public cloud, including a single point of contact for purchase, support, maintenance and upgrades.

“The new version of the private cloud is, let’s start with the operating model I have in the public cloud, and that’s how I should be able to consume it, bill it and manage it,” Miniman said. “It’s about the software, it’s about the usability, it’s about the management layer. Step one is to modernize the platform; step two is to modernize the apps. It’s taken a couple of years to move along that spectrum.”

SteelCentral NPM upgrades to bolster SD-WAN portfolio

Riverbed Technology Inc. has introduced enhancements to its network performance monitor and integration between its SD-WAN product and cloud-based security provider Zscaler Inc.

This week’s announcements are related because the SteelCentral NPM is used on cloud-based business applications served by the SteelConnect SD-WAN. The latter routes traffic from a company’s branch to online software, the corporate data center or the internet.

A SteelCentral NPM module called Insights integrates with SteelConnect and provides information on usage and availability of the SD-WAN. “As time goes on, we’re going to have more and more functionality from the SteelCentral side into SteelConnect and other appliances in the Riverbed portfolio,” said Milind Bhise, the senior director of product marketing at Riverbed.

SteelCentral improvements

The SteelCentral platform is best suited for large networks requiring application performance analysis across WAN connections. New features in the latest version of the SteelCentral NPM software include integration between its Aternity module and ServiceNow’s online customer service management product.

Aternity monitors the performance of applications running on the web, virtual desktops and mobile devices. The integration makes it possible for ServiceNow to generate trouble tickets automatically when performance thresholds are not met.

Other enhancements include the ability to add SteelCentral monitoring of containers without changing them. A container is an OS-level virtualization method for deploying and running applications. The feature works alongside container orchestrators, including Kubernetes and Docker Swarm.

The latest SteelCentral iteration adds log messages from network devices to application performance data to assist developers and support troubleshooting. Starting with the log data, engineers can trace application activity to locate the source of the problem. The feature eliminates the need for a separate log analytics tool, Riverbed said.

Finally, Riverbed introduced a 40 Gbps network interface card for Riverbed appliances running SteelCentral and other products. The NIC doubles the traffic flow capacity.

SteelCentral NPM console
New and improved SteelCentral NPM console

SteelConnect with Zscaler

The SteelConnect-Zscaler integration makes it possible to use the former to direct internet traffic to the cloud-based security service. Zscaler products include a secure web gateway, firewall and data loss prevention tools.

SteelConnect customers would buy Zscaler separately, but access its portal through the SD-WAN product’s console.

SD-WAN vendors are adding services to their core products to make them an all-in-one networking product for branch offices. Along with security, vendors are tacking on WAN optimization and edge routing.