Tag Archives: open

Instaclustr CTO on open source database as a service

In recent years, organizations of all sizes have increasingly come to rely on open source database technologies, including Apache Cassandra.

The complexity of deploying and managing Cassandra at scale has led to a rise in database-as-a-service (DBaaS) providers offering managed Cassandra services in the cloud. Among the vendors that provide managed Cassandra today are DataStax, Amazon and Instaclustr.

Instaclustr, based in Redwood City, Calif., got its start in 2013 and has grown over the past eight years to offer managed services for a number of different open source data layer projects, including Kafka event streaming, Redis database and data caching as well as Elasticsearch data query and visualization.

In this Q&A, Ben Bromhead, co-founder and CTO of Instaclustr, discusses the intersection of open source and enterprise software and why database as a service is a phenomenon that is here to stay.

How has Instaclustr changed over the last eight years?

Ben BromheadBen Bromhead

Ben Bromhead: Our original vision was wildly different and, like all good startups, we had a pretty decent pivot. When the original team got together, we were working on a marketplace for high value data sets. We took a data warehouse approach for the different data sets we provided and the access model was pure SQL. It was kind of interesting from a computer science perspective, but we probably weren’t as savvy as we needed to be to take that kind of business to market.

But one of the things we learned along the way was there was a real need for Apache Cassandra database services. We had to spend a lot of time getting our Cassandra database ready and managing it. We quickly realized that there was a market for that, so we built a web interface for a service with credit card billing, wrote a few blog posts and within a few months we had our first production customers. That’s how we kind of pivoted and got into the Cassandra database-as-a-service space.

Originally, when we built Instaclustr the idea was very much around the idea of democratizing Cassandra for smaller users and smaller use cases. Over the years, we very clearly started to move into medium and large enterprises because they tend to have bigger deployments. They also tend to have more money and are less likely to go out of business.

There are a few Cassandra DBaaS vendors now (including Amazon). How do you see the expansion of the market?

Bromhead: We’re very much of the view that having more players in the market validates the market. But sure, it does make our jobs a little bit harder.

Our take on it [managed Cassandra as a service] is also a little bit different from some of the other vendors in that we really take a multi-technology approach. So you know, not only are we engaging with our customers around their Cassandra cluster, but we’re also helping them with the Kafka cluster, Elasticsearch and Redis.

So what ends up happening is we end up becoming a trusted partner for a customer’s data layer and that’s our goal. We certainly got our start with Cassandra, that’s our bread and butter and what we’re known for, but in terms of the business vision, we want to be there as a data layer supporting different use cases.

You know, it’s great to see more Cassandra services come in. They’ve got a particular take on it and we’ve got a particular take on it. I’m very much a believer that a rising tide lifts all boats.

How does Instaclustr select and determine which open source data layer technologies you will support and turn into a managed service?

Bromhead: We’re kind of 100 percent driven by customers. So you know, when they asked us for something, they’re like, ‘Hey, you do a great job with our Elasticsearch cluster, can you look after our Redis or a Mongo?’ That’s probably the major signal that we pay most attention to. We also look at the market and certainly look at what other technologies are getting deployed side by side.

It’s one thing to have an open source license. It’s another thing to have strong governance and strong IP and copyright protection.
Ben BromheadCo-founder and CTO, Instaclustr

We very clearly look for and prefer technologies where the core IP or the majority of the IP is owned by an open source foundation. So whether that’s Apache or the Cloud Native Computing Foundation, whatever they may be. It’s one thing to have an open source license. It’s another thing to have strong governance and strong IP and copyright protection.

What are the challenges for Instaclustr in taking an open source project and turning into an enterprise grade DBaaS?

Bromhead: The open source versus enterprise grade production argument is starting to become a little bit of a false dichotomy to some degree. One thing we’ve been super focused on in the open source space around Cassandra is getting it to be more enterprise-grade and doing it in an open source way.

So a great example of that is: We have released a bunch of authentication improvements to Apache Cassandra that typically you only see in the enterprise distributions. We’ve also released backup and audit capabilities as well.

It’s one thing to have the features and to be able to tick the feature box as you kind of go down the list. It’s another thing to run a technology in a production-grade way. We take a lot of the pain out of that, in an easily reproducible, repeatable manner so that our support team can make sure that we’re delivering on our core support promises. Some of the challenges of getting stuff set up in a production-grade manner is going to get a little bit easier, particularly with the rise of Kubernetes.

The core challenge, however, for a lot of companies is actually just the expertise of being skilled in particular technologies.

We don’t live in a world where everything just lives on an Oracle or a MySQL database. You know, more and more teams are dealing with two or three or four different databases.

What impact has the COVID-19 pandemic had on Instaclustr?

Bromhead: On the business side of things it has been a mixed bag. As a DBaaS, we’re exposed to many different industries. Some of the people we work with have travel booking websites or event-based business and those have either had to pack up shop or go into hibernation.

On the flip side, we work with a ton of digital entertainment companies, including video game platforms, and that traffic has gone through the roof. We’re also seeing some people turn to Instaclustr as a way to reduce costs, to get out of expensive, unnecessary licensing agreements that they have.

We’re still in a pretty good path for growth for this year, so I think that speaks volumes to the resilient nature of the business and the diversity that we have in the customer base.

Editor’s note: This interview has been edited for clarity and conciseness.

Go to Original Article
Author:

At long last, Microsoft Teams to get multiwindow support

Microsoft Teams will soon let users open chats, calls and video meetings in separate windows. The long-sought feature will help people multitask in the team collaboration app.

Microsoft plans to finish rolling out pop-out chats this month. Teams will get multiwindow support for calls and video conferences sometime in June.

Nearly 20,000 people have asked Microsoft to add multiwindow capabilities to Teams since the first request in 2016. It’s yet another example of an essential feature of Skype for Business that’s still missing in Teams.

“It’s like not being able to open multiple Word or Excel documents at the same time,” said Andrew Dawson, an IT professional based in the United Kingdom. “Archaic!”

Without the ability to open multiple windows, users can only do one thing at a time in Teams. The limitation forces some companies to use other communications apps in conjunction with Teams.

Jacques Detroyat, an IT manager for a company based in Switzerland, said one common workaround is for users to message on Skype for Business or WhatsApp during Teams meetings.

The setup is not ideal, Detroyat said. “It’s a bit like writing with a badly sharpened pencil or trying to have a conversation in a noisy environment: You can do it, but the experience won’t be great.”

Screenshot of Microsoft Teams chat
Microsoft is rolling out multiwindow chat for Microsoft Teams in May.

Some users want the company to support multiwindow viewing in even more scenarios. For example, Microsoft could let users edit a document in Teams in one window while searching for information they need in another. But the company has not committed to doing so.

Users will be able to open multiple Teams windows only in the Windows and Mac desktop apps. Microsoft has not said whether users of the web app will eventually get the upgrade.

The launch of multiwindow support will not solve another problem that users face. People want to be able to open separate Teams windows for different accounts on desktop. Microsoft has committed to letting users sign in to multiple accounts at the same time. But it has not provided an update on the feature in months.

Teams has attracted millions of new users during the coronavirus pandemic. The app grew from 20 million daily users at the end of 2019 to 75 million daily users in April.

The increased usage of Teams has made its shortcomings more aggravating to users. Complaints include the app not having a large enough group video display or a robust calendar.

Go to Original Article
Author:

PowerShell ForEach-Object cmdlet picks up speed

Since its move to an open source project in 2016, PowerShell’s development picked up significantly.

The PowerShell 7.0 release arrived in March with a slew of improvements and new features. One of the most intriguing updates occurred with the PowerShell ForEach-Object cmdlet, which gained a powerful new ability to perform loops in parallel.

Most system administrators have needed to execute some command or operation on multiple systems. Before the addition of the Parallel parameter, each iteration in a loop would run sequentially, or one after another. While this may work fine for loops with limited items, loops that require each step to take substantially more time is a perfect candidate for the Parallel parameter.

The PowerShell ForEach-Object Parallel parameter attempts to run multiple iterations of the loop at the same time, potentially saving on the overall runtime. With this newfound capability, there are several important caveats to understand before implementing the Parallel in any production scripts.

Understanding PowerShell ForEach-Object -Parallel

PowerShell supports several different methods of parallelism. In the case of ForEach-Object, runspaces provides this functionality. Runspaces are separate threads in the same process. These threads have less overhead compared to PowerShell jobs or PowerShell remoting.

A few factors will add to the amount of overhead used with the ForEach-Object Parallel parameter. You will need to import additional modules and reference outside variables with the $Using: syntax. In some situations, the Parallel parameter is not ideal due to the extra overhead it generates when in use, but there is a way to shift that burden away from the source machine.

One automation concern with this additional feature is flooding your infrastructure or servers with multiple operations at once.

One automation concern with this additional feature is flooding your infrastructure or servers with multiple operations at once. To control this behavior, the ThrottleLimit parameter restricts the number of concurrent threads. When one thread completes, any additional iterations will take that thread’s place, up to the defined limit.

The default ThrottleLimit is five threads, which generally keeps memory and CPU usage low. Without this setting, you can quickly overwhelm your local system or server by running too many threads in parallel.

Finally, one other useful ability of the Parallel parameter is it allows any parallel loops to run as PowerShell jobs. This functionality lets the PowerShell ForEach-Object command return a job object, which you can retrieve at a later time.

Performance between Windows PowerShell 5.1 and PowerShell 7

There have been many performance improvements since Windows PowerShell 5.1 and especially so with the latest release of PowerShell 7. Specifically, how have things improved with the development of the ForEach-Object command?

The code below runs a simple test to show the speed difference in the PowerShell ForEach-Object command between different versions of PowerShell. The first example shows results from Windows PowerShell 5.1:

$Collection = 1..10000

(Measure-Command {
$Collection | ForEach-Object {
$_
}
}).TotalMilliseconds
# Result: 35112.3222

In that version, the script takes more than 35 seconds to finish. In PowerShell 7, the difference is dramatic and takes slightly more than 1 second to complete:

$Collection = 1..100000

(Measure-Command {
$Collection | ForEach-Object {
$_
}
}).TotalMilliseconds
# Result: 1042.3588

How else can we demonstrate the power of the Parallel parameter? One common feature in PowerShell scripts used in production is to introduce a delay to allow some other action to complete first. The following script uses the Start-Sleep command to add this pause.

$Collection = 1..10

(Measure-Command {
$Collection | ForEach-Object {
Start-Sleep -Seconds 1
$_
}
}).TotalMilliseconds
# Result: 10096.1418

As expected, running sequentially, the script block takes almost 10 seconds. The following code demonstrates the same loop using the Parallel parameter.

$Collection = 1..10

(Measure-Command {
$Collection | ForEach-Object -Parallel {
Start-Sleep -Seconds 1
$_
}
}).TotalMilliseconds
# Result: 2357.487

This change shaved almost 8 seconds off the total runtime. Even with only five threads running at once, each iteration kicks off when the previous one completes for a significant reduction in execution time.

Putting the Parallel parameter in action

How can these enhancements and abilities translate to real-world system administration actions? There are countless scenarios that would benefit from running operations in parallel, but two that are very common are retrieving information from multiple computers and running commands against multiple computers.

Collecting data from multiple computers

One common administrative task is to gather information on many different systems at once. How is this done with the new PowerShell ForEach-Object -Parallel command? The following example retrieves the count of files in user profiles remotely across systems.

$Computers = @(
"Computer1"
"Computer2"
"Computer3"
"Computer4"
"Computer5"
)

(Measure-Command {
$User = $Env:USERNAME

$Computers | ForEach-Object -Parallel {
Invoke-Command -ComputerName $_ -ScriptBlock {
Write-Host ("{0}: {1}" -F $_, (Get-ChildItem -Path "C:Users$($Using:User)" -Recurse).Count)
}
}
}).TotalMilliseconds

Computer1: 31716
Computer2: 30055
Computer4: 28542
Computer3: 33556
Computer5: 26052
13572.8172

On PowerShell 7, the script completes in just over 13 seconds. The same script running on Windows PowerShell 5.1 without the Parallel parameter executes in just over 50 seconds.

Running commands against multiple computers

Oftentimes, an administrator will need a command or series of commands to run against several target systems as fast as possible. The following code uses the Parallel parameter and PowerShell remoting to make quick work of this transfer process.

$Computers = @(
"Computer1"
"Computer2"
"Computer3"
"Computer4"
"Computer5"
)

$RemoteFile = "\Server1SharedFilesDeployment.zip"

(Measure-Command {
$Computers | ForEach-Object -Parallel {
Invoke-Command -ComputerName $_ -ScriptBlock {
Copy-Item -Path $Using:RemoteFile -Destination "C:"
}
}
}).TotalMilliseconds

23572.8172

Shifting overhead with Invoke-Command

One useful feature in PowerShell when working with remote systems is to lower overhead by shifting computer-intensive commands to the target system. In the previous example, Invoke-Command runs the commands via the local PowerShell session on the remote systems. This is a helpful way to spread the overhead load and avoid potential bottlenecks in performance.

Go to Original Article
Author:

Why move to PowerShell 7 from Windows PowerShell?

PowerShell’s evolution has taken it from a Windows-only tool to a cross-platform, open source project that runs on Mac and Linux systems with the release of PowerShell Core. Next on tap, Microsoft is unifying PowerShell Core and Windows PowerShell with the long-term supported release called PowerShell 7, due out sometime in February. What are advantages and disadvantages of adopting the next generation of PowerShell in your environment?

New features spring from .NET Core

Nearly rebuilt from the ground up, PowerShell Core is a departure from Windows PowerShell. There are many new features, architectures and improvements that push the language forward even further.

Open source PowerShell runs on a foundation of .NET Core 2.x in PowerShell 6.x and .NET Core 3.1 in PowerShell 7. The .NET Core framework is also cross-platform, which enables PowerShell Core to run on most operating systems. The shift to the .NET Core framework brings several important changes, including:

  • increases in execution speed;
  • Window Desktop Application support using Windows Presentation Foundation and Windows Forms;
  • TLS 1.3 support and other cryptographic enhancements; and
  • API improvements.

PowerShell Core delivers performance improvements

As noted in the .NET Core changes, execution speed has been much improved. With each new release of PowerShell Core, there are further improvements to how core language features and built-in cmdlets alike work.

PowerShell Core speed improvement
A test of the Group-Object cmdlet shows less time is needed to execute the task as you move from Windows PowerShell to the newer PowerShell Core versions.

With a simple Group-Object test, you can see how much quicker each successive release of PowerShell Core has become. With a nearly 73% speed improvement from Windows PowerShell 5.1 to PowerShell Core 6.1, running complex code in gets easier and completes faster.

Sort-Object test results
Another speed test with the Sort-Object cmdlet shows a similar improvement with each successive release of PowerShell.

Similar to the Group-Object test, you can see Sort-Object testing results in nearly a doubling of execution speed between Windows PowerShell 5.1 and PowerShell Core 6.1. With sorting so often used in many applications, using PowerShell Core in your daily workload means that you will be able to get that much more done in far less time.

Gaps in cmdlet compatibility addressed

The PowerShell team began shipping the Windows Compatibility Pack for .NET Core starting in PowerShell Core 6.1. With this added functionality, the biggest reason for holding back from greater adoption of PowerShell Core is no longer valid. The ability to run many cmdlets that previously were only available to Windows PowerShell means that most scripts and functions can now run seamlessly in either environment.

PowerShell 7 will further cinch the gap by incorporating the functionality of the current Windows Compatibility Module directly into the core engine.

New features arrive in PowerShell 7

There are almost too many new features to list in PowerShell 7, but some of the highlights include:

  • SSH-based PowerShell remoting;
  • an & at the end of pipeline automatically creates a PowerShell job in the background;
  • many improvements to web cmdlets such as link header pagination, SSLProtocol support, multipart support and new authentication methods;
  • PowerShell Core can use paths more than 260 characters long;
  • markdown cmdlets;
  • experimental feature flags;
  • SecureString support for non-Windows systems; and
  • many quality-of-life improvements to existing cmdlets with new features and fixes.

Side-by-side installation reduces risk

A great feature of PowerShell Core, and one that makes adopting the new shell that much easier, is the ability to install the application side-by-side with the current built-in Windows PowerShell. Installing PowerShell Core will not remove Windows PowerShell from your system.

Instead of invoking PowerShell using the powershell.exe command, you use pwsh.exe instead — or just pwsh in Linux. In this way, you can test your scripts and functions incrementally before moving everything over en masse.

This feature allows quicker updating to new versions rather than waiting for a Windows update. By decoupling from the Windows release cycle or patch updates, PowerShell Core can now be regularly released and updated easily.

Disadvantages of PowerShell Core

One of the biggest drawbacks to PowerShell Core is losing the ability to run all cmdlets that worked in Windows PowerShell. There is still some functionality that can’t be fully replicated by PowerShell Core, but the number of cmdlets that are unable to run is rapidly shrinking with each release. This may delay some organizations move to PowerShell Core, but in the end, there won’t be a compelling reason to stay on Windows PowerShell with the increasing cmdlet support coming to PowerShell 7 and beyond.

Getting started with the future of PowerShell

PowerShell Core is released for a wide variety of platforms, Linux and Windows alike. Windows offers MSI packages that are easily installable, while Linux packages are available for a variety of different package platforms and repositories.

Simply starting the shell using pwsh will let you run PowerShell Core without disrupting your current environment. Even better is the ability to install a preview version of the next iteration of PowerShell and run pwsh-preview to test it out before it becomes generally available.

Go to Original Article
Author:

MariaDB X4 brings smart transactions to open source database

MariaDB has come a long way from its MySQL database roots. The open source database vendor released its new MariaDB X4 platform, providing users with “smart transactions” technology to enable both analytical and transactional databases.

MariaDB, based in Redwood City, Calif., was founded in 2009 by the original creator of MySQL, Monty Widenius, as a drop replacement for MySQL, after Widenius grew disillusioned with the direction that Oracle was taking the open source database.

Oracle acquired MySQL via its acquisition of Sun Microsystems in 2008. Now, in 2020, MariaDB still uses the core MySQL database protocol, but the MariaDB database has diverged significantly in other ways that are manifest in the X4 platform update.

The MariaDB X4 release, unveiled Jan. 14, puts the technology squarely in the cloud-native discussion, notably because MariaDB is allowing for specific workloads to be paired with specific storage types at the cloud level, said James Curtis, senior analyst of data, AI and analytics at 451 Research.

“There are a lot of changes that they implemented, including new and improved storage engines, but the thing that stands out are the architectural adjustments made that blend row and columnar storage at a much deeper level — a change likely to appeal to many customers,” Curtis said.

MariaDB X4 smart transactions converges database functions

The divergence with MySQL has ramped up over the past three years, said Shane Johnson, senior director of product marketing at MariaDB. In recent releases MariaDB has added Oracle database compatibility, which MySQL does not include, he noted.

In addition, MariaDB’s flagship platform provides a database firewall and dynamic data masking, both features designed to improve security and data privacy. The biggest difference today, though, between MariaDB and SQL is how MariaDB supports pluggable storage engines, which gain new functionality in the X4 update.

The thing that stands out are the architectural adjustments made that blend row and columnar storage at a much deeper level — a change likely to appeal to many customers.
James CurtisSenior analyst of data, AI and analytics, 451 Research

Previously when using the pluggable storage engine, users would deploy an instance of MariaDB for transactional use cases with the InnoDB storage engine and another instance with the ColumnStore columnar storage engine for analytics, Johnson explained.

In earlier releases, a Change Data Capture process synchronized those two databases. In the MariaDB X4 update, transactional and analytical features have been converged in an approach that MariaDB calls smart transactions.

“So, when you install MariaDB, you get all the existing storage engines, as well as ColumnStore, allowing you to mix and match to use row and columnar data to do transactions and analytics, very simply, and very easily,” Johnson said.

MariaDB X4 aligns cloud storage

Another new capability in MariaDB X4 is the ability to more efficiently use cloud storage back ends.

“Each of the storage mediums is optimized for a different workload,” Johnson said.

For example, Johnson noted that Amazon Web Service’s S3, is a good fit for analytics, because of its high-availability and capacity. He added that for transactional applications with row-based storage, Amazon Elastic Block Storage (EBS) is a better fit. The ability to mix and match both EBS and S3 in the MariaDB X4 platform makes it easier for user to consolidate both analytics and transactional workload in the database.

“The update for X4 is not so much that you can run MariaDB in the cloud, because you’ve always been able to do that, but rather that you can run it with smart transactions and have it optimized for cloud storage services,” Johnson said.

MariaDB database as a service (DBaaS) is coming

MariaDB said it plans to expand its portfolio further this year.

The core MariaDB open source community project is currently at version 10.4, with plans for version 10.5, which will include the smart transactions capabilities, to debut sometime in the coming weeks, according to MariaDB.

The new smart transaction capabilities have already landed in the MariaDB Enterprise 10.4 update. The MariaDB Enterprise Server has more configuration settings and hardening for enterprise use cases.

The full MariaDB X4 platform goes a step further with the MariaDB MaxScale database proxy, which provides automatic failover, transaction replay and a database firewall, as well as utilities that developers need to build database applications.

Johnson noted that traditionally new features tend to land in the community version first, but as it happened, during this cycle MariaDB developers were able to get the features into the enterprise release quicker.

MariaDB has plans to launch a new DBaaS product this year. Users can already deploy MariaDB to a cloud of choice on their own. MariaDB also has a managed service that provides full management for a MariaDB environment.

“With the managed service, we take care of everything for our customers, where we deploy MariaDB on their cloud of choice and we will manage it, administer it, operate and upgrade, it,” Johnson said. “We will have our own database as a service rolling out this year, which will provide an even better option.”

Go to Original Article
Author:

How should organizations approach API-based SIP services?

Many Session Initiation Protocol features are now available through open APIs for a variety of platforms. While voice over IP only refers to voice calls, SIP encompasses the set up and release of all calls, whether they are voice, video or a combination of the two.

Because SIP establishes and tears down call sessions, it brings multiple tools into play. SIP services enable the use of multimedia, VoIP and messaging, and can be incorporated into a website, program or mobile application in many ways.

The APIs available range from application-specific APIs to native programming languages, such as Java or Python, for web-based applications. Some newer interfaces are operating system-specific for Android and iOS. SIP is an open protocol, which makes most features available natively regardless of the SIP vendor. However, the features and implementations for SIP service APIs are specific to the API vendor. 

Some of the more promising features include the ability to create a call during the shopping experience or from the shopping cart at checkout. This enables customer service representatives and customers to view the same product and discuss and highlight features within a browser, creating an enhanced customer shopping experience.

The type of API will vary based on which offerings you use. Before issuing a request for a quote, issue a request for information (RFI) to learn what kinds of SIP service APIs a vendor has to offer. While this step takes time, it will allow you to determine what is available and what you want to use. You will want to determine the platform or platforms you wish to support. Some APIs may be more compatible with specific platforms, which will require some programming to work with other platforms.

Make sure to address security in your RFI.  Some companies will program your APIs for you. If you don’t have the expertise, or aren’t sure what you’re looking for, then it’s advantageous to meet with some of those companies to learn what security features you need. 

Go to Original Article
Author:

Microsoft Open Data Project adopts new data use agreement for datasets

Datasets compilation for Open Data

Last summer we announced Microsoft Research Open Data—an Azure-based repository-as-a-service for sharing datasets—to encourage the reproducibility of research and make research data assets readily available in the cloud. Among other things, the project started a conversation between the community and Microsoft’s legal team about dataset licensing. Inspired by these conversations, our legal team developed a set of brand new data use agreements and released them for public comment on Github earlier this year.

Today we’re excited to announce that Microsoft Research Open Data will be adopting these data use agreements for several datasets that we offer.

Diving a bit deeper on the new data use agreements

The Open Use of Data Agreement (O-UDA) is intended for use by an individual or organization that is able to distribute data for unrestricted uses, and for which there is no privacy or confidentiality concern. It is not appropriate for datasets that include any data that might include materials subject to privacy laws (such as the GDPR or HIPAA) or other unlicensed third-party materials. The O-UDA meets the open definition: it does not impose any restriction with respect to the use or modification of data other than ensuring that attribution and limitation of liability information is passed downstream. In the research context, this implies that users of the data need to cite the corresponding publication with which the data is associated. This aids in findability and reusability of data, an important tenet in the FAIR guiding principles for scientific data management and stewardship.

We also recognize that in certain cases, datasets useful for AI and research analysis may not be able to be fully “open” under the O-UDA. For example, they may contain third-party copyrighted materials, such as text snippets or images, from publicly available sources. The law permits their use for research, so following the principle that research data should be “as open as possible, as closed as necessary,” we developed the Computational Use of Data Agreement (C-UDA) to make data available for research while respecting other interests. We will prefer the O-UDA where possible, but we see the C-UDA as a useful tool for ensuring that researchers continue to have access to important and relevant datasets.

Datasets that reflect the goals of our project

The following examples reference datasets that have adopted the Open Use of Data Agreement (O-UDA).

Location data for geo-privacy research

Microsoft researcher John Krumm and collaborators collected GPS data from 21 people who carried a GPS receiver in the Seattle area. Users who provided their data agreed to it being shared as long as certain geographic regions were deleted. This work covers key research on privacy preservation of GPS data as evidenced in the corresponding paper, “Exploring End User Preferences for Location Obfuscation, Location-Based Services, and the Value of Location,” which was accepted at the Twelfth ACM International Conference on Ubiquitous Computing (UbiComp 2010). The paper has been cited 147 times, including for research that builds upon this work to further the field of preservation of geo-privacy for location-based services providers.

Hand gestures data for computer vision

Another example dataset is that of labeled hand images and video clips collected by researchers Eyal Krupka, Kfir Karmon, and others. The research addresses an important computer vision and machine learning problem that deals with developing a hand-gesture-based interface language. The data was recorded using depth cameras and has labels that cover joints and fingertips. The two datasets included are FingersData, which contains 3,500 labeled depth frames of various hand poses, and GestureClips, which contains 140 gesture clips (100 of these contain labeled hand gestures and 40 contain non-gesture activity). The research associated with this dataset is available in the paper “Toward Realistic Hands Gesture Interface: Keeping it Simple for Developers and Machines,” which was published in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems.

Question-Answer data for machine reading comprehension

Finally, the FigureQA dataset generated by researchers Samira Ebrahimi Kahou, Adam Atkinson, Adam Trischler, Yoshua Bengio and collaborators, introduces a visual reasoning task for research that is specific to graphical plots and figures. The dataset has 180,000 figures with 1.3 million question-answer pairs in the training set. More details about the dataset are available in the paper “FigureQA: An Annotated Figure Dataset for Visual Reasoning” and corresponding Microsoft Research Blog post. The dataset is pivotal to developing more powerful visual question answering and reasoning models, which potentially improve accuracy of AI systems that are involved in decision making based on charts and graphs.

The data agreements are a part of our larger goals

Microsoft Research Open Data project was conceived from the start to reflect Microsoft Research’s commitment to fostering open science and research and to achieve this without compromising the ethics of collecting and sharing data. Our goal is to make it easier for researchers to maintain provenance of data while having the ability to reference and build upon it.

The addition of the new data agreements to Microsoft Research Open Data’s feature set is an exciting step in furthering our mission.

Acknowledgements: This work would not have been possible without the substantial team effort by — Dave Green, Justin Colannino, Gretchen Deo, Sarah Kim, Emily McReynolds, Mario Madden, Emily Schlesinger, Elaine Peterson, Leila Stevenson, Dave Baskin, and Sergio Loscialo.

Go to Original Article
Author: Microsoft News Center

Datrium opens cloud DR service to all VMware users

Datrium plans to open its new cloud disaster recovery as a service to any VMware vSphere users in 2020, even if they’re not customers of Datrium’s DVX infrastructure software.

Datrium released disaster recovery as a service with VMware Cloud on AWS in September for DVX customers as an alternative to potentially costly professional services or a secondary physical site. DRaaS enables DVX users to spin up protected virtual machines (VMs) on demand in VMware Cloud on AWS in the event of a disaster. Datrium takes care of all of the ordering, billing and support for the cloud DR.

In the first quarter, Datrium plans to add a new Datrium DRaaS Connect for VMware users who deploy vSphere infrastructure on premises and do not use Datrium storage. Datrium DraaS Connect software would deduplicate, compress and encrypt vSphere snapshots and replicate them to Amazon S3 object storage for cloud DR. Users could set backup policies and categorize VMs into protection groups, setting different service-level agreements for each one, Datrium CTO Sazzala Reddy said.

A second Datrium DRaaS Connect offering will enable VMware Cloud users to automatically fail over workloads from one AWS Availability Zone (AZ) to another if an Amazon AZ goes down. Datrium stores deduplicated vSphere snapshots on Amazon S3, and the snapshots replicated to three AZs by default, Datrium chief product officer Brian Biles said.

Speedy cloud DR

Datrium claims system recovery can happen on VMware Cloud within minutes from the snapshots stored in Amazon S3, because it requires no conversion from a different virtual machine or cloud format. Unlike some backup products, Datrium does not convert VMs from VMware’s format to Amazon’s format and can boot VMs directly from the Amazon data store.

“The challenge with a backup-only product is that it takes days if you want to rehydrate the data and copy the data into a primary storage system,” Reddy said.

Although the “instant RTO” that Datrium claims to provide may not be important to all VMware users, reducing recovery time is generally a high priority, especially to combat ransomware attacks. Datrium commissioned a third party to conduct a survey of 395 IT professionals, and about half said they experienced a DR event in the last 24 months. Ransomware was the leading cause, hitting 36% of those who reported a DR event, followed by power outages (26%).

The Orange County Transportation Authority (OCTA) information systems department spent a weekend recovering from a zero-day malware exploit that hit nearly three years ago on a Thursday afternoon. The malware came in through a contractor’s VPN connection and took out more than 85 servers, according to Michael Beerer, a senior section manager for online system and network administration of OCTA’s information systems department.

Beerer said the information systems team restored critical applications by Friday evening and the rest by Sunday afternoon. But OCTA now wants to recover more quickly if a disaster should happen again, he said.

OCTA is now building out a new data center with Datrium DVX storage for its VMware VMs and possibly Red Hat KVM in the future. Beerer said DVX provides an edge in performance and cost over alternatives he considered. Because DVX disaggregates storage and compute nodes, OCTA can increase storage capacity without having to also add compute resources, he said.

Datrium cloud DR advantages

Beerer said the addition of Datrium DRaaS would make sense because OCTA can manage it from the same DVX interface. Datrium’s deduplication, compression and transmission of only changed data blocks would also eliminate the need for a pricy “big, fat pipe” and reduce cloud storage requirements and costs over other options, he said. Plus, Datrium facilitates application consistency by grouping applications into one service and taking backups at similar times before moving data to the cloud, Beerer said.

Datrium’s “Instant RTO” is not critical for OCTA. Beerer said anything that can speed the recovery process is interesting, but users also need to weigh that benefit against any potential additional costs for storage and bandwidth.

“There are customers where a second or two of downtime can mean thousands of dollars. We’re not in that situation. We’re not a financial company,” Beerer said. He noted that OCTA would need to get critical servers up and running in less than 24 hours.

Reddy said Datrium offers two cost models: a low-cost option with a 60-minute window and a “slightly more expensive” option in which at least a few VMware servers are always on standby.

Pricing for Datrium DRaaS starts at $23,000 per year, with support for 100 hours of VMware Cloud on-demand hosts for testing, 5 TB of S3 capacity for deduplicated and encrypted snapshots, and up to 1 TB per year of cloud egress. Pricing was unavailable for the upcoming DRaaS Connect options.

Other cloud DR options

Jeff Kato, a senior storage analyst at Taneja Group, said the new Datrium options would open up to all VMware customers a low-cost DRaaS offering that requires no capital expense. He said most vendors that offer DR from their on-premises systems to the cloud force customers to buy their primary storage.

George Crump, president and founder of Storage Switzerland, said data protection vendors such as Commvault, Druva, Veeam, Veritas and Zerto also can do some form of recovery in the cloud, but it’s “not as seamless as you might want it to be.”

“Datrium has gone so far as to converge primary storage with data protection and backup software,” Crump said. “They have a very good automation engine that allows customers to essentially draw their disaster recovery plan. They use VMware Cloud on Amazon, so the customer doesn’t have to go through any conversion process. And they’ve solved the riddle of: ‘How do you store data in S3 but recover on high-performance storage?’ “

Scott Sinclair, a senior analyst at Enterprise Strategy Group, said using cloud resources for backup and DR often means either expensive, high-performance storage or lower cost S3 storage that requires a time-consuming migration to get data out of it.

“The Datrium architecture is really interesting because of how they’re able to essentially still let you use the lower cost tier but make the storage seem very high performance once you start populating it,” Sinclair said.

Go to Original Article
Author: