An emerging class of data science platforms that provide collaboration and workflow management capabilities is gaining more attention from both users and vendors — most recently Oracle, which is buying its way into the market.
Oracle’s acquisition of startup DataScience.com puts more major-vendor muscle behind the workbench-style platforms, which give data science teams a collaborative environment for developing, deploying and documenting analytical models. IBM is already in with its Data Science Experience platform, informally known as DSX. Other vendors include Domino Data Lab and Cloudera, which last week detailed plans for a new release of its Cloudera Data Science Workbench (CDSW) software this summer.
These technologies are a subcategory of data science platforms overall. They aren’t analytics tools; they’re hubs that data scientists can use to build predictive and machine learning models in a shared and managed space — instead of doing so on their own laptops, without a central location to coordinate workflows and maintain models. Typically, they’re aimed at teams with 10 to 20 data scientists and up.
The workbenches began appearing in 2014, but it’s only over the past year or so that they matured into products suitable for mainstream users. Even now, the market is still developing. Domino and Cloudera wouldn’t disclose the number of customers they have for their technologies; in a March interview, DataScience.com CEO Ian Swanson said only that its namesake platform has “dozens” of users.
A new way to work with data science volunteers
Thorn, a nonprofit group that fights child sex trafficking and pornography, deployed Domino’s software in early 2017. The San Francisco-based organization only has one full-time data scientist, but it taps volunteers to do analytics work that helps law enforcement agencies identify and find trafficking victims. About 20 outside data scientists are often involved at a time — a number that swells to 100 or so during hackathons that Thorn holds, said Ruben van der Dussen, director of its Innovation Lab.
That makes this sort of data science platform a good fit for the group, he said. Before, the engineers on his team had to create separate computing instances on the Amazon Elastic Compute Cloud (EC2) for volunteers and set them up to log in from their own systems. With Domino, the engineers put Docker containers on Thorn’s EC2 environment, with embedded Jupyter Notebooks that the data scientists access via the web. That lets them start analyzing data faster and frees up time for the engineers to spend on more productive tasks, van der Dussen said.
He added that data security and access privileges are also easier to manage now — an important consideration, given the sensitive nature of the images, ads and other online data that Thorn analyzes with a variety of machine learning and deep learning models, including ones based on natural language processing and computer vision algorithms.
Thorn develops and trains the analytical models within the Domino platform and uses it to maintain different versions of the Jupyter Notebooks, so the work done by data scientists is documented for other volunteers to pick up on. In addition, multiple people working together on a project can collaborate through the platform. The group uses tools like Slack for direct communication, “but Domino makes it really easy to share a Notebook and for people to comment on it,” van der Dussen said.
Oracle puts its money down on data science
Oracle is betting that data science platforms like DataScience.com’s will become a popular technology for organizations that want to manage their advanced analytics processes more effectively. Oracle, which announced the acquisition this month, plans to combine DataScience.com’s platform with its own AI infrastructure and model training tools as part of a data science PaaS offering in the Oracle Cloud.
By buying DataScience.com, Oracle hopes to help users get more out of their analytics efforts — and better position itself as a machine learning vendor against rivals like Amazon Web Services, IBM, Google and Microsoft. Oracle said it will continue to invest in DataScience.com’s technology, with a goal of delivering “more functionality and capabilities at a quicker pace.” It didn’t disclose what it’s paying for the Culver City, Calif., startup.
The workbench platforms centralize work on analytics projects and management of the data science workflow. Data scientists can team up on projects and run various commercial and open source analytics tools to which the platforms connect, then deploy finished models for production applications. The platforms also support data security and governance, plus version control on analytical models.
Cloudera said its upcoming CDSW 1.4 release adds features for tracking and comparing different versions of models during the development and training process, as well as the ability to deploy models as REST APIs embedded in containers for easier integration into dashboards and other applications. DataScience.com, Domino and IBM provide similar functionality in their data science platforms.
Choices on data science tools and platforms
Deutsche Telekom AG is offering both CDSW and IBM’s DSX to users of Telekom Data Intelligence Hub, a cloud-based big data analytics service that the telecommunications company is testing with a small number of customers in Europe ahead of a planned rollout during the second half of the year.
Users can also access Jupyter, RStudio and three other open source analytics tools, said Sven Löffler, a business development executive at the Bonn, Germany, company who’s leading the implementation of the analytics service. The project team sees benefits in enabling organizations to connect to those tools through the two data science platforms and get “all this sharing and capabilities to work collaboratively with others,” he said.
However, Löffler has heard from some customers that the cost of the platforms could be a barrier compared to working directly with the open source tools as part of the service, which runs in the Microsoft Azure cloud. It’s fed by data pipelines that Deutsche Telekom is building with a new Azure version of Cloudera’s Altus Data Engineering service.