Tag Archives: science

Data Center Scale Computing and Artificial Intelligence with Matei Zaharia, Inventor of Apache Spark

Matei Zaharia, Chief Technologist at Databricks & Assistant Professor of Computer Science at Stanford University, in conversation with Joseph Sirosh, Chief Technology Officer of Artificial Intelligence in Microsoft’s Worldwide Commercial Business


At Microsoft, we are privileged to work with individuals whose ideas are blazing a trail, transforming entire businesses through the power of the cloud, big data and artificial intelligence. Our new “Pioneers in AI” series features insights from such pathbreakers. Join us as we dive into these innovators’ ideas and the solutions they are bringing to market. See how your own organization and customers can benefit from their solutions and insights.

Our first guest in the series, Matei Zaharia, started the Apache Spark project during his PhD at the University of California, Berkeley, in 2009. His research was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in Computer Science. He is a co-founder of Databricks, which offers a Unified Analytics Platform powered by Apache Spark. Databricks’ mission is to accelerate innovation by unifying data science, engineering and business. Microsoft has partnered with Databricks to bring you Azure Databricks, a fast, easy, and collaborative Apache Spark based analytics platform optimized for Azure. Azure Databricks offers one-click set up, streamlined workflows and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts to generate great value from data faster.

So, let’s jump right in and see what Matei has to say about Spark, machine learning, and interesting AI applications that he’s encountered lately.

Video and podcast versions of this session are available at the links below. The podcast is also available from your Spotify app and via Stitcher. Alternatively, just continue reading the text version of their conversation below, via this blog post.

Joseph Sirosh: Matei, could you tell us a little bit about how you got started with Spark and this new revolution in analytics you are driving?

Matei Zaharia: Back in 2007, I started doing my PhD at UC Berkeley and I was very interested in data center scale computing, and we just saw at the time that there was an open source MapReduce implementation in Apache Hadoop, so I started early on by looking at that. Actually, the first project was profiling Hadoop workloads to identify some bottlenecks and, as part of that, we made some improvements to the Hadoop job scheduler and that actually went into Hadoop and I started working with some of the early users of that, especially Facebook and Yahoo. And what we saw across all of these is that this type of large data center scale computing was very powerful, there were a lot of interesting applications they could do with them, but just the map-reduce programming model alone wasn’t really sufficient, especially for machine learning – that’s something everyone wanted to do where it wasn’t a good fit but also for interactive queries and streaming and other workloads.

So, after seeing this for a while, the first project we built was the Apache Mesos cluster manager, to let you run other types of computations next to Hadoop. And then we said, you know, we should try to build our own computation engine which ended up becoming Apache Spark.

JS: What was unique about Spark?

MZ: I think there were a few interesting things about it. One of them was that it tried to be a general or unified programming model that can support many types of computations. So, before the Spark project, people wanted to do these different computations on large clusters and they were designing specialized engines to do particular things, like graph processing, SQL custom code, ETL which would be map-reduce, they were all separate projects and engines. So in Spark we kind of stepped back at them and looked at these and said is there any way we can come up with a common abstraction that can handle these workloads and we ended up with something that was a pretty small change to MapReduce – MapReduce plus fast data sharing, which is the in-memory RDDs in Spark, and just hooking these up into a graph of computations turned out to be enough to get really good performance for all the workloads and matched the specialized engines, and also much better performance if your workload combines a bunch of steps. So that is one of the things.

I think the other thing which was important is, having a unified engine, we could also have a very composable API where a lot of the things you want to use would become libraries, so now there are hundreds maybe thousands of third party packages that you can use with Apache Spark which just plug into it that you can combine into a workflow. Again, none of the earlier engines had focused on establishing a platform and an ecosystem but that’s why it’s really valuable to users and developers, is just being able to pick and choose libraries and arm them.

JS: Machine Learning is not just one single thing, it involves so many steps. Now Spark provides a simple way to compose all of these through libraries in a Spark pipeline and build an entire machine learning workflow and application. Is that why Spark is uniquely good at machine learning?

MZ: I think it’s a couple of reasons. One reason is much of machine learning is preparing and understanding the data, both the input data and also actually the predictions and the behavior of the model, and Spark really excels at that ad hoc data processing using code – you can use SQL, you can use Python, you can use DataFrames, and it just makes those operations easy, and, of course, all the operations you do also scale to large datasets, which is, of course, important because you want to train machine learning on lots of data.

Beyond that, it does support iterative in-memory computation, so many algorithms run pretty well inside it, and because of this support for composition and this API where you can plug in libraries, there are also quite a few libraries you can plug in that call external compute engines that are optimized to do different types of numerical computation.

JS: So why didn’t some of these newer deep learning toolsets get built on top of Spark? Why were they all separate?

MZ: That’s a good question. I think a lot of the reason is probably just because people, you know, just started with a different programming language. A lot of these were started with C++, for example, and of course, they need to run on the GPU using CUDA which is much easier to do from C++ than from Java. But one thing we’re seeing is really good connectors between Spark and these tools. So, for example, TensorFlow has a built-in Spark connector that can be used to get data from Spark and convert it to TFRecords. It also actually connects to HDFS and different sorts of big data file systems. At the same time, in the Spark community, there are packages like deep learning pipelines from Databricks and quite a few other packages as well that let you setup a workflow of steps that include these deep learning engines and Spark processing steps.

“None of the earlier engines [prior to Apache Spark] had focused on establishing a platform and an ecosystem.”

JS: If you were rebuilding these deep learning tools and frameworks, would you recommend that people build it on top of Spark? (i.e. instead of the current approach, of having a tool, but they have an approach of doing distributed computing across GPUs on their own.)

MZ: It’s a good question. I think initially it was easier to write GPU code directly, to use CUDA and C++ and so on. And over time, actually, the community has been adding features to Spark that will make it easier to do that in there. So, there’s definitely been a lot of proposals and design to make GPU a first-class resource. There’s also this effort called Project Hydrogen which is to change the scheduler to support these MPI-like batch jobs. So hopefully it will become a good platform to do that, internally. I think one of the main benefits of that is again for users that they can either program in one programming language, they can learn just one way to deploy and manage clusters and it can do deep learning and the data preprocessing and analytics after that.

JS: That’s great. So, Spark – and Databricks as commercialized Spark – seems to be capable of doing many things in one place. But what is not good at? Can you share some areas where people should not be stretching Spark?

MZ: Definitely. One of the things it doesn’t do, by design, is it doesn’t do transactional workloads where you have fine grained updates. So, even though it might seem like you can store a lot of data in memory and then update it and serve it, it is not really designed for that. It is designed for computations that have a large amount of data in each step. So, it could be streaming large continuous streams, or it could be batch but is it not these point queries.

And I would say the other thing it does not do it is doesn’t have a built-in persistent storage system. It is designed so it’s just a compute engine and you can connect it to different types of storage and that actually makes a lot of sense, especially in the cloud, with separating compute and storage and scaling them independently. But it is different from, you know, something like a database where the storage and compute are co-designed to live together.

JS: That makes sense. What do you think of frameworks like Ray for machine learning?

MZ: There are lot of new frameworks coming out for machine learning and it’s exciting to see the innovation there, both in the programming models, the interface, and how to work with it. So I think Ray has been focused on reinforcement learning which is where one of the main things you have to do is spawn a lot of little independent tasks, so it’s a bit different from a big data framework like Spark where you’re doing one computation on lots of data – these are separate computations that will take different amounts of time, and, as far as I know, users are starting to use that and getting good traction with it. So, it will be interesting to see how these things come about.

I think the thing I’m most interested in, both for Databricks products and for Apache Spark, is just enabling it to be a platform where you can combine the best algorithms, libraries and frameworks and so on, because that’s what seems to be very valuable to end users, is they can orchestrate a workflow and just program it as easily as writing a single machine application where you just import a bunch of libraries.

JS: Now, stepping back, what do you see as the most exciting applications that are happening in AI today?

MZ: Yeah, it depends on how recent. I mean, in the past five years, deep learning is definitely the thing that has changed a lot of what we can do, and, in particular, it has made it much easier to work with unstructured data – so images, text, and so on. So that is pretty exciting.

I think, honestly, for like wide consumption of AI, the cloud computing AI services make it significantly easier. So, I mean, when you’re doing machine learning AI projects, it’s really important to be able to iterate quickly because it’s all about, you know, about experimenting, about finding whether something will work, failing fast if a particular idea doesn’t work. And I think the cloud makes it much easier.

JS: Cloud AI is super exciting, I completely agree. Now, at Stanford, being a professor, you must see a lot of really exciting pieces of work that are going on, both at Stanford and at startups nearby. What are some examples?

MZ: Yeah, there are a lot of different things. One of the things that is really useful for end users is all the work on transfer learning, and in general all the work that lets you get good results with AI using smaller training datasets. There are other approaches as well like weak supervision that do that as well. And the reason that’s important is that for web-scale problems you have lot of labeled data, so for something like web search you can solve it, but for many scientific or business problems you don’t have that, and so, how can you learn from a large dataset that’s not quite in your domain like the web and then apply to something like, say, medical images, where only a few hundred patients have a certain condition so you can’t get a zillion images. So that’s where I’ve seen a lot of exciting stuff.

But yeah, there’s everything from new hardware for machine learning where you throw away the constraints that the computation has to be precise and deterministic, to new applications, to things like, for example security of AI, adversarial examples, verifiability, I think they are all pretty interesting things you can do.

JS: What are some of the most interesting applications you have seen of AI?

MZ: So many different applications to start with. First of all, we’ve seen consumer devices that bring AI into every home, or every phone, or every PC – these have taken off very quickly and it’s something that a large fraction of customers use, so that’s pretty cool to see.

In the business space, probably some of the more exciting things are actually dealing with image data, where, using deep learning and transfer learning, you can actually start to reliably build classifiers for different types of domain data. So, whether it’s maps, understanding satellite images, or even something as simple as people uploading images of a car to a website and you try to give feedback on that so it’s easier to describe it, a lot of these are starting to happen. So, it’s kind of a new class of data, visual data – we couldn’t do that much with it automatically before, and now you can get both like little features and big products that use it.

JS: So what do you see as the future of Databricks itself? What are some of the innovations you are driving?

MZ: Databricks, for people not familiar, we offer basically, a Unified Analytics Platform, where you can work with big data mostly through Apache Spark and collaborate with it in an organization, so you can have different people, developing say notebooks to perform computations, you can have people developing production jobs, you can connect these together into workflows, and so on.

So, we’re doing a lot of things to further expand on that vision. One of the things that we announced recently is what we call machine learning runtime where we have preinstalled versions of popular machine learning libraries like XGBoost or TensorFlow or Horovod on your Databricks cluster, so you can set those up as easily as you can set up as easily as you can setup an Apache Spark cluster in the past. And then another product that we featured a lot at our Spark Summit conference this year is Databricks Delta which is basically a transactional data management layer on top of cloud objects stores that lets us do things like indexing, reliable exactly once stream processing, and so on at very massive scale, and that’s a problem that all our users have, because all our users have to setup a reliable data ingest pipeline.

JS: Who are some of the most exciting customers of Databricks and what are they doing?

MZ: There are a lot of really interesting customers doing pretty cool things. So, at our conference this year, for example, one of the really cool presentations we saw was from Apple. So, Apple’s internal information security group – this is the group that does network monitoring, basically gets hundreds of terabytes of network events per day to process, to detect intrusions and information security problems. They spoke about using Databricks Delta and streaming with Apache Spark to handle all of that – so it’s one of the largest applications people have talked about publicly, and it’s very cool because the whole goal there – it’s kind of an arms race between the security team and attackers – so you really want to be able to design new rules, new measurements and add new data sources quickly. And so, the ease of programming and the ease of collaborating with this team of dozens of people was super important.

We also have some really exciting health and life sciences applications, so some of these are actually starting to discover new drugs that companies can actually productionize to tackle new diseases, and this is all based on large scale genomics and statistical studies.

And there are a lot of more fun applications as well. Like actually the largest video game in the world, League of Legends, they use Databricks and Apache Spark to detect players that are misbehaving or to recommend items to people or things like that. These are all things that were featured at the conference.

JS: If you had one advice to developers and customers using Spark or Databricks, or guidance on what they should learn, what would that be?

MZ: It’s a good question. There are a lot of high-quality training materials online, so I would say definitely look at some of those for your use case and see what other people are doing in that area. The Spark Summit conference is also a good way to see videos and talks and we make all of those available for free, the goal of that is to help and grow the developer community. So, look for someone who is doing similar things and be inspired by that and kinda see what the best practices are around that, because you might see a lot of different options for how to get started and it can be hard to see what the right path is.

JS: One last question – in recent years there’s been a lot of fear, uncertainty and doubt about AI, and a lot of popular press. Now – how real are they, and what do you think people should be thinking?

MZ: That’s a good question. My personal view is – this sort of evil artificial general intelligence stuff – we are very far away from it. And basically, if you don’t believe that, I would say just try doing machine learning tutorials and see how these models break down – you get a sense for how difficult that is.

But there are some real challenges that will come from AI, so I think one of them is the same challenge as with all technology which is, automation – how quickly does it happen. Ultimately, after automation, people usually end up being better off, but it can definitely affect some industries in a pretty bad way and if there is no time for people to transition out, that can be a problem.

I think the other interesting problem there is always a discussion about is basically access to data, privacy, managing the data, algorithmic discrimination – so I think we are still figuring out how to handle that. Companies are doing their best, but there are also many unknowns as to how these techniques will do that. That’s why we’ll see better best practices or regulations and things like that.

JS: Well, thank you Matei, it’s simply amazing to see the innovations you have driven, and looking forward to more to come.

MZ: Thanks for having me.

“When you’re doing machine learning AI projects, it’s really important to be able to iterate quickly because it’s all about experimenting, about finding whether something will work, failing fast if a particular idea doesn’t work.


And I think the cloud makes it much easier.”

We hope you enjoyed this blog post. This being our first episode in the series, we are eager to hear your feedback, so please share your thoughts and ideas below.

The AI / ML Blog Team

Resources

The nine roles you need on your data science research team

It’s easy to focus too much on building a data science research team loaded with Ph.D.s to do machine learning at the expense of developing other data science skills needed to compete in today’s data-driven, digital economy. While high-end, specialty data science skills for machine learning are important, they can also get in the way of a more pragmatic and useful adoption of data science. That’s the view of Cassie Kozyrkov, chief decision scientist at Google and a proponent of the democratization of data-based organizational decision-making.

To start, CIOs need to expand their thinking about the types of roles involved in implementing data science programs, Kozyrkov said at the recent Rev Data Science Leaders Summit in San Francisco.

For example, it’s important to think about data science research as a specialty role developed to provide intelligence for important business decisions. “If an answer involves one or more important decisions, then you need to bring in the data scientists,” said Kozyrkov, who designed Google’s analytics program and trained more than 15,000 Google employees in statistics, decision-making and machine learning.

But other tasks related to data analytics, like making informational charts, testing out various algorithms and making better decisions, are best handled by other data science team members with entirely different skill sets.

Data science roles: The nine must-haves

There are a variety of data science research roles for an organization to consider and certain characteristics best suited for each. Most enterprises already have correctly filled several of these data science positions, but most will also have people with the wrong skills or motivations in certain data science roles. This mismatch can slow things down or demotivate others throughout the enterprise, so it’s important for CIOs to carefully consider who staffs these roles to get the most from their data science research.

Here is Kozyrkov’s rundown of the essential data science roles and the part each plays in helping organizations make more intelligent business decisions.

Data engineers are people who have the skills and ability to get data required for analysis at scale.

Basic analysts could be anyone in the organization with a willingness to explore data and plot relationships using various tools. Kozyrkov suggested it may be hard for data scientists to cede some responsibility for basic analysis to others. But, in the long run, the value of data scientists will grow, as more people throughout the company are already doing basic analytics.

Expert analysts, on the other hand, should be able to search through data sets quickly. You don’t want to put a software engineer or very methodical person in this role, because they are too slow.

“The expert software engineer will do something beautiful, but won’t look at much of your data sets,” Kozyrkov said. You want someone who is sloppy and will run around your data. Caution is warranted in buffering expert analysts from software developers inclined to complain about sloppy — yet quickly produced — code.

Statisticians are the spoilsports who will explain how your latest theory does not hold up for 20 different reasons. These people can kill motivation and excitement. But they are also important for coming to conclusions safely for important decisions.

A machine learning engineer is not a researcher who builds algorithms. Instead, these AI-focused computer programmers excel at moving a lot of data sets through a variety of software packages to decide if the output looks promising. The best person for this job is not a perfectionist who would slow things down by looking for the best algorithm.

A good machine learning engineer, in Kozyrkov’s view, is someone who doesn’t know what they are doing and will try out everything quickly. “The perfectionist needs to have the perfection encouraged out of them,” she said.

Too many businesses are trying to staff the team with a bunch of Ph.D. researchers. These folks want to do research, not solve a business problem.
Cassie Kozyrkovchief decision scientist at Google

A data scientist is an expert who is well-trained in statistics and also good at machine learning. They tend to be expensive, so Kozyrkov recommended using them strategically.

A data science manager is a data scientist who wakes up one day and decides he or she wants to do something different to benefit the bottom line. These folks can connect the decision-making side of business with the data science of big data. “If you find one of these, grab them and never let them go,” Kozyrkov said.

A qualitative expert is a social scientist who can assess decision-making. This person is good at helping decision-makers set up a problem in a way that can be solved with data science. They tend to have better business management training than some of the other roles.

A data science researcher has the skills to craft customized data science and machine learning algorithms. Data science researchers should not be an early hire. “Too many businesses are trying to staff the team with a bunch of Ph.D. researchers. These folks want to do research, not solve a business problem,” Kozyrkov said. “This is a hire you only need in a few cases.”

Prioritize data science research projects

For CIOs looking to build their data science research team, develop a strategy for prioritizing and assigning data science projects. (See the aforementioned advice on hiring data science researchers.)

Decisions about what to prioritize should involve front-line business managers, who can decide what data science projects are worth pursuing.

In the long run, some of the most valuable skills lie in learning how to bridge the gap between business decision-makers and other roles. Doing this in a pragmatic way requires training in statistics, neuroscience, psychology, economic management, social sciences and machine learning, Kozyrkov said. 

A new partnership to support computer science teachers

Today we are excited to announce a new partnership with the Computer Science Teachers Association (CSTA). Microsoft will provide $2 million, over three years, to help CSTA launch new chapters and strengthen existing ones. It will help them attract new members and partners to build a stronger community to serve computer science teachers.

 We’re thrilled that students of all ages are discovering the exciting – and critical – field of computer science. From the Hour of Code, to Minecraft Education, and even Advanced Placement Computer Sciences courses, participation rates are expanding. This surge of student interest, combined with the premium our economy places on technology skill of all kinds, requires us to do all we can to ensure every student has access to computer science courses. And it all starts with our teachers.  

 Nearly every teacher belongs to a professional membership organization, from social studies, to reading, to math and science. These organizations provide teachers with subject-specific professional development, up-to-date curriculum, and networking opportunities with peers and other professionals. CSTA was started in 2004 to fill this need for computer science teachers. But to meet today’s needs in this quickly changing and growing field of study, CSTA is expanding as well. We are proud to support them!

 Our investment in CSTA continues Microsoft Philanthropies’ long-standing commitment to computer science education through our Technology Education and Literacy in Schools (TEALS) program, which pairs technology industry volunteers with classroom teachers to team-teach computer science in 350 U.S. high schools. It builds on our investments in nonprofits such as Code.org, Girls Who Code, and Boys & Girls Clubs of America, with whom we partnered to create a computer science learning pathway. And it builds on our work advocating at a state and federal level for policy change and investments in computer science education across the United States.  

While technology can be a powerful learning tool, nothing can replace the expertise, guidance, and encouragement that teachers provide to students each day of the school year. I remember my own favorite teachers who helped me see a world beyond the rural town in which I grew up. I would guess that nearly everyone has a similar story. We thank our teachers and we hope that this investment in computer science teachers, through CSTA, empowers more educators to do what they do best: make a positive difference in the lives of students. To learn how you can help CSTA serve teachers, please visit https://www.csteachers.org/page/GetInvolved.

Building a data science pipeline: Benefits, cautions

Enterprises are adopting data science pipelines for artificial intelligence, machine learning and plain old statistics. A data science pipeline — a sequence of actions for processing data — will help companies be more competitive in a digital, fast-moving economy. 

Before CIOs take this approach, however, it’s important to consider some of the key differences between data science development workflows and traditional application development workflows.

Data science development pipelines used for building predictive and data science models are inherently experimental and don’t always pan out in the same way as other software development processes, such as Agile and DevOps. Because data science models break and lose accuracy in different ways than traditional IT apps do, a data science pipeline needs to be scrutinized to assure the model reflects what the business is hoping to achieve.

At the recent Rev Data Science Leaders Summit in San Francisco, leading experts explored some of these important distinctions, and elaborated on ways that IT leaders can responsibly implement a data science pipeline. Most significantly, data science development pipelines need accountability, transparency and auditability. In addition, CIOs need to implement mechanisms for addressing the degradation of a model over time, or “model drift.” Having the right teams in place in the data science pipeline is also critical: Data science generalists work best in the early stages, while specialists add value to more mature data science processes.

Data science at Moody’s

Jacob Grotta, managing director, Moody's AnalyticsJacob Grotta

CIOs might want to take note from Moody’s, the financial analytics giant, which was an early pioneer in using predictive modeling to assess the risks of bonds and investment portfolios. Jacob Grotta, managing director at Moody’s Analytics, said the company has streamlined the data science pipeline it uses to create models in order to be able to quickly adapt to changing business and economic conditions.

“As soon as a new model is built, it is at its peak performance, and over time, they get worse,” Grotta said. Declining model performance can have significant impacts. For example, in the finance industry, a model that doesn’t accurately predict mortgage default rates puts a bank in jeopardy. 

Watch out for assumptions

Grotta said it is important to keep in mind that data science models are created by and represent the assumptions of the data scientists behind them. Before the 2008 financial crisis, a firm approached Grotta with a new model for predicting the value of mortgage-backed derivatives, he said. When he asked what would happen if the prices of houses went down, the firm responded that the model predicted the market would be fine. But it didn’t have any data to support this. Mistakes like these cost the economy almost $14 trillion by some estimates.

The expectation among companies often is that someone understands what the model does and its inherent risks. But these unverified assumptions can create blind spots for even the most accurate models. Grotta said it is a good practice to create lines of defense against these sorts of blind spots.

The first line of defense is to encourage the data modelers to be honest about what they do and don’t know and to be clear on the questions they are being asked to solve. “It is not an easy thing for people to do,” Grotta said.

A second line of defense is verification and validation. Model verification involves checking to see that someone implemented the model correctly, and whether mistakes were made while coding it. Model validation, in contrast, is an independent challenge process to help a person developing a model to identify what assumptions went into the data. Ultimately, Grotta said, the only way to know if the modeler’s assumptions are accurate or not is to wait for the future.

A third line of defense is an internal audit or governance process. This involves making the results of these models explainable to front-line business managers. Grotta said he was working with a bank recently that protested its bank managers would not use a model if they didn’t understand what was driving its results. But he said the managers were right to do this. Having a governance process and ensuring information flows up and down the organization is extremely important, Grotta said.

Baking in accountability

Models degrade or “drift” over time, which is part of the reason organizations need to streamline their model development processes. It can take years to craft a new model. “By that time, you might have to go back and rebuild it,” Grotta said. Critical models must be revalidated every year.

To address this challenge, CIOs should think about creating a data science pipeline with an auditable, repeatable and transparent process. This promises to allow organizations to bring the same kind of iterative agility to model development that Agile and DevOps have brought to software development.

Transparent means that upstream and downstream people understand the model drivers. It is repeatable in that someone can repeat the process around creating it. It is auditable in the sense that there is a program in place to think about how to manage the process, take in new information, and get the model through the monitoring process. There are varying levels of this kind of agility today, but Grotta believes it is important for organizations to make it easy to update data science models in order to stay competitive.

How to keep up with model drift

Nick Elprin, CEO and co-founder of Domino Data Lab, a data science platform vendor, agreed that model drift is a problem that must be addressed head on when building a data science development pipeline. In some cases, the drift might be due to changes in the environment, like changing customer preferences or behavior. In other cases, drift could be caused by more adversarial factors. For example, criminals might adopt new strategies for defeating a new fraud detection model.

Nick Elprin, CEO and co-founder, Domino Data LabNick Elprin

In order to keep up with this drift, CIOs need to include a process for monitoring the effectiveness of their data models over time and establishing thresholds for replacing these models when performance degrades.

With traditional software monitoring, the IT service management needs to track metrics related to CPU, network and memory usage. With data science, CIOs need to capture metrics related to accuracy of model results. “Software for [data science] production models needs to look at the output they are getting from those models, and if drift has occurred, that should raise an alarm to retrain it,” Elprin said.

Fashion-forward data science

At Stitch Fix, a personal shopping service, the company’s data science pipeline allows it to sell clothes online at full price. Using data science in various ways allows them to find new ways to add value against deep discount giants like Amazon, said Eric Colson, chief algorithms officer at Stitch Fix.

Eric Colson, chief algorithms officer,  Stitch FixEric Colson

For example, the data science team has used natural language processing to improve its recommendation engines and buy inventory. Stitch Fix also uses genetic algorithms — algorithms that are designed to mimic evolution and iteratively select the best results following a set of randomized changes. These are used to streamline the process for designing clothes, coming up with countless iterations: Fashion designers then vet the designs.

This kind of digital innovation, however, was only possible he said because the company created an efficient data science pipeline. He added that it was also critical that the data science team is considered a top-level department at Stitch Fix and reports directly to the CEO.

Specialists or generalists?

One important consideration for CIOs in constructing the data science development pipeline is whether to recruit data science specialists or generalists. Specialists are good at optimizing one step in a complex data science pipeline. Generalists can execute all the different tasks in a data science pipeline. In the early stages of a data science initiative, generalists can adapt to changes in the workflow more easily, Colson said.

Some of these different tasks include feature engineering, model training, enhance transform and loading (ETL) data, API integration, and application development. It is tempting to staff each of these tasks with specialists to improve individual performance. “This may be true of assembly lines, but with data science, you don’t know what you are building, and you need to iterate,” Colson said. The process of iteration requires fluidity, and if the different roles are staffed with different people, there will be longer wait times when a change is made.

In the beginning at least, companies will benefit more from generalists. But after data science processes are established after a few years, specialists may be more efficient.

Align data science with business

Today a lot of data science models are built in silos that are disconnected from normal business operations, Domino’s Elprin said. To make data science effective, it must be integrated into existing business processes. This comes from aligning data science projects with business initiatives. This might involve things like reducing the cost of fraudulent claims or improving customer engagement.

In less effective organizations, management tends to start with the data the company has collected and wonder what a data science team can do with it. In more effective organizations, data science is driven by business objectives.

“Getting to digital transformation requires top down buy-in to say this is important,” Elprin said. “The most successful organizations find ways to get quick wins to get political capital. Instead of twelve-month projects, quick wins will demonstrate value, and get more concrete engagement.”

New data science platforms aim to be workflow, collaboration hubs

An emerging class of data science platforms that provide collaboration and workflow management capabilities is gaining more attention from both users and vendors — most recently Oracle, which is buying its way into the market.

Oracle’s acquisition of startup DataScience.com puts more major-vendor muscle behind the workbench-style platforms, which give data science teams a collaborative environment for developing, deploying and documenting analytical models. IBM is already in with its Data Science Experience platform, informally known as DSX. Other vendors include Domino Data Lab and Cloudera, which last week detailed plans for a new release of its Cloudera Data Science Workbench (CDSW) software this summer.

These technologies are a subcategory of data science platforms overall. They aren’t analytics tools; they’re hubs that data scientists can use to build predictive and machine learning models in a shared and managed space — instead of doing so on their own laptops, without a central location to coordinate workflows and maintain models. Typically, they’re aimed at teams with 10 to 20 data scientists and up.

The workbenches began appearing in 2014, but it’s only over the past year or so that they matured into products suitable for mainstream users. Even now, the market is still developing. Domino and Cloudera wouldn’t disclose the number of customers they have for their technologies; in a March interview, DataScience.com CEO Ian Swanson said only that its namesake platform has “dozens” of users.

A new way to work with data science volunteers

Ruben van der Dussen, ThornRuben van der Dussen

Thorn, a nonprofit group that fights child sex trafficking and pornography, deployed Domino’s software in early 2017. The San Francisco-based organization only has one full-time data scientist, but it taps volunteers to do analytics work that helps law enforcement agencies identify and find trafficking victims. About 20 outside data scientists are often involved at a time — a number that swells to 100 or so during hackathons that Thorn holds, said Ruben van der Dussen, director of its Innovation Lab.

That makes this sort of data science platform a good fit for the group, he said. Before, the engineers on his team had to create separate computing instances on the Amazon Elastic Compute Cloud (EC2) for volunteers and set them up to log in from their own systems. With Domino, the engineers put Docker containers on Thorn’s EC2 environment, with embedded Jupyter Notebooks that the data scientists access via the web. That lets them start analyzing data faster and frees up time for the engineers to spend on more productive tasks, van der Dussen said.

He added that data security and access privileges are also easier to manage now — an important consideration, given the sensitive nature of the images, ads and other online data that Thorn analyzes with a variety of machine learning and deep learning models, including ones based on natural language processing and computer vision algorithms.

Thorn develops and trains the analytical models within the Domino platform and uses it to maintain different versions of the Jupyter Notebooks, so the work done by data scientists is documented for other volunteers to pick up on. In addition, multiple people working together on a project can collaborate through the platform. The group uses tools like Slack for direct communication, “but Domino makes it really easy to share a Notebook and for people to comment on it,” van der Dussen said.

Screenshot of Domino Data Lab's data science platform
Domino Data Lab’s data science platform lets users run different analytics tools in separate workspaces.

Oracle puts its money down on data science

Oracle is betting that data science platforms like DataScience.com’s will become a popular technology for organizations that want to manage their advanced analytics processes more effectively. Oracle, which announced the acquisition this month, plans to combine DataScience.com’s platform with its own AI infrastructure and model training tools as part of a data science PaaS offering in the Oracle Cloud.

By buying DataScience.com, Oracle hopes to help users get more out of their analytics efforts — and better position itself as a machine learning vendor against rivals like Amazon Web Services, IBM, Google and Microsoft. Oracle said it will continue to invest in DataScience.com’s technology, with a goal of delivering “more functionality and capabilities at a quicker pace.” It didn’t disclose what it’s paying for the Culver City, Calif., startup.

The workbench platforms centralize work on analytics projects and management of the data science workflow. Data scientists can team up on projects and run various commercial and open source analytics tools to which the platforms connect, then deploy finished models for production applications. The platforms also support data security and governance, plus version control on analytical models.

Cloudera said its upcoming CDSW 1.4 release adds features for tracking and comparing different versions of models during the development and training process, as well as the ability to deploy models as REST APIs embedded in containers for easier integration into dashboards and other applications. DataScience.com, Domino and IBM provide similar functionality in their data science platforms.

Screenshot of Cloudera Data Science Workbench
Cloudera Data Science Workbench uses a sessions concept for running analytics applications.

Choices on data science tools and platforms

Deutsche Telekom AG is offering both CDSW and IBM’s DSX to users of Telekom Data Intelligence Hub, a cloud-based big data analytics service that the telecommunications company is testing with a small number of customers in Europe ahead of a planned rollout during the second half of the year.

Users can also access Jupyter, RStudio and three other open source analytics tools, said Sven Löffler, a business development executive at the Bonn, Germany, company who’s leading the implementation of the analytics service. The project team sees benefits in enabling organizations to connect to those tools through the two data science platforms and get “all this sharing and capabilities to work collaboratively with others,” he said.

However, Löffler has heard from some customers that the cost of the platforms could be a barrier compared to working directly with the open source tools as part of the service, which runs in the Microsoft Azure cloud. It’s fed by data pipelines that Deutsche Telekom is building with a new Azure version of Cloudera’s Altus Data Engineering service.

Virtual tools help explore computer science and robotics in the classroom |

I am sure everyone enjoyed Computer Science Education Week and its amazing focus on enabling the students of today to create the world of tomorrow. We live in an amazing time of technological progress. Every aspect of our lives is being shaped by digital transformation. However, with transformation comes disruption. There’s growing concern over job growth, economic opportunity, and the world we are building for the next generation. So, the real question is: How can technology create more opportunity not for a few, but for all?

This week we would love to focus on how to bring applied computer science through robotics into the classroom. The skill of programming is fundamental for structured, logical thinking and enables students to bring technology to life and make it their own. Oftentimes this can be a lofty goal when resources are limited, but there is room for a grounded, everyday approach.

Code Builder for Minecraft: Education Edition is an extension that allows educators and students to explore, create, and play in an immersive Minecraft world – all by writing code. Since they can connect their work to learn-to-code packages like ScratchX, Tynker, and Microsoft MakeCode, players start with familiar tools, templates and tutorials. Minecraft: Education Edition is available free to qualified education institutions with any new Windows 10 device. You can check out our Minecraft: Education Edition sign-up page to learn how you can receive a one-year, single-user subscription for Minecraft: Education Edition for each new Windows 10 device purchased for your K-12 school.

OhBot is an educational robotics system that has been designed to stretch pupils’ computational thinking and understanding of computer science, and explore human/robot interaction through a creative robotic head that students program to speak and interact with their environment.

Another key area that we are supporting is in simulation solutions for robotics, to enable lower-cost access and better design practices in the classroom. With these programs, educators can teach robotic coding without a physical robot.

Daniel Rosenstein, a volunteer Robotics coach at the Elementary, Middle school and High school levels, firmly believes that simulation illustrates the connection between computer science and best practices in engineering design. Simulation makes the design process uniquely personal, because students are encouraged to build digital versions of their physical robot, and to try their programs in the simulator before investing in physical tools. The simulation environment, similar to a video game, creates a digital representation of the robot and its tasks, and allows for very quick learning cycles through design, programming, trial and error.

The Virtual Robotics Toolkit (VRT) is a good example. It’s an advanced simulator designed to enhance the LEGO MINDSTORMS experience. An excellent learning tool for classroom and competitive robotics, the VRT is easy to use and is approved by teachers and students.

Looks set to be another year of great new apps in the Microsoft Store and we are excited to shortly be welcoming Synthesis: An Autodesk Technology to the Store.  This app is built for design simulation and will enable students to work together to design, test and experiment with robotics, without having to touch a piece of physical hardware.

We look forward to connecting with you on this and more soon!

CIOs should lean on AI ‘giants’ for machine learning strategy

NEW YORK — Machine learning and deep learning will be part of every data science organization, according to Edd Wilder-James, former vice president of technology strategy at Silicon Valley Data Science and now an open source strategist at Google’s TensorFlow.

Wilder-James, who spoke at the Strata Data Conference, pointed to recent advancements in image and speech recognition algorithms as examples of why machine learning and deep learning are going mainstream. He believes image and speech recognition software has evolved to the point where it can see and understand some things as well as — and in some use cases better than — humans. That makes it ripe to become part of the internal workings of applications and the driver of new and better services to internal and external customers, he said.

But what investments in AI should CIOs make to provide these capabilities to their companies? When building a machine learning strategy, choice abounds, Wilder-James said.

Machine learning vs. deep learning

Deep learning is a subset of machine learning, but it’s different enough to be discussed separately, according to Wilder-James. Examples of machine learning models include optimization, fraud detection and preventive maintenance. “We use machine learning to identify patterns,” Wilder-James said. “Here’s a pattern. Now, what do we know? What can we do as a result of identifying this pattern? Can we take action?”

Deep learning models perform tasks that more closely resemble human intelligence such as image processing and recognition. “With a massive amount of compute power, we’re able to look at a massively large number of input signals,” Wilder-James said. “And, so what a computer is able to do starts to look like human cognitive abilities.”

Some of the terrain for machine learning will look familiar to CIOs. Statistical programming languages such as SAS, SPSS and Matlab are known territory for IT departments. Open source counterparts such as R, Python and Spark are also machine-learning ready. “Open source is probably a better guarantee of stability and a good choice to make in terms of avoiding lock-in and ensuring you have support,” Wilder-James said.

Unlike other tech rollouts

The rollout of machine learning and deep learning models, however, is a different process than most technology rollouts. After getting a handle on the problem, CIOs will need to investigate if machine learning is even an appropriate solution.

“It may not be true that you can solve it with machine learning,” Wilder-James said. “This is one important difference from other technical rollouts. You don’t know if you’ll be successful or not. You have to enter into this on the pilot, proof-of-concept ladder.”

The most time-consuming step in deploying a machine learning model is feature engineering, or finding features in the data that will help the algorithms self-tune. Deep learning models skip the tedious feature engineering step and go right to the training step. To tune a deep learning model correctly requires immense data sets, graphic processing units or tensor processing units, and time. Wilder-James said it could take weeks and even months to train a deep learning model.

One more thing to note: Building deep learning models is hard and won’t be a part of most companies’ machine learning strategy.

“You have to be aware that a lot of what’s coming out is the closest to research IT has ever been,” he said. “These things are being published in papers and deployed in production in very short cycles.”

CIOs whose companies are not inclined to invest heavily in AI research and development should instead rely on prebuilt, reusable machine and deep learning models rather than reinvent the wheel. Image recognition models, such as Inception, and natural language models, such as SyntaxNet and Parsey McParseface, are examples of models that are ready and available for use.

“You can stand on the shoulders of giants, I guess that’s what I’m trying to say,” Wilder-James said. “It doesn’t have to be from scratch.”

Machine learning tech

The good news for CIOs is that vendors have set the stage to start building a machine learning strategy now. TensorFlow, a machine learning software library, is one of the best known toolkits out there. “It’s got the buzz because it’s an open source project out of Google,” Wilder-James said. “It runs fast and is ubiquitous.”

While not terribly developer-friendly, a simplified interface called Keras eases the burden and can handle the majority of use cases. And TensorFlow isn’t the only deep learning library or framework option, either. Others include MXNet, PyTorch, CNTK, and Deeplearning4j.

For CIOs who want AI to live on premises, technologies such as Nvidia’s DGX-1 box, which retails for $129,000, are available.

But CIOs can also utilize cloud as a computing resource, which would cost anywhere between $5 and $15 an hour, according to Wilder-James. “I worked it out, and the cloud cost is roughly the same as running the physical machine continuously for about a year,” he said.

Or they can choose to go the hosted platform route, where a service provider will run trained models for a company. And other tools, such as domain-specific proprietary tools like the personalization platform from Nara Logics, can fill out the AI infrastructure.

“It’s the same kind of range we have with plenty of other services out there,” he said. “Do you rent an EC2 instance to run a database or do you subscribe to Amazon Redshift? You can pick the level of abstraction that you want for these services.”

Still, before investments in technology and talent are made, a machine learning strategy should start with the basics: “The single best thing you can do to prepare with AI in the future is to develop a competency with your own data, whether it’s getting access to data, integrating data out of silos, providing data results readily to employees,” Wilder-James said. “Understanding how to get at your data is going to be the thing to prepare you best.”