Tag Archives: science

Business and innovation tips for your Imagine Cup project

Editor’s note: This blog was contributed by the U.S. Department of Global Innovation through Science and Technology (GIST)GIST is led by the U.S. Department of State and implemented by VentureWell 

Microsoft’s Imagine Cup empowers student developers and aspiring entrepreneurs from all academic backgrounds to bring an idea to life with technology. Through competition and collaboration, it provides an opportunity to develop an application, create a business plan, and gain a keen understanding of whats needed to bring a concept to market to make an impact.We’ve partnered with GIST to provide some top tips for turning your idea into a marketable business solution and prepare you to present it effectively on a global stage. 

Key things to consider when developing a business idea

1. Assess whether your product is truly novel 

In the early development stages of a new idea, it’s important to assess whether your idea already exists in the current market and if so, what unique solution your application can provide. 

In the world of intellectual property law, “prior art” is the term used for relevant information that was publicly available before a patent claim. For example, if your company is working on a new type of football helmet, but another company has already given an interview about their own plans to invent such a helmet, that constitutes prior art – and it means your patent claim is likely to face a steep uphill battle. Start by asking yourself if your project is truly novelWhat problem does your application solve?  Are there similar solutions already on the market? If necessary, work with your university to establish if a patent already exists. 

2. Learn to take feedback  

It’s easy to get attached to an invention. However, being too lovestruck with your technology can prevent you from absorbing vital feedback from customers, professors, mentors, even teammates. “Feedback is learning,” says Dr. Lawrence Neeley, Associate Professor of Design and Entrepreneurship at Olin College of Engineering“Sure, feedback can hurt, but understand that you can’t improve your invention without learning what’s wrong with it. Feedback is a mechanism for growth.” In addition, don’t lose sight of the passion that originally drove you to developing a solution, as it can put you in the right mindset to listen to feedback. By keeping the core problem at the forefront, you can more effectively pivot your technology and business model to better address market demands. Read more about how to balance your passion with real-life data to make your project shine.

3. Incorporate diversity & inclusion 

Empower everyone to benefit from your solution by considering diversity and inclusion in your project early on. “When accessibility is at the heart of inclusive design, we not only make technology that is accessible for people with disabilities, we invest in the future of natural user interface design and improved usability for everyone,” says Megan Lawrence, an Accessibility Technical Evangelist at Microsoft. Check out some resources to help you build inclusion into your innovation: 

  • Use Accessibility Insights to run accessibility testing on web pages and applications. 
  • Learn how to create inclusive design through video tutorials and downloadable toolkits. 
  • Read the story of two Microsoft teams at Ability Hacks who embraced the transformative power of technology to create inclusive solutions now used by millions of people. 

Read more tips on using inclusion as a lens to drive innovation. 

4. Consider environmental responsibility 

To maximize impact from the start, it’s critical that student innovators develop an environmentally responsible mindset at the earliest stages of their innovation, business, or manufacturing process. Here are some examples from student innovators of how they integrated environmental responsibility into their business models: 

  • Use renewable energy sources where possible, such as solar power or implementing recycling processes. 
  • Incorporate sustainable processes through things like reducing packaging, limiting plastic waste, and sourcing materials that are reusable or biodegradable.  
  • Create an innovation that solves a key environmental issue or repurposes harmful by-products, such as recovering metal water contaminants or converting ocean waste.  

Read more about how they leveraged sustainability in their projects. 

Maximizing resources for your innovation 

It can be a challenge to seek support resources as a student entrepreneur.  Here are some top tips for maximizing on and off-campus benefits while you’re still in school  – check out additional advice if you’re interested in learning more.  

1. Take stock of university resources 

Assess what skills you may need beyond just technical and talk to faculty or administrators to develop a roadmap for your time in school. For instance, seek out seminars or courses in different departments to help sharpen writing or public speaking skills, or visit your university library to find out what resources they have to offer student entrepreneurs such as makerspaces, workshops, or guest lectures. 

2. Maximize networking opportunities 

Connect with others through LinkedIn, your university’s alumni network, classes, hackathons, and more to network with industry-specific experts. Pro-tip: Imagine Cup connects you to a global community of like-minded tech enthusiasts to collaborate and innovate together, in addition to giving you access to industry professionals. 

3. Take advantage of competitions  

Approach competitions as not just an opportunity to win, but also to further refine your project and go-to-market planLeverage feedback and insights from judges, mentors, and peers to continue ideating and developing a marketable solution.   

Build business skills through hands-on innovation 

What better way to put these tips into practice than through bringing your own solution to life? The Imagine Cup is your opportunity to build a technology innovation from what you’re most passionate about. Regardless of where you place in the competition, youll have the chance to connect with likeminded tech enthusiasts across the globe, including joining a network of over two million past competitors. In addition, teams who advance to the Regional Finals will receive mentorship from industry professionals and in-person entrepreneurship workshops from GISTled by the U.S. Department of State and implemented by VentureWellthelp elevate their solutions.   

Learn by doing, code for impact, and build purpose from your passion. Register now for the 2020 competition. 

 

Go to Original Article
Author: Microsoft News Center

Bringing together deep bioscience and AI to help patients worldwide: Novartis and Microsoft work to reinvent treatment discovery and development   – The Official Microsoft Blog

In the world of commercial research and science, there’s probably no undertaking more daunting – or more expensive – than the process of bringing a new medicine to market. For a new compound to make it from initial discovery through development, testing and clinical trials to finally earn regulatory approval can take a decade or more. Nine out of 10 promising drug candidates fail somewhere along the way. As a result, on average, it costs life sciences companies $2.6 billion to introduce a single new prescription drug.

This is much more than just a challenge for life sciences companies. Streamlining drug development is an urgent issue for human health more broadly. From uncovering new ways to treat age-old sicknesses like malaria that still kills hundreds of thousands of people every year, to finding new cancer treatments, or developing new vaccines to prevent highly-contagious diseases from turning into global pandemics, the impact in terms of lives saved worldwide would be enormous if we could make inventing new medicines faster.

As announced today, this is why Novartis and Microsoft are collaborating to explore how to take advantage of advanced Microsoft AI technology combined with Novartis’ deep life sciences expertise to find new ways to address the challenges underlying every phase of drug development – including research, clinical trials, manufacturing, operations and finance. In a recent interview, Novartis CEO Vas Narasimhan spoke about the potential for this alliance to unlock the power of AI to help Novartis accelerate research into new treatments for many of the thousands of diseases for which there is, as yet, no known cure.

In the biotech industry, there have been amazing scientific advances in recent years that have the potential to revolutionize the discovery of new, life-saving drugs. Because many of these advances are based on the ability to analyze huge amounts of data in new ways, developing new drugs has become as much an AI and data science problem as it is a biology and chemistry problem. This means companies like Novartis need to become data science companies to an extent never seen before. Central to our work together is a focus on empowering Novartis associates at each step of drug development to use AI to unlock the insights hidden in vast amounts of data, even if they aren’t data scientists. That’s because while the exponential increase in digital health information in recent years offers new opportunities to improve human health, making sense of all the data is a huge challenge.

The issue isn’t just a problem of the overwhelming volume. Much of the information exists in the form of unstructured data, such as research lab notes, medical journal articles, and clinical trial results, all of which is typically stored in disconnected systems. This makes bringing all that data together extremely difficult. Our two companies have a dream. We want all Novartis associates – even those without special expertise in data science – to be able to use Microsoft AI solutions every day, to analyze large amounts of information and discover new correlations and patterns critical to finding new medicines. The goal of this strategic collaboration is to make this dream a reality. This offers the potential to empower everyone from researchers exploring the potential of new compounds and scientists figuring out dosage levels, to clinical trial experts measuring results, operations managers seeking to improve supply chains more efficiently, and even business teams looking to make more effective decisions. And as associates work on new problems and develop new AI models, they will continually build on each other’s work, creating a virtuous cycle of exploration and discovery. The result? Pervasive intelligence that spans the company and reaches across the entire drug discovery process, improving Novartis’ ability to find answers to some of the world’s most pressing health challenges.

As part of our work with Novartis, data scientists from Microsoft Research and research teams from Novartis will also work together to investigate how AI can help unlock transformational new approaches in three specific areas. The first is about personalized treatment for macular degeneration – a leading cause of irreversible blindness. The second will involve exploring ways to use AI to make manufacturing new gene and cell therapies more efficient, with an initial focus on acute lymphoblastic leukemia. And the third area will focus on using AI to shorten the time required to design new medicines, using pioneering neural networks developed by Microsoft to automatically generate, screen and select promising molecules. As our work together moves forward, we expect that the scope of our joint research will grow.

At Microsoft, we’re excited about the potential for this collaboration to transform R&D in life sciences. As Microsoft CEO Satya Nadella explained, putting the power of AI in the hands of Novartis employees will give the company unprecedented opportunities to explore new frontiers of medicine that will yield new life-saving treatments for patients around the world.

While we’re just at the beginning of a long process of exploration and discovery, this strategic alliance marks the start of an important collaborative effort that promises to have a profound impact on how breakthrough medicines and treatments are developed and delivered. With the depth and breadth of knowledge that Novartis offers in bioscience and Microsoft’s unmatched expertise in computer science and AI, we have a unique opportunity to reinvent the way new medicines are created. Through this process, we believe we can help lead the way forward toward a world where high-quality treatment and care is significantly more personal, more effective, more affordable and more accessible.

Tags: , , ,

Go to Original Article
Author: Steve Clarke

DevSecOps veterans share security strategy, lessons learned

SEATTLE — DevSecOps strategy is as much an art as a science, but experienced practitioners have a few pointers about how to approach it, including what not to do.

The first task DevSecOps newcomers should undertake, according to Julien Vehent, security engineer at web software firm Mozilla, is to design an effective IT security team structure. In his view, the ideal is an org chart that embeds security engineers with DevOps teams but has them report to a centralized security department. This structure helps to balance their impact on day-to-day operations with maintaining a cohesive set of broad goals for the organization.

This embedding can and should go both ways, Vehent added — security champions from DevOps teams should also have access to the central security organization to inform their work.

“Sys admins are great at security,” he said in a presentation here at DevSecCon this week. “Talk to people who do red teaming in their organization, and half the time they get caught by the sys admins.”

Once DevOps and IT security teams are aligned, the most important groundwork for improved DevOps security is to gather accurate data on IT assets and the IT environment, and give IT teams access to relevant data in context, practitioners said.

“What you really want from [DevSecOps] models is to avoid making assumptions and to test those assumptions, because assumptions lead to vulnerability,” Vehent said, recalling an incident at Mozilla where an assumption about SSL certificate expiration data brought down Mozilla’s add-ons service at launch.

Since then, Vehent’s mantra has been, “Avoid assumptions, trust the data.”

Effective DevSecOps tools help make data-driven decisions

Assumptions lead to vulnerability. Avoid assumptions, trust the data.
Julien VehentSecurity engineer, Mozilla

Once a strategy is in place, it’s time to evaluate tools for security automation and visibility. Context is key in security monitoring, said Erkang Zheng, chief information security officer at LifeOmic Security, a healthcare software company, which also markets its internally developed security visibility tools as JupiterOne.

“Attackers think in graphs, defenders think in lists, and that’s how attackers win,” Zheng said during a presentation here. “Stop thinking in lists and tables, and start thinking in entities and relationships.”

For example, it’s not enough to know how many AWS Elastic Cloud Compute instances an organization has, but to understand their context by analyzing multiple factors, such as which ones are exposed to the internet, both directly and through cross-account access methods.

IT pros can configure such security visibility graphs with APIs and graphing databases, or use prepackaged tools. There are also open source tools available to help developers assess the security of their own applications, such as Mozilla’s Observatory.

LifeOmic also takes a code-driven, systematized approach to DevOps security documentation, Zheng said. Team members create “microdocuments,” similar to microservices, and check them into GitHub as version-controlled JSON and YAML files.

Another speaker urged IT pros new to DevSecOps to take a templatized approach to IT training documentation for cybersecurity that explains, in jargon-free language, the reasons for best practices, and give specific examples of how developers often want to do things, versus how they should do things to ensure application security.

“The important thing is to make the secure way the easy way to do things for developers,” said Morgan Roman, application penetration tester at electronic signature software maker DocuSign. “Find bad patterns, and make the safer way to do things the default.”

DevSecOps how tos — and how NOT to dos

Strategic planning and assessments are important, but certain lessons about DevOps security can only be learned through experience. A panel of cybersecurity practitioners from blue-chip software companies shared their lessons learned, along with tips to help attendees avoid learning the hard way.

Attackers think in graphs, defenders think in lists, and that’s how attackers win.
Erkang ZhengChief information security officer, LifeOmic Security

Multiple panelists said they struggled to get effective results from code analysis tools and spent time setting up software that returned very little value or, worse, created false-positive security alerts.

“We tried to do things like, ‘Hey, let’s make sure that we aren’t just checking in secrets to the code repository,'” said Hongyi Hu, security engineer at SaaS file-sharing firm Dropbox, based in San Francisco. “It turns out that there’s not really a standardized way of doing these things. … You can find things that look like secrets, but secrets don’t always look like secrets — a really weak secret might not be captured by a tool.”

Ultimately, no tool can replace effective communication within DevOps teams to improve IT security, panelists said. It sounds like a truism, but it represents a major shift in the way cybersecurity teams work, from being naysayers to acting as a consulting resource to apps teams. Often, a light-handed approach is best.

“The most effective strategy we got to with threat modeling was throwing away any heavyweight process we had,” said Zane Lackey, co-founder and CSO at WAF vendor Signal Sciences in Los Angeles. “As product teams were developing new features, someone from the security team would just sit in the back of their meeting and ask them, ‘How would you attack this?’ and then shut up.”

It takes time to gain DevOps teams’ trust after years of adversarial relationships with security teams, panelists said, but when all else fails, security pros can catch more flies with honey — or candy.

“We put out bowls of candy in the security team’s area and it encouraged people to come and ask them questions,” Lackey said. “It was actually wildly successful.”

Go to Original Article
Author:

Adobe Experience Platform adds features for data scientists

After almost a year in beta, Adobe has introduced Query Service and Data Science Workspace to the Adobe Experience Platform to enable brands to deliver tailored digital experiences to their customers, with real-time data analytics and understanding of customer behavior.

Powered by Adobe Sensei, the vendor’s AI and machine learning technology, Query Service and Data Science Workspace intend to automate tedious, manual processes and enable real-time data personalization for large organizations.

The Adobe Experience Platform — previously the Adobe Cloud Platform — is an open platform for customer experience management that synthesizes and breaks down silos for customer data in one unified customer profile.

According to Adobe, the volume of data organizations must manage has exploded. IDC predicted the Global DataSphere will grow from 33 zettabytes in 2018 to 175 zettabytes by 2025. And while more data is better, it makes it difficult for businesses and analysts to sort, digest and analyze all of it to find answers. Query Service intends to simplify this process, according to the vendor.

Query Service enables analysts and data scientists to perform queries across all data sets in the platform instead of manually combing through siloed data sets to find answers for data-related questions. Query Service supports cross-channel and cross-platform queries, including behavioral, point-of-sale and customer relationship management data. Query Service enables users to do the following:

  • run queries manually with interactive jobs or automatically with batch jobs;
  • subgroup records based on time and generate session numbers and page numbers;
  • use tools that support complex joins, nested queries, window functions and time-partitioned queries;
  • break down data to evaluate key customer events; and
  • view and understand how customers flow across all channels.

While Query Service simplifies the data identification process, Data Science Workspace helps to digest data and enables data scientists to draw insights and take action. Using Adobe Sensei’s AI technology, Data Science Workspace automates repetitive tasks and understands and predicts customer data to provide real-time intelligence.

Also within Data Science Workspace, users can take advantage of tools to develop, train and tune machine learning models to solve business challenges, such as calculating customer predisposition to buy certain products. Data scientists can also develop custom models to pull particular insights and predictions to personalize customer experiences across all touchpoints.

Additional capabilities of Data Science Workstation enable users to perform the following tasks:

  • explore all data stored in Adobe Experience Platform, as well as deep learning libraries like Spark ML and TensorFlow;
  • use prebuilt or custom machine learning recipes for common business needs;
  • experiment with recipes to create and train tracked unlimited instances;
  • publish intelligent services recipes without IT to Adobe I/O; and
  • continuously evaluate intelligent service accuracy and retrain recipes as needed.

Adobe data analytics features Query Service and Data Science Workspace were first introduced as part of the Adobe Experience Platform in beta in September 2018. Adobe intends these tools to improve how data scientists handle data on the Adobe Experience Platform and create meaningful models off of which developers can work. 

Go to Original Article
Author:

Data Center Scale Computing and Artificial Intelligence with Matei Zaharia, Inventor of Apache Spark

Matei Zaharia, Chief Technologist at Databricks & Assistant Professor of Computer Science at Stanford University, in conversation with Joseph Sirosh, Chief Technology Officer of Artificial Intelligence in Microsoft’s Worldwide Commercial Business


At Microsoft, we are privileged to work with individuals whose ideas are blazing a trail, transforming entire businesses through the power of the cloud, big data and artificial intelligence. Our new “Pioneers in AI” series features insights from such pathbreakers. Join us as we dive into these innovators’ ideas and the solutions they are bringing to market. See how your own organization and customers can benefit from their solutions and insights.

Our first guest in the series, Matei Zaharia, started the Apache Spark project during his PhD at the University of California, Berkeley, in 2009. His research was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in Computer Science. He is a co-founder of Databricks, which offers a Unified Analytics Platform powered by Apache Spark. Databricks’ mission is to accelerate innovation by unifying data science, engineering and business. Microsoft has partnered with Databricks to bring you Azure Databricks, a fast, easy, and collaborative Apache Spark based analytics platform optimized for Azure. Azure Databricks offers one-click set up, streamlined workflows and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts to generate great value from data faster.

So, let’s jump right in and see what Matei has to say about Spark, machine learning, and interesting AI applications that he’s encountered lately.

Video and podcast versions of this session are available at the links below. The podcast is also available from your Spotify app and via Stitcher. Alternatively, just continue reading the text version of their conversation below, via this blog post.

Joseph Sirosh: Matei, could you tell us a little bit about how you got started with Spark and this new revolution in analytics you are driving?

Matei Zaharia: Back in 2007, I started doing my PhD at UC Berkeley and I was very interested in data center scale computing, and we just saw at the time that there was an open source MapReduce implementation in Apache Hadoop, so I started early on by looking at that. Actually, the first project was profiling Hadoop workloads to identify some bottlenecks and, as part of that, we made some improvements to the Hadoop job scheduler and that actually went into Hadoop and I started working with some of the early users of that, especially Facebook and Yahoo. And what we saw across all of these is that this type of large data center scale computing was very powerful, there were a lot of interesting applications they could do with them, but just the map-reduce programming model alone wasn’t really sufficient, especially for machine learning – that’s something everyone wanted to do where it wasn’t a good fit but also for interactive queries and streaming and other workloads.

So, after seeing this for a while, the first project we built was the Apache Mesos cluster manager, to let you run other types of computations next to Hadoop. And then we said, you know, we should try to build our own computation engine which ended up becoming Apache Spark.

JS: What was unique about Spark?

MZ: I think there were a few interesting things about it. One of them was that it tried to be a general or unified programming model that can support many types of computations. So, before the Spark project, people wanted to do these different computations on large clusters and they were designing specialized engines to do particular things, like graph processing, SQL custom code, ETL which would be map-reduce, they were all separate projects and engines. So in Spark we kind of stepped back at them and looked at these and said is there any way we can come up with a common abstraction that can handle these workloads and we ended up with something that was a pretty small change to MapReduce – MapReduce plus fast data sharing, which is the in-memory RDDs in Spark, and just hooking these up into a graph of computations turned out to be enough to get really good performance for all the workloads and matched the specialized engines, and also much better performance if your workload combines a bunch of steps. So that is one of the things.

I think the other thing which was important is, having a unified engine, we could also have a very composable API where a lot of the things you want to use would become libraries, so now there are hundreds maybe thousands of third party packages that you can use with Apache Spark which just plug into it that you can combine into a workflow. Again, none of the earlier engines had focused on establishing a platform and an ecosystem but that’s why it’s really valuable to users and developers, is just being able to pick and choose libraries and arm them.

JS: Machine Learning is not just one single thing, it involves so many steps. Now Spark provides a simple way to compose all of these through libraries in a Spark pipeline and build an entire machine learning workflow and application. Is that why Spark is uniquely good at machine learning?

MZ: I think it’s a couple of reasons. One reason is much of machine learning is preparing and understanding the data, both the input data and also actually the predictions and the behavior of the model, and Spark really excels at that ad hoc data processing using code – you can use SQL, you can use Python, you can use DataFrames, and it just makes those operations easy, and, of course, all the operations you do also scale to large datasets, which is, of course, important because you want to train machine learning on lots of data.

Beyond that, it does support iterative in-memory computation, so many algorithms run pretty well inside it, and because of this support for composition and this API where you can plug in libraries, there are also quite a few libraries you can plug in that call external compute engines that are optimized to do different types of numerical computation.

JS: So why didn’t some of these newer deep learning toolsets get built on top of Spark? Why were they all separate?

MZ: That’s a good question. I think a lot of the reason is probably just because people, you know, just started with a different programming language. A lot of these were started with C++, for example, and of course, they need to run on the GPU using CUDA which is much easier to do from C++ than from Java. But one thing we’re seeing is really good connectors between Spark and these tools. So, for example, TensorFlow has a built-in Spark connector that can be used to get data from Spark and convert it to TFRecords. It also actually connects to HDFS and different sorts of big data file systems. At the same time, in the Spark community, there are packages like deep learning pipelines from Databricks and quite a few other packages as well that let you setup a workflow of steps that include these deep learning engines and Spark processing steps.

“None of the earlier engines [prior to Apache Spark] had focused on establishing a platform and an ecosystem.”

JS: If you were rebuilding these deep learning tools and frameworks, would you recommend that people build it on top of Spark? (i.e. instead of the current approach, of having a tool, but they have an approach of doing distributed computing across GPUs on their own.)

MZ: It’s a good question. I think initially it was easier to write GPU code directly, to use CUDA and C++ and so on. And over time, actually, the community has been adding features to Spark that will make it easier to do that in there. So, there’s definitely been a lot of proposals and design to make GPU a first-class resource. There’s also this effort called Project Hydrogen which is to change the scheduler to support these MPI-like batch jobs. So hopefully it will become a good platform to do that, internally. I think one of the main benefits of that is again for users that they can either program in one programming language, they can learn just one way to deploy and manage clusters and it can do deep learning and the data preprocessing and analytics after that.

JS: That’s great. So, Spark – and Databricks as commercialized Spark – seems to be capable of doing many things in one place. But what is not good at? Can you share some areas where people should not be stretching Spark?

MZ: Definitely. One of the things it doesn’t do, by design, is it doesn’t do transactional workloads where you have fine grained updates. So, even though it might seem like you can store a lot of data in memory and then update it and serve it, it is not really designed for that. It is designed for computations that have a large amount of data in each step. So, it could be streaming large continuous streams, or it could be batch but is it not these point queries.

And I would say the other thing it does not do it is doesn’t have a built-in persistent storage system. It is designed so it’s just a compute engine and you can connect it to different types of storage and that actually makes a lot of sense, especially in the cloud, with separating compute and storage and scaling them independently. But it is different from, you know, something like a database where the storage and compute are co-designed to live together.

JS: That makes sense. What do you think of frameworks like Ray for machine learning?

MZ: There are lot of new frameworks coming out for machine learning and it’s exciting to see the innovation there, both in the programming models, the interface, and how to work with it. So I think Ray has been focused on reinforcement learning which is where one of the main things you have to do is spawn a lot of little independent tasks, so it’s a bit different from a big data framework like Spark where you’re doing one computation on lots of data – these are separate computations that will take different amounts of time, and, as far as I know, users are starting to use that and getting good traction with it. So, it will be interesting to see how these things come about.

I think the thing I’m most interested in, both for Databricks products and for Apache Spark, is just enabling it to be a platform where you can combine the best algorithms, libraries and frameworks and so on, because that’s what seems to be very valuable to end users, is they can orchestrate a workflow and just program it as easily as writing a single machine application where you just import a bunch of libraries.

JS: Now, stepping back, what do you see as the most exciting applications that are happening in AI today?

MZ: Yeah, it depends on how recent. I mean, in the past five years, deep learning is definitely the thing that has changed a lot of what we can do, and, in particular, it has made it much easier to work with unstructured data – so images, text, and so on. So that is pretty exciting.

I think, honestly, for like wide consumption of AI, the cloud computing AI services make it significantly easier. So, I mean, when you’re doing machine learning AI projects, it’s really important to be able to iterate quickly because it’s all about, you know, about experimenting, about finding whether something will work, failing fast if a particular idea doesn’t work. And I think the cloud makes it much easier.

JS: Cloud AI is super exciting, I completely agree. Now, at Stanford, being a professor, you must see a lot of really exciting pieces of work that are going on, both at Stanford and at startups nearby. What are some examples?

MZ: Yeah, there are a lot of different things. One of the things that is really useful for end users is all the work on transfer learning, and in general all the work that lets you get good results with AI using smaller training datasets. There are other approaches as well like weak supervision that do that as well. And the reason that’s important is that for web-scale problems you have lot of labeled data, so for something like web search you can solve it, but for many scientific or business problems you don’t have that, and so, how can you learn from a large dataset that’s not quite in your domain like the web and then apply to something like, say, medical images, where only a few hundred patients have a certain condition so you can’t get a zillion images. So that’s where I’ve seen a lot of exciting stuff.

But yeah, there’s everything from new hardware for machine learning where you throw away the constraints that the computation has to be precise and deterministic, to new applications, to things like, for example security of AI, adversarial examples, verifiability, I think they are all pretty interesting things you can do.

JS: What are some of the most interesting applications you have seen of AI?

MZ: So many different applications to start with. First of all, we’ve seen consumer devices that bring AI into every home, or every phone, or every PC – these have taken off very quickly and it’s something that a large fraction of customers use, so that’s pretty cool to see.

In the business space, probably some of the more exciting things are actually dealing with image data, where, using deep learning and transfer learning, you can actually start to reliably build classifiers for different types of domain data. So, whether it’s maps, understanding satellite images, or even something as simple as people uploading images of a car to a website and you try to give feedback on that so it’s easier to describe it, a lot of these are starting to happen. So, it’s kind of a new class of data, visual data – we couldn’t do that much with it automatically before, and now you can get both like little features and big products that use it.

JS: So what do you see as the future of Databricks itself? What are some of the innovations you are driving?

MZ: Databricks, for people not familiar, we offer basically, a Unified Analytics Platform, where you can work with big data mostly through Apache Spark and collaborate with it in an organization, so you can have different people, developing say notebooks to perform computations, you can have people developing production jobs, you can connect these together into workflows, and so on.

So, we’re doing a lot of things to further expand on that vision. One of the things that we announced recently is what we call machine learning runtime where we have preinstalled versions of popular machine learning libraries like XGBoost or TensorFlow or Horovod on your Databricks cluster, so you can set those up as easily as you can set up as easily as you can setup an Apache Spark cluster in the past. And then another product that we featured a lot at our Spark Summit conference this year is Databricks Delta which is basically a transactional data management layer on top of cloud objects stores that lets us do things like indexing, reliable exactly once stream processing, and so on at very massive scale, and that’s a problem that all our users have, because all our users have to setup a reliable data ingest pipeline.

JS: Who are some of the most exciting customers of Databricks and what are they doing?

MZ: There are a lot of really interesting customers doing pretty cool things. So, at our conference this year, for example, one of the really cool presentations we saw was from Apple. So, Apple’s internal information security group – this is the group that does network monitoring, basically gets hundreds of terabytes of network events per day to process, to detect intrusions and information security problems. They spoke about using Databricks Delta and streaming with Apache Spark to handle all of that – so it’s one of the largest applications people have talked about publicly, and it’s very cool because the whole goal there – it’s kind of an arms race between the security team and attackers – so you really want to be able to design new rules, new measurements and add new data sources quickly. And so, the ease of programming and the ease of collaborating with this team of dozens of people was super important.

We also have some really exciting health and life sciences applications, so some of these are actually starting to discover new drugs that companies can actually productionize to tackle new diseases, and this is all based on large scale genomics and statistical studies.

And there are a lot of more fun applications as well. Like actually the largest video game in the world, League of Legends, they use Databricks and Apache Spark to detect players that are misbehaving or to recommend items to people or things like that. These are all things that were featured at the conference.

JS: If you had one advice to developers and customers using Spark or Databricks, or guidance on what they should learn, what would that be?

MZ: It’s a good question. There are a lot of high-quality training materials online, so I would say definitely look at some of those for your use case and see what other people are doing in that area. The Spark Summit conference is also a good way to see videos and talks and we make all of those available for free, the goal of that is to help and grow the developer community. So, look for someone who is doing similar things and be inspired by that and kinda see what the best practices are around that, because you might see a lot of different options for how to get started and it can be hard to see what the right path is.

JS: One last question – in recent years there’s been a lot of fear, uncertainty and doubt about AI, and a lot of popular press. Now – how real are they, and what do you think people should be thinking?

MZ: That’s a good question. My personal view is – this sort of evil artificial general intelligence stuff – we are very far away from it. And basically, if you don’t believe that, I would say just try doing machine learning tutorials and see how these models break down – you get a sense for how difficult that is.

But there are some real challenges that will come from AI, so I think one of them is the same challenge as with all technology which is, automation – how quickly does it happen. Ultimately, after automation, people usually end up being better off, but it can definitely affect some industries in a pretty bad way and if there is no time for people to transition out, that can be a problem.

I think the other interesting problem there is always a discussion about is basically access to data, privacy, managing the data, algorithmic discrimination – so I think we are still figuring out how to handle that. Companies are doing their best, but there are also many unknowns as to how these techniques will do that. That’s why we’ll see better best practices or regulations and things like that.

JS: Well, thank you Matei, it’s simply amazing to see the innovations you have driven, and looking forward to more to come.

MZ: Thanks for having me.

“When you’re doing machine learning AI projects, it’s really important to be able to iterate quickly because it’s all about experimenting, about finding whether something will work, failing fast if a particular idea doesn’t work.


And I think the cloud makes it much easier.”

We hope you enjoyed this blog post. This being our first episode in the series, we are eager to hear your feedback, so please share your thoughts and ideas below.

The AI / ML Blog Team

Resources

The nine roles you need on your data science research team

It’s easy to focus too much on building a data science research team loaded with Ph.D.s to do machine learning at the expense of developing other data science skills needed to compete in today’s data-driven, digital economy. While high-end, specialty data science skills for machine learning are important, they can also get in the way of a more pragmatic and useful adoption of data science. That’s the view of Cassie Kozyrkov, chief decision scientist at Google and a proponent of the democratization of data-based organizational decision-making.

To start, CIOs need to expand their thinking about the types of roles involved in implementing data science programs, Kozyrkov said at the recent Rev Data Science Leaders Summit in San Francisco.

For example, it’s important to think about data science research as a specialty role developed to provide intelligence for important business decisions. “If an answer involves one or more important decisions, then you need to bring in the data scientists,” said Kozyrkov, who designed Google’s analytics program and trained more than 15,000 Google employees in statistics, decision-making and machine learning.

But other tasks related to data analytics, like making informational charts, testing out various algorithms and making better decisions, are best handled by other data science team members with entirely different skill sets.

Data science roles: The nine must-haves

There are a variety of data science research roles for an organization to consider and certain characteristics best suited for each. Most enterprises already have correctly filled several of these data science positions, but most will also have people with the wrong skills or motivations in certain data science roles. This mismatch can slow things down or demotivate others throughout the enterprise, so it’s important for CIOs to carefully consider who staffs these roles to get the most from their data science research.

Here is Kozyrkov’s rundown of the essential data science roles and the part each plays in helping organizations make more intelligent business decisions.

Data engineers are people who have the skills and ability to get data required for analysis at scale.

Basic analysts could be anyone in the organization with a willingness to explore data and plot relationships using various tools. Kozyrkov suggested it may be hard for data scientists to cede some responsibility for basic analysis to others. But, in the long run, the value of data scientists will grow, as more people throughout the company are already doing basic analytics.

Expert analysts, on the other hand, should be able to search through data sets quickly. You don’t want to put a software engineer or very methodical person in this role, because they are too slow.

“The expert software engineer will do something beautiful, but won’t look at much of your data sets,” Kozyrkov said. You want someone who is sloppy and will run around your data. Caution is warranted in buffering expert analysts from software developers inclined to complain about sloppy — yet quickly produced — code.

Statisticians are the spoilsports who will explain how your latest theory does not hold up for 20 different reasons. These people can kill motivation and excitement. But they are also important for coming to conclusions safely for important decisions.

A machine learning engineer is not a researcher who builds algorithms. Instead, these AI-focused computer programmers excel at moving a lot of data sets through a variety of software packages to decide if the output looks promising. The best person for this job is not a perfectionist who would slow things down by looking for the best algorithm.

A good machine learning engineer, in Kozyrkov’s view, is someone who doesn’t know what they are doing and will try out everything quickly. “The perfectionist needs to have the perfection encouraged out of them,” she said.

Too many businesses are trying to staff the team with a bunch of Ph.D. researchers. These folks want to do research, not solve a business problem.
Cassie Kozyrkovchief decision scientist at Google

A data scientist is an expert who is well-trained in statistics and also good at machine learning. They tend to be expensive, so Kozyrkov recommended using them strategically.

A data science manager is a data scientist who wakes up one day and decides he or she wants to do something different to benefit the bottom line. These folks can connect the decision-making side of business with the data science of big data. “If you find one of these, grab them and never let them go,” Kozyrkov said.

A qualitative expert is a social scientist who can assess decision-making. This person is good at helping decision-makers set up a problem in a way that can be solved with data science. They tend to have better business management training than some of the other roles.

A data science researcher has the skills to craft customized data science and machine learning algorithms. Data science researchers should not be an early hire. “Too many businesses are trying to staff the team with a bunch of Ph.D. researchers. These folks want to do research, not solve a business problem,” Kozyrkov said. “This is a hire you only need in a few cases.”

Prioritize data science research projects

For CIOs looking to build their data science research team, develop a strategy for prioritizing and assigning data science projects. (See the aforementioned advice on hiring data science researchers.)

Decisions about what to prioritize should involve front-line business managers, who can decide what data science projects are worth pursuing.

In the long run, some of the most valuable skills lie in learning how to bridge the gap between business decision-makers and other roles. Doing this in a pragmatic way requires training in statistics, neuroscience, psychology, economic management, social sciences and machine learning, Kozyrkov said. 

A new partnership to support computer science teachers

Today we are excited to announce a new partnership with the Computer Science Teachers Association (CSTA). Microsoft will provide $2 million, over three years, to help CSTA launch new chapters and strengthen existing ones. It will help them attract new members and partners to build a stronger community to serve computer science teachers.

 We’re thrilled that students of all ages are discovering the exciting – and critical – field of computer science. From the Hour of Code, to Minecraft Education, and even Advanced Placement Computer Sciences courses, participation rates are expanding. This surge of student interest, combined with the premium our economy places on technology skill of all kinds, requires us to do all we can to ensure every student has access to computer science courses. And it all starts with our teachers.  

 Nearly every teacher belongs to a professional membership organization, from social studies, to reading, to math and science. These organizations provide teachers with subject-specific professional development, up-to-date curriculum, and networking opportunities with peers and other professionals. CSTA was started in 2004 to fill this need for computer science teachers. But to meet today’s needs in this quickly changing and growing field of study, CSTA is expanding as well. We are proud to support them!

 Our investment in CSTA continues Microsoft Philanthropies’ long-standing commitment to computer science education through our Technology Education and Literacy in Schools (TEALS) program, which pairs technology industry volunteers with classroom teachers to team-teach computer science in 350 U.S. high schools. It builds on our investments in nonprofits such as Code.org, Girls Who Code, and Boys & Girls Clubs of America, with whom we partnered to create a computer science learning pathway. And it builds on our work advocating at a state and federal level for policy change and investments in computer science education across the United States.  

While technology can be a powerful learning tool, nothing can replace the expertise, guidance, and encouragement that teachers provide to students each day of the school year. I remember my own favorite teachers who helped me see a world beyond the rural town in which I grew up. I would guess that nearly everyone has a similar story. We thank our teachers and we hope that this investment in computer science teachers, through CSTA, empowers more educators to do what they do best: make a positive difference in the lives of students. To learn how you can help CSTA serve teachers, please visit https://www.csteachers.org/page/GetInvolved.

Building a data science pipeline: Benefits, cautions

Enterprises are adopting data science pipelines for artificial intelligence, machine learning and plain old statistics. A data science pipeline — a sequence of actions for processing data — will help companies be more competitive in a digital, fast-moving economy. 

Before CIOs take this approach, however, it’s important to consider some of the key differences between data science development workflows and traditional application development workflows.

Data science development pipelines used for building predictive and data science models are inherently experimental and don’t always pan out in the same way as other software development processes, such as Agile and DevOps. Because data science models break and lose accuracy in different ways than traditional IT apps do, a data science pipeline needs to be scrutinized to assure the model reflects what the business is hoping to achieve.

At the recent Rev Data Science Leaders Summit in San Francisco, leading experts explored some of these important distinctions, and elaborated on ways that IT leaders can responsibly implement a data science pipeline. Most significantly, data science development pipelines need accountability, transparency and auditability. In addition, CIOs need to implement mechanisms for addressing the degradation of a model over time, or “model drift.” Having the right teams in place in the data science pipeline is also critical: Data science generalists work best in the early stages, while specialists add value to more mature data science processes.

Data science at Moody’s

Jacob Grotta, managing director, Moody's AnalyticsJacob Grotta

CIOs might want to take note from Moody’s, the financial analytics giant, which was an early pioneer in using predictive modeling to assess the risks of bonds and investment portfolios. Jacob Grotta, managing director at Moody’s Analytics, said the company has streamlined the data science pipeline it uses to create models in order to be able to quickly adapt to changing business and economic conditions.

“As soon as a new model is built, it is at its peak performance, and over time, they get worse,” Grotta said. Declining model performance can have significant impacts. For example, in the finance industry, a model that doesn’t accurately predict mortgage default rates puts a bank in jeopardy. 

Watch out for assumptions

Grotta said it is important to keep in mind that data science models are created by and represent the assumptions of the data scientists behind them. Before the 2008 financial crisis, a firm approached Grotta with a new model for predicting the value of mortgage-backed derivatives, he said. When he asked what would happen if the prices of houses went down, the firm responded that the model predicted the market would be fine. But it didn’t have any data to support this. Mistakes like these cost the economy almost $14 trillion by some estimates.

The expectation among companies often is that someone understands what the model does and its inherent risks. But these unverified assumptions can create blind spots for even the most accurate models. Grotta said it is a good practice to create lines of defense against these sorts of blind spots.

The first line of defense is to encourage the data modelers to be honest about what they do and don’t know and to be clear on the questions they are being asked to solve. “It is not an easy thing for people to do,” Grotta said.

A second line of defense is verification and validation. Model verification involves checking to see that someone implemented the model correctly, and whether mistakes were made while coding it. Model validation, in contrast, is an independent challenge process to help a person developing a model to identify what assumptions went into the data. Ultimately, Grotta said, the only way to know if the modeler’s assumptions are accurate or not is to wait for the future.

A third line of defense is an internal audit or governance process. This involves making the results of these models explainable to front-line business managers. Grotta said he was working with a bank recently that protested its bank managers would not use a model if they didn’t understand what was driving its results. But he said the managers were right to do this. Having a governance process and ensuring information flows up and down the organization is extremely important, Grotta said.

Baking in accountability

Models degrade or “drift” over time, which is part of the reason organizations need to streamline their model development processes. It can take years to craft a new model. “By that time, you might have to go back and rebuild it,” Grotta said. Critical models must be revalidated every year.

To address this challenge, CIOs should think about creating a data science pipeline with an auditable, repeatable and transparent process. This promises to allow organizations to bring the same kind of iterative agility to model development that Agile and DevOps have brought to software development.

Transparent means that upstream and downstream people understand the model drivers. It is repeatable in that someone can repeat the process around creating it. It is auditable in the sense that there is a program in place to think about how to manage the process, take in new information, and get the model through the monitoring process. There are varying levels of this kind of agility today, but Grotta believes it is important for organizations to make it easy to update data science models in order to stay competitive.

How to keep up with model drift

Nick Elprin, CEO and co-founder of Domino Data Lab, a data science platform vendor, agreed that model drift is a problem that must be addressed head on when building a data science development pipeline. In some cases, the drift might be due to changes in the environment, like changing customer preferences or behavior. In other cases, drift could be caused by more adversarial factors. For example, criminals might adopt new strategies for defeating a new fraud detection model.

Nick Elprin, CEO and co-founder, Domino Data LabNick Elprin

In order to keep up with this drift, CIOs need to include a process for monitoring the effectiveness of their data models over time and establishing thresholds for replacing these models when performance degrades.

With traditional software monitoring, the IT service management needs to track metrics related to CPU, network and memory usage. With data science, CIOs need to capture metrics related to accuracy of model results. “Software for [data science] production models needs to look at the output they are getting from those models, and if drift has occurred, that should raise an alarm to retrain it,” Elprin said.

Fashion-forward data science

At Stitch Fix, a personal shopping service, the company’s data science pipeline allows it to sell clothes online at full price. Using data science in various ways allows them to find new ways to add value against deep discount giants like Amazon, said Eric Colson, chief algorithms officer at Stitch Fix.

Eric Colson, chief algorithms officer,  Stitch FixEric Colson

For example, the data science team has used natural language processing to improve its recommendation engines and buy inventory. Stitch Fix also uses genetic algorithms — algorithms that are designed to mimic evolution and iteratively select the best results following a set of randomized changes. These are used to streamline the process for designing clothes, coming up with countless iterations: Fashion designers then vet the designs.

This kind of digital innovation, however, was only possible he said because the company created an efficient data science pipeline. He added that it was also critical that the data science team is considered a top-level department at Stitch Fix and reports directly to the CEO.

Specialists or generalists?

One important consideration for CIOs in constructing the data science development pipeline is whether to recruit data science specialists or generalists. Specialists are good at optimizing one step in a complex data science pipeline. Generalists can execute all the different tasks in a data science pipeline. In the early stages of a data science initiative, generalists can adapt to changes in the workflow more easily, Colson said.

Some of these different tasks include feature engineering, model training, enhance transform and loading (ETL) data, API integration, and application development. It is tempting to staff each of these tasks with specialists to improve individual performance. “This may be true of assembly lines, but with data science, you don’t know what you are building, and you need to iterate,” Colson said. The process of iteration requires fluidity, and if the different roles are staffed with different people, there will be longer wait times when a change is made.

In the beginning at least, companies will benefit more from generalists. But after data science processes are established after a few years, specialists may be more efficient.

Align data science with business

Today a lot of data science models are built in silos that are disconnected from normal business operations, Domino’s Elprin said. To make data science effective, it must be integrated into existing business processes. This comes from aligning data science projects with business initiatives. This might involve things like reducing the cost of fraudulent claims or improving customer engagement.

In less effective organizations, management tends to start with the data the company has collected and wonder what a data science team can do with it. In more effective organizations, data science is driven by business objectives.

“Getting to digital transformation requires top down buy-in to say this is important,” Elprin said. “The most successful organizations find ways to get quick wins to get political capital. Instead of twelve-month projects, quick wins will demonstrate value, and get more concrete engagement.”

New data science platforms aim to be workflow, collaboration hubs

An emerging class of data science platforms that provide collaboration and workflow management capabilities is gaining more attention from both users and vendors — most recently Oracle, which is buying its way into the market.

Oracle’s acquisition of startup DataScience.com puts more major-vendor muscle behind the workbench-style platforms, which give data science teams a collaborative environment for developing, deploying and documenting analytical models. IBM is already in with its Data Science Experience platform, informally known as DSX. Other vendors include Domino Data Lab and Cloudera, which last week detailed plans for a new release of its Cloudera Data Science Workbench (CDSW) software this summer.

These technologies are a subcategory of data science platforms overall. They aren’t analytics tools; they’re hubs that data scientists can use to build predictive and machine learning models in a shared and managed space — instead of doing so on their own laptops, without a central location to coordinate workflows and maintain models. Typically, they’re aimed at teams with 10 to 20 data scientists and up.

The workbenches began appearing in 2014, but it’s only over the past year or so that they matured into products suitable for mainstream users. Even now, the market is still developing. Domino and Cloudera wouldn’t disclose the number of customers they have for their technologies; in a March interview, DataScience.com CEO Ian Swanson said only that its namesake platform has “dozens” of users.

A new way to work with data science volunteers

Ruben van der Dussen, ThornRuben van der Dussen

Thorn, a nonprofit group that fights child sex trafficking and pornography, deployed Domino’s software in early 2017. The San Francisco-based organization only has one full-time data scientist, but it taps volunteers to do analytics work that helps law enforcement agencies identify and find trafficking victims. About 20 outside data scientists are often involved at a time — a number that swells to 100 or so during hackathons that Thorn holds, said Ruben van der Dussen, director of its Innovation Lab.

That makes this sort of data science platform a good fit for the group, he said. Before, the engineers on his team had to create separate computing instances on the Amazon Elastic Compute Cloud (EC2) for volunteers and set them up to log in from their own systems. With Domino, the engineers put Docker containers on Thorn’s EC2 environment, with embedded Jupyter Notebooks that the data scientists access via the web. That lets them start analyzing data faster and frees up time for the engineers to spend on more productive tasks, van der Dussen said.

He added that data security and access privileges are also easier to manage now — an important consideration, given the sensitive nature of the images, ads and other online data that Thorn analyzes with a variety of machine learning and deep learning models, including ones based on natural language processing and computer vision algorithms.

Thorn develops and trains the analytical models within the Domino platform and uses it to maintain different versions of the Jupyter Notebooks, so the work done by data scientists is documented for other volunteers to pick up on. In addition, multiple people working together on a project can collaborate through the platform. The group uses tools like Slack for direct communication, “but Domino makes it really easy to share a Notebook and for people to comment on it,” van der Dussen said.

Screenshot of Domino Data Lab's data science platform
Domino Data Lab’s data science platform lets users run different analytics tools in separate workspaces.

Oracle puts its money down on data science

Oracle is betting that data science platforms like DataScience.com’s will become a popular technology for organizations that want to manage their advanced analytics processes more effectively. Oracle, which announced the acquisition this month, plans to combine DataScience.com’s platform with its own AI infrastructure and model training tools as part of a data science PaaS offering in the Oracle Cloud.

By buying DataScience.com, Oracle hopes to help users get more out of their analytics efforts — and better position itself as a machine learning vendor against rivals like Amazon Web Services, IBM, Google and Microsoft. Oracle said it will continue to invest in DataScience.com’s technology, with a goal of delivering “more functionality and capabilities at a quicker pace.” It didn’t disclose what it’s paying for the Culver City, Calif., startup.

The workbench platforms centralize work on analytics projects and management of the data science workflow. Data scientists can team up on projects and run various commercial and open source analytics tools to which the platforms connect, then deploy finished models for production applications. The platforms also support data security and governance, plus version control on analytical models.

Cloudera said its upcoming CDSW 1.4 release adds features for tracking and comparing different versions of models during the development and training process, as well as the ability to deploy models as REST APIs embedded in containers for easier integration into dashboards and other applications. DataScience.com, Domino and IBM provide similar functionality in their data science platforms.

Screenshot of Cloudera Data Science Workbench
Cloudera Data Science Workbench uses a sessions concept for running analytics applications.

Choices on data science tools and platforms

Deutsche Telekom AG is offering both CDSW and IBM’s DSX to users of Telekom Data Intelligence Hub, a cloud-based big data analytics service that the telecommunications company is testing with a small number of customers in Europe ahead of a planned rollout during the second half of the year.

Users can also access Jupyter, RStudio and three other open source analytics tools, said Sven Löffler, a business development executive at the Bonn, Germany, company who’s leading the implementation of the analytics service. The project team sees benefits in enabling organizations to connect to those tools through the two data science platforms and get “all this sharing and capabilities to work collaboratively with others,” he said.

However, Löffler has heard from some customers that the cost of the platforms could be a barrier compared to working directly with the open source tools as part of the service, which runs in the Microsoft Azure cloud. It’s fed by data pipelines that Deutsche Telekom is building with a new Azure version of Cloudera’s Altus Data Engineering service.