Tag Archives: Cognitive Services APIs

ICYMI – Your weekly TL;DR

Hi developers,

Hope you had a good week. Here is your weekly ICYMI rounding up the links and blogs of the week!

The Cognitive Service API Series

Check this series out to learn about all of our new Cognitive Service APIs.

HoloLens Spectator View

Pretty cool, right?

Upcoming Flexera Webinar

Mark your calendars!

Why do you code?

Thank you to everyone who submitted a response!

Download Visual Studio to get started.

The Windows team would love to hear your feedback. Please keep the feedback coming using our Windows Developer UserVoice site. If you have a direct bug, please use the Windows Feedback tool built directly into Windows 10.

Cognitive Service API – Search

In the last few posts, we’ve explored a few aspects available to you as part of the wider group of Microsoft Cognitive Services APIs. Microsoft Cognitive Services puts the power of machine learning within your reach with easy to consume REST APIs. Using the APIs and SDKs can add a whole new level of cognitive awareness and intelligence to your applications.

Today, let’s take a closer look at the Search APIs and show you how to get started. The five Search APIs available are:

  • Bing Autosuggest
  • Bing Image Search
  • Bing News Search
  • Bing Video Search
  • Bing Web Search

You can use the in-browser API test that you see linked after each topic to see these in action now. At the end of the article we’ll show you how to get a free trial and your APIs keys to start adding functionality to your app.

Okay, let’s dig in.

Bing AutoSuggest

This API provides you with the same feature you see as you type into the Bing search box, providing intelligent suggestions based on the current query. The usage for this API is pretty simple, you send the query and get a response containing the suggestions. This API is typically used for application scenarios where you need an autocomplete search box, but can be used for much more as the results are not simple word matches, they’re intelligent and context aware suggestions.

Here’s an example of getting suggestions:

Resources for Bing AutoSuggest:

Bing Image Search

Bing Image Search puts millions of photos at your fingertips. You can define search parameters to get more accurate results. Parameters such as Market, SafeSearch, Color, Freshness, Image Type, License and Size.

Sending a query to the API will return thumbnails, full image URL, metadata (such as width & height), original website source and much more in an easy to use JSON response.

Here’s part of the result when searching “beaches”:

Resources for Bing Image Search:

Bing News Search

With Bing News search API, you can get intelligent web search results containing news articles related to your query, quickly and within your application. In addition to the link to the article, this API will provide results containing the official article image, a video URL (if one is available), related articles and provider information (the source).

Just like with Image Search, you can use initial search parameters. The parameters are Market, SafeSearch and Freshness.

Here’s part of the result when searching “Microsoft Cognitive Services”:

Bing News Search Resources:

Bing Video Search

Similar to Image Search API, you can make a call to the API using parameters such as Market, SafeSearch, Pricing, Freshness, Resolution and Video Length. You’ll get a result containing the video name, description, video URL, thumbnail, creator, video view count, encoding information and a lot more.

Here’s an example search for “beaches” again, but using the Video Search API this time:

Bing Video Search Resources

Bing Web Search

To wrap up today’s post, let’s take a look at the Bing Web Search API. This API will deliver the trusted results you get from Bing into your application. You’ll have access to literally billions of web documents using this API and you can pass parameters such as Market, SafeSearch, Relevance and Freshness. In the results you can expect the web result with metadata, such as name, url, displayUrl and snippet.

For example, here’s a search for “Microsoft Cognitive Services” again, but this time using the Web Search API:

Additionally, if the result has deep links available, you’ll see a deepLinks array. Here’s the top of the JSON response for the same search:

Bing Web Search Resources

Wrapping Up

You can easily add some great functionality to your application by introducing one, or more, of the Cognitive Services APIs. As we promised at the top of the article, go here to enable the free trial and get your API keys, then swing by the Getting Started guide for a walk-through on how to get up and running.

Cognitive Services APIs: Knowledge

“Ipsa scientia potestas est.” – Sir Francis Bacon

In the last two posts in this series, we covered the Speech APIs and the Language Understanding APIs. In the current post, we’ll go over the Knowledge APIs. There is a natural progression to this since from speech, we derive meaning, and with meaning we obtain knowledge. The Knowledge APIs provide ways to link our collective knowledge and to access that knowledge in more effective ways.

Academic Knowledge API

Sometimes we want to find out what our social networks are saying about a given topic, like a recent wardrobe malfunction at a star-studded celebrity event. The Academic Knowledge API isn’t for that and would return terrible results if we tried.

The Academic Knowledge API, as you might guess from the name, acts as a computer librarian for retrieving the best academic research on topics you are interested in. It uses its knowledge of natural language semantics (how we speak), along with its understanding of what the papers it has indexed in the Microsoft Academic Graph are about, in order to track down the entities that are most relevant to your request.

There are five REST endpoints in the Academic Knowledge API…

{ "expr": "composite(AA.AuN==’wardrobe malfunction’)", "entities": [] }

And here are the entities you get back if you search for Francis Bacon:

{
"expr": "composite(AA.AuN==’francis bacon’)",
"entities": [{
"logprob": -18.364,
"Id": 567362522,
"Ti": "the new organon"
}, {
"logprob": -20.279,
"Id": 2414579677,
"Ti": "novum organum 1620"
}, {
"logprob": -20.32,
"Id": 2493700514,
"Ti": "confession of faith"
}, {
"logprob": -20.726,
"Id": 1487541657,
"Ti": "the instauratio magna part ii novum organum and associated texts"
}, {
"logprob": -20.984,
"Id": 1905465919,
"Ti": "la nouvelle atlantide"
}]
}

Knowledge Exploration Service

KES takes data and grammar that you provide and creates a service that enables interactive search with autocompletion. It basically lets you build something equivalent to the Academic Knowledge API with your own documents, whether you are dealing with cookbooks, medical data, D&D manuals or galactic star charts. Refer to the Getting Started guide to discover all that it offers.

Entity Linking Intelligent Service

ELIS searches a paragraph of text you send to it and identifies entities suitable for linking. It also returns links to Wikipedia for the entities it identifies. It’s the sort of thing that would work great as plugin for a blog engine or reading app.

What makes the service particularly clever is that it disambiguates words based on context. For instance, “Mars” can mean either the war god of the Roman pantheon or the fourth planet from the Sun. Elis can figure out which is which. This is a sample of returned results for the paragraph above (filtered to show Mars related entries only):

{
"entities":[{
"matches":[{
"text":"Mars",
"entries":[{
"offset":0
}]
},
{
"text":"Red Planet",
"entries":[{
"offset":172
}]
}],
"name":"Mars",
"wikipediaId":"Mars",
"score":0.993
},
{
"matches":[{
"text":"Roman god of war",
"entries":[{
"offset":122
}]
}],
"name":"Mars",
"wikipediaId":"Mars (mythology)",
"score":0.007
}]
}

ELIS found two entities that match the Wikipedia entry “Mars” and one entity that matches the disambiguated Wikipedia entry “Mars (mythology).”

QnA Maker

Most websites these days include a Frequently Asked Questions section. The QnA Maker is a free service that lets you take one of these FAQs and turn it into an intelligent service that understands natural language and can interact with people in a conversational way over email, Facebook and chat. QnA Maker is both a REST API and a web-based user interface.

To get started, go to the QnA Maker website, create a new service and upload or link to your FAQ. You can also enter your question and answer pairs manually.

Once the FAQ is ingested, the QnA Maker will walk you through training and testing your service until you are ready to publish.

After publishing your QnA service, you will be provided with the REST endpoint to call your service, which you can incorporate into a bot if you’d like to.

To learn more about using the QnA Maker to build an intelligent service out of your FAQ, refer to the QnA Maker FAQ (someone clearly had a lot of fun with this).

Recommendations API

The Recommendations API supports the sort of smarts you see on shopping websites, music sharing services and web streaming services. It handles three types of recommendations:

  • Frequently bought together recommendations—people who buy this right-handed glove often also by this matching left-handed glove with it.
  • Item to item recommendations—if you are looking at this item, may we suggest you also look at this other item that we sell?
  • Personalized user recommendations—we’ve noticed that you listen to this sort of music X a lot, so we think you would like this new album Y.

Like the QnA Maker, the Recommendations API is both a REST service and a web-based UI for creating, training and publishing functionality.

Sign up for an account key and head over to the Recommendations UI page in order to create a new project. You will then upload your product catalog and usage data and use this to train your recommendation model.

To learn more about developing recommendations for your website, read this Quick Start guide.

Wrapping Up

If, as Francis Bacon believed, knowledge is power, then Cognitive Services is its engine. The Cognitive Services Knowledge API helps you find information you need and also makes it easier for others to access the information you already have. It harkens back to the original purpose of the internet, sharing knowledge and reinvigorates it with machine learning. The following links will help you to further explore the capabilities of the Knowledge APIs.

Cognitive Services APIs: Language

In the last post, you saw how AI is used to turn speech into text through the Cognitive Services Speech APIs. Once sounds have been converted into written text, they still have to be distilled for their meaning. Human language, however, is rich in ambiguities and understanding a sentence correctly can sometimes be extremely difficult, even for experienced humans. We have multiple ways to say the same thing (morning star and evening star both denote the planet Venus), while the same sentence can have multiple meanings (‘I just shot an elephant in my pajamas’).

The Language APIs attack the problem of meaning from many different angles. There isn’t enough time at the moment to go into all of them in depth, but here’s a quick overview so know what is possible when using the six Cognitive Services Language APIs …

  • Bing Spell Check cleans up not only misspellings, but also recognizes slang, understands homonyms and fixes bad word breaks.
  • Microsoft Translator API, built on Deep Neural Networks, can do speech translation for 9 supported languages and text translations between 60 languages.
  • Web Language Model API uses a large reservoir of data about language usage on the web to make predictions like: how to insert word breaks in a run-on sentence (or a hashtag or URL), the likelihood that a sequence of words would appear together and the word most likely to follow after a given word sequence (sentence completion).
  • Linguistic Analysis basically parses text for you into sentences, then into parts-of-speech (nouns, verbs, adverbs, etc.), and finally into phrases (meaningful groupings of words such as prepositional phrases, relative clauses, subordinate clauses).
  • Text Analytics will sift through a block of text to determine the language it is written in (it recognizes 160), key phrases and overall sentiment (pure negative is 0 and absolutely positive is 1).
  • Language Understanding Intelligent Service (LUIS) provides a quick and easy way to determine what your users want by parsing sentences for entities (nouns) and intents (verbs), which can then be passed to appropriate services for fulfillment. For instance, “I want to hear a little night music” could open up a preferred music streaming service and commence playing Mozart. LUIS can be used with bots and speech-driven applications.

That’s the mile-high overview. Let’s now take a closer look at the last two Language APIs in this list.

Digging into Text Analytics

The Cognitive Services Text Analytics API is designed to do certain things very well, like evaluating web page reviews and comments. Many possibilities are opened up by this simple scenario. For instance, you could use this basic functionality to evaluate opening passages of famous novels. The REST interface is straight-forward. You pass a block of text to the service and request that Text Analytics return either the key phrases, the language the block of text is written in, or a sentiment score from 0 to 1 indicating whether the passage is negative in tone or positive.

The user interface for this app is going to be pretty simple. You want a TextBox for the text you need to have analyzed, a ListBox to hold the key phrases and two TextBlocks to display the language and sentiment score. And, of course, you need a Button to fire the whole thing off with a call to the Text Analytics service endpoint.

When the Analyze button is clicked, the app will use the HttpClient class to build a REST call to the service and retrieve, one at a time, the key phrases, the language and the sentiment. The sample code below uses a helper method, CallEndpoint, to construct the request. You’ll want to have a good JSON deserializer like Newtonsoft’s Json.NET (which is available as a NuGet package) to make it easier to parse the returned messages. Also, be sure to request your own subscription key to use Text Analytics.


readonly string _subscriptionKey = "xxxxxx1a89554dd493177b8f64xxxxxx";
readonly string _baseUrl = "https://westus.api.cognitive.microsoft.com/";

static async Task<String> CallEndpoint(HttpClient client, string uri, byte[] byteData)
{
    using (var content = new ByteArrayContent(byteData))
    {
        content.Headers.ContentType = new MediaTypeHeaderValue("application/json");
        var response = await client.PostAsync(uri, content);
        return await response.Content.ReadAsStringAsync();
    }
}

private async void btnAnalyze_Click(object sender, RoutedEventArgs e)
{
    using (var client = new HttpClient())
    {
        client.BaseAddress = new Uri(_baseUrl);

        // Request headers
        client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", _subscriptionKey);
        client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));

        // Build request body
        string textToAnalyze = myTextBox.Text;
        byte[] byteData = Encoding.UTF8.GetBytes("{"documents":[" +
            "{"id":"1","text":"" + textToAnalyze + ""},]}");

        // Detect key phrases:
        var uri = "text/analytics/v2.0/keyPhrases";
        var response = await CallEndpoint(client, uri, byteData);
        var keyPhrases = JsonConvert.DeserializeObject<KeyPhrases>(response);

        // Detect just one language:
        var queryString = "numberOfLanguagesToDetect=1";
        uri = "text/analytics/v2.0/languages?" + queryString;
        response = await CallEndpoint(client, uri, byteData);
        var detectedLanguages = JsonConvert.DeserializeObject<LanguageArray>(response);

        // Detect sentiment:
        uri = "text/analytics/v2.0/sentiment";
        response = await CallEndpoint(client, uri, byteData);
        var sentiment = JsonConvert.DeserializeObject<Sentiments>(response);

        DisplayReturnValues(keyPhrases, detectedLanguages, sentiment);
    }
}

Remarkably, this is all the code you really need to access the rich functionality of the Text Analytics API. The only things left out are the class definitions for KeyPhrases, LanguageArray and Sentiment to economize on space, and you should be able to reconstruct these yourself from the returned JSON strings.

According to Text Analytics, the opening to James Joyce’s Ulysses (0.93 sentiment) is much more positive than the opening to Charles Dickens’ A Tale of Two Cities (0.67). You don’t have to use this just for evaluating the mood of famous opening passages, however. You could also paste in posts from your favorite social network. In fact, you can search for social media related to a certain topic of interest and find out what the average sentiment is regarding it.

You can probably see where we’re going with this. If you are running a social media campaign, you could use Text Analytics to do a qualitative evaluation of the campaign based on how the audience responds. You could even run tests to see if changes to the campaign will cause the audience’s mood to shift.

Using LUIS to figure out what your user wants

LUIS lets you build language intelligence into your speech driven apps. Based on things that your user might say, LUIS attempts to parse its statements to figure out the Intents behind the statement (what your user wants to do) and also the Entities involved in your user’s desire. For instance, if your app is for making travel arrangements, the Intents you are interested in are booking and cancellation, while the Entities you care about are travel dates and number of passengers. For a music playing app, the Intents you should be interested in are playing and pausing while the Entities you care about are particular songs.

In order to use LUIS, you first need to sign in through the LUIS website and either use a Cortana pre-built app or build a new app of your own. The pre-built apps are pretty extensive and for a simple language understanding task like evaluating the phrase “Play me some Mozart,” it has no problem identifying both the intent and the entity involved.


{
    "query": "play me some mozart",
    "intents": [
        {
        "intent": "builtin.intent.ondevice.play_music"
        }
    ],
    "entities": [
        {
            "entity": "mozart",
            "type": "builtin.ondevice.music_artist_name"
        }
    ]
}

If your app does music streaming, this quick call to Cognitive Services provides all the information you need to fulfill your user’s wish. A full list of pre-built applications is available in the LUIS.ai documentation. To learn more about building applications with custom Intents and Entities, follow through these training videos.

Wrapping Up

Cognitive Services provides some remarkable machine learning-based tools to help you determine meaning and intent based on human utterances, whether these utterances come from an app user talking into his or her device or someone providing feedback on social media. The following links will help you discover more about the capabilities the Cognitive Services Language APIs put at your disposal.

Cognitive Services APIs: Speech

Speech recognition is in many ways at the heart of Artificial Intelligence. The 18th Century essayist Samuel Johnson captured this beautifully when he wrote, “Language is the dress of thought.” If the ultimate goal of AI research is a machine that thinks like a human, a reasonable starting point would be to create a machine that understands how humans think. To understand how humans think, in turn, requires an understanding of what humans say.

In the previous post in this series, you learned about the Cognitive Services Vision APIs. In this post, we’re going to complement that with an overview of the Speech APIs. The Cognitive Services Speech APIs are grouped into three categories:

  • Bing Speech—convert spoken audio to text and, conversely, text to speech.
  • Speaker Recognition—identify speakers and use speech recognition for authentication.
  • Custom Speech Service (formerly CRIS)—overcome speech recognition barriers like background noise and specialized vocabulary.

A good way to understand the relationship between the Bing Speech API and the other APIs is that while Big Speech handles taking raw speech and turns it into text without knowing anything about the speaker, Custom Speech Service and Speaker Recognition go further and try to use processing to clean up the raw speech or to compare it against other speech samples. They basically do extra speech analysis work.

Bing Speech for UWP

As a UWP developer, you have several options for accessing speech-to-text capabilities. You can access the UWP Speech APIs found in the Windows.Media.SpeechRecognition namespace. You can also integrate Cortana into your UWP app. Alternatively, you can go straight to the Bing Speech API which underlies both of these technologies.

Bing Speech lets you do text-to-speech and speech-to-text through REST calls to Cognitive Services. The Cognitive Services website provides samples for iOS, Android and Javascript. There’s also a client library NuGet package if you are working in WPF. For UWP, however, you will use the REST APIs.

As with the other Cognitive Services offerings, you first need to pick up a subscription key for Bing Speech in order to make calls to the API. In UWP, you then need to record microphone input using the MediaCapture class and encode it before sending it to Bing Speech. (Gotcha Warning — be sure to remember to check off the Microphone capability in your project’s app manifest file so the mic can be accessed, otherwise you may spend hours wondering why the code doesn’t work for you.)


var CaptureMedia = new MediaCapture();
var captureInitSettings = new MediaCaptureInitializationSettings();
captureInitSettings.StreamingCaptureMode = StreamingCaptureMode.Audio;
await CaptureMedia.InitializeAsync(captureInitSettings);
MediaEncodingProfile encodingProfile = MediaEncodingProfile.CreateWav(AudioEncodingQuality.Medium);
AudioStream = new InMemoryRandomAccessStream();
await CaptureMedia.StartRecordToStreamAsync(encodingProfile, AudioStream);

Once you are done recording, you can use the standard HttpClient class to send the audio stream to Cognitive Services for processing, like so…


// build REST message
cookieContainer = new CookieContainer();
handler = new HttpClientHandler() { CookieContainer = cookieContainer };
client = new HttpClient(handler);
client.DefaultRequestHeaders.TryAddWithoutValidation("Content-Type", "audio / wav; samplerate = 16000");
// authenticate the REST call
client.DefaultRequestHeaders.TryAddWithoutValidation("Authorization", _subscriptionKey);
// pass in the Bing Speech endpoint
request = new HttpRequestMessage(HttpMethod.Post, uri);
// pass in the audio stream
request.Content = new ByteArrayContent(fileBytes);
// make REST call to CogSrv
response = await client.SendAsync(request, HttpCompletionOption.ResponseHeadersRead, cancellationToken);

Getting these calls right may seem a bit hairy at first. To make integrating Bing Speech easier, Microsoft MVP Gian Paolo Santopaolo has created a UWP reference app on GitHub with several useful helper classes you can incorporate into your own speech recognition project. This reference app also includes a sample for reversing the process and doing text-to-speech.

Speaker Recognition

While the Bing Speech API can figure out what you are saying without knowing anything about who you are as a speaker, the Speaker Recognition API in Cognitive Services is all about figuring out who you without caring about what you are specifically saying. There’s a nice symmetry to this. Using machine learning, the Speaker Recognition API finds qualities in your voice that identify you almost as well as your fingerprints or retinal pattern do.

This API is typically used for two purposes: identification and verification. Identification allows a voice to be compared to a group of voices in order to find the best match. This is the auditory equivalent to how the Cognitive Services Face API matches up faces that resemble each other.

Speaker verification allows you to use a person’s voice as part of a two-factor login mechanism. For verification to work, the speaker must say a specific, pre-selected passphrase like “apple juice tastes funny after toothpaste” or “I am going to make him an offer he cannot refuse.” The initial recording of a passphrase to compare against is called enrollment. (It hardly needs to be said but—please don’t use “password” for your speaker verification passphrase.)

There is a client library that supports speaker enrollment, speaker verification and speaker identification. Per usual, you need to sign up for a Speaker Recognition subscription key to use it. You can add the client library to your UWP project in Visual Studio by installing the Microsoft.ProjectOxford.SpeakerRecognition NuGet package.

Using the media capture code from the Bing Speech sample above to record on the microphone, and assuming that the passphrase has already been enrolled for the user through her Speaker Id (a Guid), verification is as easy as calling the Speaker Recognition client library VerifyAsync method and passing the audio stream and Speaker Id as parameters.


string _subscriptionKey;
Guid _speakerId;
Stream audioStream;

public async void VerifySpeaker()
{
    var serviceClient = new SpeakerVerificationServiceClient(_subscriptionKey);
    Verification response = await serviceClient.VerifyAsync(audioStream, _speakerId);

    if (response.Result == Result.Accept)
    {
        // verification successful
    }

}

Sample projects are available showing how to use Speaker Recognition with Android, Python and WPF. Because of the close similarities between UWP and WPF, you will probably find the last sample useful as a reference for using this Cognitive Service in your UWP app.

Custom Speech Service

You already know how to use the Bing Speech speech-to-text capability introduced at the top of this post. That Cognitive Service is built around generalized language models to work for most people most of the time. But what if you want to do speech recognition involving specialized jargon or vocabulary? To handle these situations, you might need a custom language model rather than the one used by the speech-to-text engine in Bing Speech.

Along the same lines, the generalized acoustic model used to train Bing Speech may not work well for you if your app is likely to be used in an atypical acoustic environment like an air hangar or a factory floor.

Custom Speech Service lets you build custom language models and custom acoustic models for your speech-to-text engine. You can then set these up as custom REST endpoints for doing calls to Cognitive Services from your app. These RESTful endpoints can also be used from any device and from any software platform that can make REST calls. It’s basically a really powerful machine learning tool that lets you take the speech recognition capabilities of your app to a whole new level. Additionally, since all that changes is the endpoint you call, any previous code you have written to use the Bing Speech API should work without any alteration other than the Uri you are targeting.

Wrapping Up

In this post, we went over the Bing Speech APIs for speech-to-text and text-to-speech as well as the extra APIs for cleaning up raw speech input and doing comparisons and verification using speech input. In the next post in the Cognitive APIs Series, we’ll take a look using the Language Understanding Intelligent Service (LUIS) to derive meaning from speech in order to figure out what people really want when they ask for something. In the meantime, here are some additional resources so you can learn more about the Speech APIs on your own.

Cognitive Services APIs: Vision

What exactly are Cognitive Services and what are they for? Cognitive Services are a set of machine learning algorithms that Microsoft has developed to solve problems in the field of Artificial Intelligence (AI). The goal of Cognitive Services is to democratize AI by packaging it into discrete components that are easy for developers to use in their own apps. Web and Universal Windows Platform developers can consume these algorithms through standard REST calls over the Internet to the Cognitive Services APIs.

The Cognitive Services APIs are grouped into five categories…

  • Vision—analyze images and videos for content and other useful information.
  • Speech—tools to improve speech recognition and identify the speaker.
  • Language—understanding sentences and intent rather than just words.
  • Knowledge—tracks down research from scientific journals for you.
  • Search—applies machine learning to web searches.

So why is it worthwhile to provide easy access to AI? Anyone watching tech trends realizes we are in the middle of a period of huge AI breakthroughs right now with computers beating chess champions, go masters and Turing tests. All the major technology companies are in an arms race to hire the top AI researchers.

Along with high profile AI problems that researchers know about, like how to beat the Turing test and how to model computer neural-networks on human brains, are discrete problems that developers are concerned about, like tagging our family photos and finding an even lazier way to order our favorite pizza on a smartphone. The Cognitive Services APIs are a bridge allowing web and UWP developers to use the resources of major AI research to solve developer problems. Let’s get started by looking at the Vision APIs.

Cognitive Services Vision APIs

The Vision APIs are broken out into five groups of tasks…

  • Computer Vision—Distill actionable information from images.
  • Content Moderator—Automatically moderate text, images and videos for profanity and inappropriate content.
  • Emotion—Analyze faces to detect a range of moods.
  • Face—identify faces and similarities between faces.
  • Video—Analyze, edit and process videos within your app.

Because the Computer Vision API on its own is a huge topic, this post will mainly deal with just its capabilities as an entry way to the others. The description of how to use it, however, will provide you good sense of how to work with the other Vision APIs.

Note: Many of the Cognitive Services APIs are currently in preview and are undergoing improvement and change based on user feedback.

One of the biggest things that the Computer Vision API does is tag and categorize an image based on what it can identify inside that image. This is closely related to a computer vision problem known as object recognition. In its current state, the API recognizes about 2000 distinct objects and groups them into 87 classifications.

Using the Computer Vision API is pretty easy. There are even samples available for using it on a variety of development platforms including NodeJS, the Android SDK and the Swift SDK. Let’s do a walkthrough of building a UWP app with C#, though, since that’s the focus of this blog.

The first thing you need to do is register at the Cognitive Services site and request a key for the Computer Vision Preview (by clicking on one of the “Get Started for Free” buttons.

Next, create a new UWP project in Visual Studio and add the ProjectOxford.Vision NuGet package by opening Tools | NuGet Package Manager | Manage Packages for Solution and selecting it. (Project Oxford was an earlier name for the Cognitive Services APIs.)

For a simple user interface, you just need an Image control to preview the image, a Button to send the image to the Computer Vision REST Services and a TextBlock to hold the results. The workflow for this app is to select an image -> display the image -> send the image to the cloud -> display the results of the Computer Vision analysis.


<Grid Background="{ThemeResource ApplicationPageBackgroundThemeBrush}">
    <Grid.RowDefinitions>
        <RowDefinition Height="9*"/>
        <RowDefinition Height="*"/>
    </Grid.RowDefinitions>
    <Grid.ColumnDefinitions>
        <ColumnDefinition/>
        <ColumnDefinition/>
    </Grid.ColumnDefinitions>
    <Border BorderBrush="Black" BorderThickness="2">
    <Image x:Name="ImageToAnalyze" />
    </Border>
    <Button x:Name="AnalyzeButton" Content="Analyze" Grid.Row="1" Click="AnalyzeButton_Click"/>
    <TextBlock x:Name="ResultsTextBlock" TextWrapping="Wrap" Grid.Column="1" Margin="30,5"/>
</Grid>

When the Analyze Button gets clicked, the handler in the Page’s code-behind will open a FileOpenPicker so the user can select an image. In the ShowPreviewAndAnalyzeImage method, the returned image is used as the image source for the Image control.


readonly string _subscriptionKey;

public MainPage()
{
    //set your key here
    _subscriptionKey = "b1e514ef0f5b493xxxxx56a509xxxxxx";
    this.InitializeComponent();
}

private async void AnalyzeButton_Click(object sender, RoutedEventArgs e)
{
    var openPicker = new FileOpenPicker
    {
        ViewMode = PickerViewMode.Thumbnail,
        SuggestedStartLocation = PickerLocationId.PicturesLibrary
    };
    openPicker.FileTypeFilter.Add(".jpg");
    openPicker.FileTypeFilter.Add(".jpeg");
    openPicker.FileTypeFilter.Add(".png");
    openPicker.FileTypeFilter.Add(".gif");
    openPicker.FileTypeFilter.Add(".bmp");
    var file = await openPicker.PickSingleFileAsync();

    if (file != null)
    {
        await ShowPreviewAndAnalyzeImage(file);
    }
}

private async Task ShowPreviewAndAnalyzeImage(StorageFile file)
{
    //preview image
    var bitmap = await LoadImage(file);
    ImageToAnalyze.Source = bitmap;

    //analyze image
    var results = await AnalyzeImage(file);

    //"fr", "ru", "it", "hu", "ja", etc...
    var ocrResults = await AnalyzeImageForText(file, "en");

    //parse result
    ResultsTextBlock.Text = ParseResult(results) + "nn " + ParseOCRResults(ocrResults);
}

The real action happens when the returned image then gets passed to the VisionServiceClient class included in the Project Oxford NuGet package you imported. The Computer Vision API will try to recognize objects in the image you pass to it and recommend tags for your image. It will also analyze the image properties, color scheme, look for human faces and attempt to create a caption, among other things.


private async Task<AnalysisResult> AnalyzeImage(StorageFile file)
{

    VisionServiceClient VisionServiceClient = new VisionServiceClient(_subscriptionKey);

    using (Stream imageFileStream = await file.OpenStreamForReadAsync())
    {
        // Analyze the image for all visual features
        VisualFeature[] visualFeatures = new VisualFeature[] { VisualFeature.Adult, VisualFeature.Categories
            , VisualFeature.Color, VisualFeature.Description, VisualFeature.Faces, VisualFeature.ImageType
            , VisualFeature.Tags };
        AnalysisResult analysisResult = await VisionServiceClient.AnalyzeImageAsync(imageFileStream, visualFeatures);
        return analysisResult;
    }
}

And it doesn’t stop there. With a few lines of code, you can also use the VisionServiceClient class to look for text in the image and then return anything that the Computer Vision API finds. This OCR functionality currently recognizes about 26 different languages.


private async Task<OcrResults> AnalyzeImageForText(StorageFile file, string language)
{
    //language = "fr", "ru", "it", "hu", "ja", etc...
    VisionServiceClient VisionServiceClient = new VisionServiceClient(_subscriptionKey);
    using (Stream imageFileStream = await file.OpenStreamForReadAsync())
    {
        OcrResults ocrResult = await VisionServiceClient.RecognizeTextAsync(imageFileStream, language);
        return ocrResult;
    }
}

Combining the image analysis and text recognition features of the Computer Vision API will return results like that shown below.

The power of this particular Cognitive Services API is that it will allow you to scan your device folders for family photos and automatically start tagging them for you. If you add in the Face API, you can also tag your photos with the names of family members and friends. Throw in the Emotion API and you can even start tagging the moods of the people in your photos. With Cognitive Services, you can take a task that normally requires human judgement and combine it with the indefatigability of a machine (in this case a machine that learns) in order to perform this activity quickly and indefinitely on as many photos as you own.

Wrapping Up

In this first post in the Cognitive API series, you received an overview of Cognitive Services and what it offers you as a developer. You also got a closer look at the Vision APIs and a walkthrough of using one of them. In the next post, we’ll take a closer look at the Speech APIs. If you want to dig deeper on your own, here are some links to help you on your way…