NEW ENTRIES

Create ADF resources

No category
PowerShell

# Connect to Azure account
Connect-AzAccount

# List all subscriptions
Get-AzSubscription

# Select Subscription
Select-AzSubscription -SubscriptionId "<SUBSCRIPTION_ID>"

# Create Resource group in East US
$resourceGroupName = "<RESOURCE_GROUP_NAME_UNIQUE>";
$ResGrp = New-AzResourceGroup $resourceGroupName -location 'East US'

#Create ADF resource
$dataFactoryName = "<DATA_FACTORY_RESOURCE_NAME_UNIQUE>";
PS C:UserscorifDocumentsMyDataLearnADF> $DataFactory = Set-AzDataFactoryV2 -ResourceGroupName $ResGrp.ResourceGroupName -Location $ResGrp.Location -Name $dataFactoryName

# Create a Folder for files (json) RESOURCE_FOLDER
# Create AzureStorageLinkedService.json file in RESOURCE_FOLDER folder
# Switch to that folder
Set-Location 'PATH_TO_RESOURCE_FOLDER'

# Create linked service
Set-AzDataFactoryV2LinkedService -DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName -Name "AzureStorageLinkedService" `
-DefinitionFile ".AzureStorageLinkedService.json"

# Output must be:
# LinkedServiceName : AzureStorageLinkedService
# ResourceGroupName : RESOURCE_GROUP_NAME_UNIQUE
# DataFactoryName : DATA_FACTORY_RESOURCE_NAME_UNIQUE
# Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobStorageLinkedService

# Create datasets
# Create a JSON file named InputDataset.json in the RESOURCE_FOLDER

# Run
Set-AzDataFactoryV2Dataset -DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName -Name "InputDataset" `
-DefinitionFile ".InputDataset.json"

# Output must be:
# DatasetName : InputDataset
# ResourceGroupName : RESOURCE_GROUP_NAME_UNIQUE
# DataFactoryName : DATA_FACTORY_RESOURCE_NAME_UNIQUE
# Structure :
# Properties : Microsoft.Azure.Management.DataFactory.Models.BinaryDataset

# Create a JSON file named OutputDataset.json in the RESOURCE_FOLDER

# Run
Set-AzDataFactoryV2Dataset -DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName -Name "OutputDataset" `
-DefinitionFile ".OutputDataset.json"

# Output must be:
# DatasetName : OutputDataset
# ResourceGroupName : RESOURCE_GROUP_NAME_UNIQUE
# DataFactoryName : DATA_FACTORY_RESOURCE_NAME_UNIQUE
# Structure :
# Properties : Microsoft.Azure.Management.DataFactory.Models.BinaryDataset

#Make sure you have created folder structure and files in Azure Storage (Dat Lake Gen2) !!!

# Create a pipeline
# Create a JSON file named myPipeline.json in the RESOURCE_FOLDER

# Run
$DFPipeLine = Set-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-Name "myPipeline" `
-DefinitionFile ".myPipeline"

# Nothing output

# Create a pipeline run
# Run
$RunId = Invoke-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-PipelineName $DFPipeLine.Name

# Nothing output

# Monitor the pipeline run

# Run
while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun `
-ResourceGroupName $ResGrp.ResourceGroupName `
-DataFactoryName $DataFactory.DataFactoryName `
-PipelineRunId $RunId

if ($Run) {
if ( ($Run.Status -ne "InProgress") -and ($Run.Status -ne "Queued") ) {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output ("Pipeline is running...status: " + $Run.Status)
}

Start-Sleep -Seconds 10
}

# Run the following script to retrieve copy activity run details, for example, size of the data read/written

Write-Output "Activity run details:"
$Result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $DataFactory.DataFactoryName -ResourceGroupName $ResGrp.ResourceGroupName -PipelineRunId $RunId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-Date).AddMinutes(30)
$Result

Write-Output "Activity 'Output' section:"
$Result.Output -join "`r`n"

Write-Output "Activity 'Error' section:"
$Result.Error -join "`r`n"





Recover keys pair on EC2

Assume you have access to the console.
  • Power off theEC2 instance
  • Create an image (create AMI)
  • Go to AMI section (from left menu), wait status to be "completed"
  • Launch a new instance
    • Give it a name
    • Generate a new pair of keys (public will be stored on new instance and private will be downloaded)
    • attach previous or create a new security group
  • Go to EC2 section you should see old and new instances here.
  • Go to Volumes, you should see all volumes here, make sure you know exactly which of them belong to the old instance, must be deleted to avoid charges.
  • Start new instance and make sure all is good.
  • Detach from old instance (if applicable any volume and IP) and attach them to the new instance.
  • Check all you need again
  • Clean up
    • Go to AMI and delete the one you created from old instance.
    • Go to volumes and detach all volumes still attached to old instance then delete them
    • Go to EC2 and terminate old instance
You can use now the new private key.

Basics data process

Study notes

Types of data
  • Structured
    table-based source systems such as a relational database or from a flat file such as a comma separated (CSV) file
    The primary element of a structured file is that the rows and columns are aligned consistently throughout the file.
  • Semi-structured
    data such as JavaScript object notation (JSON) files, which may require flattening prior to loading into your source system.
    When flattened, this data doesn't have to fit neatly into a table structure.
  • Unstructured
    data stored as key-value pairs that don't adhere to standard relational models and Other types of unstructured data that are commonly used include portable data format (PDF), word processor documents, and images.
Types of data by usage
  • Operational data
    Usually transactional data that is generated and stored by applications, often in a relational or non-relational database.
  • Analytical data
    Data that has been optimized for analysis and reporting, often in a data warehouse.
  • Streaming data
    Perpetual sources of data that generate data values in real-time, often relating to specific events.
    Common sources of streaming data include internet-of-things (IoT) devices and social media feeds.
  • Data pipelines
    Used to orchestrate activities that transfer and transform data.
    Pipelines are the primary way in which data engineers implement repeatable extract, transform, and load (ETL) solutions that can be triggered based on a schedule or in response to events.
  • Data lakes
    Storage repository that holds large amounts of data in native, raw formats
    Data lake stores are optimized for scaling to massive volumes (terabytes or petabytes) of data.
  • Data warehouses
    Centralized repository of integrated data from one or more disparate sources.
    Data warehouses store current and historical data in relational tables that are organized into a schema that optimizes performance for analytical queries.
  • Apache Spark
    Parallel processing framework that takes advantage of in-memory processing and a distributed file storage. It's a common open-source software (OSS) tool for big data scenarios.
Data operations
  • Data integration
    Establishing links between operational and analytical services and data sources to enable secure, reliable access to data across multiple systems.
  • Data transformation
    Operational data usually needs to be transformed into suitable structure and format for analysis
    It is often as part of an extract, transform, and load (ETL) process; though increasingly a variation in which you extract, load, and transform (ELT) the data is used to quickly ingest the data into a data lake and then apply "big data" processing techniques to transform it. Regardless of the approach used, the data is prepared to support downstream analytical needs.
  • Data consolidation
    Combining data that has been extracted from multiple data sources into a consistent structure - usually to support analytics and reporting.
    Commonly, data from operational systems is extracted, transformed, and loaded into analytical stores such as a data lake or data warehouse.


  1. Operational data is generated by applications and devices and..
  2. Stored in Azure data storage services such as Azure SQL Database, Azure Cosmos DB, and Microsoft Dataverse.
  3. Streaming data is captured in event broker services such as Azure Event Hubs.

  1. Operational data must be captured, ingested, and consolidated into analytical store and ...
  2. From where it can be modeled and visualized in reports and dashboards.
These tasks represent the core area of responsibility for the data engineer.
The core Azure technologies used to implement data engineering workloadsinclude:
  • Azure Synapse Analytics
    Azure Synapse Analyticsincludes functionality for pipelines, data lakes, and relational data warehouses.
  • Azure Data Lake Storage Gen2
  • Azure Stream Analytics
  • Azure Data Factory
  • Azure Databricks
The analytical data stores that are populated with data produced by data engineering workloads support data modeling and visualization for reporting and analysis, often using sophisticated visualization tools such as Microsoft Power BI.

Azure Data Lake Storage Gen2
Provides a cloud-based solution for data lake storage in Microsoft Azure, and underpins many large-scale analytics solutions built on Azure.
A data lake is a repository of data that is stored in its natural format, usually as blobs or files. Azure Data Lake Storage is a comprehensive, massively scalable, secure, and cost-effective data lake solution for high performance analytics built into Azure.

  • Hadoop compatible access.
    You can store the data in one place and access it through compute technologies including Azure Databricks, Azure HDInsight, and Azure Synapse Analytics
  • Security
    Data Lake Storage supports access control lists (ACLs) and Portable Operating System Interface (POSIX) permissions that don't inherit the permissions of the parent directory
  • Performance
  • Data redundancy

  • Blob
    in terms of blob manageability the blobs are stored as a single-level hierarchy in a flat namespace.
    Flat namespaces, by contrast, require several operations proportionate to the number of objects in the structure.
  • Azure Data Lake Storage Gen2
    builds on blob storage and optimizes I/O of high-volume data by using a hierarchical namespace that organizes blob data into directories, and stores metadata about each directory and the files within it.
    Hierarchical namespaces keep the data organized, which yields better storage and retrieval performance for an analytical use case and lowers the cost of analysis.
Stages for processing big data solutions that are common to all architectures:
  1. Ingest
    The ingestion phase identifies the technology and processes that are used to acquire the source data. This data can come from files, logs, and other types of unstructured data that must be put into the data lake.
    The technology that is used will vary depending on the frequency that the data is transferred.
    1. Batch movement of data, pipelines in Azure Synapse Analytics or Azure Data Factory
    2. Real-time ingestion of data, Apache Kafka for HDInsight or Stream Analytics .
  2. Store
    The store phase identifies where the ingested data should be placed. Azure Data Lake Storage Gen2 provides a secure and scalable storage solution that is compatible with commonly used big data processing technologies.
  3. Prep and train
    The prep and train phase identifies the technologies that are used to perform data preparation and model training and scoring for machine learning solutions.
    Common technologies that are used in this phase are Azure Synapse Analytics, Azure Databricks, Azure HDInsight, and Azure Machine Learning.
  4. Model and serve
    Involves the technologies that will present the data to users. These technologies can include visualization tools such as Microsoft Power BI, or analytical data stores such as Azure Synapse Analytics. Often, a combination of multiple technologies will be used depending on the business requirements.
Use Azure Data Lake Storage Gen2 in data analytics workloads:
  • Big data processing and analytics
    Usually refer to analytical workloads that involve massive volumes of data in a variety of formats that needs to be processed at a fast velocity - the so-called "three v's".
    Big data services such as Azure Synapse Analytics, Azure Databricks, and Azure HDInsight can apply data processing frameworks such as Apache Spark, Hive, and Hadoop.
  • Data warehousing
    Integrate large volumes of data stored as files in a data lake with relational tables in a data warehouse.
    There are multiple ways to implement this kind of data warehousing architecture. The diagram shows a solution in whichAzure Synapse Analytics hosts pipelines to perform extract, transform, and load (ETL) processes using Azure Data Factory technology.
  • Real-time data analytics
    streaming data requires a solution that can capture and process a boundless stream of data events as they occur.
    Streaming events are often captured in a queue for processing. There are multiple technologies you can use to perform this task, including Azure Event Hubs as shown in the image.
    Azure Stream Analytics enables you to create jobs that query and aggregate event data as it arrives, and write the results in an output sink
  • Data science and machine learning
    Involves the statistical analysis of large volumes of data, often using tools such as Apache Spark and scripting languages such as Python. Azure Data Lake Storage Gen 2 provides a highly scalable cloud-based data store for the volumes of data required in data science workloads.
Azure Synapse Analytics
Analytical technique that organizations commonly use:
  • Descriptive analytics
    which answers the question “What is happening in my business?”. The data to answer this question is typically answered through the creation of a data warehouse in which historical data is persisted in relational tables for multidimensional modeling and reporting.
  • Diagnostic analytics,
    which deals with answering the question “Why is it happening?”. This may involve exploring information that already exists in a data warehouse, but typically involves a wider search of your data estate to find more data to support this type of analysis.
  • Predictive analytics,
    which enables you to answer the question “What is likely to happen in the future based on previous trends and patterns?
  • Prescriptive analytics,
    which enables autonomous decision making based on real-time or near real-time analysis of data, using predictive analytics.
Azure Synapse Analytics provides a cloud platform for all of these analytical workloads through support for multiple data storage, processing, and analysis technologies in a single, integrated solution.

To support the analytics needs of today's organizations, Azure Synapse Analytics combines a centralized service for data storage and processing with an extensible architecture through which linked services enable you to integrate commonly used data stores, processing platforms, and visualization tools.

A Synapse Analytics workspace defines an instance of the Synapse Analytics service in which you can manage the services and data resources needed for your analytics solution.
A workspace typically has a default data lake, which isimplemented as a linked serviceto an Azure Data Lake Storage Gen2 container.
Azure Synapse Analytics includes built-in support for creating, running, and managing pipelines that orchestrate the activities necessary
  • to retrieve data from a range of sources,
  • transform the data as required, and
  • load the resulting transformed data into an analytical store.
Azure Synapse Analytics supports SQL-based data querying and manipulation through two kinds of SQL pool that are based on the SQL Server relational database engine:
  • A built-in serverless pool that is optimized for using relational SQL semantics to query file-based data in a data lake.
    use the built-in serverless pool for cost-effective analysis and processing of file data in the data lake
  • Customdedicated SQL pools that host relational data warehouses.
    use dedicated SQL pools to create relational data warehouses for enterprise data modeling and reporting.
Processing and analyzing data with Apache Spark
In Azure Synapse Analytics, you can create one or more Spark pools and use interactive notebooks to combine code and notes as you build solutions for data analytics, machine learning, and data visualization.

Exploring data with Data Explorer
Data Explorer uses an intuitive query syntax named Kusto Query Language (KQL) to enable high performance, low-latency analysis of batch and streaming data.

Azure Synapse Analytics can be integrated with other Azure data services for end-to-end analytics solutions. Integrated solutions include:
  • Azure Synapse Link
    enables near-realtime synchronization between operational data in Azure Cosmos DB, Azure SQL Database, SQL Server, and Microsoft Power Platform Dataverse and analytical data storage that can be queried in Azure Synapse Analytics.
  • Microsoft Power BI integration
    enables data analysts to integrate a Power BI workspace into a Synapse workspace, and perform interactive data visualization in Azure Synapse Studio.
  • Microsoft Purview integration
    enables organizations to catalog data assets in Azure Synapse Analytics, and makes it easier for data engineers to find data assets and track data lineage when implementing data pipelines that ingest data into Azure Synapse Analytics.
  • Azure Machine Learning integration
    enables data analysts and data scientists to integrate predictive model training and consumption into analytical solutions.
Across all organizations and industries, the common use cases for Azure Synapse Analytics are identified by the need for:
  • Large-scale data warehousing
    Data warehousing includes the need to integrate all data, including big data, to reason over data for analytics and reporting purposes from a descriptive analytics perspective, independent of its location or structure.
  • Advanced analytics
    Enables organizations to perform predictive analytics using both the native features of Azure Synapse Analytics, and integrating with other technologies such as Azure Machine Learning.
  • Data exploration and discovery
    The serverless SQL pool functionality provided by Azure Synapse Analytics enables Data Analysts, Data Engineers and Data Scientist alike to explore the data within your data estate. This capability supports data discovery, diagnostic analytics, and exploratory data analysis.
  • Real time analytics
    Azure Synapse Analytics can capture, store and analyze data in real-time or near-real time with features such as Azure Synapse Link, or through the integration of services such as Azure Stream Analytics and Azure Data Explorer.
  • Data integration
    Azure Synapse Pipelines enables you to ingest, prepare, model and serve the data to be used by downstream systems. This can be used by components of Azure Synapse Analytics exclusively.
  • Integrated analytics
    With the variety of analytics that can be performed on the data at your disposal, putting together the services in a cohesive solution can be a complex operation. Azure Synapse Analytics removes this complexity by integrating the analytics landscape into one service. That way you can spend more time working with the data to bring business benefit, than spending much of your time provisioning and maintaining multiple systems to achieve the same outcomes.

Cognitive service terms

Personalizer
Cognitive service for a decision support solution.
Analyzes user's real-time behavior, for example, online shopping patterns, and then help your app to choose the best content items to show.

Spatial Analysis
Cognitive service for a computer vision solution.
Ingest a live video stream, detect and track people, monitor specific regions of interest on the video, and generate an event when a specific trigger occurs.
Can monitor an area in front of the checkout counter and trigger an event when the count of people exceeds a defined number.

QnA Maker
Cognitive service for a Natural Language Processing (NLP) solution.
Helps to build a custom knowledge base to provide a natural conversation layer over your common questions and answers.

Anomaly Detector
Finds data that is an outlier or is out of trend in time-series data. It does not analyze content supplied by users for offensive content such as adult, racy, or
gory Images.

Content Moderator
Cognitive service that identifies content that is potentially offensive, including images that may contain adult, racy, or gory content. It flags such content automatically for a human to review in a portal.

Smart Labeler
Used in training models (object identification).
Tag uploaded images automatically, reducing the manual effort needed to improve the model. You must check and adjust tags.

Normalizepunctuation
Language Understanding - remove punctuations such as dots, commas, brackets, and others from your utterances before your model gets trained or your app's endpoint queries get predicted. However, it does not eliminate the effect of accented characters.

NormalizeDiacritics
Language Understanding - replace accented characters, also known as diacritics, with regular characters.
Allows you to ignore the effect of diacritics during your app's training and prediction.
Only available for the supported languages like Spanish, Portuguese, Dutch, French, German, and Italian.

Phrase list
Language Understanding - list of similar words or phrases that can be used by your model as a domain-specific vocabulary.
Example in travel industry: Single, Double, Queen, King, and Twin as a phrase list feature, so the app can recognize from the utterances preferences for a hotel room type.
The phrase list feature can improve the quality of your NLU app understanding of intents and entities

filterable
QnA - Property specifies if the field in the index can be used as a filter and can restrict the list of documents returned by the search.
The filterable property is a Boolean value defined on a field in the index.

sortable
QnA - allows other fields to be used for sorting results
By default, search results are ordered by score.

facetable
QnA - returns a hit count by category, for example the number of results by test type.

retrievable
QnA - Boolean; if set to true, includes the field in the results of the search, if set to false the field will not be included in the search result.
Does not allow the user search on the field or restrict by the value of the field.

Knowledge store
Azure Cognitive Search - place in Azure Storage, where is stored data created by a Cognitive Search enrichment pipeline.
It is used for independent analysis or downstream processing in non-search scenarios like knowledge mining.
It is an enriched (created/generated by the skillset) content stored:
  • tables TAB3 (key phrase extraction)
    table projections require the data to be mapped to the knowledge store using outputFieldMappings with:
    • sourceFieldName
    • targetFieldName
  • blob containers in storageContainer.
When you save enrichments as a table projection you need to specify:
  • source - path to projection
  • tableName - name of table in Azure Table storage

Projections
Enriched documents from Cognitive Services that are stored in a knowledge store.
Enhance and shape the data.
Type of projections:
  • File (images )
  • Object (JSON)
  • Table (dictionary)
Skillset
Enrichment process of Cognitive Search enrichment pipeline.
Move a document through a sequence of enrichments that invoke atomic transformations, such as recognizing entities or translating text.
Output:
  • always a search index.
  • can be projections in a knowledge store.
Search index and knowledge store are mutually exclusive products of the same pipeline.
They are derived from the same inputs but their content is structured, stored, and used in different applications.

encryptionKey
Azure Cognitive Search, enrichment - optional and used to reference an Azure Key Vault for the skillset, not for the knowledge store.

referenceKeyName
Azure Cognitive Search, enrichment - used to relate data across projections.
If it is not specified, then the system will use generatedKeyName

fieldMappings
Property is used to map key fields.
It is optional. By default, the metadata_storage_path property is used.

storageConnectionString
Required when storing the skillset's output data into a knowledge store (not required for the indexer)

cognitiveServices
Defines the Cognitive Services resource to use to enrich the data
Required when defining the skillset (not required for the indexer).

LUISGen
Command-line tool that can generate C# and Typescript classes for your LUIS intents and entities.

Lodown
Command-line tool that you use to parse .lu files.
If the file contains Intents, entities, and patterns, LUDown tool can generate a (LUIS) model in JSON format.
If the file contains question and answer pairs, then the LUDown can generate a knowledge base in JSON format.

Chatdown
Command-line tool that can be used to generate mock conversations between a user and a bot.
Expects input in a chat file format to generate conversation transcripts in .transcript format that can be consumed by the Bot Framework Emulator.
->Work with:Bot Framework Emulator and ngrok

ngrok
1. Tool that allows you to expose a web server running on your local computer to the internet.
It helps you to locally debug a bot from any channel.
2. It is integrated with Bot Framework Emulator and allows Bot Framework Emulator to connect to remote endpoints such as the production Web app running in Azure.
Enables Bot Framework Emulator to bypass the firewall (tunnel) on your computer and connect to the Azure bot service and intercept the messages to and from the bot.
->Work with: Bot Framework Emulator and Chatdown

Bot Framework Emulator
Tool to debug your bot.
Bot Framework Emulator is a Windows desktop application that can connect to a bot and inspect the messages sent and received by the bot.
Framework Emulator to connect to remote endpoints such as the production Web app running in Azure.
Can view conversation transcripts (.transcript files) and can use these transcript files to test the bot.
Conversations between a user and a bot can be mocked up as text files in markdown in the .chat file format.
->Work with: Bot Framework Emulator and Chatdown

Active learning
If enabled, QnA Maker will analyze user queries and suggest alternative questions that can improve the quality of your knowledge base.
If you approve those suggestions, they will be added as alternative questions to the knowledge base, so that you can re-train your QnA Maker model to serve your customers better.

Chit-chat
Pre-built data sets.
The chit-chat feature will add a predefined personality to your bot to make it more conversational and engaging.

Precise answering
The precise answering feature uses a deep learning model to identify the intent in the customer question and match it with the best candidate answer passage from the
knowledge base.

Managed keys
Encryption keys managed by Microsoft or the customer used to protect QnA Maker's data at rest.

Regular expression type
Uses a regular expression (Regex) to search for a pattern. It is used to match fixed patterns in a string such as numbers with two decimal places.

Default recognizer type
It includes LUIS and QnA Maker recognizers.

Custom recognizer type
Allows you to define your own recognizer - JSON format
It may be possible to perform regular expressions using a custom recognizer in JSON, but this requires additional effort to define and test the JSON to extract the numbers.

Orchestrator recognizer type
Allows you to link other bots to your bot as skills. It may help to find patterns in text strings but with more effort and resources.

Computer Vision
Only identifies well-known brands

Object Detection
Can locate and identify logos in images. Custom Vision resource in Azure must exist.

Classification project
Classification model for Custom Vision analyze and describe images.

Partitions
Cognitive search - Control the distribution of index across the physical storage.
Partitions split data across different computing resources. This has the effect of improving the performance of slow and large queries.
For example, with three partitions, you divide your index into three slices. To meet the

Replicas
Primarily used for load balancing, and so assist with the response for all queries under load from multiple users.
Adding a replica will not make an individual query perform faster.
Cognitive search - Microsoft guarantees 99.9% availability of read-write workloads for queries and Indexing if your Azure Cognitive Search resource has three or more replicas.

Sample labeling tool
Tool for training custom Form Recognizer.

Azure Files
Are fully managed file shares that you can mount in Windows, Linux, or macOS machines.

Custom Vision service
Allows you to train image classification and object detection algorithms that you can use in your image recognition solutions.

Azure Video Analyzer for Media
Accessible athttps://wwwvideoindexer_ai.
With a free trial, you can use up to 600 minutes of free video indexing using the Video Analyzer for Media website or up to 2,400 minutes when accessing it through API.
The Content model customization option allows you to manage Person, Brands, and Language models, for example, to add custom faces or exclude certain brands.

endpoint.microsoft.com
Microsoft Intune portal address. Manage your mobile devices, deploy device policies, and monitor them for compliance.

azure-cognitive-services/decision repository
Common storage location for Azure Cognitive Services container images in the Decision domain, for example Anomaly Detector.

azure-cognitive-services/textanalytics repository
Common storage location for Text Analytics container images such as Key Phrase Extraction or Text Language Detection.

food(compact) domain
Allows you to classify photographs of fruit and vegetables.
Compact models are lightweight and can be exported to run on edge devices.

food domain
Will help you classify photographs of fruit and vegetables.
It is not optimized to run on edge devices.
Cannot be exported from the Custom Vision portal for offline use.

retail(compact) domain
Optimized for images that are found in a shopping catalog or shopping website.
Us it for high precision classification between dresses, pants, shirts, etc.

products on shelves domain
Object detection domain that can detect and classify products on shelves.
It is not optimized to run on edge devices.
It cannot be exported from the Custom Vision portal for offline use.

Adaptive expressions
Used by language generation to evaluate conditions described in language generation templates.

Language generation
Enables your bot to respond with varying text phrases, creating a more natural conversation with the user.

Language understanding
Enables your bot to understand user input naturally and to determine the intent of the user.

Skills
Allow you to call one bot from another and create a seamless bot experience for the user.

Orchestrator
Combines your bot with other bots as skills and determines which bot to use to respond to the user.

Verify API
Allow a person automated entry to the premises when showing their face to the entrance gate camera.
Determines whether the face belongs to that same person.
The Face Verify API will compare the person's image against the enrolled database of persons' images and provide them access, creating a one-to-one mapping between the two images to verify if both images belong to the same person.

Detect API
Generate engagement reports for using student emotions and head poses during period of time.
Face detection captures a number of face-related attributes like head pose, gender, age, facial hair, etc.
Admins can use this JSON output from the images to study the people engagement.

Identify API
Identify persons who are attending a specific event.
Face identification allow admins to compare given event photos against all previous photographs of the
same subject and do a one-to-many mapping.

Group API
Face grouping divides a set of unknown faces into smaller sets of groups based on similarity.
Additionally, it also returns a set of face IDs having no similarities.
A single person can have multiple groups, although each returned group is likely to belong to the same person.
These different groups of the same person are differentiated due to additional factors like expression.

Precision
It is about predicted. Indicates what is the proportion of true positives (TP) over the sum of all TP and False Positives (FP).
It uses the formula: TP/ (Tp Fp)

Recall
Indicates the fraction of actual classifications that were correctly identified.
It uses the formula: TP / (TP + FN)

Roles
Are added to entities to distinguish between different contexts.
Example: flight origin and flight destinatiom - you can add roles to the prebuilt geographyV2 entity.

Features
Provide hints for interchangeable words. Features act as synonyms for words when training a LUIS app.

Pattern
Patterns are used with entities and roles to extract information from an utterance.

synonym map
Resource that supplements the content of a search index with equivalent terms.
It cannot be used to enable a search-as-you-type functionality

Knowledge store - object or file projections
With object and file projections, you can write the content from source documents, skills output, and enrichments to blob storage.

Bot Framework Composer
Ca be published as web app to the Azure App Service
Run as a serverless application In the Azure Functions.

Skill bot
Boot that can perform tasks for another bot.
Skill manifest is required for bots that are skills

Skill consumer bot
Bot that can call other bots as skills

skill manifest
JSON file that describes the actions that it exposes to the skill consumer and the parameters it requires.

Billable Cognitive Search
Must be used for the Azure scenarios where you expect high or frequent load.
As an all-in-one resource, Azure Cognitive Service provides access to Computer Vision, Text Analytics, and Text Translation services through the relevant endpoints.
The Computer Vision service particularly provides OCR the capability to identify and extract text from given images.

Image Analysis
Can analyze the image content and generate captions and tags or identify celebrities and landmarks.
For the low-volume text extraction, you can use built-in OCR skills, as it allows the processing of a limited number of documents for free.

built-in Key Phrase Extraction
Can evaluate given input text and return a list of detected key phrases.

Faceted navigation in Azure Cognitive Search
Filtering feature that enables drill-down navigation in your search-enabled application.
Rather than typing your search expression, you can use faceted navigation to filter your search results through the exposed search criteria such as range or counts within your Azure Cognitive Search index.

Brands model
Identify mentions of products, services, and companies.
By using content model customization, you can configure the Brands model to identify Bing suggested brands or any other brands that you add manually.
For example, if Microsoft is mentioned in video or audio content or if it shows up in visual text in a video
Video Analyzer for Media detects it as a brand in the content.

Person model
Recognize celebrities from video content.
By using content model customization, you can configure the Person model to detect celebrity faces from a database of over one million faces powered by various data sources like Internet Movie Database (IMDB), Wikipedia, and top Linkedln influencers.

Language model
Ability to determine industry terms or specific vocabulary.
By using content model customization, you can configure the Language model to add your own vocabulary or industry terms that can be recognized by the Video Analyzer for Media.


cognitiveservices_azure.com domain
Can be used to access the Computer Vision API and generate the required image thumbnails. Example:
https://MYAPPLOCATION.cognitiveservices_azure.com/vision/v3_1/generateThumbnail
https://MYAPPNAME.api_cognitive_microsoft.com/vision/v3_1/generateThumbnail

azurewebsites_net
Subdomain reserved for the use of Azure web apps.
This subdomain is assigned to your custom web apps that you deploy in Azure.

azure-api.net
Subdomain is reserved for use by Azure API Management instances.
API Management can potentially hide Computer Vision endpoints to act as a frontend Interface.

Azure Application Insights
Helps in monitoring your container across multiple parameters like availability, performance, and usage. It is a recommended, but optional, setting when configuring the docker container.
It is not a required setting that must be configured for telemetry support of your container.

Direct Line speech
Channel that allows users to interact with your bot via voice.
Uses the Cognitive Service Speech-to-Text service.

Custom commands
Used with voice assistants for more complex task-orientated conversations.
Use speech-to-text to transcribe the user's speech, then take action on the natural language understanding of the text.

Language understanding
Enables your bot to understand user input naturally and to determine the intent of the user.

Language generation
Use templates enable you to send a variety of simple text messages to users.

Telephony
Channel that allows users to interact with the bot over the phone.
Only enables voice over the telephone not on other channels.

Indexer
Crawler that extract searchable text and respective metadata from an external Azure data source.
It populates a search index mapping between source data and your Azure Cognitive Search index.
Supported as a data source:
  • Azure Table Storage
  • Azure Data Lake Gen2
  • Azure Cosmos DB
Azure File Storage
Provides files shares in cloud that are fully managed and accessible through the SMB or NFS protocol.

Azure Data Lake Gen1
Designed for big data analytic workloads and acts as an enterprise-wide hyper-scale repository.

Azure Bastion
Enables secure RDP (Remote Desktop Protocol) or SSH (Secure Shell) connectivity to your VM, without the need for the VM to have a public IP address.

HeroCard
Single large image and one or more buttons with text.
It has an array of card images and an array of card buttons.
The buttons can either return the selected value to the bot or open a url.

CardCarousel
Collection of cards that allows your user to view horizontally, scroll through, and select from.
The code would need to specify an attachmentLayout of carousel in order to display as a carousel.

ReceiptCard
Contains a list of Receiptltem objects with a button at the end.
You should not use SuggestActions_ SuggestActions displays a list of suggested actions. It uses an array of
card actions.

ImBack
Shows the accept button. The Imback activity type sends a message to the bot when the user selects the button containing the value specified in the parameters.

openlJrl
Show the a webpage.
An openlJrl activity type Opens the URL specified in the value parameters, which in this case is the organization's privacy policy.
You should not use Signin. The Signin activity type uses OAuth to connect securely with other services.

displayText
Used with the messageBack activity type to display text in the chat. It is not sent to the bot.

Azure Form Recognizer
Service that allows you to analyze and extract text from structured documents like invoices.
Azure Cognitive Search does not provide any built-in skills to apply the Form Recognizer's functionality in an Al enrichment pipeline.
For this reason, you need to create a custom skill for it.

Bing Entity Search
Used to describe given geographical locations.
The Bing Entity Search functionality is not available in Azure Cognitive Search as a built-in cognitive skill.
For this reason, you need to build a custom skill and programmatically call the Bing Entity Search API to return descriptions for the given locations.

management_azure_com
Endpoint for the management of Azure services including the Search service
It enables the management and creation of keys for the Search service. Here you create the key for your search app - YOUR_COG_SERACH

YOUR_COG_SERACH.search.windows_net
Endpoint for the Search service. The client application will use this to query the indexes,

api.cognitive.microsoft_com
Endpoint is used by the Bing Search services.

Microsoft.Search
Provider for the Search service.
Using this provider, you can regenerate the admin keys or create query keys for the Search service.

Microsoft.CognitiveServices
Provider for generic Cognitive Services.
It can regenerate the primary and secondary keys for the service, but it cannot generate query keys for the Search service.

Microsoft_Authorization
Provider for managing resources in Azure and can be used to define Azure policies and apply locks to resources.


SpeechRecognizer class
Start the speech service.
Can perform speech-to-text processing.

AudioDataStream
Represents an audio data stream when using speech synthesis in speech-to-text.

addPhrase
Add the ITEM_NAME_THAT_HAS_TO_BE_CONV_TO_TEXT to improve recognition of the product name.

start_continuous_recognition
Starts speech recognition until an end event is raised.
Continuous recognition can easily process the audio up to 2 min long

recognize_once / recognize_once_async
Methods only listen to audio for a maximum of 15 seconds.


Upload
Method of the IndexDocumentsAction class that you can use to push your data to a search index.
If the document is new, then it is inserted. If the document already exists, then this method updates its values instead.

IndexDocuments
Method of the SearchClient class that you can use to send a batch of upload, merge, or delete actions to your target search index.

Merge
Method of the IndexDocumentsAction class that you can use to update the values of an existing document. The
Merge method will fail if you will try to push data into a document that does not yet exist in the search index.

Autocomplete
Method of the SearchClient class that you can use to use the search-as-you-type functionality in your app.
It uses text from your input field to autosuggest query terms from the matching documents in your search index.

GetDocument
Method of the SearchClient class that you can use to retrieve the specific document from your search index. In your case, you need to upload new documents instead.

customEvents
Telemetry sent by your bot to Azure Application Insights can be retrieved by the Kusto query only from the customEvents table.
Analyze your bot's telemetry

summarize operator
Produces a table with the data aggregated by specified columns.
This operator can call the count() aggregation function to return a count of the group.

Heartbeat table
Azure Monitor collects heartbeat data from Azure resources like virtual machines (VMs).

StormEvents table
Table in the Azure Data Explorer's sample database that contains information about US storms.

top operator
Returns the first N records sorted by the specified column

Bot Framework Composer
Windows desktop application that allows you to build bots using visual designer interface.

Log Analytics workspace
Used to send logs to a repository where they can be enriched with other monitoring logs collected by Azure Monitor to build powerful log queries.
Log Analytics is a flexible tool for log data search and analysis.
In a Log Analytics workspace you can combine your Azure Cognitive Services logs with other logs and metrics data collected by Azure Monitor, use Kusto query language to analyze your log data and also leverage other Azure Monitor capabilities such as alerts and visualizations.

Event Hub
Stream data to external log analytics solutions.
Event Hub can receive, transform and transfer millions of events per second.
With Event Hub, you can stream real-time logs and metrics from your Azure Cognitive Services to external systems such as Security Information and Event Management (SIEM) systems or third-party analytics solutions.

Azure Blob storage
Used to archive logs for audit, static analysis or backup purposes, keeping them in JSON format files.
Blob storage is optimized for storing big volumes of unstructured data.
With Blob storage you can keep your logs in a very granular format by the hour or even minutes, to assist with a potential investigation of specific incidents reported for your Azure Cognitive Services.

PythonSettings.json
It is a workspace configuration file for the Python used in the Visual Studio integrated development environment (IDE).

requirements.txt
In Python, you can use requirements.txt to specify the external packages that your Python project requires. You can install them all using the Python pip installer.

Cloning model
Copies the model into a new version.
The cloned version becomes the active version.
Cloning allows you to make changes to the model, such as adding intents, entities, and utterances, and test them without changing the original version of the model

Pattern.Any entity.
Helps the model understand the importance of the word order.
It is a variable-length placeholder to indicate where the entity starts and ends in a sentence.
It is used when utterances are very similar but refer to different entities and when entities are made up of multiple words.

activity handler
Send a welcome message when a user joins the bot conversation.
An activity handler organizes the conversational logic for a bot. Activity handlers respond to events such as a user joining the bot conversation.

component dialog
Use it to create a reusable set of steps for one or more bots.
A component dialog is a set of dialogs that can call the other dialogs in the set.
The component dialog manages the child dialogs in this set.
A component dialog can be reused in the same bot or in several bots.

waterfall dialog
Us it to create a set of sequential steps for a bot
A waterfall dialog is a set of dialog steps in a sequence to gather specific information from a user.

prompt dialog
Single individual prompt and user response.
Component dialogs and waterfall dialogs will contain one or more prompt dialogs.

Chat files
Are markdown files. They consist of two parts:
  • header - here you define participants
  • conversation.
Person
Object you add to the Face API to store names and images of faces for identification.
A Person can hold up to 248 face images.

PersonDirectory
Structure into which you add the Persons and their facial images.
PersonDirectory can hold up to 75 million Persons and does not need training for new facial images.

DynamicPersonGroup
Subset of identities from a PersonDirectory that you can filter identification operations on.
Using a DynamicPersonGroup, you can increase the accuracy of facial identification by only verifying faces against the smaller list people, instead of the entire set of faces in the PersonDirectory.

FaceList
Used for the Find Similar operation, not for identification.
Find Similar is used to find people who look similar to the facial image, it cannot be used to verify that a face belongs to a person.

PersonGroup
Can only hold 1,000 Persons on the free tier and 10,000 Persons on the standard tier.

LargePersonGroup
Can hold up to one million identities.
However, a LargePersonGroup requires training when a new facial image is added.
This training can take 30 minutes, and the face cannot be recognized until training is complete.
This solution unable to identify new faces immediately

Direct Line channel
Does not enable voice in capability in your bot.
Enables the communication between a client app and your bot over HTTPS protocol.
Does not support support network isolation

Direct Line Speech
Enables voice in or voice out capabilities in your bot
Does not support support network isolation

Direct Line App Service Extension
Ensure that traffic between your bot and client applications never leaves the boundaries of Azure VNet

You should use pattern matching or LUIS. The Speech software development kit (SDK) has two ways to
. The first is

pattern matching
Can be used to recognize intents for simple instructions or specific phrases that the user is Instructed to use.

Language Understanding (LUIS)
Model can be integrated using the speech SDK to recognize more complex intents with natural language.

SSMC Speech Synthesis Markup Language (SSML)
Adjusts the pitch, pronunciation, speaking rate, and volume of the synthesized speech output from the speech service.

Key phrase extraction
Is part of the Text Analytics API and extracts the main talking points from text.
Key phrase extraction is not integrated with the Speech SDK.

Visemes
Used to lip-sync animations of faces to speech.
Visemes use the position of the lips, jaw, and tongue when making particular sounds.

Scoring profile
Are part of the index definition and boost the relevance search based on the fields that you specify.
To favorize new entries, you can use the date the product was added to boost its relevance score in search.

Computer Vision
Analyze an image and generate a human-readable phrase that describes its contents. The algorithm returns several descriptions based on different visual features, and each description is given a confidence score. The final output is a list of descriptions ordered from highest to lowest confidence.
The endpoint used in crrl:
<Zone where Coputer Vison resource was created>.api.cognitive.microsoft.com....



Azure Cognitive Services - Hands-On

Knowledge mining with Azure Cognitive Search

Study notes

1. Cognitive Search

Provides a cloud-based solution for indexing and querying a wide range of data sources, and creating comprehensive and high-scale search solutions.
Plain English: allow you to search in all data you have in Azure cloud does not matter where it is and what it is: database and documents.
To start you need to create a resource: Azure Cognitive Search.During creatin select tier based on what you need:
  • Free (F)
    Learn - try to see how it works
  • Basic (B)
    Small-scale search solutions that include a maximum of 15 indexes and 2 GB of index data.
  • Standard (S)
    Enterprise-scale solutions. has variants S, S2, and S3 which offer increasing capacity, read performance and numbers of indexes.
  • Storage Optimized (L)
    Variants L1 or L2 - large indexes at the cost / query latency.
Optimize your solution for scalability and availability by creating:
Replica and Partitions => Unit search = RxP
  • Replicas (R)
    Instances of the search service (kind of nodes in a cluster). Obvious higher number - better.
  • Partitions (P)
    More technical approach - divide an index into multiple storage locations, enabling you to split I/O operations such as querying or rebuilding an index.
Search components:
  • Data source
    many...Unstructured in blob, tables in SQl, documents in CosmoDb or JSON(direct injection in index)
  • Skillset
    It's the AI in action. on top of extracted data AI add more details/insights via an "enrichment pipeline", ex:
    • Language used.
    • Key phrases / main topics
    • Sentiment score
    • Specific locations, people, organizations, or landmarks
    • Images description, text extracted (OCR).
    • Custom skills.
  • Indexer
    Engine that drives the overall indexing process. Take output already generated and map it to fields in index.
    It creates the index.
    Fields extracted are mapped
    • fields extracted mapped direct to index fields.
      • Implicit mapping - automatically mapped to fields with the same name in index
      • Explicit mapping - mapping is defined - may rename field in index.
    • filed from skills (skillset) explicit mapped to target fields in index
  • Index
    Searchable result. Collection of JSON docs used by client application.
    It is an entity that contains details extracted and enriched (metadata, normalized images, language used, text from images, merged content from enrichment details)
    Fileds atributes:
    • key
      Unique key for index records.
    • searchable
      That's the content where you search.
    • filterable
      Fields that can be included in filter expressions to return only documents that match specified constraints.
    • sortable
      Fields that can be used to order the results.
    • facetable
      Fields that can be used to determine values for facets (user interface elements used to filter the results based on a list of known field values).
    • retrievable
      Fields that can be included in search results (by default, all fields are retrievable unless this attribute is explicitly removed).
Search an index
Based on the Lucene query syntax, which provides a rich set of query operations for searching, filtering, and sorting data in indexes.
  • Simple
    Intuitive syntax perform basic search - match literal.
  • Full
    Must submitted search term & search paramaters
    • queryType - simple or full
    • searchFields - Index fields to be searched.
    • select - Fields to be included in the results.
    • searchMode - Any/All - Criteria for including results based on multiple search terms.
Query processing stages
  • Query parsing.
  • Lexical analysis
  • Document retrieval
  • Scoring
Filtering results
  • Include filter criteria - valid for Simple search.
    search=TERM+author='Reviewer'
    queryType=Simple
  • Providing a parameter - an OData filter expression as a $filter parameter with a Full searchexpression.
    search=TERM
    $filter=author eq 'Reviewer'
    queryType=Full
  • Filtering with facets
    Facets are a useful way to present users with filtering criteria based on field values in a result set. Example get all Authors - filter by Author field
    search=*
    $filter=author eq 'selected-facet-value-here'
Sorting results
By default, results are sorted based on the relevancy score.
Use ODataorderby parameter that specifies one or more sortable fields and a sort order (asc or desc).
search=*
$orderby=last_modified desc


Enhance the index
  • Search-as-you-type
    • Suggestions
    • Autocomplete
  • Custom scoring and result boosting
    By default, search results are sorted by a relevance score that is calculated based on a term-frequency/inverse-document-frequency (TF/IDF) algorithm. You can customize the way this score is calculated by defining a scoring profile (ie. increase relevance of docs)
    You can modify an index definition so that it uses your custom scoring profile by default.
  • Synonyms
2. Custom skill

Implement the expected schema for input and output data that is expected by skills in an Azure Cognitive Search skillset.
  • Input Schema
    Defines a JSON structure containing a record for each document to be processed.
    Each document has a unique identifier, and a data payload with one or more inputs
  • Output schema
    It is for the results returned by your custom skill and reflects the input schema.
    The output will contain a record for each input record, with either the results produced by the skill or details of any errors that occurred.
To integrate a custom skill into your indexing solution, you must add a skill for it to a skillset using the Custom.WebApiSkillskill type.
Skill definition:
  • Specify the URI to your web API endpoint, including parameters and headers if necessary.
  • Set the context to specify at which point in the document hierarchy the skill should be called
  • Assign input values, usually from existing document fields
  • Store output in a new field, optionally specifying a target field name (otherwise the output name is used)
3. Knowledge stores

Consists of projections of the enriched data, which can be JSON objects, tables, or image files.
When an indexer runs the pipeline to create or update an index, the projections are generated and persisted in the knowledge store.

Result of indexing it is a collection of JSON objects.
It is an outputbut it can be as well the input
  • for integration onto data orchestration process (ie. Data factory)
  • normalized and imported in relational database can be used by visualisation tools
  • create images index *save images for browsing)
Azure Cognitive Search enables you to create search solutions in which a pipeline of AI skills is used to enrich data and populate an index.
The data enrichments performed by the skills in the pipeline supplement the source data with insights:
  • Language
  • Main themes or topics
  • Sentiment score
  • Locations, people, organizations, or landmarks
  • AI-generated descriptions of images, or image text extracted by optical character recognition (OCR).
The process of indexing incrementally creates a complex document that contains the various output fields from the skills in the skillset.
TheShaper:
  • simplify the mapping of these field values to projections in a knowledge store
  • create a new, field containing a simpler structure for the fields you want to map to projections
First create a knowledgeStore object.
You can define different types of projections.
Separate projection for every single type
Only one projectyion type can be pupulated byShaper.
  • object projections
  • table projections
  • file projections
Example:
"knowledgeStore": {
"storageConnectionString": "<storage_connection_string>",
"projections": [
{
"objects": [
{
"storageContainer": "<container>",
"source": "/projection"
}
],
"tables": [],
"files": []
},
{
"objects": [],
"tables": [
{
"tableName": "KeyPhrases",
"generatedKeyName": "keyphrase_id",
"source": "projection/key_phrases/*",
},
{
"tableName": "docs",
"generatedKeyName": "document_id",
"source": "/projection"
}
],
"files": []
},
{
"objects": [],
"tables": [],
"files": [
{
"storageContainer": "<container>",
"source": "/document/normalized_images/*"
}
]
}
]
}


4. Enrich index in Language Studio

We put together Data Modeling(A) and Search(B):
  1. Store documents you wish to search
    Use Blob containers
    Classify documents: simple or multiple or at least label them - to which category(ies) belongs (for next step)
    Ie.
    ...
    documents": [
    {
    "location": "{DOCUMENT-NAME}",
    "language": "{LANGUAGE-CODE}",
    "dataset": "{DATASET}",
    "classes": [
    {
    "category": "Class1"
    },
    {
    "category": "Class2"
    }
    ]
    }
    ...

    ===A===
  2. Create a custom text classification project
  3. Train and test your model (we have the model and its endpoint to access it)
    You can add documents to test set.
    ===B===
  4. Create a search index based on your stored documents
  5. Create a function app that will use your deployed trained model
    1. Be able to pass JSON to the custom text classification endpoint
    2. Get the response and process it
    3. Returns a structured JSON message back to a custom skillsetin cognitive search
      Function must know
      1. The text to be classified.
      2. The endpoint for your trained custom text classification deployed model.
      3. The primary key for the custom text classification project.
      4. The project name.
      5. The deployment name.
  6. Update your search solution, your index, indexer, and custom skillset
    There are three changes in the Azure portal you need to make to enrich your search index!
    1. Add a field to your index to store the custom text classification enrichment.
    2. Add a custom skillset to call your function app with the text to classify.
    3. Map the response from the skillset into the index.
5. Implement advanced search features in Azure Cognitive Search

How search calculates scores for documents and the tools you have to influence that score.
You can boost individual terms in your search queries, add custom scoring profiles to focus on the most important field in your index, enrich your indexes with more languages, and return results based on their location.

All search engines try to return the most relevant results to search queries. Azure Cognitive Search implements an enhanced version of Apache Lucene for full text search

Query - Improve the ranking of a document with term boosting (Enable the Lucene Query Parser)

  • Results - Improve the relevance of results by adding scoring profiles.
  • Azure Cognitive Search uses the BM25 similarity ranking algorithm. The algorithm scores documents based on the search terms used.
    The search engine scores the documents returned from the first three phases.
    By default, the search results are ordered by their search score, highest first. If two documents have an identical search score, you can break the tie by adding an $orderby clause.
    Very simple, the document score is a function of:
    • number of times identified search terms appear in a document
    • document's size
    • rarity of each of the terms.
Cognitive Search lets you influence a document score using scoring profiles.
Define different weights for fields in an index.
Scoring profile functions:
  • Magnitude
    Alter scores based on a range of values for a numeric field
  • Freshness
    Alter scores based on the freshness of documents as given by a DateTimeOffset field
  • Freshness
    Alter scores based on the freshness of documents as given by a DateTimeOffset field
  • Tag
    Alter scores based on common tag values in documents and queries

Extract text from images and documenst

Study notes

1. Read text from images - OCR technology:
  • Read API
    • Read small to large volumes of text (images and PDFs).
    • New OCR API generation - greater accuracy.
    • Can read printed text in multiple languages, handwritten English only.
    • Asynchronoperation. Initial call returns ID to be used to retrieve the results.
    • The results of from the Read API are broken down by:
      It is read at line level.
      • page
      • line
      • word
  • Image analysis API
    • Preview, with reading text functionality added
    • Read small amounts of text from images.
    • Returns contextual information, including line number and position.
    • Results are returned immediately (synchronous) from a single function call.
    • Analyz images past extracting text, including detecting content ...
2. Extract data from forms with Form Recognizer cognitive services
Use Form Recognizer cognitive service.
Create bounding boxes around detected objects in an image (text area) end then extract text.
Form Recognizer provides underlying models that have been trained on thousands of form examples.
Component services:
  • Document analysis models
    • Take an input files:
      • JPEG, PNG, PDF, and TIFF
      • less than 500 MB for paid (S0) tier and 4 MB for free (F0) tier
      • 50 x 50 pixels to 10000 x 10000 pixels
      • training data set must max 500 pages
    • Return a JSON file with the location of text in bounding boxes, text content, tables, selection marks (also known as checkboxes or radio buttons), and document structure.
  • Prebuilt models
    Detect and extract information from document images and return the extracted data in a structured JSON output.
    Prebuilt models:
    • W-2 forms
    • Invoices
    • Receipts
    • ID documents
  • Custom models
    Extract data from forms specific to your business.
    Can be trained by calling the Build model API, or using Form Recognizer Studio.
    • Take an input files:
    • JPEG, PNG, PDF, and TIFF
    • less than 500 MB for paid (S0) tier and 4 MB for free (F0) tier
    • 50 x 50 pixels to 10000 x 10000 pixels
    • training data set must max 500 pages
Process:
  • Subscribe to a resource:
    • Cognitive Service resource
    • Form Recognizer resource
  • Make sure input requirements are met
  • Decide what component of Form Recognizer to use (it is about document analysis models, not prebuild or custom)
    • Layout model
      Analyzes and extracts text, tables, selection marks, and other structure elements like titles, section headings, page headers, page footers, and more
    • Read model
      Extract print and handwritten text including words,locations, and detected languages.
    • General Document model
      Extract key-value pairs in addition totext and document structure information.
To create an application that extracts data from use a prebuilt model. These models do not need to be trained.
To create an application to extract data from your industry-specific forms create a custom model. This model needs to be trained:
  • Form Recognizer service supports supervised machine learning. You can train custom models and create composite models with form documents and JSON documents that contain labeled fields
  • Train using Cognitive services:
    • Store sample forms in an Azure blob container, along with JSON files containing layout and label field information
    • Generate a shared access security (SAS) URL for the container.
    • Use the Build model REST API function (or equivalent SDK method).
    • Use the Get model REST API function (or equivalent SDK method) to get the trained model ID.
  • Train using Form Recognizer studio
    • Custom template models
      Accurately extract labeled key-value pairs, selection marks, tables, regions, and signatures from documents. Training only takes a few minutes, and more than 100 languages are supported.
    • Custom neural models
      Deep learned models that combine layout and language features to accurately extract labeled fields from documents.
      Best for semi-structured or unstructured documents.
a) Form Recognizer models using the REST API
Custom model
Rest API - have model ID (get it after training is finalised)
Pass this ID when call get_analyze_result function
Response in JSON format (text boxes location and text- words)

b) Extract data with Form Recognizer Studio
  • Document analysis models
    • Read
      Extract printed and handwritten text lines, words, locations, and detected languages from documents and images.
    • Layout
      Extract text, tables, selection marks, and structure information from documents (PDF and TIFF) and images (JPG, PNG, and BMP).
    • General Documents
      Extract key-value pairs, selection marks, and entities from documents.
  • Prebuilt models
  • Custom models (must train model):
    • Create a Form Recognizer or Cognitive Services resource
    • Collect at least 5-6 sample forms for training and upload them to your storage account container.
    • Configure cross-domain resource sharing (CORS).
    • Create a custom model project in Form Recognizer Studio.
    • Apply labels to text.
    • Train your model - receive a Model ID and Average Accuracy for tags.
    • Test model

Control version with Git

Study notes

Many way to define Control system (VCS) or software configuration management (SCM) system.
  • Program or set of programs that tracks changes to a collection of files.
  • Allow many to work on a project in a coherent way.
  • Create by Linus Torvalds, the creator of Linux because he did need it.
Why is so good:
  • See all the changes made to your project, when the changes were made, and who made them.
  • Include a message with each change to explain the reasoning behind it.
  • Retrieve past versions of the entire project or of individual files.
  • Create branches, where changes can be made experimentally. This feature allows several different sets of changes (for example, features or bug fixes) to be worked on at the same time, possibly by different people, without affecting the main branch. Later, you can merge the changes you want to keep back into the main branch.
  • Attach a tag to a version—for example, to mark a new release.
Git is distributed, which means that a project's complete history is stored both on the client and on the server.
Work files without a network connection, check them in locally, and sync with the server when a connection becomes available.

You must be very clear with terminology:
  • Working tree
    Set of nested directories and files that contain the project that's being worked on.
  • Repository (repo)
    Directory, located at the top level of a working tree
    Git keeps here all the history and metadata for a project.
  • Bare repository
    Respositorythat isn not part of a working tree; it's used for sharing or backup, usually a directory with a name that ends in .git
  • Hash
    Number produced by a hash function that represents the contents of a file or another object as a fixed number of digits. Hashes usd by Git to tell whether a file has changed by hashing its contents and comparing the result to the previous hash.
  • Object
  • it's a repo ant contains four types of objects, each uniquely identified by an SHA-1 hash.
    • Blobobject contains an ordinary file.
    • Treeobject represents a directory; it contains names, hashes, and permissions.
    • Commitobject represents a specific version of the working tree.
      • Tag - name attached to a commit.
  • Commit
    Committing (doing) the changes you have made so that others can eventually see them, too.
  • Branch
    Series of linked commits with agiven name.
    The most recent commit on a branch is called the head.
    The default branch, which is created when you initialize a repository, is called main
    The head of the current branch is named HEAD.
  • Remote
    A named reference to another Git repository. When you create a repo, Git creates a remote named origin that is the default remote for push and pull operations.
  • Commands, subcommands, and options
    Git operations are performed by using commands like git push and git pull. git is the command, and push or pull is the subcommand.
    Commands frequently are accompanied by options, which use hyphens (-) or double hyphens (--). ie. git reset --hard.
GitHub
  • Cloud platform that uses Git as its core technology.
  • Simplifies the process of collaborating on projects and provides a website, more command-line tools, and overall flow that developers and users can use to work together.
  • Act s as the remote repository mentioned earlier.
  • Key features:
    • Issues
    • Discussions
    • Pull requests
    • Notifications
    • Labels
    • Actions
    • Forks
    • Projects
Very basic at the beginning:
  • git config
    set who operate: name & email
  • git init,git checkout
    Init git and set active branch
  • git add
    keeep track of files
  • git commit
    commit changes (need a name)
  • git log
    see data about commits
  • git help
Projects are iterative. Write some code, test, patch, others work o n the same project, multiple branches are merged, errors and ... voala problems.
Using git you can keep all this mess under control.
Helpfull:
  • git log
  • git diff
  • .gitignore (file)
    Prevents extraneous files from being submitted to version control. Boilerplate .gitignore files are available for popular programming environments and languages.
Subfolders
Git doesn't consider adding an empty directory to be a change.
To have empty directories as placeholders. A common convention is to create an empty file, often called .git-keep, in a placeholder directory.

Small changes
Put small changes in the same commit as the original.
git commit --amend --no-edit

Recover a deleted file
git checkout -- <file_name>
if file was deleted with git rm:
git reset -- <file_name>

Revert a commit
git revert

Collaborate on project
  • git clone
    Clone a repository
    When Git clones a repository, itcreates a reference to the original repo that's called a remote by using the name origin.
  • git pull
    Copies changes from the remote repository to the local one - only new commits.It already knows the last commit that it got from the remote repository because it saved the list of commits.
    Git pulls or pushes only when you tellit to do so.
  • git pull-request - p origin/main .
    Other users can create a pull request, ask you, as main developer to pull their work and commit into main branch.
    A pull request gives you a chance to review other collaborators' changes before you incorporate their work into the work you're doing on the project.
  • git remote
    The project owner, you need to know how to merge pull requests:
    git remote - to set up another developer's repo as a remote
    git pull - use that remote for pulls and pull requests
    git pull is a combination of two simpler operations: git fetch, which gets the changes, and git merge
Collaborate by using a shared repo
You MUST set up a central repository, which is also called a bare repository -repository that doesn't have a working tree.
  • Everybody can push changes without worrying about which branch is checked out (no working tree)
  • Git can detect when another user has pushed changes that might conflict with yours.
  • Shared repo can scales to any number of developers.
  • Shared repo on a server that you all can access - no worry about firewalls and permissions.
  • No need separate accounts on the server because Git keeps track of who made each commit.

Computer vision with Azure Cognitive Services

Study notes

1. Analise images

Computer Vision
  • Part of artificial intelligence (AI) in which software interprets visual input: images or video feeds.
  • Designed to help you extract information from images:
    • Description and tag generation
      Determining an appropriate caption for an image, and identifying relevant "tags"
    • Object detection
      Detecting the presence and location of specific objects within the image.
    • Face detection
      Detecting the presence, location, and features of human faces in the image.
    • Image metadata, color, and type analysis
      Determining the format and size of an image, its dominant color palette, and whether it contains clip art.
    • Category identification
      Identifying an appropriate categorization for the image, and if it contains any known landmarks.
    • Brand detection
      Detecting the presence of any known brands or logos.
    • Moderation rating
      Determine if the image includes any adult or violent content.
    • Optical character recognition
      Reading text in the image.
    • Smart thumbnail generation
      Identifying the main region of interest in the image to create a smaller "thumbnail"
  • Provision:
    • Single-service resource
    • Computer Vision API in a multi-service Cognitive Services resource.
Analyze an image
Use the Analyze Image REST method or the equivalent method in the SDK (Python, C# etc)
You can use scoped functions to retrieve specific subsets of the image features, including the image description, tags, and objects in the image.
Returns a JSON document containing the requested information.
Sample:
{
"categories": [
{
"name": "_outdoor_mountain",
"confidence": "0.9"}
],
"adult": {"isAdultContent": "false", …},
..
..
}


Generate a smart-cropped thumbnail
Creates thumbnail with different dimensions (and aspect ratio) from the source image, and optionally to use image analysis to determine the region of interest in the image (its main subject) and make that the focus of the thumbnail.

2. Analise video
Extract info:
  • Facial recognition
  • OCR
  • Speech transcription
  • Topics - key topics discussed in the video.
  • Sentiment analysis
  • Labels - label tags that identify key objects or themes throughout the video.
  • Content moderation
  • Scene segmentation
You can create custom models and train them for:
reating custom models for:
  • People.
    Add images of the faces of people you want to recognize in videos, and train a model. Consider Limited Access approval, adhering to our Responsible AI standard.
  • Language.
    Specific terminology that may not be in common usage
  • Brands.
    Train a model to recognize specific names as brands relevant to your business.
  • Animated characters.
    Detect the presence of individual animated characters in a video.
Incorporate the service into custom applications.
  • Video Analyzer for Media widgets
    share insights from specific videos with others without giving them full access to your account in the Video Analyzer for Media portal
  • Video Analyzer for Media API
    REST API that you can subscribe to in order to get a subscription key -> automate video indexing tasks, such as uploading and indexing videos, retrieving insights, and determining endpoints for Video Analyzer widgets.
    Result is in JSON.
3. Classify images
Image classification
Computer vision technique in which a model is trained to predict a class label for an image based on its contents.
  • multiclass classification - multiple classes, each image can belong to only one class.
  • multilabel classification - an image might be associated with multiple labels.
Classic flow for modeling/prediction:
  • Use existing (labeled) images to train a Custom Vision model.
  • Create a client application that allow others to submit new images - model generate predictions.
4. Detect objects in images
Object detection
Computer vision technique in which a model is trained to detect the presence and location of one or more classes of object in an image.
  • Class label of each object detected in the image.
  • Location of each object within the image, indicated as coordinates of a bounding box that encloses the object.
Bounding boxes are defined by four values that represent the left (X) and top (Y) coordinates of the top-left corner of the bounding box, and the width and height of the bounding box. These values are expressed as proportional values relative to the source image size.

Hardest part is training model:
  • Add label to every object in image via use the interactive UI from Custom Vision portal.
    Suggest train the model as soon as you have relevant images labeled then, use smart labeling, system prefill and you just confirm or change.
  • Use labeling tools ie. the one provided in Azure Machine Learning Studio or the Microsoft Visual Object Tagging Tool (VOTT)- team work.
    In this case, you may need to adjust the output to match the measurement units expected by the Custom Vision API
5. Detect objects in images
In fact, there are multiple 'actions':
  • Face detection
  • Face analysis
  • Face recognition
You can use:
  • Computer Vision service
    Detect human faces and return the box blundering face and its location (like in object detection).
  • The Face service
    What do Computer Vision (box +location) plus:
    • Comprehensive facial feature analysis
      • Head pose
      • Glasses
      • Blur
      • Exposure
      • Noise
      • Occlusion
    • Facial landmark location
    • Face comparison and verification.
    • Facial recognition.
When using this service consider:
  • Data privacy and security
  • Transparency
  • Fairness and inclusiveness
System has the ability tocompare faces anonymously(confirm that the same person is present on two occasions, without the need to know the actual identity of the person)
When you need to positively identify individuals, you can train a facial recognition model using face images:

Training process:
  • Create a Person Group that defines the set of individuals you want to identify.
  • Add a Person to the Person Group for each individual you want to identify.
  • Add detected faces from multiple images to each person, preferably in various poses.
    The IDs of these faces will no longer expire after 24 hours (persisted faces).
  • Train the model.
The trained model is stored in your Face (or Cognitive Services) resource.
It can be used to:
  • Identify individuals in images.
  • Verify the identity of a detected face.
  • Analyze new images to find faces that are similar to a known, persisted face.

Python environment

Study notes

Build a python app in isolation.

Solution: virtual environment
Self-contained copy of everything needed to run your program. This includes the Python interpreter and any libraries your program needs. By using a virtual environment, you can ensure your program will have access to the correct versions and resources to run correctly.

Steps:
  1. Create a virtual environment that won't affect the rest of the machine.
  2. Step into the virtual environment, where you specify the Python version and libraries that you need.
  3. Develop your program.
1. Create a virtual environment
Create & go to app folder.
Run:
python -m venv env

Result a folders structure (may vary)
/env
|__/Include
|__/Lib

|_____site-packages
|__ /Scripts
Don't put your program files in the env directory. We suggest that you put your files in the srcdirectory or similar.

1.1. Activate the virtual environment
activate
#or
PATH_TO_ACTIVATE/activat

Result
(env) in front of prompt

To desactivateL
deactivate
Result
No more (env) in front of prompt

1.2. Install a package
pip install python-dateutil

Once installed you will see
/
|__/env
|__/Lib
|__/dateutil

To see what packages are now installed in your virtual environmen
Run
pip freeze
Result:
python-dateutil==2.8.2
...

More ways to install a package:
  • Have a set of files on your machine and install from that source
    cd <to where the package is on your machine>
    python3 -m pip install .
  • Install from a GitHub repository
    git+https://github.com/your-repo.git
  • Install from a compressed archive file
    python3 -m pip install package.tar.gz
1.3. Use an installed package
from datetime import *
from dateutil.relativedelta import *
now = datetime.now()
print(now)
now = now + relativedelta(months=1, weeks=1, hour=10)
print(now)

Work with project files
Distribute project to others for collaboration.

Create a project file
pip freeze > requirements.txt
Result:
Creates a requirements.txtfile with all the packages that the program needs.
Create a .gitignorefile, and check in your application code and requirements.txt.
src/
requirements.txt
Check in the code to GitHub.
Commit
Publish to GitHub
Check Online
Result - all files from requirements.txt are online (GitHub)

Consume a project
Create a folder
Go into
Get the git URL from GitHub, and run:
git clonehttps://github.com/corifeanu/Azure.git

We have the prject locally.

Install requirements.
Run:
pip install -r requirements.txt

Update/Run your all (what is in src).

Manage dependencies
Kee upgrading your packages:
  • Bug fixes.
    The library that you use might have problems. For example, a feature doesn't work as intended and the author goes in to fix it. You most likely want to upgrade the package as soon as such a fix is in place.
  • Security issues.
    Your package might have a security vulnerability. After such a fix is released, you want to upgrade the package to protect your company and your customers.
  • Additional features.
    The release of more features is nice, though it isn't a strong reason to upgrade the package. Still, if there's a feature that you've been waiting for, you might want to upgrade for that reason
Install the latest version
Specific version
pip install python-dateutil===1.4
Check if version is available
pip install python-dateutil===randomwords
Result:
ERROR: Could not find a version that satisfies the requirement python-dateutil===randomwords (from versions: 1.4, 1.4.1, 1.5, 2.1, 2.2, 2.3, 2.4.0, 2.4.1, 2.4.1.post1, 2.4.2, 2.5.0, 2.5.1, 2.5.2, 2.5.3, 2.6.0, 2.6.1, 2.7.0, 2.7.1, 2.7.2, 2.7.3, 2.7.4, 2.7.5, 2.8.0, 2.8.1, 2.8.2)
ERROR: No matching distribution found for python-dateutil===randomwords
Use:
pip freeze
Result
...
python-dateutil==1.4
...

Install last one
pip installpython-dateutil===2.8.2

Upgrade
pip install python-dateutil --upgrade

Versioning plain English:
  • The leftmost number is called Major. If this number increases, many things have changed, and you can no longer assume that methods are named the same or have the same number of arguments as before.
  • The middle number is called Minor. If it changes, a feature has been added.
  • The rightmost number is called Patch. If this number increases, it most likely means that a bug has been corrected.
Upgrade specific version to last one
pip install "python-dateutil==2.8.*" --upgrade

Clean up unused packages.

Remove one package only:
pip uninstall python-dateutil

Remove all installed packages, by first writing them to a requirements.txt list and then removing all packages in that list.
pip freeze > requirements.txt
pip uninstall -r requirements.txt -y

Now if you run pip freeze, you see that it contains only the following output:
pip-autoremove==0.10.0

Conversational AI solutions

Study notes

Bot = application with a conversational interface.
Must be:
  • Discoverable (integrate in Teams, website)
  • Intuitive and easy to use
  • Available on the devices and platforms used by target public
  • Solve user problems with minimal use
  • Better experience than alternative.
Responsible AI
  • Respects relevant cultural norms and guards against misuse.
  • Reliable.
  • Treats people fairly.
  • Respect user privacy.
  • Handles data securely.
  • Accept responsibility operation and how it affects people.
Implement a Bot solutions in Azure:
  • Azure Bot Service - Dedicated cloud service.
  • Bot Framework Service - useREST API for handling bot activities.
  • Bot Framework SDK - usetools and libraries for end-to-end bot development that abstracts the REST interface - use programming languages.
Build a Bot- tools:
  • Power Virtual Agents
    Power Virtual Agents (PVA) is built on the Microsoft Power Platform, and enables users to build a chatbot without requiring any code.
  • Framework Composer
    App for developers to build, test, and publish your bot via an interactive interface.
    It is an app for developers to build, test, and publish your bot via an interactive interface.
  • Framework SDK
    Collection oflibraries and tools to build, test, publish, and manage conversational bots. The SDK can connect to other AI services, covers end-to-end bot development, and offers the most authoring flexibility.
1. Developing a Bot with the Bot Framework SDK

SDK - Extensive set of tools and libraries that software engineers can use to develop bots. The SDK is available for multiple programming languages, including Microsoft C# (.NET Core), Python, and JavaScript (Node.js)

Bot templates:
Are based on the Bot class defined in the Bot Framework SDK
  • Empty Bot - skeleton.
  • Echo Bot - a simple, sample. echo to messages.
  • Core Bot - comprehensive bot that includes common bot functionality (integration with the Language Understanding service)
Bot logic:
The Bot Framework Service notifies your bot's adapter when an activity occurs in a channel by calling its Process Activity method, and the adapter creates a contextfor the turn and calls the bot's Turn Handler methodto invoke the appropriate logic for the activity.

The logic for processing the activity can be implemented in multiple ways. The Bot Framework SDK provides classes that can help you build bots that manage conversations using:
  • Activity handlers
    Event methods that you can override to handle different kinds of activities.
    For simple bots, implement an event-driven conversation model in which the events are triggered by activities: users joining the conversation, message being received.
    Activity occurs - Bot Framework Service calls the bot adapter's Process Activity function, passing the activity details - adapter creates a turn context for the activity and passes it to the bot's turn handler, which calls the individual, event-specific activity handler.
    Activitieshandled by ActivityHandlerbase class:
    • Message received
    • Members joined the conversation
    • Members left the conversation
    • Message reaction received
    • Bot installed
  • Dialogs
    More complex patterns for handling stateful, multi-turn conversations - conversational flows where you need to store statebetween turns.
    • Component dialogs
      dialog that can contain other dialogs, defined in its dialog set
      each step to be a prompt dialog so that conversational flow consists of gathering input data from the user sequentially. Each step must be completed before passing the output onto the next step
    • Adaptive dialogs
      Container dialog in which the flow is more flexible, allowing for interruptions, cancellations, and context switchesat any point in the conversation
      There is a recognizer that analyzes natural language inputand detects intents, which can be mapped to triggers that change the flow of the conversation.
Deploy Bot
  • Create the Azure resource
  • Register an Azure app
  • Create a bot application service
  • Prepare your bot for deployment
  • Deploy your bot as a web app
  • Test and configure in the Azure portal

2. Create a Bot with the Bot Framework Composer
Bot Framework Composer is a visual designer that lets you quickly and easily build sophisticated conversational bots without writing code

Pros - compared with SDK
  • Visual design - development more accessible.
  • Save time - fewer steps to set up your environment.
  • Visualize Dialogs - easy guide the conversation.
  • Triggers - easily created
  • Enables saving of pieces of data to various scopes to remember things between dialogs or sessions.
  • Test your bot directly inside Composer via embedded Web Chat.
Any bot interaction begins with a main dialogin which the bot welcomesa user and establishes the initial conversation,and then triggers child dialogs.
Dialogs:
  • Have a flexible conversation flow, allowing for interruptions, cancellations, and context switches at any point in the conversation.
  • Consists of:
    • One or more actions - define the flow of message activities in the dialog (sending a message, prompting the user for input, asking a question, etc.)
    • Trigger -invokes the dialog logic for certain conditions or based on intent detected.
    • Recognizer -invokes the which interprets user input to determine semantic intent.
  • Has memory in which values are stored as properties.
    Properties can be defined at various scopes
    • user scope (variables that store information for the lifetime of the user session with the bot, such as user.greeted)
    • dialog scope (variables that persist for the lifetime of the dialog, such as dialog.response).
  • Adaptive - has ability to adapt to any kind of interruption to the conversational flow,
Interruption= when the recognizeridentifies input that fires a trigger, signaling a conversational context change.
The ability to handle interruptions is configurable for each user input action, under the Prompt Configurations tab of the action.

Provide good user experience - how you present the bot:
  • Text - a typical interaction
    Do not assume user know something. be very precise in what you ask, take out any ambiguity.
  • Buttons - presenting the user with buttons from which to select options.
  • Images - enhance the user experience
  • Cards - allow you to present your users
The Bot Framework Composer interface includes a Response Editor that can generate the appropriate language generation code for you, making it easier to create conversational responses.

Questions Answering solution

Study notes

Purpose: Create intelligent apps is to enable users to ask questions using natural language and receive appropriate answers. Replacement for FAQ section/publications
There is similarity between Question Answering& Language understanding. First output is an answer, the second output is an action. Input and flow are exactly the same.
The two services are in fact complementary. You can build comprehensive natural language solutions that combine conversational language understanding models and question answering knowledge bases.
To create a question answering solution use:
  • REST API
  • SDK
Then
  • define knowledge base
  • train
  • publishe
To start use Language Studio
  1. Create a Language resource in your Azure subscription.
    • Enable the question answering feature.
    • Create or select an Azure Cognitive Search resource to host the knowledge base index.
  2. In Language Studio, select the Language resource and create a Custom question answering project.
  3. Name the knowledge base.
  4. Add one or more data sources to populate the knowledge base:
    • URLs for web pages containing FAQs.
    • Files containing structured text from which questions and answers can be derived.
    • Pre-defined chit-chat datasets that include common conversational questions and responses in a specified style.
  5. Create the knowledge base and edit question and answer pairs in the portal.
Consider multi-turn conversation (ask more info before providing an answer).
Next:
  1. Test
  2. Deploy
  3. Usethe created knowledge base.
    Question and Answer are both in JSON format.
  4. Active learning
    • Implicit feedback
      Service identifies user-provided questions that have multiple, similarly scored matches in the knowledge base.
      These are automatically clustered as alternate phrase suggestions for the possible answers that you can accept or reject in the Suggestions page for your knowledge base in Language Studio
    • Explicit feedback
      When developing a client application you can control the number of possible question matches returned for the user's input by specifying the top parameter (no of possible answer to be provided)
  5. Define synonyms
    Question answering service can find an appropriate answer regardless of which term an individual customer uses (ie. booking - reservation)
Question answering bot
A bot is a conversational application that enables users to interact using natural language through one or more channels, such as email, web chat, voice messaging, or social media platform such as Microsoft Teams.
Language Studio provides the option to easily create a bot that runs in the Azure Bot Service based on your knowledge base.
Language Studio ->deploy the bot -> use the Create Bot button to create a bot in your Azure subscription.
(You can then edit and customize your bot in the Azure portal)

Language Understanding solution with Cognitive Services

Study notes

Language Understanding solution means to use Natural language processing (NLP) and that is creating models able to interpret the semantic meaningof written or spoken language.
Language Service provides various features for understanding human language.

Language service features:
  • Language detection - pre-configured
  • Sentiment Analysis and opinion mining pre-configured
  • Named Entity Recognition (NER) - pre-configured
  • Personally identifying (PII) and health (PHI) information detection -pre-configured
  • Summarization - pre-configured
  • Key phrase extraction - pre-configured - pre-configured
  • Entity linking - pre-configured
  • Text analytics for health - pre-configured
  • Custom text classification - learned
  • Orchestration workflow
  • Question answering - learned
  • entity recognition ?
  • intent recognition ?
  • text classification ?
  • Conversational language understanding - require a model to be built for prediction - learned
    Core custom features
    Helps users to build custom natural language understanding models to predict overall intent and extract important information from incoming utterances.
  • Custom named entity recognition (Custom NER) - require a model to be built for prediction- learned
    Core custom features

Language service features fall into two categories:
  • Pre-configured features
  • Learned features. Learned features require building and training a model to correctly predict appropriate labels.
A client application can use each feature to better communicate with users, or use them together to provide more insight into what the user is saying, intending, and asking about.

Building model:

All start with creating a Language Resource (you may use Cognitive service as well)
  • Azure portal
  • New resource -> Language resource
  • Create (you will have keys and end point, like usual)
By default, Azure Cognitive Service for Language comes with several pre-built capabilities like sentiment analysis, key phrase extraction, pre-built question answering, etc. Some customizable features below require additional services like Azure Cognitive Search, Blob storage, etc. to be provisioned as well. Select the custom features you want to enable as part of your Language service.

Select resource, give a name and select FREE price tier (5k transactions in 30 days)


Once done, go to Keys and Endpoints

1. Create model via REST API

  1. create your project
  2. import data
  3. train model
  4. deploy model
  5. query model
All steps are done asynchronously, for each:
  • Submit a request to the appropriate URI
  • Send a request to get the status of that job. You will get it in JSON format.
    {
    "jobId":"{JOB-ID}",
    "createdDateTime":"String",
    "lastUpdatedDateTime":"String",
    "expirationDateTime":"String",
    "status":"running"
    }
For each call you must authenticate the request by providing a specific header
Ocp-Apim-Subscription-Key -> language resource key (see above)

2. Create model usingLangauge Studio
  1. Open Language Studio
    Language Studio - Microsoft Azure
  2. Select Directory, subscription and Resource type (language) and Resource (the one you just created above)
  3. From Create New, select what project you wish to create and go through the process.

Does not matter how you create and train the model, the final step is to query the model.
To query your model for a prediction, create a POST request to the appropriate URL with the appropriate body specified.
Authentication is necessary for any request you send (via header)
Body is JSON format.
Result is in JSON format
Example to detect language:
{
"kind": "LanguageDetection",
"parameters": {
"modelVersion": "latest"
},
"analysisInput":{
"documents":[
{
"id":"1",
"text": "This is a document written in English."
}
]
}
}

Result:
{
"kind": "LanguageDetectionResults",
"results": {
"documents": [{
"id": "1",
"detectedLanguage": {
"name": "English",
"iso6391Name": "en",
"confidenceScore": 1.0
},
"warnings": []
}],
"errors": [],
"modelVersion": "String"
}
}


Terms in NLP

  • Utterance
    The user input. Phrase that a user enters when interacting with the model (via an app that uses your Language Understanding model).
  • Intent
    Task or action the user wants to perform (meaning of an utterance).
    That something you want model to do and model has to understand and 'execute'
    There is "None" intend - default, when there is no action defined for what you ask.
  • Entity
    Add specific context to intents.

Steps:
  • Define intents model must support.
  • Define all possible utterance for every single intent (all possible a user may input to request an intend)
    • Capture multiple different examples, or alternative ways of saying the same thing
    • Vary the length of the utterances from short, to medium, to long
    • Vary the location of the noun or subject of the utterance. Place it at the beginning, the end, or somewhere in between
    • Use correct grammar and incorrect grammar in different utterances to offer good training data examples
    • The precision, consistency and completeness of your labeled data are key factors to determining model performance.
      • Precisely: Label each entity to its right type always. Only include what you want extracted, avoid unnecessary data in your labels.
      • Consistently: The same entity should have the same label across all the utterances.
      • Completely: Label all the instances of the entity in all your utterances.
  • Add specific context to utterances using entities.
    Example: Utterance: What time is it in London? / Intent: GetTime/ Entities: Location (London)
    • Learned entities
      Flexible, use them in most cases. You define a learned component with a suitable name, and then associate words or phrases with it in training utterances.
      When you train your model, it learns to match the appropriate elements in the utterances with the entity.
    • List entities
      Useful when you need an entity with a specific set of possible values - for example, days of the week.
      Ex: DayOfWeek entity that includes the values "Sunday", "Monday", etc
    • Prebuilt entities
      Ueful for common types such as numbers, datetimes, and names.
      Ex: when prebuilt components are added, you will automatically detect values such as "6" or organizations such as "Microsoft".
      Let the Language service automatically detect the specified type of entity, and not have to train your model with examples of that entity
      You can have up to five prebuilt components per entity.
Create, Train and deploy Model:
  1. Train it to learn intents and entities from sample utterances.
  2. Test it interactively or using a testing dataset with known labels
  3. Deploy a trained model to a public endpoint (users can use it)
  4. Review predictions and iterate on utterances to train your model
The NLP model predict the user Intent from user Input (Natural language)
To use model in application you must publish a Language Understanding app for the created model.
Any app need an endpointused to query a specific feature varies.
https://{ENDPOINT}/text/analytics/{VERSION}/{FEATURE}
  • {ENDPOINT}
    The endpoint for authenticating your API request. For example, myLanguageService.cognitiveservices.azure.com
  • {VERSION}
    The version number of the service you want to call. For example, v3.0
  • {FEATURE}
    The feature you're submitting the query to. For example, keyPhrases for key phrase detection
Pre-configured features
The Language service provides certain features without any model labeling or training.
Once you create your resource, you can send your data and use the returned results within your app. See Language service features(top this page)

Learned features
Require you to label data, train, and deploy your model to make it available to use in your application. See Language service features(top this page)

Consume a model via an App (Process predictions)
  • Use REST APIs
  • A programming language-specific SDKs.
Parameters to be sent:
  • kind
    Indicates language feature you're requesting.
    Ex: Conversationfor conversational language understanding, EntityRecognition to detect entities.
  • parameters
    Indicates the values for various input parameters.
    Ex: projectName and deploymentName are required for conversational language understanding
  • analysis input
    Specifies the input documents or text strings to be analyzed by the Language service.
Result consits of a hierarchy of information that your application must parse. JSON format.
The prediction results include the query utterance, the top (most likely) intent, other potential intents with their respective confidence score, and the entities that were detected.
Each entity includes a category and subcategory (when applicable) in addition to its confidence score (for example, "Edinburgh", which was detected as a location with confidence of 1.0).
The results may also include any other intents that were identified as a possible match, and details about the location of each entity in the utterance string.

The Language Understanding service can be deployed as a container, running in a local Dockerhost, an Azure Container Instance (ACI), or in an Azure Kubernetes Service (AKS) cluster.
Process:
  1. Container imagefor the specific Cognitive Services API you want to use is downloaded and deployed to a container host(local Docker server, ACI or AKS)
    Ex. Get image:docker pull mcr.microsoft.com/azure-cognitive-services/textanalytics/language:latest
  2. Client applications submit data to the endpointprovided by the containerized service, and retrieve resultsjust as they would from a Cognitive Services cloud resource in Azure.
    Ex. run container
    docker run --rm -it -p 5000:5000
    --memory 4g
    --cpus 1
    mcr.microsoft.com/azure-cognitive-services/textanalytics/language
    Eula=accept
    Billing={ENDPOINT}
    ApiKey={API_KEY}

    Ex. Query service
    http://localhost:5000/
  3. Periodically, usage metricsfor the containerized service are sent to a Cognitive Services resource in Azure in order to calculate billing for the service.

Speach process and translate with Speach Services

Study notes

Speach service

  1. Speech-to-Text
    Core API
    API that enables speech recognition in which your application can accept spoken input.
  2. Text-to-Speech
    Core API
    API that enables speech synthesis in which your application can provide spoken output.
  3. Speech Translation
  4. Speaker Recognition
  5. Intent Recognition
Process:
  • Create Resource (Speach Service dedicated or Cognitive Services)
  • Get Resource Location & On Key (Resource Keys/Endpoint)

1. Speech-to-Text
Processed: Interactive (real time) or batch.
In practice, most interactive speech-enabled applications use the Speech service through a (programming) language-specific SDK
Speech service supports speech recognition via:
  • Speech-to-text API, which is the primary way to perform speech recognition.
  • Speech-to-text Short Audio API, which is optimized for short streams of audio (up to 60 seconds).
Main parameters to configure:
  • SpeechConfig object to encapsulate the information required to connect to your Speech resource (location & key)
  • Write AudioConfig (Optional) to define the input sourcefor the audio to be transcribed (microphone or audio file)
Result:
  • Recognized Speach (value), NaN or cancel
  • transcript
2. Text-to-Speech
Speech service offers two APIs for speach synthesis (spoken out from text):
  • Text-to-speech API, which is the primary way to perform speech synthesis.
  • Text-to-speech Long Audio API, which is designed to support batch operations that convert large volumes of text to audio.
Main parameters to configure:
  • SpeechConfig object to encapsulate the information required to connect to your Speech resource (location & key)
  • Write AudioConfig (Optional) to define the output device for the speech to be synthesized (default system speaker, null value or audio stream object that is returned directly)
Result:
  • Reason property is set to the SynthesizingAudioCompleted enumeration.
  • AudioData property contains the audio stream.

Config file -SpeechConfig
Speech service supports multiple output formats (audio)for the audio stream that is generated by speech synthesis.
Depending on your specific needs, you can choose a format based on the required:
  • Audio file type
  • Sample-rate
  • Bit-depth
SetSpeechSynthesisOutputFormat method (SpeechConfig object) - specify the required output format.
Speech service provides multiple voicesthat you can use to personalize your speech-enabled applications
  • Standard voices - synthetic voices created from audio samples.
  • Neural voices - more natural sounding voices created using deep neural networks.
SpeechSynthesisVoiceName- specify a voice for speech synthesis.

Speech Synthesis Markup Language
Speach service use
  • Speech SDK enables you to submit plain text to be synthesized into speech (via SpeakTextAsync() method)
  • Speech Synthesis Markup Language(SSML) - XML-based syntax for describing characteristics of the speech you want to generate.
    • Specify a speaking style (excited, cheerful...)
    • Insert pauses or silence.
    • Specify phonemes (phonetic pronunciations)
    • Adjust the prosody of the voice (affecting the pitch, timbre, and speaking rate).
    • Use common "say-as" rules (phone no, date...)
    • Insert recorded speech or audio (include a standard recorded message)
  • SpeakSsmlAsync() - submit the SSML description to the Speechservice.

Translate speech
Built on speech recognition:
  • Recognize and transcrib spoken input in a specified language
  • Return translations of the transcription in one or more other languages
Prerequirements:
  • Speech or Cognitive Service resource must be already created.
  • Have location and one Key (above service)
Main parameters to configure:
SpeechConfigobject - information required to connect to your Speech resource (location, key)
SpeechTranslationConfig object (input language, target languages)

Return if successs:
  • Reason property has the enumerated value RecognizedSpeech
  • Text property contains the
    • Transcription in the original language
    • Translations property contains a dictionary of the translations (using the two-character ISO language code, such as "en" for English, as a key).
The TranslationRecognizerreturns translated transcriptions of spoken input - essentially translating audible speech to text, it must be spoken out.

Event based synthesis
1 to 1 translation, you can use event-based synthesis to capture the translation as an audio stream
  • Specify the desired voice for the translated speech in the TranslationConfig.
  • Create an event handler for the TranslationRecognizer object's Synthesizing event.
  • In the event handler, use the GetAudio() method of the Result parameter (retrieve audio)
Manual synthesis
Doesn't require you to implement an event handler. You can use manual synthesis to generate audio translations for one or more target languages.
  • Use a TranslationRecognizerto translate spokeninput into text transcriptionsin one or more target languages.
  • Iterate through the Translations dictionaryin the result of the translation operation, using a SpeechSynthesizerto synthesize an audio stream for each language.

Text process and translate with Language Services

Study notes

Language Service

  • Language detection
    Works with documents (max 5210 characters) and single phrases. Max 1000 items(ids) per collections.
    For mixed languages => return predominantlanguage.
    If not able to provide response return NaN or Not a Number.
  • Key phrase extraction
    Process of evaluating the text of a document, or documents, and then identifying the main points around the context of the document
    Max 5120 characters. Bigger document, better results.
  • Sentiment analysis
    Quantifying positive / negative "message"
    Return overall document sentiment and individual sentence sentiment for each document submitted to the service.
    Overall document sentiment is based on sentences:
    • All sentences are neutral, the overall sentiment is neutral.
    • Sentence classifications include only positive and neutral, the overall sentiment is positive.
    • Sentence classifications include only negative and neutral, the overall sentiment is negative.
    • Sentence classifications include positive and negative, the overall sentiment is mixed.
  • Named entity recognition
    Recognize entities mentioned in text (people, locations, time periods, organizations, etc)
    • Person
    • Location
    • DateTime
    • Organization
    • Address
    • Email
    • URL
  • Entity linking
    Entities reference links to Wikipedia articles.
    Possibly, the same name might be applicable to more than one entity.
    Wikipedia provides the knowledge base for the Text Analytics service. Specific article links are determined based on entity context within the text.
Translator Service

  • Language detection
    Compared with Language detection from Language service,this return if translation and transliteration is supported.
    Return example.
    [
    {
    "isTranslationSupported": true,
    "isTransliterationSupported": true,
    "language": "ja",
    "score": 1.0
    }
    ]
  • Translation and One-to-many translation
    Specifying a single from parameterto indicate the source language,and one or more to parameters to specify the languages into which you want the text translated
    [
    {"translations":
    [
    {"text": "Hello", "to": "en"},
    {"text": "Bonjour", "to": "fr"}
    ]
    }
    ]
  • Script transliteration
    Converting text from its native script to an alternativescript.
    Translate to the same language but a different dialect.

  • Translation options
    Word alignment - spaces not always are used to separate words.
  • Sentence length - useful to know the length of a translation
  • Profanity filtering
    • NoAction: Profanities are translated along with the rest of the text.
    • Deleted: Profanities are omitted in the translation.
    • Marked: Profanities are indicated using the technique indicated in the profanityMarker parameter (if supplied).
Custom translations
Solution for businesses or industries in that have specific vocabularies of terms that require custom translation.
To solve this problem, you can create a custom model that maps your own sets of source and target terms for translation.
Use Azure Custom Translator Portal

Process is the same like Azure ML Designer. To create a custom model:
  1. Create a workspacelinked to your Translator resource
  2. Create a project
  3. Upload training data files
  4. Traina model
Your custom model is assigned a unique category Id, which you can specify in translate calls to your Translator resource by using the category parameter, causing translation to be performed by your custom model instead of the default model.

How to use any of above services:
  • Provision a resource (Language or Cognitive service)
  • Call it using Endpoint and Key, submit data using JSON to REST interface or using an SDK.

Monitor Cognitive Service

Study notes

1.Before do anything you must estimate the cost.
  • Start here: Pricing Calculator | Microsoft Azure
  • Search and select Azure Cognitive Services
  • Create an estimate.
  • To create an estimate that includes multiple Cognitive Services APIs, add additional Azure Cognitive Services products to the estimate.
2.Then you must monitor the cost (this is about any service you use)
  • Azure Portal
  • Subscriptions and select one subscription
  • Go to Cost analysis (left)
All is very easy to understand, and analysis can go very in details.

3.Last but most important is create alerts (can be any area not only costs)
Checking from time to time is not an option.
Microsoft Azure provides alerting support for resources through the creation of alert rules.
There are two areas: Alert & Metrics.

Alerts
  • Azure Portal
  • Select Resource (Cognitive services in this case)
  • Select Alerts (left)
  • Create (top) ->Alert rule
    • Scope is already selected because 'Azure Cognitive services' is set already. You can change.
    • Condition on which the alert is triggered.
      Any of three selectors can be used to narrow search. Let's look for 'key' and set that 'Regenerating keys' will trigger the alert
    • Optional actions, such as sending an email to an administrator notifying them of the alert, or running an Azure Logic App to address the issue automatically.
    • Alert rule details, such as a name for the alert rule and the resource group in which it should be defined.
      After creation, based on what you have you may get this.

Metrics

  • Azure Portal
  • Select Resource (Cognitive services in this case)
  • Select Metrics (left)
    Selecting Data In / Azure cognitive services:
You can:
  • add multiple metrics and aggregation type (sum, mean etc.)
  • select chart type (bar, area chart, grid)
  • drill into other services/logs
  • create alert.
  • Save to (Pin to dashboard: private or shared)

4. Diagnostic logging enables you to capture rich operational data for a Cognitive Services resource, which can be used to analyze service usage and troubleshoot problems.
Create destination to log data (where to store data coming from logs)
You can use Azure Event Hub as a destination in order to then forward the data on to a custom telemetry solution, and you can connect directly to some third-party solutions.
Does not matter what you do you need one or both resources (must create them before configuring diagnostic logging.)
  • Azure Log Analytics- a service that enables you to query and visualize log data within the Azure portal.
  • Azure Storage - a cloud-based data store that you can use to store log archives (which can be exported for analysis in other tools as needed).

Secure cognitive services in Azure

Study notes

By default, access to cognitive services resources is restricted by using subscription keys. Management of access to these keys is a primary consideration for security.
You should regenerate keys regularly to protect against the risk of keys being shared with or accessed by unauthorized users. You can regenerate keys by using the visual interface in the Azure portal, or by using theaz cognitiveservices account keys regenerate Azure command-line interface (CLI) command.

Azure Key Vault
Keys and other secrets mut not be kept in configuration files.Always use Azure Key Vault.
Access to the key vault is granted to security principals, which you can think of user identities that are authenticated using Azure AD.
  • Administrators assign a security principal to an application (in which case it is known as a service principal) to define a managed identity for the application.
  • The application can then use this identity to access the key vault and retrieve a secret to which it has access.
Token-based authentication
Most common REST interface require token-based authentication.
  • You should have a subscription key (initial request)
  • Obtain an authentication token (valid 10 minutes)
  • Present the token to validate that the caller has been authenticated.
Azure Active Directory authentication
Some Cognitive Services support Azure Active Directory authentication, enabling you to grant access to specific service principals or managed identities for apps and services running in Azure.

Network Security (network access restrictions)
  • In Azure portal
  • Cognitive service - you have created
  • Networking
  • Set restrictions
    Example:


Secure key access with Azure Key Vault

You can develop applications that consume cognitive services by using a key for authentication.
This means that the application code must be able to obtain the key.
  • Store the key in an environment variable or a configuration file where the application is deployed, not good.
  • Store the key securely in Azure Key Vault, and provide access to the key through a managed identity(a user account used by the application itself).

Steps:
Create Resource: Key Vault and store Cognitive Service Key1 in it, Name must be Cognitive-Services-Key
Create a Service Principal (we need resource name and subscription Id)
Find its (Service Principal) Object Id (We need Service Principal appId)
Grand access to Service Principal to secret (which is the Key1 of Cognitive service, we need Key Vault name, Service Principal Object Id and possible Resource Group name)

Now we're ready to use the service principal identity in an application,
It can access the secret cognitive services key in your key vault and use it to connect to your cognitive services resource.

AI - Cognitive - basic

Study notes

AI capabilities:
  • Visual perception
    Use computer vision capabilities to accept, interpret, and process input from images, video streams, and live cameras.
  • Text analysis
    Use natural language processing (NLP) to read & extract semantic meaning from text-based data.
  • Speech
    Recognize speech as input and synthesize spoken output. Speech capabilities together with analysis of text enables a form of human-compute interaction that's become known as conversational AI.
  • Decision making
    Use past experience and learned correlations to assess situations and take appropriate actions.
Terms:
  • Data science
    Discipline that focuses on the processing and analysis of data; applying statistical techniques to uncover and visualize relationships and patterns in the data, and defining experimental models that help explore those patterns.
  • Machine learning
    Subset of data science that deals with the training and validation of predictive models. Used by Data Scientist to predict values for unknown labels.
  • Artificial intelligence
    Most common is built on machine learning, can be a software that emulates one or more characteristics of human intelligence.
  • Azure Cognitive Services
    Cloud-based services that encapsulate AI capabilities.
Azure Cognitive Services capabilities:
  • Language
    • Text analysis
    • Question answering
    • Language understanding
    • Translation
  • Speech
    • Speech to Text
    • Text to Speech
    • Speech Translation
    • Speaker Recognition
  • Vision
    • Image analysis
    • Video analysis
    • Image classification
    • Object detection
    • Facial analysis
    • Optical character recognition
  • Decision
    • Anomaly detection
    • Content moderation
    • Content personalization
You can use:
  • Multi-service resource
    Cognitive services - single resource that enable Language, Computer Vision, Speech, etc.
  • Single service resource
    Each service must be provision separately. See list below.
Azure Cognitive services - Single service resource:
  • Language
    • Language
    • Translator
  • Speech
    • Speech
  • Vision
    • Computer Vision
    • Custom Vision
    • Face
  • Decision
    • Anomaly detection
    • Content moderation
    • Personalizer

Azure out-of-box solutions.
  • Azure Form Recognizer
    OCR solution
  • Azure Metrics Advisor
    Real-time monitoring and response to critical metrics (Iot).
  • Azure Video Analyzer for Media
    Video analysis solution.
  • Azure Immersive Reader
    Supports for people of all ages and abilities.
  • Azure Bot Service
    Deliver conversational AI solutions.
  • Azure Cognitive Search
    Extract insights from data and documents.

AI based services relay on trained models (found relations between features and labels and can predict unknown labels).
Any prediction is in fact a probability and has associated a confidence score.

When predictions affect people (most of them do so) ethical considerations mut be enforced.
Responsible AI:
  • Fairness
    Treat all people fairly (banks, insurance)
  • Reliability and safety
    Ex: all industry automations.
  • Privacy and security
  • Inclusiveness
    When training models make sure you include subject from all social, demographic, etc. categories.
  • Transparency
    Easy to understand what system is doing.
  • Accountability
    If there are problems, not the AI but the one who train & test the model are responsible.
Roles:
  • Data Scientist
    • Ingest and prepare data.
    • Run experiments to explore data and train predictive models.
    • Deploy and manage trained models as web services.
  • AI Engineers
    Integrates AI capabilities into applications and services consumed by end users.
    • Use Azure ML designer to train machine learning models and deploy them as REST services that can be integrated into AI-enabled applications.
    • Collaborat with data scientists to deploy models based on common frameworks.
    • Use Azure ML SDKs or CLI scripts to orchestrate DevOps processes that manage versioning, deployment, and testing of machine learning models as part of an overall application delivery solution.

Cognitive service in containers

Study notes

Containers enable you to host Azure Cognitive Services either on-premises or on Azure.
Azure Cognitive Services is provided as a cloud service.
Some Cognitive Services can be deployed in a container, which encapsulates the necessary runtime components, and which is in turn deployed in a container host that provides the underlying operating system and hardware.

A container comprises an application or service and the runtime components needed to run it.
Containers are portable across hosts.
To use a container, you typically pull the container image from a registry and deploy it to a container host, specifying any required configuration settings.
  • A Docker server.
  • An Azure Container Instance (ACI).
  • An Azure Kubernetes Service (AKS) cluster.
Deploy and use a Cognitive Services container:
  • Container image for the specific Cognitive Services API you want to use is downloaded and deployed to a container host (ex: local Docker server, ACIAKS).
  • Client applications submit data to the endpoint provided by the containerized service, and retrieve results (as they would be an Cognitive Services cloud resource in Azure).
  • Periodically, usage metrics for the containerized service are sent to a Cognitive Services resource in Azure in order to calculate billing for the service.
Each container provides a subset of Cognitive Services functionality.
  • Key Phrase Extraction
    mcr.microsoft.com/azure-cognitive-services/keyphrase
  • Language Detection
    mcr.microsoft.com/azure-cognitive-services/language
  • Sentiment Analysis v3 (English)
    mcr.microsoft.com/azure-cognitive-services/sentiment:3.0-en
To deploy a Cognitive Services container image to a host, you must specify:
  • ApiKey
    Key from your deployed Azure Cognitive Service; used for billing.
  • Billing
    Endpoint URI from your deployed Azure Cognitive Service; used for billing.
  • Eula
    Value of accept to state you accept the license for the container.

Python development environment for Azure Machine Learning

Set up a Python development environment for Azure Machine Learning

Local and Data Science Virtual Machine (DSVM)
# Create Resource group and Workspace
from azureml.core import Workspace
ws = Workspace.create(name='vscodeml-ws',
subscription_id='SUBSCRIPTION_ID',
resource_group='vscodeml-rg',
create_resource_group=True,
location='eastus2'
)

# Write config.json file file (confg file for the environent)
# A folder .azureml will be created and the config.json file will be created into it
ws.write_config(path="./", file_name="config.json")

You can do it manually
Create folder
.azureml

Create file
.azureml/config.js
{
"subscription_id": "SUBSCRIPTION_ID",
"resource_group": "vscode-ml-rg",
"workspace_name": "vscode-ml-ws"
}


Or from Azure ML Studio
Download config file.
# Finally load the workspace (interactive login)
ws = Workspace.from_config()
# print details
ws.get_details()
# or just print short message
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

You can load a specific config.json file

ws = Workspace.from_config(path="my/path/config.json")

You can load a workspace from current run context.
Applicable if you are on a compute cluster.
# did not used it in production yet
Run.get_context().experiment.workspace




References:
Set up Python development environment - Azure Machine Learning | Microsoft Learn
azureml.core.run.Run class - Azure Machine Learning Python | Microsoft Learn

MOST POPULAR

Visual Studio code in Azure ML and Git

Study notes

VS code is a great tool to create and maintain applications, but it is as well a great too to manage Azure cloud resources fully integrated with GitHub.
It has a massive collection of extensions.
Last but very important: you can manage Kubernetes cluster from Docker and Azure. That the cherry on top so far because any deep learning experiment can be developed, tested and debugged locally.

PowerShell commands history
C:UsersUSER_NAMEAppDataRoamingMicrosoftWindowsPowerShellPSReadline

How to use it in Azure Machine learning - experiments (Data science)?

Requirements:
Install:
  • Visual studio code
  • Visual studio code extensions - mandatory:
    • Python
    • Jupiter
    • Azure Machine Learning
  • Visual studio code extensions - useful:
    • Polyglot Notebooks - https://marketplace.visualstudio.com/items?itemName=ms-dotnettools.dotnet-interactive-vscode
    • Remote SSH

Note before reading further.
If you have a fresh new install of VS Code, only on python kernel available, installed through VS Code then all may work somehow, I mean all except torch and Azure SDK
Otherwise:
Install Anaconda
Add all packages you need via conda (Anaconda navigator is really nice; need a good computer)
In VS Code select Python inconda environment
You will have all you ever need for any experiment and no problem with dependecies and pacjkages update.
In images (Docker or ACI) just keep simple; one pyton version and only packages you need via YAML file.


Upgrade pip (pip is a packages utility for python.
# In terminal run
python -m pip install -U pip

If you have not installed
# run in terminal or command windows prompt (as administrator)
# replace 19.1.1 with the last stable version, check https://pypi.org/project/pip/

python -m pip install downloads/pip-19.1.1-py2.py3-none-any.whl

#or
python -m pip install downloads/pip-19.1.1.tar.gz

#or
https://files.pythonhosted.org/packages/cb/28/91f26bd088ce8e22169032100d4260614fc3da435025ff389ef1d396a433/pip-20.2.4-py2.py3-none-any.whl -O ~/pip20.2.4 then do python -m pip install ~/pip20.2.4


Install basic python libraries used in Data Science (Azure ML)
pip install pandas
pip install numpy
pip install matplotlib
pip install scipy
pip install scikit-learn

Work with deep learning:
pip install torchvision
# if not, you may try
# or !pip install torch==1.9.0+cpu torchvision==0.10.0+cpu torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html (triky stuff with versions)
pip3 install torchvision

Just in case there are problems try this:
pip install --upgrade setuptools

Install python SDK
pip install azureml-core
pip install azure-ai-ml

Check:
pip list


Create first jupyter note.
Create a new file and give extension .ipynb
or
Use Command Palette (Ctrl+Shift+P) and run.
Create: New Jupyter Notebook

Write a test (python code) and run it.



To get data from Git or whatever other place you need wget utility.
In Linux you already have it.

On Windows and Mac OS:
Download it from https://www.gnu.org/software/wget/
Copy it (wget.exe) where you need ( do not leave it in downloads folder)
For example C:Wget or C:Program FilesWget (you must create Wget folder.

Add C:Wget in the Environment variable.
There are plenty of tutorials on the net, here there are two:
Windows - Command prompt: Windows CMD: PATH Variable - Add To PATH - Echo PATH - ShellHacks
Windows UI:How to Add to Windows PATH Environment Variable (helpdeskgeek.com)

Git / GitHub

Gits is part of anyone who run experiments.
Now VS Code fully integrate Git in UI

Install Git extension.

Basic operation:

1. Stage changes -> Click on + (right to added/changed/deleted file or on top line (Changes)
Write name of "Commit." and click "Commit"

2. Create Branch - 3 dots (very top)
Example "My New Branch"

3. Merge a branch to master:
- Click on 3 dots (very top)
- Select [Checkout to] and then the branch TO - master in this case
- Again 3 dots
- Click on [Branch] ->[Merge Branch] and then branch FROM - My New Branch, in this case
- Again 3 dots
- Select [Pull, Push] -> [Push]


Resources:
Working with Jupyter Notebooks in Visual Studio Code
Doing Data Science in Visual Studio Code
torch.nn — PyTorch 1.13 documentation




Machine Learning terms

Study notes

Data exploration and analysis
It is an iterative process- analyse data and test hypotheses.
  • Collect and clean data
  • Apply statistical techniques to better understand data.
  • Visualise data and determine relations.
  • Check hypotheses and repeat the process.

Statistics
Science of collecting and analysing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample
It is is fundamentally about taking samples of data and using probability functions to extrapolateinformation about the full population of data.

Statistic samples
Data we have "in hand", avilable to be analysed.

Statistics population
All possible data we could collect (theoretical).
We can wish to have data from population but that's may not be possible in timeframe and amiable resources. However, we must estimate labels with the sample we have.
Havin enough samples we can calculate the Probability Density Function

Probability Density Function
Estimate distribution of labels for the full population

Ensemble Algorithm
Works by combining multiple base estimators to produce an optimal model,

Bagging(Essamble alghorytm)
Technique used in ML training models - Regression.
Combine multiple base estimators to produce an optimal model by applying an aggregate function to base collection.

Boosting (Essamble alghorytm)
Technique used in ML training models - Regression.
Create a sequence of models that build on one another to improve predictive performance.


Jupiter notebook
Popular way to tun basic script in web browser (no need python installed to run)

NumPy
Python library that gives functionality comparable with tools like MATLABS and R.
Simplify analyzing and manipulating data.

Matplotlib
Provides attractive data visualizations

Panda
Python library for data analysis and manipulation (excel for Python) - easy to use functionality for data tables.
Simplify analyzing and manipulating data.
Include basic functionality for visualization (graphs)

TensorFlow
Open-source platform for machine learning (end to end)

SciKit-learn

Offers simple and effective predictive data analysis
TensorFlow Software library for machine learning and artificial intelligence - focus on training and inference of deep neural networks.
Supply machine learning and deep learning capabilities

predict()
Predicts the actual class.

predict_proba()
Predicts the class probabilities.

DataFrame
Data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet.
One of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.

pandas.DataFrame.to_dict()
Convert the DataFrame to a dictionary.
The type of the key-value pairs can be customized with the parameters which determines the type of the values of the dictionary.
  • ‘dict’ (default) :
    dict like {column -> {index -> value}}
  • ‘list’ :
    dict like {column -> [values]}
  • ‘series’ :
    dict like {column -> Series(values)}
  • ‘split’ :
    dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
  • ‘tight’ :
    dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values], ‘index_names’ -> [index.names], ‘column_names’ -> [column.names]}
  • ‘records’ :
    list like [{column -> value}, … , {column -> value}]
  • ‘index’ :
    dict like {index -> {column -> value}}

Pecentile
Give you a number that describes the value that a given percent of the values are lower than.

Quantile
A cut point, or line of division, that splits a probability distribution into continuous intervals with equal probabilities
Eliminate all values that fall below a specific percentile

Probability distribution
A function that accepts as input elements of some specific set x∈X, and produces as output, real-valued numbers between 0 and 1.
A probability distribution is a statistical function that describes all the possible values and probabilities for a random variable within a given range.
This range will be bound by the minimum and maximum possible values, but where the possible value would be plotted on the probability distribution will be determined by a number of factors. The mean (average), standard deviation, skewness, and kurtosis of the distribution are among these factors.
https://www.simplilearn.com/tutorials/statistics-tutorial/what-is-probability-distribution

Normalize data
Process data so values retain their proportional distribution, but are measured on the same scale

Proportional distribution
Distribute values across considering other factors
Example: Department intends to distribute funds for employment services across all areas of the state taking into consideration population distribution and client needs.

Correlation measurement
Quantify the relationship between these columns.

Outlier
A data point that is noticeably different from the rest

Regression

Where models predict a number, establishing a relationship between variables in the data that represent characteristics - known as the feature - of the thing being observed, and the variable we're trying to predict—known as the label
Supervised machine learning techniques involve training a model to operate on a set of features(x1,x2...xn) and predict a label (y) using a dataset that includes some already-known label values.
Mathematical approach to find the relationship between two or more variables
Regression works by establishing a relationship between variables in the data that represent characteristics—known as the features—of the thing being observed, and the variable we're trying to predict—known as the label
https://learn.microsoft.com/en-us/training/modules/train-evaluate-regression-models/2-what-is-regression

Linear regression
Simplest form of regression, with no limit to the number of features used.
Comes in many forms - often named by the number of features used and the shape of the curve that fits.

Decision trees
Take a step-by-step approach to predicting a variable.
If we think of our bicycle example, the decision tree may be first split examples between ones that are during Spring/Summer and Autumn/Winter, make a prediction based on the day of the week. Spring/Summer-Monday may have a bike rental rate of 100 per day, while Autumn/Winter-Monday may have a rental rate of 20 per day.

Ensemble algorithms
Construct a large number of trees - allowing better predictions on more complex data.
Ensemble algorithms, such as Random Forest, are widely used in machine learning and science due to their strong prediction abilities.

Hyperparameters
For real life scenarios with complex models and big datasets a model must befit repentantly (train, compare, adjust, train and so on...)
Values that change the way that the model is fit during loops.
Hyperparameter example Learning rate = sets how much a model is adjusted every cycle.
learning_rate - hyperparameter of GradientBoostingRegressorestimator.
n_estimators - Hyperparameter of GradientBoostingRegressorestimator.
Type:
  • Discrete hyperparameter (select discrete values from continues distributions)
    • qNormal distribution
    • qUniformdistribution
    • qLognormal distribution
    • qLogUniform distribution
  • Continuous hyperparameters
    • Normal distribution
    • Uniform distribution
    • Lognormal distribution
    • LogUniform distribution

Normal distribution
Normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable

Uniform distribution
Continuous uniform distribution or rectangular distribution is a family of symmetric probability distributions. The distribution describes an experiment where there is an arbitrary outcome that lies between certain bounds.


Lognormal distribution
Continuous probability distribution that models right-skewed data.
The lognormal distribution is related to logs and the normal distribution.

LogUniform distribution
Continuous probability distribution. It is characterised by its probability density function, within the support of the distribution, being proportional to the reciprocal of the variable.

Preprocess the Data
Perform some preprocessing of the data to make it easier for the algorithm to fit a model to it.

Scaling numeric features
Normalizing numeric features so they're on the same scale prevents features with large values from producing coefficients that disproportionately affect the predictions.
Bring all features values between 0 & 1, Ex 3 => 0.3, 480 => 0.48, 65=> 0.65
they're on the same scale. This prevents features with large values from producing coefficients that disproportionately affect the predictions

Encoding categorical variables
Convert categorical features into numeric representations
S,M,L => 0.1.2
by using a one hot encoding technique you can create individual binary (true/false) features for each possible category value.

Hot Encoding categorical variables
S M L
1 0 0
0 1 0
0 0 1

Classification
Form of machine learning in which you train a model to predict which category an item belongs to

Binary classification
is classification with two categories.

Regularization
technique that reduces error from a model by avoiding overfitting and training the model to function properly.
helps us control our model capacity, ensuring that our models are better at making (correct) classifications on data points that they were not trained on, which we call the ability to generalize.
threshold
A threshold value of 0.5 is used to decide whether the predicted label is a 1 (P(y) > 0.5) or a 0 (P(y) <= 0.5).
You can use the predict_proba method to see the probability pairs for each case
If we were to change the threshold, it would affect the predictions; and therefore change the metrics in the confusion matrix.

pipeline

Used extensively in machine learning, often to mean very different things.
1. Allow to define set of preprocessing steps that end with an algorithm.
Then fit entire pipeline to the data => model encapsulate all preprocessing steps and the (regression) algorithm.


Classification algorithms
logistic regression algorithm, (linear algorithm)
Support Vector Machine algorithms: Algorithms that define a hyperplane that separates classes.
Tree-based algorithms: Algorithms that build a decision tree to reach a prediction
Ensemble algorithms: Algorithms that combine the outputs of multiple base algorithms to improve generalizability (ex Random Forest)

Multiclass classification
Combination of multiple binary classifiers

One vs Rest (OVR)
Multiclass classification classifier.
A classifier is created for each possible class value, with a positive outcome for cases where the prediction is this class, and negative predictions for cases where the prediction is any other class
Ex:
square or not
circle or not
triangle or not
hexagon or not

One vs One (OVO)
Multiclass classification classifier
a classifier for each possible pair of classes is created. The classification problem with four shape classes would require the following binary classifiers:
square or circle
square or triangle
square or hexagon
circle or triangle
circle or hexagon
triangle or hexagon

predict_proba
Returns probabilities of a classification label.
Example:
Have a trained classification model
May run confusion_matrix(y_test, predictions) - check result
Run y_score =model.predict_proba(X_tests)
We get probability of 0 or 1 for every record in X_test
[[0.81651727 0.18348273]
[0.96298333 0.03701667]
[0.80862083 0.19137917]
...
[0.60688422 0.39311578]
[0.10672996 0.89327004]
[0.63865894 0.36134106]]


Stratification technique
Used (example in classification) when splitting the data to maintain the proportion of each label value in the training and validation datasets.

Clustering
Clustering is a form of unsupervised machine learning in which observations are grouped into clusters based on similarities in their data values, or features
process of grouping objects with similar objects
‘unsupervised’ method, where ‘training’ is done without labels

MinMaxScaler
Normalize the numeric features so they're on the same scale.

#areaperimetercompactnesskernel_lengthkernel_widthasymmetry_coefficient







17111.5513.100.84555.1672.8456.715
5815.3814.770.88575.6623.4191.999

scaled_features = MinMaxScaler().fit_transform(features[data.columns[0:6]])
Result:
array([[0.44098206, 0.50206612, 0.5707804 , 0.48648649, 0.48610121,
0.18930164],
[0.40509915, 0.44628099, 0.66243194, 0.36880631, 0.50106914,
0.03288302],

fit(data)
Method is used to compute the mean and std dev for a given feature to be used further for scaling.

transform(data)
Method is used toperform scaling using mean and std dev calculated using the .fit() method.

fit_transform()
Method does both fits and transform.

Principal Component Analysis (PCA)
Analyze the relationships between the features and summarize each observation as coordinates for two principal components
Translate the N-dimensional feature values into two-dimensional coordinates.

Within cluster sum of squares (WCSS) metric often used to measure this tightness
Lower values meaning that the data points are closer

k-means clustering algorithm
Iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group.
The way kmeans algorithm works is as follows:
• The feature values are vectorized to define n-dimensional coordinates (where n is the number of features). In the flower example, we have two features (number of petals and number of leaves), so the feature vector has two coordinates that we can use to conceptually plot the data points in two-dimensional space.
• You decide how many clusters you want to use to group the flowers, and call this value k. For example, to create three clusters, you would use a k value of 3. Then k points are plotted at random coordinates. These points will ultimately be the center points for each cluster, so they're referred to as centroids.
• Each data point (in this case flower) is assigned to its nearest centroid.
• Each centroid is moved to the center of the data points assigned to it based on the mean distance between the points.
• After moving the centroid, the data points may now be closer to a different centroid, so the data points are reassigned to clusters based on the new closest centroid.
• The centroid movement and cluster reallocation steps are repeated until the clusters become stable or a pre-determined maximum number of iterations is reached.

KMeans.inertia_
Sum of Squared errors (SSE)
Calculates the sum of the distances of all points within a cluster from the centroid of the point. It is the difference between the observed value and the predicted value.
The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion. Inertia can be recognized as a measure of how internally coherent clusters are.

PyTorch

PyTorch
Machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing (NLP)
Supply machine learning and deep learning capabilities
An open source machine learning framework that accelerates the path from research prototyping to production deployment
PyTorch datasets - the data is stored in PyTorch *tensor* objects.

manual_seed(torch.manual_seed())
Sets the seed for generating random numbers.
Returns a torch.Generator object.

optimizer.zero_grad()
In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly


Hierarchical Clustering
clustering algorithm in which clusters themselves belong to a larger group, which belong to even larger groups, and so on. The result is that data points can be clusters in differing degrees of precision: with a large number of very small and precise groups, or a small number of larger groups.
Useful for not only breaking data into groups, but understanding the relationships between these groups.
A major advantage of hierarchical clustering is that it does not require the number of clusters to be defined in advance, and can sometimes provide more interpretable results than non-hierarchical approaches.
The major drawback is that these approaches can take much longer to compute than simpler approaches and sometimes are not suitable for large datasets.

divisive method
Hierarchical Clustering
"top down" approach starting with the entire dataset and then finding partitions in a stepwise manner

agglomerative method
Hierarchical Clustering
"bottom up** approach. In this lab you will work with agglomerative clustering which roughly works as follows:
1. The linkage distances between each of the data points is computed.
2. Points are clustered pairwise with their nearest neighbor.
3. Linkage distances between the clusters are computed.
4. Clusters are combined pairwise into larger clusters.
5. Steps 3 and 4 are repeated until all data points are in a single cluster.

linkage function
Hierarchical Clustering - agglomerative method
can be computed in a number of ways:
• Ward linkage measures the increase in variance for the clusters being linked,
• Average linkage uses the mean pairwise distance between the members of the two clusters,
• Complete or Maximal linkage uses the maximum distance between the members of the two clusters.
Several different distance metrics are used to compute linkage functions:
• Euclidian or l2 distance is the most widely used. This metric is only choice for the Ward linkage method - measures of difference
• Manhattan or l1 distance is robust to outliers and has other interesting properties - measures of difference
• Cosine similarity, is the dot product between the location vectors divided by the magnitudes of the vectors. - measure of similarity
Similarity can be quite useful when working with data such as images or text documents.

Deep learning
Advanced form of machine learning that tries to emulate the way the human brain learns.
1. When the first neuron in the network is stimulated, the input signal is processed
2. If it exceeds a particular threshold, the neuron is activated and passes the signal on to the neurons to which it is connected.
3. These neurons in turn may be activated and pass the signal on through the rest of the network.
4. Over time, the connections between the neurons are strengthened by frequent use as you learn how to respond effectively.
Deep learning emulates this biological process using artificial neural networks that process numeric inputs rather than electrochemical stimuli.
The incoming nerve connections are replaced by numeric inputs that are typically identified as x (x1,x2…)
Associated with each x value is a weight (w)
Additionally, a bias (b) input is added to enable fine-grained control over the network
The neuron itself encapsulates a function that calculates a weighted sum of x, w, and b. This function is in turn enclosed in an activation function that constrains the result (often to a value between 0 and 1) to determine whether or not the neuron passes an output onto the next layer of neurons in the network.

Deep neural network
DNN model
The deep neural network model for the classifier consists of multiple layers of artificial neurons. In this case, there are four layers:
• An input layer with a neuron for each expected input (x) value.
• Two so-called hidden layers, each containing five neurons.
• An output layer containing three neurons - one for each class probability (y) value to be predicted by the model.
Particularly useful for dealing with data that consists of large arrays of numeric values - such as images.
Are the foundation for an area artificial intelligence called computer vision,

epochs
Training DNN model
The training process for a deep neural network consists of multiple iterations

backpropagation
Training DNN model
the loss from the model is calculated and used to adjust the weight and bias values

Calculating loss
Training DNN model
The loss is calculated using a function, which operates on the results from the final layer of the network, which is also a function
multiple observations, we typically aggregate the variance

Loss function
Training DNN model
the entire model from the input layer right through to the loss calculation is just one big nested function
Functions have a few really useful characteristics, including:
• You can conceptualize a function as a plotted line comparing its output with each of its variables.
• You can use differential calculus to calculate the derivative of the function at any point with respect to its variables.
The derivative of a function for a given point indicates whether the slope (or gradient) of the function output (in this case, loss) is increasing or decreasing with respect to a function variable (in this case, the weight value).
A positive derivative indicates that the function is increasing, and a negative derivative indicates that it is decreasing.

optimizer
apply this same trick for all of the weight and bias variables in the model and determine in which direction we need to adjust them (up or down) to reduce the overall amount of loss in the model.
There are multiple commonly used optimization algorithms:
- stochastic gradient descent (SGD),
- Adaptive Learning Rate (ADADELTA),
- Adaptive Momentum Estimation (Adam), and others;
All of which are designed to figure out how to adjust the weights and biases to minimize loss.

Learning rate
how much should the optimizer adjust the weights and bias values
A low learning rate results in small adjustments (so it can take more epochs to minimize the loss), while a high learning rate results in large adjustments (so you might miss the minimum altogether).

Convolutional neural networks (CNN)
A CNN typically works by extracting features from images, and then feeding those features into a fully connected neural network to generate a prediction.
CNNs consist of multiple layers, each performing a specific task in extracting features or predicting labels.
The feature extraction layers in the network have the effect of reducing the number of features from the potentially huge array of individual pixel values to a smaller feature set that supports label prediction.
1. An image is passed to the convolutional layer. In this case, the image is a simple geometric shape.
2. The image is composed of an array of pixels with values between 0 and 255 (for color images, this is usually a 3-dimensional array with values for red, green, and blue channels).
3. A filter kernel is generally initialized with random weights (in this example, we've chosen values to highlight the effect that a filter might have on pixel values; but in a real CNN, the initial weights would typically be generated from a random Gaussian distribution). This filter will be used to extract a feature map from the image data.
4. The filter is convolved across the image, calculating feature values by applying a sum of the weights multiplied by their corresponding pixel values in each position. A Rectified Linear Unit (ReLU) activation function is applied to ensure negative values are set to 0.
5. After convolution, the feature map contains the extracted feature values, which often emphasize key visual attributes of the image. In this case, the feature map highlights the edges and corners of the triangle in the image.

overlay
An image is also just a matrix of pixel values. To apply the filter, you "overlay" it on an image and calculate a weighted sum of the corresponding image pixel values under the filter kernel. The result is then assigned to the center cell of an equivalent 3x3 patch in a new matrix of values that is the same size as the image

Pooling layers
After extracting feature values from images, pooling (or downsampling) layers are used to reduce the number of feature values while retaining the key differentiating features that have been extracted.
One of the most common kinds of pooling is max pooling in which a filter is applied to the image, and only the maximum pixel value within the filter area is retained. So for example, applying a 2x2 pooling kernel to the following patch of an image would produce the result 155.
1. The feature map extracted by a filter in a convolutional layer contains an array of feature values.
2. A pooling kernel is used to reduce the number of feature values. In this case, the kernel size is 2x2, so it will produce an array with quarter the number of feature values.
3. The pooling kernel is convolved across the feature map, retaining only the highest pixel value in each position.

overfitting
the resulting model performs well with the training data but doesn't generalize well to new data on which it wasn't trained.
One technique you can use to mitigate overfitting is to include layers in which the training process randomly eliminates (or "drops") feature maps
Other techniques you can use to mitigate overfitting include randomly flipping, mirroring, or skewing the training images to generate data that varies between training epochs.
- For this reason, it’s common to use some kind of regularisation method to prevent the model from fitting too closely to the training data

Flattening layers
resulting feature maps are multidimensional arrays of pixel values. A flattening layer is used to flatten the feature maps into a vector of values that can be used as input to a fully connected layer.
CNN architecture 1. Images are fed into a convolutional layer. In this case, there are two filters, so each image produces two feature maps.
2. The feature maps are passed to a pooling layer, where a 2x2 pooling kernel reduces the size of the feature maps.
3. A dropping layer randomly drops some of the feature maps to help prevent overfitting.
4. A flattening layer takes the remaining feature map arrays and flattens them into a vector.
5. The vector elements are fed into a fully connected network, which generates the predictions. In this case, the network is a classification model that predicts probabilities for three possible image classes (triangle, square, and circle).

Transfer learning
Conceptually, this neural network consists of two distinct sets of layers:
1. A set of layers from the base model that perform feature extraction.
extraction layers apply convolutional filters and pooling to emphasize edges, corners, and other patterns in the images that can be used to differentiate them, and in theory should work for any set of images with the same dimensions as the input layer of the network
2. A fully connected layer that takes the extracted features and uses them for class prediction.
extraction layers apply convolutional filters and pooling to emphasize edges, corners, and other patterns in the images that can be used to differentiate them, and in theory should work for any set of images with the same dimensions as the input layer of the network
This approach enables you to keep the pre-trained weights for the feature extraction layers, which means you only need to train the prediction layers you have added.

Azure Machine Learning studio
Cloud-based service that helps simplify some of the tasks it takes to prepare data, train a model, and deploy a predictive service.

Azure Machine Learning workspace
Resource in your Azure subscription you use to manage data, compute resources, code, models, and other artifacts related to your machine learning workloads.

Azure Machine Learning compute
Cloud-based resources on which you can run model training and data exploration processes.
1. Compute Instances: Development workstations that data scientists can use to work with data and models.
2. Compute Clusters: Scalable clusters of virtual machines for on-demand processing of experiment code.
3. Inference Clusters: Deployment targets for predictive services that use your trained models.
4. Attached Compute: Links to existing Azure compute resources, such as Virtual Machines or Azure Databricks clusters.

Azure Machine Learning
Service for training and managing machine learning models, for which you need compute on which to run the training process.

Azure Automated Machine Learning
Automatically tries multiple pre-processing techniques and model-training algorithms in parallel.
These automated capabilities use the power of cloud compute to find the best performing supervised machine learning model for your data.
It provides a way to save time and resources by automating algorithm selection and hyperparameter tuning.

AutoML process
1. Prepare data: Identify the features and label in a dataset. Pre-process, or clean and transform, the data as needed.
2. Train model: Split the data into two groups, a training and a validation set. Train a machine learning model using the training data set. Test the machine learning model for performance using the validation data set.
3. Evaluate performance: Compare how close the model's predictions are to the known labels.
4. Deploy a predictive service: After you train a machine learning model, you can deploy the model as an application on a server or device so that others can use it.

Train model
You can use automated machine learning to train models for:
• Classification (predicting categories or classes)
• Regression (predicting numeric values)
• Time series forecasting (predicting numeric values at a future point in time)
In Automated Machine Learning, you can select configurations for the primary metric, type of model used for training, exit criteria, and concurrency limits.

Evaluate performance
After the job has finished you can review the best performing model.

Inference Clusters
Deployment targets for predictive services that use your trained models

Pipelines
Let you organize, manage, and reuse complex machine learning workflows across projects and users. A pipeline starts with the dataset from which you want to train the model

Components
Encapsulates one step in a machine learning pipeline

Azure Machine Learning Jobs
executes a task against a specified compute target

Stratified sampeling
technique used in Machine Learning to generate a test set
Random sampling is generally fine if the original dataset is large enough; if not, a bias is introduced due to the sampling error. Stratified Sampling is a sampling method that reduces the sampling error in cases where the population can be partitioned into subgroups.
We perform Stratified Sampling by dividing the population into homogeneous subgroups, called strata, and then applying Simple Random Sampling within each subgroup.
As a result, the test set is representative of the population, since the percentage of each stratum is preserved. The strata should be disjointed; therefore, every element within the population must belong to one and only one stratum.

ML experiment
• a named process, usually the running of a script or a pipeline, that can generate metrics and outputs and be tracked in the Azure Machine Learning workspace
• it can be run multiple times, with different data, code, or settings; and Azure Machine Learning tracks each run, enabling you to view run history and compare results for each run.
• When you submit an experiment, you use its run context to initialize and end the experiment run that is tracked in Azure Machine Learning
1. Every experiment generates log files (keep data between runs)
2. You can view the metrics logged by an experiment run in Azure Machine Learning studio or by using the RunDetails widget in a notebook
3. In addition to logging metrics, an experiment can generate output files. The output files of an experiment are saved in its outputs folder.

experiment script
• a Python code file that contains the code you want to run in the experiment
1. To access the experiment run context (which is needed to log metrics) the script must import the azureml.core.Run class and call its get_context method.
2. To run a script as an experiment, you must define
a. a script configuration that defines the script to be run and
b. the Python environment in which to run it.
This is implemented by using a ScriptRunConfig object.

Log experiment metrics
Run object
Every experiment generates log files that include the messages that would be written to the terminal during interactive execution.
If you want to record named metrics for comparison across runs, you can do so by using the Run object; which provides a range of logging functions specifically for this purpose. These include:
• log: Record a single named value.
• log_list: Record a named list of values.
• log_row: Record a row with multiple columns.
• log_table: Record a dictionary as a table.
• log_image: Record an image file or a plot.

Environment
Defines Python packages, environment variables, and Docker settings that are used in machine learning experiments, including in data preparation, training, and deployment to a web service.
An Environment is managed and versioned in an Azure Machine Learning Workspace.
You can update an existing environment and retrieve a version to reuse.
Environments are exclusive to the workspace they are created in and can't be used across different workspaces.
Azure Machine Learning provides curated environments, which are predefined environments that offer good starting points for building your own environments. Curated environments are backed by cached Docker images, providing a reduced run preparation cost.
Environment are created in by:
  • Initialize a new Environment object.
  • Use one of the Environment class methods: from_conda_specification, from_pip_requirements, or from_existing_conda_environment.
  • Use the submit method of the Experiment class to submit an experiment run without specifying an environment, including with an Estimator object.
argparse
To use parameters in a script, you must use a library such as argparse to read the arguments passed to the script and assign them to variables.

train_test_split sklearn
split model

LogisticRegression
Supervised classification algorithm.
The model builds a regression model to predict the probability that a given data entry belongs to the category numbered as “1” or "0"
Linear regression assumes that the data follows a linear function, Logistic regression models the data using the sigmoid function.

Hyperparameter
- configure how the model is trained
- top-level parameters that control the learning process and the model parameters that result from it. As a machine learning engineer designing a model, you choose and set hyperparameter values that your learning algorithm will use before the training of the model even begins

Regularization rate
(regression algorithm)
The Logistic regression function, which originally takes training data X, and label yas input, now needs to add one more input: the strength of regularization λ.
used to train models that generalize better on unseen data,by preventing the algorithm from overfitting the training dataset.

Hyperparameter
search space
Search space for hyperparameters values
To define a search space for hyperparameter tuning, create a dictionary with the appropriate parameter expression for each named hyperparameter
The specific values used in a hyperparameter tuning run depend on the type of sampling used.

Discrete hyperparameters
distributions
• qnormal
• quniform
• qlognormal
• qloguniform

Continuous hyperparameters
Distributions
• normal
• uniform
• lognormal
• loguniform

Hyperparameters search space
values sampling
Grid sampling - can only be employed when all hyperparameters are discrete, and is used to try every possible combination of parameters in the search space.
Random sampling is used to randomly select a value for each hyperparameter, which can be a mix of discrete and continuous values
Bayesian sampling chooses hyperparameter values based on the Bayesian optimization algorithm, which tries to select parameter combinations that will result in improved performance from the previous selection.

Training early termination
set an early termination policy that abandons runs that are unlikely to produce a better result than previously completed runs.
The policy is evaluated at an evaluation_interval you specify, based on each time the target performance metric is logged.
You can also set a delay_evaluation parameter to avoid evaluating the policy until a minimum number of iterations have been completed.

Data privacy parameters
The amount of variation caused by adding noise is configurable
epsilon This value governs the amount of additional risk that your personal data can be identified through rejecting the opt-out option and participating in a study
- A low epsilon value provides the most privacy, at the expense of less accuracy when aggregating the data
- A higher epsilon value results in aggregations that are more true to the actual data distribution, but in which the individual contribution of a single individual to the aggregated value is less obscured by noise - less privacy

Differential privacy
Technique that is designed to preserve the privacy of individual data points by adding "noise" to the data. The goal is to ensure that enough noise is added to provide privacy for individual values while ensuring that the overall statistical makeup of the data remains consistent, and aggregations produce statistically similar results as when used with the original raw data.

The noise is different for each analysis, so the results are non-deterministic – in other words, two analyses that perform the same aggregation may produce slightly different results.
The amount of variation caused by adding noise is configurable through a parameter called epsilon
  • A low epsilon value provides the most privacy, at the expense of less accuracy when aggregating the data.
  • A higher epsilon value results in aggregations that are more true to the actual data distribution, but in which the individual contribution of a single individual to the aggregated value is less obscured by noise.
SmartNoise
Create an analysis in which noise is added to the source data.
The underlying mathematics of how the noise is added can be quite complex, but SmartNoise takes care of most of the details for you.
  • Upper and lower bounds
    Clampingis used to set upper and lower bounds on values for a variable. This is required to ensure that the noise generated by SmartNoise is consistent with the expected distribution of the original data.
  • Sample size
    To generate consistent differentially private data for some aggregations, SmartNoise needs to know the size of the data sample to be generated.
  • Epsilon
    Put simplistically, epsilon is a non-negative value that provides an inverse measure of the amount of noise added to the data. A low epsilon results in a dataset with a greater level of privacy, while a high epsilon results in a dataset that is closer to the original data. Generally, you should use epsilon values between 0 and 1. Epsilon is correlated with another value named delta, that indicates the probability that a report generated by an analysis is not fully private.

Covariance
Establish relationships between variables.
Positive values - one feature increases, the second increases the same; direct relation.

Model explainers
Use statistical techniques to calculate feature importance.
Allow to quantify the relative influence each feature in the training dataset has on label prediction.
Explainers work by evaluating a test data set of feature cases and the labels the model predicts for them.

Global feature importance
quantifies the relative importance of each feature in the test dataset as a whole
It provides a general comparison of the extent to which each feature in the dataset influences prediction.

model-agnostic
Use ML models to study the the underlying structure without assuming that it can be accurately described by the model because of its nature.



Local feature importance
measures the influence of each feature value for a specific individual prediction.
For a regression model, there are no classes so the local importance values simply indicate the level of influence each feature has on the predicted scalar label.

MimicExplainer
Model explainers
An explainer that creates a global surrogate model that approximates your trained model and can be used to generate explanations.
This explainable model must have the same kind of architecture as your trained model (for example, linear or tree-based).

TabularExplainer
Model explainers
An explainer that acts as a wrapper around various SHAP explainer algorithms, automatically choosing the one that is most appropriate for your model architecture.

Model explainers
a Permutation Feature Importance explainer that analyzes feature importance by shuffling feature values and measuring the impact on prediction performance.

SHAP
SHapley Additive exPlanations — is probably the state of the art in Machine Learning explain ability.
In a nutshell, SHAP values are used whenever you have a complex model (could be a gradient boosting, a neural network, or anything that takes some features as input and produces some predictions as output) and you want to understand what decisions the model is making.

Disparity
a difference in level or treatment, especially one that is seen as unfair.
In prediction it is about fairness of the model

Measuring disparity in predictions
One way to start evaluating the fairness of a model is to compare predictions for each group within a sensitive feature.
To evaluate the fairness of a model, you can apply the same predictive performance metric to subsets of the data, based on the sensitive features on which your population is grouped, and measure the disparity in those metrics across the subgroups.
Potential causes of disparity • Data imbalance.
• Indirect correlation
• Societal biases.

Data imbalance
Some groups may be overrepresented in the training data, or the data may be skewed so that cases within a specific group aren't representative of the overall population.

Indirect correlation
The sensitive feature itself may not be predictive of the label, but there may be a hidden correlation between the sensitive feature and some other feature that influences the prediction. For example, there's likely a correlation between age and credit history, and there's likely a correlation between credit history and loan defaults. If the credit history feature is not included in the training data, the training algorithm may assign a predictive weight to age without accounting for credit history, which might make a difference to loan repayment probability.

Societal biases
Subconscious biases in the data collection, preparation, or modeling process may have influenced feature selection or other aspects of model design.

Fairlearn
Python package that you can use to analyze models and evaluate disparity between predictions and prediction performance for one or more sensitive features.
The mitigation support in Fairlearn is based on the use of algorithms to create alternative models that apply parity constraints to produce comparable metrics across sensitive feature groups. Fairlearn supports the following mitigation techniques.

Exponentiated Gradient
Fairlearn techniques
A reduction technique that applies a cost-minimization approach to learning the optimal trade-off of overall predictive performance and fairness disparity
- Binary classification
- Regression

Grid Search
Fairlearn techniques
A simplified version of the Exponentiated Gradient algorithm that works efficiently with small numbers of constraints
- Binary classification
- Regression

Threshold Optimizer
Fairlearn techniques
A post-processing technique that applies a constraint to an existing classifier, transforming the prediction as appropriate
- Binary classification

Fairlearn constraints
• Demographic parity
Use this constraint with any of the mitigation algorithms to minimize disparity in the selection rate across sensitive feature groups. For example, in a binary classification scenario, this constraint tries to ensure that an equal number of positive predictions are made in each group.
• True positive rate parity:
Use this constraint with any of the mitigation algorithms to minimize disparity in true positive rate across sensitive feature groups. For example, in a binary classification scenario, this constraint tries to ensure that each group contains a comparable ratio of true positive predictions.
• False-positive rate parity:
Use this constraint with any of the mitigation algorithms to minimize disparity in false_positive_rate across sensitive feature groups. For example, in a binary classification scenario, this constraint tries to ensure that each group contains a comparable ratio of false-positive predictions.
• Equalized odds:
Use this constraint with any of the mitigation algorithms to minimize disparity in combined true positive rate and false_positive_rate across sensitive feature groups. For example, in a binary classification scenario, this constraint tries to ensure that each group contains a comparable ratio of true positive and false-positive predictions.
• Error rate parity:
Use this constraint with any of the reduction-based mitigation algorithms (Exponentiated Gradient and Grid Search) to ensure that the error for each sensitive feature group does not deviate from the overall error rate by more than a specified amount.
• Bounded group loss:
Use this constraint with any of the reduction-based mitigation algorithms to restrict the loss for each sensitive feature group in a regression model.

data drift
change in data profiles between training and inferencing and over the time.
To monitor data drift using registered datasets, you need to register two datasets:
- A baseline dataset - usually the original training data.
- A target dataset that will be compared to the baseline based on time intervals. This dataset requires a column for each feature you want to compare, and a timestamp column so the rate of data drift can be measured.

Service tags
a group of IP address prefixes from a given Azure service
Microsoft manages the address prefixes encompassed by the service tag and automatically updates the service tag as addresses change, minimizing the complexity of frequent updates to network security rules.
You can use service tags in place of specific IP addresses when you create security rules to define network access controls on network security groups or Azure Firewall.

Azure VNet
the fundamental building block for your private network in Azure. VNet enables Azure resources, such as Azure Blob Storage and Azure Container Registry, to securely communicate with each other, the internet, and on-premises networks.
With a VNet, you can enhance security between Azure resources and filter network traffic to ensure only trusted users have access to the network.

IP address space:
When creating a VNet, you must specify a custom private IP address space using public and private (RFC 1918) addresses.

Subnets
enable you to segment the virtual network into one or more sub-networks and allocate a portion of the virtual network's address space to each subnet, enhancing security and performance.

Network interfaces (NIC)
the interconnection between a VM and a virtual network (VNet). When you create a VM in the Azure portal, a network interface is automatically created for you.

Network security groups (NSG)
can contain multiple inbound and outbound security rules that enable you to filter traffic to and from resources by source and destination IP address, port, and protocol.

Load balancers
can be configured to efficiently handle inbound and outbound traffic to VMs and VNets, while also offering metrics to monitor the health of VMs.

Service endpoints
provide the identity of your virtual network to the Azure service.
Service endpoints use public IP addresses
Once you enable service endpoints in your virtual network, you can add a virtual network rule to secure the Azure service resources to your virtual network.

Private endpoints
effectively bringing the Azure services into your VNet
Private endpoint uses a private IP address from your VNet
network interfaces that securely connect you to a service powered by Azure Private Link

Private Link Service
your own service, powered by Azure Private Link that runs behind an Azure Standard Load Balancer, enabled for Private Link access. This service can be privately connected with and consumed using Private Endpoints deployed in the user's virtual network

Azure VPN gateway
Connects on-premises networks to the VNet over a private connection. Connection is made over the public internet. There are two types of VPN gateways that you might use:
• Point-to-site: Each client computer uses a VPN client to connect to the VNet.
• Site-to-site: A VPN device connects the VNet to your on-premises network.

ExpressRoute
Connects on-premises networks into the cloud over a private connection. Connection is made using a connectivity provider.

Azure Bastion
In this scenario, you create an Azure Virtual Machine (sometimes called a jump box) inside the VNet. You then connect to the VM using Azure Bastion. Bastion allows you to connect to the VM using either an RDP or SSH session from your local web browser. You then use the jump box as your development environment. Since it is inside the VNet, it can directly access the workspace.

Azure Databricks
Microsoft analytics service, part of the Microsoft Azure cloud platform. It offers an integration between Microsoft Azure and the Apache Spark's Databricks implementation

notebook
a document that contains runnable code, descriptive text, and visualizations.
We can override the default language by specifying the language magic command %<language> at the beginning of a cell.
The supported magic commands are:
• %python
• %r
• %scala
• %sql
Notebooks also support a few auxiliary magic commands:
• %sh: Allows you to run shell code in your notebook
• %fs: Allows you to use dbutils filesystem commands
• %md: Allows you to include various types of documentation, including text, images, and mathematical formulas and equations.


workspace
It groups objects (like notebooks, libraries, experiments) into folders,
Provides access to your data,
Provides access to the computations resources used (clusters, jobs).

cluster
set of computational resources on which you run your code (as notebooks or jobs). We can run ETL pipelines, or machine learning, data science, analytics workloads on the cluster.
• An all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
• A jbto run a specific job. The cluster will be terminated when the job completes (A job is a way of running a notebook or JAR either immediately or on a scheduled basis).

job
a way of running a notebook or JAR either immediately or on a scheduled basis

Databricks runtimes
the set of core components that run on Azure Databricks clusters.
Azure Databricks offers several types of runtimes:
Databricks Runtime: includes Apache Spark, components and updates that optimize the usability, performance, and security for big data analytics.
Databricks Runtime for Machine Learning: a variant that adds multiple machine learning libraries such as TensorFlow, Keras, and PyTorch.
Databricks Light: for jobs that don’t need the advanced performance, reliability, or autoscaling of the Databricks Runtime.

Azure Databricks database
a collection of tables. An Azure Databricks table is a collection of structured data.
We can cache, filter, and perform any operations supported by Apache Spark DataFrames on Azure Databricks tables. We can query tables with Spark APIs and Spark SQL.

Databricks File System (DBFS)
distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the following benefits:
• Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
• Allows you to interact with object storage using directory and file semantics instead of storage URLs.
• Persists files to object storage, so you won’t lose data after you terminate a cluster.

Resilient Distributed Dataset (RDD)
The fundamental data structure of Apache Spark which are an immutable collection of objects which computes on the different node of the cluster.
Each and every dataset in Spark RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster.

MLLib
SAME LIBRARY as Spark ML
legacy approach for machine learning on Apache Spark. It builds off of Spark's Resilient Distributed Dataset (RDD) data structure.
additional data structures on top of the RDD, such as DataFrames, have reduced the need to work directly with RDDs.
classic" MLLib namespace is org.apache.spark.mllib

Spark ML
SAME LIBRARY as MLLib
Primary library for machine learning development in Apache Spark.
It supports DataFrames in its API (versus the classic RDD approach).
USE this as much as you can This makes Spark ML an easier library to work with for data scientists .
As Spark DataFrames share many common ideas with the DataFrames used in Pandas and R.
Spark ML workspace is org.apache.spark.ml.

Train and validate a model
The process of training and validating a machine learning model using Spark ML is fairly straightforward. The steps are as follows:
• Splitting data.
• Training a model.
• Validating a model.

Splitting data
splitting data between training and validation datasets
This hold-out dataset can be useful for determining whether the training model is overfitting
DataFrames support a randomSplit() method, which makes this process of splitting data simple

Training a model
Training a model relies on three key abstractions:
• a transformer - performing feature engineering and feature selection, as the result of a transformer is another DataFrame - support a randomSplit()
• an estimator - takes a DataFrame as an input and returns a model. It takes a DataFrame as an input and returns a model, which is itself a transformer.
ex: LinearRegression
It accepts a DataFrame and produces a Model. Estimators implement a .fit() method.
• a pipeline - combine together estimators and transformers and implement a .fit()

Validating a model
process based on built-in summary statistics
the model contains a summary object, which includes scores such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and coefficient of determination (R2, pronounced R-squared)
with a validation dataset, it is possible to calculate summary statistics on a never-before-seen set of data, running the model's transform() function against the validation dataset.

other machine learning frameworks
Azure Databricks supports machine learning frameworks other than Spark ML and MLLib.
For libraries, which do not support distributed training, it is also possible to use a single node cluster. For example, PyTorch and TensorFlow both support single node use.

DataFrames
the distributed collections of data, organized into rows and columns. Each column in a DataFrame has a name and an associated type.
Spark DataFrames can be created from various sources, such as CSV files, JSON, Parquet files, Hive tables, log tables, and external databases.

Query dataframes
Spark SQL is a component that introduced the DataFrames, which provides support for structured and semi-structured data.
Spark has multiple interfaces (APIs) for dealing with DataFrames:
• the .sql() method, which allows to run arbitrary SQL queries on table data.
• use the Spark domain-specific language for structured data manipulation, available in Scala, Java, Python, and R.

DataFrame API
The Apache Spark DataFrame API provides a rich set of functions (select columns, filter, join, aggregate, and so on) that allow you to solve common data analysis problems efficiently. A complex operation where tables are joined, filtered, and restructured is easy to write, easy to understand, type safe, feels natural for people with prior sql experience
statistics about the DataFrame • Available statistics are:
• Count
• Mean
• Stddev
• Min
• Max
• Arbitrary approximate percentiles specified as a percentage (for example, 75%).

Plot options
• The following display options are available:
• We can choose the DataFrame columns to be used as axes (keys, values).
• We can choose to group our series of data.
• We can choose the aggregations to be used with our grouped data (avg, sum, count, min, max).

Machine learning
Data science technique used to extract patterns from data allowing computers to identify related data, forecast future outcomes, behaviors, and trends.
In machine learning, you train the algorithm with data and answers, also known as labels, and the algorithm learns the rules to map the data to their respective labels.

Synthetic Minority Over-sampling Technique (SMOTE)
Oversampling technique that allows us to generate synthetic samples for our minority categories
the idea is based on the K-Nearest Neighbors algorithm
We get a difference between a sample and one of its k nearest neighbours and multiply by some random value in the range of (0, 1). Finally, we generate a new synthetic sample by adding the value we get from the previous operation

Imputation of null values
Null values refer to unknown or missing data. Strategies for dealing with this scenario include:
• Dropping these records: Works when you do not need to use the information for downstream workloads.
• Adding a placeholder (for example, -1): Allows you to see missing data later on without violating a schema.
• Basic imputing: Allows you to have a "best guess" of what the data could have been, often by using the mean or median of non-missing data for numerical data type, or most_frequent value of non-missing data for categorical data type.
• Advanced imputing: Determines the "best guess" of what data should be using more advanced strategies such as clustering machine learning algorithms or oversampling techniques such as SMOTE (Synthetic Minority Over-sampling Technique).

ScriptRunConfig
To submit a run, create a configuration object that describes how the experiment is run. ScriptRunConfig is an examples of configuration objects used.
Identifies the Python script file to be run in the experiment. An experiment can be run based on it.
The ScriptRunConfig also determines the compute target and Python environment.

data cleaning
Imputation of null values, Imputation of null values, Duplicate records, Outliers

feature engineering
process of generating new predictive features from existing raw data
it is important to derive features from existing raw data that better represent the nature of the data and thus help improve the predictive power of the machine learning algorithms
• Aggregation (count, sum, average, mean, median, and the like)
• Part-of (year of date, month of date, week of date, and the like)
• Binning (grouping entities into bins and then applying aggregations)
• Flagging (boolean conditions resulting in True of False)
• Frequency-based (calculating the frequencies of the levels of one or more categorical variables)
• Embedding (transforming one or more categorical or text features into a new set of features, possibly with a different cardinality)
• Deriving by example

data scaling
Bring features to similar scales
There are two common approaches to scaling numerical features:
• Normalization - mathematically rescales the data into the range [0, 1].
• Standardization - rescales the data to have mean = 0 and standard deviation = 1
For the numeric input
- compute the mean and standard deviation using all the data available in the training dataset.
- then for each individual input value, you scale that value by subtracting the mean and then dividing by the standard deviation.

data encoding
converting data into a format required for a number of information processing needs
We will look at two common approaches for encoding categorical data:
• Ordinal encoding - converts categorical data into integer codes ranging from 0 to (number of categories – 1).
• One-hot encoding - transforming each categorical value into n (= number of categories) binary values, with one of them 1, and all others 0 (recommended)

MLflow
Open-source product designed to manage the Machine Learning development lifecycle.
Allows data scientists to train models, register those models, deploy the models to a web server, and manage model updates.
Important part of machine learning with Azure Databricks, as it integrates key operational processes with the Azure Databricks interface (also operate on workloads outside of Azure Databricks)
Offers a standardized format for packaging models for distribution.
Components:
• MLflow Tracking - provides the ability to audit the results of prior model training executions. It is built around runs
• MLflow Projects - a way of packaging up code in a manner, which allows for consistent deployment and the ability to reproduce results
• MLflow Models - standardized model format allows MLflow to work with models generated from several popular libraries, including scikit-learn, Keras, MLlib, ONNX, and more
• MLflow Model Registry - llows data scientists to register models in a registry
Key steps:
• Mode registration - stores the details of a model in the MLflow Model Registry, along with a name for ease of access
• Model Versioning - makes model management easy by labeling new versions of models and retaining information on prior model versions automatically

MLflow Tracking
To use MLflow to track metrics for an inline experiment, you must set the MLflow tracking URI to the workspace where the experiment is being run. This enables you to use mlflow tracking methods to log data to the experiment run.
When you use MLflow tracking in an Azure ML experiment script, the MLflow tracking URI is set automatically when you start the experiment run. However, the environment in which the script is to be run must include the required mlflow packages.
It is built around runs, that is, executions of code for a data science task. Each run contains several key attributes, including:
  • Parameters:
    Key-value pairs, which represent inputs. Use parameters to track hyperparameters, that is, inputs to functions, which affect the machine learning process.
  • Metrics:
    Key-value pairs, which represent how the model is performing. This can include evaluation measures such as Root Mean Square Error, and metrics can be updated throughout the course of a run. This allows a data scientist, for example, to track Root Mean Square Error for each epoch of a neural network.
  • Artifacts:
    Output files. Artifacts may be stored in any format, and can include models, images, log files, data files, or anything else, which might be important for model analysis and understanding.
Experiments
Intended to collect and organize runs
The data scientist can then review the individual runs in order to determine which run generated the best model.

Run
Single trial of an experiment.
Object is used to monitor the asynchronous execution of a trial, log metrics and store outputof the trial, and to analyze results and access artifacts generated by the trial.
Used inside of your experimentation code to log metrics and artifacts to the Run History service.
Used outsideof your experiments tomonitor progress and to query and analyzethe metrics and results that were generated.
Functionality of Run:
  • Storing and retrieving metrics and data
  • Uploading and downloading files
  • Using tags as well as the child hierarchy for easy lookup of past runs
  • Registering stored model files as a model that can be operationalized
  • Storing, modifying, and retrieving properties of a run
  • Loading the current run from a remote environment with the get_context method
  • Efficiently snapshotting a file or directory for reproducibility
MLflow Projects
A project in MLflow is a method of packaging data science code. This allows other data scientists or automated processes to use the code in a consistent manner.
Each project includes at least one entry point, which is a file (either .py or .sh)
Projects also specify details about the environment.

MLflow Models
A model in MLflow is a directory containing an arbitrary set of files along with an MLmodel file in the root of the directory.
Each model has a signature, which describes the expected inputs and outputs for the model.
allows models to be of a particular flavor, which is a descriptor of which tool or library generated a model. This allows MLflow to work with a wide variety of modeling libraries, such as scikit-learn, Keras, MLlib, ONNX, and many

MLflow Model Registry
The MLflow Model Registry allows a data scientist to keep track of a model from MLflow Models
the data scientist registers a model with the Model Registry, storing details such as the name of the model. Each registered model may have multiple versions, which allow a data scientist to keep track of model changes over time.
t is also possible to stage models. Each model version may be in one stage, such as Staging, Production, or Archived. Data scientists and administrators may transition a model version from one stage to the next.

DatabricksStep
specialized pipeline step supported by Azure Machine Learning (Azure Data bricks), with which you can run a notebook, script, or compiled JAR on an Azure Databricks cluster
In order to run a pipeline step on a Databricks cluster, you need to do the following steps:
1. Attach Azure Databricks Compute to Azure Machine Learning workspace.
2. Define DatabricksStep in a pipeline.
3. Submit the pipeline.

Real-Time Inferencing
The model is deployed as part of a service that enables applications to request immediate, or real-time, predictions for individual, or small numbers of data observations.
In Azure Machine learning, you can create real-time inferencing solutions by deploying a model as a real-time service, hosted in a containerized platform such as Azure Kubernetes Services (AKS)
You can use the service components and tools to register your model and deploy it to one of the available compute targets so it can be made available as a web service in the Azure cloud, or on an IoT Edge device:

targets
1. Local web service - Testing/debug - Good for limited testing and troubleshooting.
2. Azure Kubernetes Service (AKS) - Real-time inference - Good for high-scale production deployments. Provides autoscaling, and fast response times.
3. Azure Container Instances (ACI) - Testing - Good for low scale, CPU-based workloads.
4. Azure Machine Learning Compute Clusters - Batch inference - Run batch scoring on serverless compute. Supports normal and low-priority VMs.
5. Azure IoT Edge - (Preview) IoT module - Deploy & serve ML models on IoT devices.

Deploy a model to Azure ML
To deploy a model as an inferencing webservice, you must perform the following tasks:
1. Register a trained model.
2. Define an Inference Configuration.
3. Define a Deployment Configuration.
4. Deploy the Model.

Hyperparameter tuning the process of choosing the hyperparameter that has the best result on our loss function, or the way we penalize an algorithm for being wrong.
Within Azure Databricks, there are two approaches to tune hyperparameters, which will be discussed in the next units:
• Automated MLflow tracking - common and simple approach to track model training in Azure Databricks
• Hyperparameter tuning with Hyperopt.
k-fold cross-validation A model is then trained on k-1 folds of the training data and the last fold is used to evaluate its performance.

automated MLflow Tracking
When you use automated MLflow for model tuning, the hyperparameter values and evaluation metrics are automatically logged in MLflow and a hierarchy will be created for the different runs that represent the distinct models you train.
To use automated MLflow tracking, you have to do the following:
1. Use a Python notebook to host your code.
2. Attach the notebook to a cluster with Databricks Runtime or Databricks Runtime for Machine Learning.
3. Set up the hyperparameter tuning with CrossValidator or TrainValidationSplit.

Hyperopt
tool that allows you to automate the process of hyperparameter tuning and model selection
Hyperopt is simple to use, but using it efficiently requires care. The main advantage to using Hyperopt is that it is flexible and it can optimize any Python model with hyperparameters
Hyperopt is already installed if you create a compute with the Databricks Runtime ML. To use it when training a Python model, you should follow these basic steps:
1. Define an objective function to minimize.
2. Define the hyperparameter search space.
3. Specify the search algorithm.
4. Run the Hyperopt function fmin().
objective function represents what the main purpose is of training multiple models through hyperparameter tuning. Often, the objective is to minimize training or validation loss.

hyperparameter search algorithm
There are two main choices in how Hyperopt will sample over the search space:
1. hyperopt.tpe.suggest: Tree of Parzen Estimators (TPE), a Bayesian approach, which iteratively and adaptively selects new hyperparameter settings to explore based on past results.
2. hyperopt.rand.suggest: Random search, a non-adaptive approach that samples over the search space.

Horovod
help data scientists when training deep learning models.
allows data scientists to distribute the training process and make use of Spark's parallel processing.
is designed to take care of the infrastructure management so that data scientists can focus on training models.

HorovodRunner
is a general API, which triggers Horovod jobs. The benefit of using HorovodRunner instead of the Horovod framework directly, is that HorovodRunner has been designed to distribute deep learning jobs across Spark workers.
->HorovodRunner is more stable for long-running deep learning training jobs on Azure Databricks.
Before working with Horovod and HorovodRunner, the code used to train the deep learning model should be tested on a single-node cluster

Petastorm
library that enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

LinearRegression
In Scikit-Learn, training algorithms are encapsulated in estimators, and in this case we'll use the LinearRegression estimator to train a linear regression model.

Authentication with Azure AD
• Interactive:
You use your account in Azure Active Directory to either manually authenticate or obtain an authentication token. Interactive authentication is used during experimentation and iterative development. It enables you to control access to resources (such as a web service) on a per-user basis.
• Service principal:
You create a service principal account in Azure Active Directory, and use it to authenticate or obtain an authentication token. A service principal is used when you need an automated process to authenticate to the service. For example, a continuous integration and deployment script that trains and tests a model every time the training code changes needs ongoing access and so would benefit from a service principal account.
• Azure CLI session:
You use an active Azure CLI session to authenticate. Azure CLI authentication is used during experimentation and iterative development, or when you need an automated process to authenticate to the service using a pre-authenticated session. You can log in to Azure via the Azure CLI on your local workstation, without storing credentials in code or prompting the user to authenticate.
• Managed identity:
When using the Azure Machine Learning SDK on an Azure Virtual Machine, you can use a managed identity for Azure. This workflow allows the VM to connect to the workspace using the managed identity, without storing credentials in code or prompting the user to authenticate. Azure Machine Learning compute clusters can also be configured to use a managed identity to access the workspace when training models.

Service principal
Object that defines what the app can actually do in the specific tenant, who can access the app, and what resources the app can access.
When an application is given permission to access resources in a tenant (upon registration or consent), a service principal object is created
A service principal is used when you need an automated process to authenticate to the service.
For example, a continuous integration and deployment script that trains and tests a model every time the training code changes needs ongoing access and so would benefit from a service principal account.

Managed Identities
Managed identities allow you to authenticate services by providing an automatically managed identity for applications or services to use when connecting to Azure cloud services.
Managed identities work with any service that supports Azure AD authentication, and provides activity logs so admins can see user activity such as log-in times, when operations were started, and by whom.
Main resources
Exam DP-100: Designing and Implementing a Data Science Solution on Azure - Certifications | Microsoft Learn
Welcome to Python.org
NumPy user guide — NumPy v1.24 Manual
Introduction to Tensors | TensorFlow Core
pandas - Python Data Analysis Library (pydata.org)
All things · Deep Learning (dzlab.github.io)
API reference — pandas 1.5.3 documentation (pydata.org)
Track ML experiments and models with MLflow - Azure Machine Learning | Microsoft Learn
Lognormal Distribution: Uses, Parameters & Examples - Statistics By Jim
Normal Distribution | Examples, Formulas, & Uses (scribbr.com)

Docker in Visual Studio Code

Study notes

What for Docker here; statistics and Azure ML?

When create experiments no everything is going well and you need to debug.
If it is about a deep learning, convulsive neural networks or just an experiment that require AKS clusters then the recommended option is to run your buggy experiment in a container on Docker and then send back to cloud into AKS cluster.

Good news is that Visual Studio code make our life easier.

Initial steps:The extension can scaffold Docker files for most popular development languages (C#, Node.js, Python, Ruby, Go, and Java) and customizes the generated Docker files accordingly.

The app has to be set into set into the folder
Docker/Node.js steps:
Open a terminal (command prompt in VS code) and install express (Node.js)
>npx express-generator
>npm install

Open Command Palette and

  • Generate Docker file
Command Palette and type:
>Docker: Add Docker Files to Workspace command
If image selected is node.js then:
- before doing this install node, yo can do it in terminal (VS code)
- you may get an error: "No package .json found on workspace"

Open a terminal and run:
>npm init
Now I assume the "Docker: Add Docker Files to Workspace" command works.


Main resources:
https://code.visualstudio.com/docs/containers/overview
https://docs.docker.com/desktop/