Knowledge mining with Azure Cognitive Search

Rating & reviews (0 reviews)
Study notes

1. Cognitive Search

Provides a cloud-based solution for indexing and querying a wide range of data sources, and creating comprehensive and high-scale search solutions.
Plain English: allow you to search in all data you have in Azure cloud does not matter where it is and what it is: database and documents.
To start you need to create a resource: Azure Cognitive Search.During creatin select tier based on what you need:
  • Free (F)
    Learn - try to see how it works
  • Basic (B)
    Small-scale search solutions that include a maximum of 15 indexes and 2 GB of index data.
  • Standard (S)
    Enterprise-scale solutions. has variants S, S2, and S3 which offer increasing capacity, read performance and numbers of indexes.
  • Storage Optimized (L)
    Variants L1 or L2 - large indexes at the cost / query latency.
Optimize your solution for scalability and availability by creating:
Replica and Partitions => Unit search = RxP
  • Replicas (R)
    Instances of the search service (kind of nodes in a cluster). Obvious higher number - better.
  • Partitions (P)
    More technical approach - divide an index into multiple storage locations, enabling you to split I/O operations such as querying or rebuilding an index.
Search components:
  • Data source
    many...Unstructured in blob, tables in SQl, documents in CosmoDb or JSON(direct injection in index)
  • Skillset
    It's the AI in action. on top of extracted data AI add more details/insights via an "enrichment pipeline", ex:
    • Language used.
    • Key phrases / main topics
    • Sentiment score
    • Specific locations, people, organizations, or landmarks
    • Images description, text extracted (OCR).
    • Custom skills.
  • Indexer
    Engine that drives the overall indexing process. Take output already generated and map it to fields in index.
    It creates the index.
    Fields extracted are mapped
    • fields extracted mapped direct to index fields.
      • Implicit mapping - automatically mapped to fields with the same name in index
      • Explicit mapping - mapping is defined - may rename field in index.
    • filed from skills (skillset) explicit mapped to target fields in index
  • Index
    Searchable result. Collection of JSON docs used by client application.
    It is an entity that contains details extracted and enriched (metadata, normalized images, language used, text from images, merged content from enrichment details)
    Fileds atributes:
    • key
      Unique key for index records.
    • searchable
      That's the content where you search.
    • filterable
      Fields that can be included in filter expressions to return only documents that match specified constraints.
    • sortable
      Fields that can be used to order the results.
    • facetable
      Fields that can be used to determine values for facets (user interface elements used to filter the results based on a list of known field values).
    • retrievable
      Fields that can be included in search results (by default, all fields are retrievable unless this attribute is explicitly removed).
Search an index
Based on the Lucene query syntax, which provides a rich set of query operations for searching, filtering, and sorting data in indexes.
  • Simple
    Intuitive syntax perform basic search - match literal.
  • Full
    Must submitted search term & search paramaters
    • queryType - simple or full
    • searchFields - Index fields to be searched.
    • select - Fields to be included in the results.
    • searchMode - Any/All - Criteria for including results based on multiple search terms.
Query processing stages
  • Query parsing.
  • Lexical analysis
  • Document retrieval
  • Scoring
Filtering results
  • Include filter criteria - valid for Simple search.
    search=TERM+author='Reviewer'
    queryType=Simple
  • Providing a parameter - an OData filter expression as a $filter parameter with a Full searchexpression.
    search=TERM
    $filter=author eq 'Reviewer'
    queryType=Full
  • Filtering with facets
    Facets are a useful way to present users with filtering criteria based on field values in a result set. Example get all Authors - filter by Author field
    search=*
    $filter=author eq 'selected-facet-value-here'
Sorting results
By default, results are sorted based on the relevancy score.
Use ODataorderby parameter that specifies one or more sortable fields and a sort order (asc or desc).
search=*
$orderby=last_modified desc


Enhance the index
  • Search-as-you-type
    • Suggestions
    • Autocomplete
  • Custom scoring and result boosting
    By default, search results are sorted by a relevance score that is calculated based on a term-frequency/inverse-document-frequency (TF/IDF) algorithm. You can customize the way this score is calculated by defining a scoring profile (ie. increase relevance of docs)
    You can modify an index definition so that it uses your custom scoring profile by default.
  • Synonyms
2. Custom skill

Implement the expected schema for input and output data that is expected by skills in an Azure Cognitive Search skillset.
  • Input Schema
    Defines a JSON structure containing a record for each document to be processed.
    Each document has a unique identifier, and a data payload with one or more inputs
  • Output schema
    It is for the results returned by your custom skill and reflects the input schema.
    The output will contain a record for each input record, with either the results produced by the skill or details of any errors that occurred.
To integrate a custom skill into your indexing solution, you must add a skill for it to a skillset using the Custom.WebApiSkillskill type.
Skill definition:
  • Specify the URI to your web API endpoint, including parameters and headers if necessary.
  • Set the context to specify at which point in the document hierarchy the skill should be called
  • Assign input values, usually from existing document fields
  • Store output in a new field, optionally specifying a target field name (otherwise the output name is used)
3. Knowledge stores

Consists of projections of the enriched data, which can be JSON objects, tables, or image files.
When an indexer runs the pipeline to create or update an index, the projections are generated and persisted in the knowledge store.

Result of indexing it is a collection of JSON objects.
It is an outputbut it can be as well the input
  • for integration onto data orchestration process (ie. Data factory)
  • normalized and imported in relational database can be used by visualisation tools
  • create images index *save images for browsing)
Azure Cognitive Search enables you to create search solutions in which a pipeline of AI skills is used to enrich data and populate an index.
The data enrichments performed by the skills in the pipeline supplement the source data with insights:
  • Language
  • Main themes or topics
  • Sentiment score
  • Locations, people, organizations, or landmarks
  • AI-generated descriptions of images, or image text extracted by optical character recognition (OCR).
The process of indexing incrementally creates a complex document that contains the various output fields from the skills in the skillset.
TheShaper:
  • simplify the mapping of these field values to projections in a knowledge store
  • create a new, field containing a simpler structure for the fields you want to map to projections
First create a knowledgeStore object.
You can define different types of projections.
Separate projection for every single type
Only one projectyion type can be pupulated byShaper.
  • object projections
  • table projections
  • file projections
Example:
"knowledgeStore": {
"storageConnectionString": "<storage_connection_string>",
"projections": [
{
"objects": [
{
"storageContainer": "<container>",
"source": "/projection"
}
],
"tables": [],
"files": []
},
{
"objects": [],
"tables": [
{
"tableName": "KeyPhrases",
"generatedKeyName": "keyphrase_id",
"source": "projection/key_phrases/*",
},
{
"tableName": "docs",
"generatedKeyName": "document_id",
"source": "/projection"
}
],
"files": []
},
{
"objects": [],
"tables": [],
"files": [
{
"storageContainer": "<container>",
"source": "/document/normalized_images/*"
}
]
}
]
}


4. Enrich index in Language Studio

We put together Data Modeling(A) and Search(B):
  1. Store documents you wish to search
    Use Blob containers
    Classify documents: simple or multiple or at least label them - to which category(ies) belongs (for next step)
    Ie.
    ...
    documents": [
    {
    "location": "{DOCUMENT-NAME}",
    "language": "{LANGUAGE-CODE}",
    "dataset": "{DATASET}",
    "classes": [
    {
    "category": "Class1"
    },
    {
    "category": "Class2"
    }
    ]
    }
    ...

    ===A===
  2. Create a custom text classification project
  3. Train and test your model (we have the model and its endpoint to access it)
    You can add documents to test set.
    ===B===
  4. Create a search index based on your stored documents
  5. Create a function app that will use your deployed trained model
    1. Be able to pass JSON to the custom text classification endpoint
    2. Get the response and process it
    3. Returns a structured JSON message back to a custom skillsetin cognitive search
      Function must know
      1. The text to be classified.
      2. The endpoint for your trained custom text classification deployed model.
      3. The primary key for the custom text classification project.
      4. The project name.
      5. The deployment name.
  6. Update your search solution, your index, indexer, and custom skillset
    There are three changes in the Azure portal you need to make to enrich your search index!
    1. Add a field to your index to store the custom text classification enrichment.
    2. Add a custom skillset to call your function app with the text to classify.
    3. Map the response from the skillset into the index.
5. Implement advanced search features in Azure Cognitive Search

How search calculates scores for documents and the tools you have to influence that score.
You can boost individual terms in your search queries, add custom scoring profiles to focus on the most important field in your index, enrich your indexes with more languages, and return results based on their location.

All search engines try to return the most relevant results to search queries. Azure Cognitive Search implements an enhanced version of Apache Lucene for full text search

Query - Improve the ranking of a document with term boosting (Enable the Lucene Query Parser)

  • Results - Improve the relevance of results by adding scoring profiles.
  • Azure Cognitive Search uses the BM25 similarity ranking algorithm. The algorithm scores documents based on the search terms used.
    The search engine scores the documents returned from the first three phases.
    By default, the search results are ordered by their search score, highest first. If two documents have an identical search score, you can break the tie by adding an $orderby clause.
    Very simple, the document score is a function of:
    • number of times identified search terms appear in a document
    • document's size
    • rarity of each of the terms.
Cognitive Search lets you influence a document score using scoring profiles.
Define different weights for fields in an index.
Scoring profile functions:
  • Magnitude
    Alter scores based on a range of values for a numeric field
  • Freshness
    Alter scores based on the freshness of documents as given by a DateTimeOffset field
  • Freshness
    Alter scores based on the freshness of documents as given by a DateTimeOffset field
  • Tag
    Alter scores based on common tag values in documents and queries

Hands-On Cognitive seach, Login to view

Note, Login to view

References