Orbit Extract is a system that extracts both objective and subjective information from a range of documents.
Objective information has clearly defined attributes, for example: identifiers, company names, invoice amounts, that can always be tracked back to the exact location of the file in question.
Subjective information is derived and summarised by algorithms, for example: sentiment, summarisation, Q&A. In place of absolute accuracy, the quality of the subjective extraction is measured against manual annotations, accuracy/recall, and other chosen statistics.
The key components and architecture of the system:
The system is designed to process all unstructured data types, for example: PDFs, web pages, word documents and scanned files.
In many cases, the analysis is about processing publicly available information. Orbit provides a wide range of off-the-shelf global datasets which are updated daily by the Orbit Data team. (Please refer to Orbit Data)
For Orbit provided content, the data will be run through further processing stages with entitlement to manage data coverage, frequency, and latency.
Orbit offers a data collection service for bespoke requirements.
For the requirement to process internal documents with the related internal metadata to manage process, Orbit provides a number of connectors to enable the automatic uploading of the data into its secured cloud environment.
Dependent on which use case, a number of different pre-processing and functional algorithms can be performed. Orbit hosts and provides a range of generic processing capabilities via API, as well as training tailored project-specific models for clients.
Orbit provides a cloud environment from data storage to calculation and as required, can also provide on-premises or hybrid deployment solutions.
There’s a number of available algorithms for both objective and subjective extractions and depending upon the task at hand, one or more of the below will be utilised on client projects.
Algorithms to extract text from documents, split into sentences, etc.
For simple objective extractions, e.g. attributes from consistent formats, rules are the most efficient approach. Rules can be explicitly defined and users can be trained to configure them, ranging from keyword matching and regular expressions to linguistic pattern-based matching.
Template based extraction
For lengthy documents, e.g. prospectus and annual reports, where extractions rely on document structure, a template is the best method to handle it.
For more straight forward documents, e.g. invoices and short contracts that have huge volume and inconsistent formats, it can be easier to convert documents to images and run Optical Character Recognition to extract.
Model based extraction
For tasks where it’s harder to define patterns or rules, the only feasible approach is to develop bespoke models.
ESG classification model
Orbit provides a sentence-level ESG topic classification model off-the-shelf to help locate relevant content more efficiently.
Orbit has developed a bespoke framework that takes a natural language question and looks for candidate answers against pre-defined content to locate relevant information.
Sentence level event detection
A deep learning model that runs on public documents, for example news, to detect entity and pre-defined events at a sentence level. A typical use case is to scan news for ESG controversies with a high confidence level.
Generate summary from news, research reports, and more.
A multi-lingual sentiment model on article and sentence levels.
Machine extraction can never be 100% accurate. To address this Orbit provides user interfaces and workflows for users to address them manually and also provides managed services for this.
Allow team collaboration
Quality assurance checks
For users to generate training data via manual labelling to train their own models
How it works
Requirements for a clean extraction as a dataset
We decide together the:
- Data source
- Extraction logic
- Extraction frequency
- Exception handling logic (if required)
Clients will receive the clean dataset with the capability to check the original documents.
To use as an efficiency improvement tool
We configure the solution for your specific use case.
If you need us to prepare raw data we can set this up for you, otherwise you can upload your own in-house or sourced data to the system directly.
You can then operate the system as required.