Document Classification

Modified on Mon, 19 Jun 2023

Once the document is uploaded to the Extract folder, the document editor dashboard will display the different statuses of the documents at the top and present different cards showing the available document models.

The existing document models would be displayed as cards such as invoices, bills of Lading, Bank statements, etc. You can create various document models based on your document to be extracted.

If the system is unable to identify the model for a newly uploaded document, it will be displayed as an Unknown Document Model, and you will need to classify it accordingly.

  1. Click on the document to open a screen where you can classify it.
  2. Choose an existing document model from the drop-down if it has already been created, or create a new document model by giving it a name and selecting the required Business object and attributes to store the document fields.
  3. The document definition screen will then open, allowing you to define the type of data extraction, mark the split identifier, or use the ML model.
  4. Click the Finish Definition button.

Now the document definition is completed and you can start training the documents.

Type of Data Extraction

  • Unstructured documents: These are documents that lack a predefined format or organization, making it difficult to extract specific information automatically.  
  • Semi-structured document: Documents with some level of organization or structure but containing unformatted or variable data. They may have fixed fields and free-form text, allowing for partial automated data extraction, such as invoices with additional elements like images and checkboxes.

Page Splitting

To handle long documents with multiple records, you can enable page splitting during the document model definition. Simply mark the attribute, such as invoice ID, as the split identifier to split the document into individual records.

E.g.

For a document containing 1000 invoices, and each invoice has a unique invoice number. During the document model configuration, you mark the invoice number as the split identifier. As a result, the system intelligently splits the document into individual records based on the invoice number. This allows you to store and manage each invoice separately within the provided Business object.

ML Models

In the Document Editor, you have the option to utilize machine learning (ML) models for data extraction. Choose between default ML predictions, where the system automatically predicts and extracts data, or predefined models with higher accuracy. Simply select the desired ML model option during the document definition.

E.g.

In an invoice document where the invoice number can be accurately extracted using a predefined model with 95% accuracy. You select this predefined model to ensure precise data extraction for the invoice number field. However, there might be instances where an unpredictable or unknown field appears, such as a "Special Discount" field that varies from invoice to invoice. In this scenario, you can rely on the default ML model for data extraction.