Skip to content

Field Extraction

Overview

Field extraction lets you submit PDFs together with a JSON Schema and receive a structured JSON payload that conforms to that schema. Use it when you know the fields you want to extract, such as invoice numbers, totals, policy details, line items, or any repeated list of objects.

This pipeline is different from full extraction:

Pipeline Endpoint Output
Full extraction POST /api/v1/extract Markdown, structured document JSON, page images, image crops, and summaries
Field extraction POST /api/v1/extract-fields One {file_name}_fields.json payload per document, based on your JSON Schema

Field extraction is asynchronous. The submit response returns a request_uid; use Request Status and Document Status to poll progress, then retrieve results with POST /api/v1/documents/fields.


POST /api/v1/generate-schema

Generates a JSON Schema from a natural-language description. This endpoint is synchronous and does not consume credits.

Parameters
Name In Type Description
description body string Description of the fields you want to extract
{
  "description": "Extract the invoice number, total amount, due date, and every line item with description, quantity, and unit price."
}
Response

200

{
  "success": true,
  "json_schema": {
    "type": "object",
    "properties": {
      "invoice_number": { "type": "string" },
      "total_amount": { "type": "number" },
      "due_date": { "type": "string" },
      "line_items": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "description": { "type": "string" },
            "quantity": { "type": "integer" },
            "unit_price": { "type": "number" }
          },
          "required": ["description", "quantity", "unit_price"]
        }
      }
    },
    "required": ["invoice_number", "total_amount", "line_items"]
  },
  "message": "Schema generated successfully. Review and edit before submitting to /extract-fields."
}

Note

Always review the generated schema before submitting it to /api/v1/extract-fields. The schema controls the output shape and the merge behavior for repeated fields.

Errors
Status Meaning
400 Empty description
500 The schema generator could not produce a valid schema
Example "Try it out!"
curl -L 'https://neurolinker.api.ainexxo.com/api/v1/generate-schema' \
-H 'Authorization: Bearer nl_********************************' \
-H 'Content-Type: application/json' \
-d '{
  "description": "Extract invoice number, total amount, due date, and line items with description, quantity, and unit price."
}'

POST /api/v1/extract-fields

Submits PDFs and a JSON Schema for field extraction.

Parameters
Name Type Description
documents list of files Optional binary PDF files uploaded directly to the endpoint
form JSON string Must contain json_schema; may also contain documents_url, alias, and description

At least one of documents or documents_url is required. The json_schema field is always required.

Form parameter
{
  "documents_url": ["https://example.com/invoice.pdf"],
  "alias": "April invoices",
  "description": "Invoice extraction batch",
  "json_schema": {
    "type": "object",
    "properties": {
      "invoice_number": { "type": "string" },
      "total_amount": { "type": "number" },
      "line_items": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "description": { "type": "string" },
            "quantity": { "type": "integer" },
            "unit_price": { "type": "number" }
          },
          "required": ["description", "quantity", "unit_price"]
        }
      }
    },
    "required": ["invoice_number", "total_amount"]
  }
}
Response

200

{
  "request_uid": "a1c2b3d4-e5f6-7890-abcd-ef1234567890",
  "status": "PENDING",
  "performance": null,
  "msg": null
}

The status value in this response is the Celery task state. For user-facing progress, poll GET /api/v1/request-status/{request_id}.

Validation and credits

Before the job is accepted, NeuroLinker:

  • validates that the schema is valid JSON Schema Draft 7;
  • validates that the schema uses the supported subset documented below;
  • counts document pages;
  • reserves field extraction credits for each processed page.

If validation fails, no credits are reserved and no files are uploaded. If there are not enough credits, the endpoint returns 402.

Errors
Status Meaning
400 Invalid schema, unsupported schema keyword, or missing PDFs/URLs
402 Insufficient credits
Example "Try it out! - file upload"
curl -L 'https://neurolinker.api.ainexxo.com/api/v1/extract-fields' \
-H 'Authorization: Bearer nl_********************************' \
-F 'documents=@"invoice.pdf"' \
-F 'form="{
  \"alias\": \"April invoices\",
  \"json_schema\": {
    \"type\": \"object\",
    \"properties\": {
      \"invoice_number\": { \"type\": \"string\" },
      \"total_amount\": { \"type\": \"number\" }
    },
    \"required\": [\"invoice_number\", \"total_amount\"]
  }
}"'
Example "Try it out! - URL upload"
curl -L 'https://neurolinker.api.ainexxo.com/api/v1/extract-fields' \
-H 'Authorization: Bearer nl_********************************' \
-F 'documents="[]"' \
-F 'form="{
  \"documents_url\": [\"https://example.com/invoice.pdf\"],
  \"json_schema\": {
    \"type\": \"object\",
    \"properties\": {
      \"invoice_number\": { \"type\": \"string\" },
      \"total_amount\": { \"type\": \"number\" }
    },
    \"required\": [\"invoice_number\", \"total_amount\"]
  }
}"'

Supported JSON Schema subset

Field extraction accepts a restricted subset of JSON Schema Draft 7.

Accepted keywords
Keyword Usage
type string, number, integer, boolean, array, or object
properties Required on object; must be a non-empty mapping
required List of property names present in properties
items Required on array; must be a single schema
enum Scalar fields only; values must match the field type
description Optional semantic hint forwarded to extraction
Structural rules
  • The root schema must be an object with non-empty properties.
  • Property names must be valid identifiers: no leading digits and no reserved Python keywords.
  • Union types such as "type": ["string", "null"] are not supported. For optional fields, omit the field from required.
  • Array items must be a single schema, not a tuple-style list.
Rejected keywords

anyOf, oneOf, allOf, not, $ref, $defs, definitions, pattern, format, minLength, maxLength, minimum, maximum, minItems, maxItems, additionalProperties, default, and const.


Schema compliance

NeuroLinker enforces schema extraction in three steps:

  1. The submitted schema is checked as JSON Schema Draft 7 and against the supported subset.
  2. The validated schema is converted at runtime into a structured response model. Required fields become required model fields; optional fields may return null.
  3. Extraction is retried automatically when the model response does not match the expected structure.

Completed field-extraction payloads include every top-level property declared in the submitted schema. If no successful batch finds a non-empty value for a property, that property is returned as null.


Multi-page merge logic

Documents are processed in page batches. Each batch returns a JSON object, then NeuroLinker merges the batch results into one payload per document.

Schema field type Merge behavior
array Keeps all distinct non-null values in order of appearance
Any non-array type Keeps the most frequent non-null value; ties use the earliest batch

Values are canonicalized before comparison:

  • strings are trimmed;
  • objects are compared using canonical JSON with sorted keys;
  • no lowercasing, separator removal, or fuzzy matching is applied.

This conservative behavior avoids changing identifiers such as SKUs, IBANs, policy numbers, or case-sensitive codes.


Storage output

For each document, field extraction writes:

{user_uid}/{request_uid}/{document_uid}/
|-- {file_name}.pdf
|-- page_1.png
|-- page_2.png
|-- ...
|-- page_N.png
`-- {file_name}_fields.json

The canonical result is {file_name}_fields.json. The document Firestore metadata stores schema_used and results.result_storage_path; the extracted payload itself is stored in Firebase Storage, not duplicated inline in Firestore.


Retrieving field results

Use POST /api/v1/documents/fields after the document reaches completed.

{
  "document_ids": ["8b3f-..."]
}

Response:

{
  "success": true,
  "results": [
    {
      "document_id": "8b3f-...",
      "format": "fields",
      "content": {
        "invoice_number": "INV-2026-0412",
        "total_amount": 1284.5,
        "due_date": null,
        "line_items": [
          { "description": "Widget A", "quantity": 3, "unit_price": 120.0 },
          { "description": "Service B", "quantity": 1, "unit_price": 50.0 }
        ]
      },
      "schema_used": {
        "type": "object",
        "properties": {
          "invoice_number": { "type": "string" }
        }
      }
    }
  ],
  "total": 1,
  "successful": 1,
  "failed": 0,
  "message": "Retrieved extracted fields for 1/1 documents"
}