Field Extraction

Overview

Field extraction lets you submit PDFs together with a JSON Schema and receive a structured JSON payload that conforms to that schema. Use it when you know the fields you want to extract, such as invoice numbers, totals, policy details, line items, or any repeated list of objects.

This pipeline is different from full extraction:

Pipeline	Endpoint	Output
Full extraction	`POST /api/v1/extract`	Markdown, structured document JSON, page images, image crops, and summaries
Field extraction	`POST /api/v1/extract-fields`	One `{file_name}_fields.json` payload per document, based on your JSON Schema

Field extraction is asynchronous. The submit response returns a request_uid; use Request Status and Document Status to poll progress, then retrieve results with POST /api/v1/documents/fields.

POST /api/v1/generate-schema

Generates a JSON Schema from a natural-language description. This endpoint is synchronous and does not consume credits.

Parameters

Name	In	Type	Description
`description`	body	string	Description of the fields you want to extract

{
  "description": "Extract the invoice number, total amount, due date, and every line item with description, quantity, and unit price."
}

Response

200

{
  "success": true,
  "json_schema": {
    "type": "object",
    "properties": {
      "invoice_number": { "type": "string" },
      "total_amount": { "type": "number" },
      "due_date": { "type": "string" },
      "line_items": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "description": { "type": "string" },
            "quantity": { "type": "integer" },
            "unit_price": { "type": "number" }
          },
          "required": ["description", "quantity", "unit_price"]
        }
      }
    },
    "required": ["invoice_number", "total_amount", "line_items"]
  },
  "message": "Schema generated successfully. Review and edit before submitting to /extract-fields."
}

Note

Always review the generated schema before submitting it to /api/v1/extract-fields. The schema controls the output shape and the merge behavior for repeated fields.

Errors

Status	Meaning
`400`	Empty description
`500`	The schema generator could not produce a valid schema

Example "Try it out!"

curl -L 'https://neurolinker.api.ainexxo.com/api/v1/generate-schema' \
-H 'Authorization: Bearer nl_********************************' \
-H 'Content-Type: application/json' \
-d '{
  "description": "Extract invoice number, total amount, due date, and line items with description, quantity, and unit price."
}'

POST /api/v1/extract-fields

Submits PDFs and a JSON Schema for field extraction.

Parameters

Name	Type	Description
`documents`	list of files	Optional binary PDF files uploaded directly to the endpoint
`form`	JSON string	Must contain `json_schema`; may also contain `documents_url`, `alias`, and `description`

At least one of documents or documents_url is required. The json_schema field is always required.

Form parameter

{
  "documents_url": ["https://example.com/invoice.pdf"],
  "alias": "April invoices",
  "description": "Invoice extraction batch",
  "json_schema": {
    "type": "object",
    "properties": {
      "invoice_number": { "type": "string" },
      "total_amount": { "type": "number" },
      "line_items": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "description": { "type": "string" },
            "quantity": { "type": "integer" },
            "unit_price": { "type": "number" }
          },
          "required": ["description", "quantity", "unit_price"]
        }
      }
    },
    "required": ["invoice_number", "total_amount"]
  }
}

Response

200

{
  "request_uid": "a1c2b3d4-e5f6-7890-abcd-ef1234567890",
  "status": "PENDING",
  "performance": null,
  "msg": null
}

The status value in this response is the Celery task state. For user-facing progress, poll GET /api/v1/request-status/{request_id}.

Validation and credits

Before the job is accepted, NeuroLinker:

validates that the schema is valid JSON Schema Draft 7;
validates that the schema uses the supported subset documented below;
counts document pages;
reserves field extraction credits for each processed page.

If validation fails, no credits are reserved and no files are uploaded. If there are not enough credits, the endpoint returns 402.

Errors

Status	Meaning
`400`	Invalid schema, unsupported schema keyword, or missing PDFs/URLs
`402`	Insufficient credits

Example "Try it out! - file upload"

curl -L 'https://neurolinker.api.ainexxo.com/api/v1/extract-fields' \
-H 'Authorization: Bearer nl_********************************' \
-F 'documents=@"invoice.pdf"' \
-F 'form="{
  \"alias\": \"April invoices\",
  \"json_schema\": {
    \"type\": \"object\",
    \"properties\": {
      \"invoice_number\": { \"type\": \"string\" },
      \"total_amount\": { \"type\": \"number\" }
    },
    \"required\": [\"invoice_number\", \"total_amount\"]
  }
}"'

Example "Try it out! - URL upload"

curl -L 'https://neurolinker.api.ainexxo.com/api/v1/extract-fields' \
-H 'Authorization: Bearer nl_********************************' \
-F 'documents="[]"' \
-F 'form="{
  \"documents_url\": [\"https://example.com/invoice.pdf\"],
  \"json_schema\": {
    \"type\": \"object\",
    \"properties\": {
      \"invoice_number\": { \"type\": \"string\" },
      \"total_amount\": { \"type\": \"number\" }
    },
    \"required\": [\"invoice_number\", \"total_amount\"]
  }
}"'

Supported JSON Schema subset

Field extraction accepts a restricted subset of JSON Schema Draft 7.

Accepted keywords

Keyword	Usage
`type`	`string`, `number`, `integer`, `boolean`, `array`, or `object`
`properties`	Required on `object`; must be a non-empty mapping
`required`	List of property names present in `properties`
`items`	Required on `array`; must be a single schema
`enum`	Scalar fields only; values must match the field type
`description`	Optional semantic hint forwarded to extraction

Structural rules

The root schema must be an object with non-empty properties.
Property names must be valid identifiers: no leading digits and no reserved Python keywords.
Union types such as "type": ["string", "null"] are not supported. For optional fields, omit the field from required.
Array items must be a single schema, not a tuple-style list.

Rejected keywords

anyOf, oneOf, allOf, not, $ref, $defs, definitions, pattern, format, minLength, maxLength, minimum, maximum, minItems, maxItems, additionalProperties, default, and const.

Schema compliance

NeuroLinker enforces schema extraction in three steps:

The submitted schema is checked as JSON Schema Draft 7 and against the supported subset.
The validated schema is converted at runtime into a structured response model. Required fields become required model fields; optional fields may return null.
Extraction is retried automatically when the model response does not match the expected structure.

Completed field-extraction payloads include every top-level property declared in the submitted schema. If no successful batch finds a non-empty value for a property, that property is returned as null.

Multi-page merge logic

Documents are processed in page batches. Each batch returns a JSON object, then NeuroLinker merges the batch results into one payload per document.

Schema field type	Merge behavior
`array`	Keeps all distinct non-null values in order of appearance
Any non-array type	Keeps the most frequent non-null value; ties use the earliest batch

Values are canonicalized before comparison:

strings are trimmed;
objects are compared using canonical JSON with sorted keys;
no lowercasing, separator removal, or fuzzy matching is applied.

This conservative behavior avoids changing identifiers such as SKUs, IBANs, policy numbers, or case-sensitive codes.

Storage output

For each document, field extraction writes:

{user_uid}/{request_uid}/{document_uid}/
|-- {file_name}.pdf
|-- page_1.png
|-- page_2.png
|-- ...
|-- page_N.png
`-- {file_name}_fields.json

The canonical result is {file_name}_fields.json. The document Firestore metadata stores schema_used and results.result_storage_path; the extracted payload itself is stored in Firebase Storage, not duplicated inline in Firestore.

Retrieving field results

Use POST /api/v1/documents/fields after the document reaches completed.

{
  "document_ids": ["8b3f-..."]
}

Response:

{
  "success": true,
  "results": [
    {
      "document_id": "8b3f-...",
      "format": "fields",
      "content": {
        "invoice_number": "INV-2026-0412",
        "total_amount": 1284.5,
        "due_date": null,
        "line_items": [
          { "description": "Widget A", "quantity": 3, "unit_price": 120.0 },
          { "description": "Service B", "quantity": 1, "unit_price": 50.0 }
        ]
      },
      "schema_used": {
        "type": "object",
        "properties": {
          "invoice_number": { "type": "string" }
        }
      }
    }
  ],
  "total": 1,
  "successful": 1,
  "failed": 0,
  "message": "Retrieved extracted fields for 1/1 documents"
}