Field Extraction
Overview
Field extraction lets you submit PDFs together with a JSON Schema and receive a structured JSON payload that conforms to that schema. Use it when you know the fields you want to extract, such as invoice numbers, totals, policy details, line items, or any repeated list of objects.
This pipeline is different from full extraction:
| Pipeline | Endpoint | Output |
|---|---|---|
| Full extraction | POST /api/v1/extract |
Markdown, structured document JSON, page images, image crops, and summaries |
| Field extraction | POST /api/v1/extract-fields |
One {file_name}_fields.json payload per document, based on your JSON Schema |
Field extraction is asynchronous. The submit response returns a request_uid; use Request Status and Document Status to poll progress, then retrieve results with POST /api/v1/documents/fields.
POST /api/v1/generate-schema
Generates a JSON Schema from a natural-language description. This endpoint is synchronous and does not consume credits.
Parameters
| Name | In | Type | Description |
|---|---|---|---|
description |
body | string | Description of the fields you want to extract |
{
"description": "Extract the invoice number, total amount, due date, and every line item with description, quantity, and unit price."
}
Response
200
{
"success": true,
"json_schema": {
"type": "object",
"properties": {
"invoice_number": { "type": "string" },
"total_amount": { "type": "number" },
"due_date": { "type": "string" },
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "integer" },
"unit_price": { "type": "number" }
},
"required": ["description", "quantity", "unit_price"]
}
}
},
"required": ["invoice_number", "total_amount", "line_items"]
},
"message": "Schema generated successfully. Review and edit before submitting to /extract-fields."
}
Note
Always review the generated schema before submitting it to /api/v1/extract-fields. The schema controls the output shape and the merge behavior for repeated fields.
Errors
| Status | Meaning |
|---|---|
400 |
Empty description |
500 |
The schema generator could not produce a valid schema |
Example "Try it out!"
curl -L 'https://neurolinker.api.ainexxo.com/api/v1/generate-schema' \
-H 'Authorization: Bearer nl_********************************' \
-H 'Content-Type: application/json' \
-d '{
"description": "Extract invoice number, total amount, due date, and line items with description, quantity, and unit price."
}'
POST /api/v1/extract-fields
Submits PDFs and a JSON Schema for field extraction.
Parameters
| Name | Type | Description |
|---|---|---|
documents |
list of files | Optional binary PDF files uploaded directly to the endpoint |
form |
JSON string | Must contain json_schema; may also contain documents_url, alias, and description |
At least one of documents or documents_url is required. The json_schema field is always required.
Form parameter
{
"documents_url": ["https://example.com/invoice.pdf"],
"alias": "April invoices",
"description": "Invoice extraction batch",
"json_schema": {
"type": "object",
"properties": {
"invoice_number": { "type": "string" },
"total_amount": { "type": "number" },
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "integer" },
"unit_price": { "type": "number" }
},
"required": ["description", "quantity", "unit_price"]
}
}
},
"required": ["invoice_number", "total_amount"]
}
}
Response
200
{
"request_uid": "a1c2b3d4-e5f6-7890-abcd-ef1234567890",
"status": "PENDING",
"performance": null,
"msg": null
}
The status value in this response is the Celery task state. For user-facing progress, poll GET /api/v1/request-status/{request_id}.
Validation and credits
Before the job is accepted, NeuroLinker:
- validates that the schema is valid JSON Schema Draft 7;
- validates that the schema uses the supported subset documented below;
- counts document pages;
- reserves field extraction credits for each processed page.
If validation fails, no credits are reserved and no files are uploaded. If there are not enough credits, the endpoint returns 402.
Errors
| Status | Meaning |
|---|---|
400 |
Invalid schema, unsupported schema keyword, or missing PDFs/URLs |
402 |
Insufficient credits |
Example "Try it out! - file upload"
curl -L 'https://neurolinker.api.ainexxo.com/api/v1/extract-fields' \
-H 'Authorization: Bearer nl_********************************' \
-F 'documents=@"invoice.pdf"' \
-F 'form="{
\"alias\": \"April invoices\",
\"json_schema\": {
\"type\": \"object\",
\"properties\": {
\"invoice_number\": { \"type\": \"string\" },
\"total_amount\": { \"type\": \"number\" }
},
\"required\": [\"invoice_number\", \"total_amount\"]
}
}"'
Example "Try it out! - URL upload"
curl -L 'https://neurolinker.api.ainexxo.com/api/v1/extract-fields' \
-H 'Authorization: Bearer nl_********************************' \
-F 'documents="[]"' \
-F 'form="{
\"documents_url\": [\"https://example.com/invoice.pdf\"],
\"json_schema\": {
\"type\": \"object\",
\"properties\": {
\"invoice_number\": { \"type\": \"string\" },
\"total_amount\": { \"type\": \"number\" }
},
\"required\": [\"invoice_number\", \"total_amount\"]
}
}"'
Supported JSON Schema subset
Field extraction accepts a restricted subset of JSON Schema Draft 7.
Accepted keywords
| Keyword | Usage |
|---|---|
type |
string, number, integer, boolean, array, or object |
properties |
Required on object; must be a non-empty mapping |
required |
List of property names present in properties |
items |
Required on array; must be a single schema |
enum |
Scalar fields only; values must match the field type |
description |
Optional semantic hint forwarded to extraction |
Structural rules
- The root schema must be an
objectwith non-emptyproperties. - Property names must be valid identifiers: no leading digits and no reserved Python keywords.
- Union types such as
"type": ["string", "null"]are not supported. For optional fields, omit the field fromrequired. - Array
itemsmust be a single schema, not a tuple-style list.
Rejected keywords
anyOf, oneOf, allOf, not, $ref, $defs, definitions, pattern, format, minLength, maxLength, minimum, maximum, minItems, maxItems, additionalProperties, default, and const.
Schema compliance
NeuroLinker enforces schema extraction in three steps:
- The submitted schema is checked as JSON Schema Draft 7 and against the supported subset.
- The validated schema is converted at runtime into a structured response model. Required fields become required model fields; optional fields may return
null. - Extraction is retried automatically when the model response does not match the expected structure.
Completed field-extraction payloads include every top-level property declared in the submitted schema. If no successful batch finds a non-empty value for a property, that property is returned as null.
Multi-page merge logic
Documents are processed in page batches. Each batch returns a JSON object, then NeuroLinker merges the batch results into one payload per document.
| Schema field type | Merge behavior |
|---|---|
array |
Keeps all distinct non-null values in order of appearance |
| Any non-array type | Keeps the most frequent non-null value; ties use the earliest batch |
Values are canonicalized before comparison:
- strings are trimmed;
- objects are compared using canonical JSON with sorted keys;
- no lowercasing, separator removal, or fuzzy matching is applied.
This conservative behavior avoids changing identifiers such as SKUs, IBANs, policy numbers, or case-sensitive codes.
Storage output
For each document, field extraction writes:
{user_uid}/{request_uid}/{document_uid}/
|-- {file_name}.pdf
|-- page_1.png
|-- page_2.png
|-- ...
|-- page_N.png
`-- {file_name}_fields.json
The canonical result is {file_name}_fields.json. The document Firestore metadata stores schema_used and results.result_storage_path; the extracted payload itself is stored in Firebase Storage, not duplicated inline in Firestore.
Retrieving field results
Use POST /api/v1/documents/fields after the document reaches completed.
Response:
{
"success": true,
"results": [
{
"document_id": "8b3f-...",
"format": "fields",
"content": {
"invoice_number": "INV-2026-0412",
"total_amount": 1284.5,
"due_date": null,
"line_items": [
{ "description": "Widget A", "quantity": 3, "unit_price": 120.0 },
{ "description": "Service B", "quantity": 1, "unit_price": 50.0 }
]
},
"schema_used": {
"type": "object",
"properties": {
"invoice_number": { "type": "string" }
}
}
}
],
"total": 1,
"successful": 1,
"failed": 0,
"message": "Retrieved extracted fields for 1/1 documents"
}