Text Extraction

Extracts text from a file.

The Text Extraction API uses HP KeyView to extract metadata and text content from a file that you provide. The API can handle over 500 different file formats (for more information, see Supported Formats).

Quick Start

You must provide an input file, which you can specify as a file, a URL, or an object reference. You can create an object reference by using the Store Object API to store a file, which you can then use in the API. The following example submits a file:

POST /1/api/[sync|async]/extracttext/v1?file=mydoc.doc

By default the Text Extraction API extracts all the metadata it can from the file provided, as well as the text from the main content of the file.

{
  "document": [
    {
      "reference": "root",
      "doc_iod_reference": "dd4a9c274eda24f9bac59c065cf979e4",
      "app_name": [
        "Microsoft Office Word"
      ],
      "author": [
        "janesmith"
      ],
... other metadata fields ...
      "content": "This is the content of my document..."
    }
  ]
}

If the file is protected, you can send the password in the password parameter. For example:

POST /1/api/[sync|async]/extracttext/v1?file=mydoc.doc&password=myfilepassword

You can also specify URLs as input. In this case, IDOL OnDemand retrieves the file and extracts the text. For example:

POST /1/api/[sync|async]/extracttext/v1?**url=**http://mysite.com/mydoc.doc

You can disable metadata and text extraction individually to return only the text or only the metadata. For example:

POST /1/api/[sync|async]/extracttext/v1?file=mydoc.doc&extract_text=false

POST /1/api/[sync|async]/extracttext/v1?file=mydoc.doc&extract_meta=false

The Text Extraction API is used by other APIs, to extract content from files to use in further analysis. You can also use the Text Extraction API to extract the contents of the files that you have retrieved from a container file using the Expand Container API.

POST /1/api/[sync|async]/detectsentiment/v1?file=mydoc.doc

Synchronous
https://api.idolondemand.com/1/api/sync/extracttext/v1
Asynchronous
https://api.idolondemand.com/1/api/async/extracttext/v1
Authentication

This API requires an authentication token to be supplied in the following parameter:

ParameterDescription
apikeyThe API key to use to authenticate the API request.
Input Source

This API accepts a single input source that can be supplied using one of the following parameters:

ParameterDescription
fileA file containing the document to process. Multi part POST only.
referenceAn IDOL OnDemand reference obtained from either the Expand Container or Store Object API. The corresponding document is passed to the API.
urlA publicly accessible HTTP URL from which the document can be retrieved.
Parameters

In addition to the above input source, this API accepts the following parameters:

NameTypeDescription
extract_text
boolean Whether to extract text from the file. Default value: true
extract_metadata
boolean Whether to extract metadata from the file. Default value: true
extract_xmlattributes
boolean Whether to extract xml attributes from the file.
additional_metadata
array A JSON object containing additional metadata to add to the extracted documents. This option does not apply to JSON input. To add metadata for multiple files, specify objects in order, separated by an empty object.
reference_prefix
array A string to add to the start of the reference of documents that are extracted from a file. This option does not apply to JSON input. To add a prefix for multiple files, specify prefixes in order, separated by a space.
password
array Passwords to use to extract the files.
All parameters for this API are optional.

This API returns a JSON response that is described by the model below. This single model is presented both as an easy to read abstract definition and as the formal JSON schema.

Model
This is an abstract definition of the response that describes each of the properties that might be returned.
Text Extraction Response {
document (array[object])
}
Model Schema
This is a JSON schema that describes the syntax of the response. See json-schema.org for a complete reference.
{
    "type": "object",
    "properties": {
        "document": {
            "type": "array",
            "items": {
                "type": "object",
                "anyOf": [
                    {
                        "type": "object",
                        "properties": {
                            "name": {
                                "type": "string"
                            },
                            "reference": {
                                "type": "string"
                            },
                            "parent_iod_reference": {
                                "type": "string"
                            },
                            "doc_iod_reference": {
                                "type": "string"
                            },
                            "content-type": {
                                "type": "array",
                                "items": {
                                    "type": "string"
                                }
                            },
                            "document_attributes": {
                                "type": "array",
                                "items": {
                                    "type": "string"
                                }
                            },
                            "keyview_class": {
                                "type": "array",
                                "items": {
                                    "type": "integer"
                                }
                            },
                            "original_size": {
                                "type": "array",
                                "items": {
                                    "type": "integer"
                                }
                            },
                            "keyview_type": {
                                "type": "array",
                                "items": {
                                    "type": "integer"
                                }
                            },
                            "import_original_encoding": {
                                "type": "array",
                                "items": {
                                    "type": "string"
                                }
                            },
                            "content": {
                                "type": "string"
                            }
                        },
                        "required": [
                            "reference",
                            "doc_iod_reference"
                        ]
                    },
                    {
                        "type": "object",
                        "properties": {
                            "reference": {
                                "type": "string"
                            },
                            "parent_iod_reference": {
                                "type": "string"
                            },
                            "doc_iod_reference": {
                                "type": "string"
                            },
                            "error": {
                                "type": "object",
                                "properties": {
                                    "error": {
                                        "type": "integer"
                                    },
                                    "reason": {
                                        "type": "string"
                                    }
                                },
                                "required": [
                                    "error",
                                    "reason"
                                ]
                            }
                        },
                        "required": [
                            "reference",
                            "doc_iod_reference",
                            "error"
                        ]
                    }
                ]
            }
        }
    },
    "required": [
        "document"
    ]
}
https://api.idolondemand.com/1/api/sync/extracttext/v1
/developer/api/api-example/1/api/sync/extracttext/v1
Examples
See this API for yourself - select one of our examples below.
Web Page
Word Doc
PDF File
Input Source
ParameterValue
file
reference
url
Parameters
NameTypeValue
extract_text
boolean (Default: True)
extract_metadata
boolean (Default: True)
extract_xmlattributes
boolean (Default: False)
additional_metadata
array
Add another value
reference_prefix
array
Add another value
password
array
Add another value

ASync – Response An error occurred making the API request
Response Code:
Response Body

	

Making API Request…
Checking result of job

To try this API with your own data and use it in your own applications, you need an API Key. You can create an API Key from your account page - API Keys.

Output Refresh An error occurred making the API request View the raw output View Input
Rendered RawHtml Response
Result Display
Response Code:
Response Body:

		
Make this call with curl
curl


If you would like to provide us with more information then please use the box below:

We will use your submission to help improve our product.