Text Extraction

Extracts text from a file.

The Text Extraction API uses HP KeyView to extract metadata and text content from a file that you provide. The API can handle over 500 different file formats (for more information, see Supported Formats).

Quick Start

You must provide an input file, which you can specify as a file, a URL, or an object reference. You can create an object reference by using the Store Object API to store a file, which you can then use in the API. The following example submits a file:

POST /1/api/[sync|async]/extracttext/v1 file=mydoc.doc

By default the Text Extraction API extracts all the metadata it can from the file provided, as well as the text from the main content of the file.

{
  "document": [
    {
      "reference": "root",
      "doc_iod_reference": "dd4a9c274eda24f9bac59c065cf979e4",
      "app_name": [
        "Microsoft Office Word"
      ],
      "author": [
        "janesmith"
      ],
... other metadata fields ...
      "content": "This is the content of my document..."
    }
  ]
}

If the file is protected, you can send the password in the password parameter. For example:

POST /1/api/[sync|async]/extracttext/v1 file=mydoc.doc password=myfilepassowrd

You can also specify URLs as input. In this case, IDOL OnDemand retrieves the file and extracts the text. For example:

/1/api/[sync|async]/extracttext/v1?**url=**http://mysite.com/mydoc.doc

You can disable metadata and text extraction individually to return only the text or only the metadata. For example:

POST /1/api/[sync|async]/extracttext/v1 file=mydoc.doc extract_text=false

POST /1/api/[sync|async]/extracttext/v1 file=mydoc.doc extract_meta=false

The Text Extraction API is used by other APIs, to extract content from files to use in further analysis. You can also use the Text Extraction API to extract the contents of the files that you have retrieved from a container file using the Expand Container API.

POST /1/api/[sync|async]/detectsentiment/v1 file=mydoc.doc

Synchronous
https://api.idolondemand.com/1/api/sync/extracttext/v1
Asynchronous
https://api.idolondemand.com/1/api/async/extracttext/v1
Authentication

This API requires an authentication token to be supplied in the following parameter:

ParameterDescription
apikeyThe API key to use to authenticate the API request.
Input Source

This API accepts a single input source that can be supplied using one of the following parameters:

ParameterDescription
fileA file containing the document to process. Multi part POST only.
referenceAn IDOL OnDemand reference obtained from either the Expand Container or Store Object API. The corresponding document is passed to the API.
urlA publicly accessible HTTP URL from which the document can be retrieved.
Parameters

In addition to the above input source, this API accepts the following parameters:

NameTypeDescription
extract_text
boolean Whether to extract text from the file. Default value: true
extract_metadata
boolean Whether to extract metadata from the file. Default value: true
extract_xmlattributes
boolean Whether to extract xml attributes from the file.
additional_metadata
array Specify a JSON object of additional metadata to be added to extracted document. Does not apply to json source input. For multiple files maintain order with an empty object.
reference_prefix
array Specify a string to prepend to the reference of extracted documents. Does not apply to json source input. For multiple files maintain order with a space.
password
array Passwords to use to extract the files.
All parameters for this API are optional.

This API returns a JSON response that is described by the model below. This single model is presented both as an easy to read abstract definition and as the formal JSON schema.

Model
This is an abstract definition of the response that describes each of the properties that might be returned.
Text Extraction Response {
document (array[object])
}
Model Schema
This is a JSON schema that describes the syntax of the response. See json-schema.org for a complete reference.
{
    "type": "object",
    "properties": {
        "document": {
            "type": "array",
            "items": {
                "type": "object",
                "anyOf": [
                    {
                        "type": "object",
                        "properties": {
                            "name": {
                                "type": "string"
                            },
                            "reference": {
                                "type": "string"
                            },
                            "parent_iod_reference": {
                                "type": "string"
                            },
                            "doc_iod_reference": {
                                "type": "string"
                            },
                            "content-type": {
                                "type": "array",
                                "items": {
                                    "type": "string"
                                }
                            },
                            "document_attributes": {
                                "type": "array",
                                "items": {
                                    "type": "string"
                                }
                            },
                            "keyview_class": {
                                "type": "array",
                                "items": {
                                    "type": "integer"
                                }
                            },
                            "original_size": {
                                "type": "array",
                                "items": {
                                    "type": "integer"
                                }
                            },
                            "keyview_type": {
                                "type": "array",
                                "items": {
                                    "type": "integer"
                                }
                            },
                            "import_original_encoding": {
                                "type": "array",
                                "items": {
                                    "type": "string"
                                }
                            },
                            "content": {
                                "type": "string"
                            }
                        },
                        "required": [
                            "reference",
                            "doc_iod_reference"
                        ]
                    },
                    {
                        "type": "object",
                        "properties": {
                            "reference": {
                                "type": "string"
                            },
                            "parent_iod_reference": {
                                "type": "string"
                            },
                            "doc_iod_reference": {
                                "type": "string"
                            },
                            "error": {
                                "type": "object",
                                "properties": {
                                    "error": {
                                        "type": "integer"
                                    },
                                    "reason": {
                                        "type": "string"
                                    }
                                },
                                "required": [
                                    "error",
                                    "reason"
                                ]
                            }
                        },
                        "required": [
                            "reference",
                            "doc_iod_reference",
                            "error"
                        ]
                    }
                ]
            }
        }
    },
    "required": [
        "document"
    ]
}
https://api.idolondemand.com/1/api/sync/extracttext/v1
/developer/api/api-example/1/api/sync/extracttext/v1
Examples
See this API for yourself - select one of our examples below.
Web Page
Word Doc
PDF File
Input Source
ParameterValue
file
reference
url
Parameters
NameTypeValue
extract_text
boolean (Default: True)
extract_metadata
boolean (Default: True)
extract_xmlattributes
boolean (Default: True)
additional_metadata
array
Add another value
reference_prefix
array
Add another value
password
array
Add another value
Making API Request...
Output
An error occurred making the API request
Rendered RawHtml Response
Result Display
Response Body

		
Response Code

		
Make this call with curl
curl


If you would like to provide us with more information then please use the box below:

We will use your submission to help improve our product.