Extract text from files
Speed up document ingestion by extracting text from files for processing
With Squid's text extraction feature, you can easily process the contents of files for fast turnaround of information. From compliance documentation to earnings reports to scientific studies, there are many types of documents that need to be reviewed and actions that must be taken based on their contents.
Manual ingestion of this information is time-consuming and error-prone. With Squid, these tasks that previously took days are completed in seconds.
Text extraction requires admin privileges, so it should only be performed in a secure environment with access to your Squid API key like the Squid backend or other server environment.
Create the extraction client
To perform text extraction on a document, first create an extraction client using the extraction()
method:
const extractionClient = this.squid.extraction();
Extract the text
Use the extraction client's extractDataFromDocumentFile
method to extract text from the file. The method takes one of two types: File
or BlobAndFileName
. The extractDataFromDocumentFile
method returns a promise that resolves to an array of pages
. The text of a given page can be accessed using the text
attribute.
The following example shows extracting text using the BlobAndFileName
type:
const data = {
blob: dataBlob,
name: 'myDocument.pdf',
};
const extractedResult =
await extractionClient.extractDataFromDocumentFile(data);
console.log(extractedResult.pages[0].text); // 'Q4 Development Plan...'
Squid's extraction client uses powerful AI handling to extract text from a variety of file contents. Not only does it support reading text files, but it will also extract text from scanned documents, tables, multiple languages, and more.
The extraction client has an additional extractDataFromDocumentUrl
method, which extracts text from a document at a remote URL. Both the extraction methods optionally include an options
parameter that allows you to specify the pageIndexes
of the document to extract, along with some options for image extraction:
const extractedResult = await extractionClient.extractDataFromDocumentUrl(
'www.file-url.com',
{ pageIndexes: [0, 1, 2] }
);
Next steps
Once your text is extracted, you can:
- Parse the text and write to a database using Squid's database connectors.
- Pass text as part of a query to a Squid AI Agent to answer questions or take actions based on the text.