Skip to main content

Extract text from files

Speed up document ingestion by extracting text from files for processing

With Squid's text extraction feature, you can easily process the contents of files for fast turnaround of information. From compliance documentation to earnings reports to scientific studies, there are many types of documents that need to be reviewed and actions that must be taken based on their contents.

Manual ingestion of this information is time-consuming and error-prone. With Squid, these tasks that previously took days are completed in seconds.

Note

Text extraction requires admin privileges, so it should only be performed in a secure environment with access to your Squid API key like the Squid backend or other server environment.

Create the extraction client

To perform text extraction on a document, first create an extraction client using the extraction() method:

Backend code
const extractionClient = this.squid.extraction();

Extract the text

Use the extraction client's extractDataFromDocumentFile method to extract text from the file. The method takes one of two types: File or BlobAndFileName. The extractDataFromDocumentFile method returns a promise that resolves to an array of pages. The text of a given page can be accessed using the text attribute.

The following example shows extracting text using the BlobAndFileName type:

Backend code
const data = {
blob: dataBlob,
name: 'myDocument.pdf',
};

const extractedResult =
await extractionClient.extractDataFromDocumentFile(data);
console.log(extractedResult.pages[0].text); // 'Q4 Development Plan...'
What content is extracted?

Squid's extraction client uses powerful AI handling to extract text from a variety of file contents. Not only does it support reading text files, but it will also extract text from scanned documents, tables, multiple languages, and more.

The extraction client has an additional extractDataFromDocumentUrl method, which extracts text from a document at a remote URL. Both the extraction methods optionally include an options parameter that allows you to specify the pageIndexes of the document to extract, along with some options for image extraction:

Backend code
const extractedResult = await extractionClient.extractDataFromDocumentUrl(
'www.file-url.com',
{ pageIndexes: [0, 1, 2] }
);

Next steps

Once your text is extracted, you can:

  • Parse the text and write to a database using Squid's database connectors.
  • Pass text as part of a query to a Squid AI Agent to answer questions or take actions based on the text.