Skip to main content

Extract text from files

Automate tasks and speed up document ingestion by extracting text from files for processing.

With Squid's text extraction feature, you can easily process the contents of multiple large documents for fast turnaround of information. Every industry deals with multiple types of documents that are integral to the business. From compliance documentation, to earnings reports, to scientific studies, there are always documents that need to be reviewed and actions that must be taken based on their contents. Manual ingestion of this information is time-consuming and error-prone. With Squid, these tasks that once took hours or days are completed in seconds. Get the information you need right away to be more productive and lead your industry.

Use cases

  • Automate turning the information in your unstructured documents into structured datasets, allowing you to quickly read your resources, process the text, and write to your data source. From there, you can query your data programmatically using the Client SDK, ask questions about your data and Squid AI will run the queries for you using Query with AI, and even generate charts and graphs on the fly using Query with AI.
  • Extract text to pass to an AI agent that can take relevant actions depending on the contents of the text.
Note

Text extraction requires admin privileges, so it should only be performed in a secure environment with access to your Squid API key like the Squid backend or other server environment.

Create the extraction client

To perform text extraction on a document, first create an extraction client using the extraction() method:

Backend code
const extractionClient = this.squid.extraction();

Extract the text

Use the extraction client's extractDataFromDocumentFile method to extract text from the file. The method takes one of two types: File or BlobAndFileName. The following example shows extracting text using the BlobAndFileName type:

Backend code
const extractionClient = this.squid.extraction();

const data = {
blob: dataBlob,
name: 'myDocument.pdf',
};

const extractedResult = await extractionClient.extractDataFromDocumentFile(
data
);
console.log(extractedResult.pages[0].text); // 'Q4 Development Plan...'

The extractDataFromDocumentFile method returns a promise that resolves to an array of pages. The text of a given page can be accessed using the text attribute.

Next steps

Once your text is extracted, you can:

  • Parse the text and write to a database using Squid's Client SDK.
  • Pass text as part of a query to a Squid AI Agent to answer questions or take actions based on the text.