Robot

LLM data ingestion pipeline with Langchain & Robocorp

Schedule and run your Python RAG data loaders in cloud with Robocorp Control Room

Langchain
Python
OpenAI

Are you curious about what's happening behind the scenes with ReMark💬, a code-gen assistant specifically trained to help developers build automation bots on Roboccorp? We are exposing (almost) everything here in how we create vector embeddings from various sources! ReMark💬 is trained on Robocorp documentation and examples, which are either on JSON files, GitHub repos or websites.

This example shows how to implement an LLM data ingestion pipeline with Robocorp using Langchain. The need for simple pipelines that run frequently has exploded, and one driver is retrieval-augmented generation (RAG) use cases, where the source data needs to be loaded into a vector database as embeddings frequently.

The benefits of using Robocorp for RAG data ingestion:

  • Zero infra: run and schedule workflows in the Robocorp Control Room (4h/month runtime for free!)
  • Also supports running the workflows on self-hosted infrastructure
  • Connect your git repo, and your new updates deploy automatically to workers in the cloud
  • Use Asset Storage to manage configurations - update without code changes
  • Easy management of Python environments between dev and prod
  • Huge ecosystem of tools like Llamaindex, Playwright, BeautifulSoup4
  • It's all Python 🐍

Setup

The following configurations are needed to run the ingestion pipeline.

  • Get VS Code with Robocorp Code connected to your Robocorp workspace (get a free account here)
  • OpenAI API key in Robocorp Vault called OpenAI with a key named key.
  • Configuration data stored in Control Room Asset Storage with name rag_loader_config. Below is a sample that works.

The bot

The bot contains three loaders as an example, each a class in loaders directory:

  • PortalLoader: Reads a JSON configuration file and traverses multiple GitHub repos to get descriptions and code examples.
  • RoboLoader: Reads markdown from a GitHub repo that contains Python library documentation
  • RPALoader: Reads a configuration JSON file and documentation website contents using BeautifulSoup4.

For each loader, the URL and black/whitelist data are read from the Control Room Asset Storage, meaning that you can add more white/blacklisted entries without code changes or deployments.

Control Room

Control Room allows you to repeatedly and reliably run the data loaders at your chosen schedule and configuration. It supports alerts on errors, so you'll always be aware of what's happening.

Follow the video to see how to set things up from a GitHub repo in Robocorp. This is what you'll see:

  • Connect to your repo (updates will be automatically deployed)
  • Create an Asset for config data (URLs, whitelist, blacklist)
  • Create a Process that combines the repo with a worker, in this case, the Robocorp cloud container
  • Configure a schedule
  • Set alerts, for example, only for failed runs
  • RUN IT 🏃