Openai/6913e3a3-1ed8-800a-9a2d-932e764b3c66


This conversation was exported from ChatGPT[1]

gpt-5-thinking edit

User: How could I build a simple txt file ingestion, SQLite as the datastore, ChromaDB as the vector store, and OpenAI API as… edit

How could I build a simple txt file ingestion, SQLite as the datastore, ChromaDB as the vector store, and OpenAI API as the API, RAG pipeline?

It should be written in python, and allow me to ingest new text files. I also want chunking with optional sizing.

Assistant: I put a complete, production-ready Python script on the canvas that does exactly what you asked: ingest .txt files, sto… edit

I put a complete, production-ready Python script on the canvas that does exactly what you asked: ingest .txt files, store metadata in SQLite, persist vectors in ChromaDB, and run RAG with the OpenAI API. It supports configurable chunk size/overlap and token-aware chunking (uses tiktoken if present; otherwise falls back to char-based).

What you’ll need edit

  • Python 3.9+
  • OPENAI_API_KEY in your environment
  • pip install chromadb openai tiktoken (tiktoken is optional but recommended)

Quick start edit

# 1) install
pip install chromadb openai tiktoken

=== 2) set your key ===
export OPENAI_API_KEY=sk-...

=== 3) ingest .txt files (folder or single file) ===
python rag.py ingest --path ./texts --chunk-size 1200 --chunk-overlap 200

=== 4) query ===
python rag.py query --question "Summarize the incident notes" --top-k 5
Other commands edit
  • List what’s been ingested:
python rag.py show-docs
Notes edit
  • Defaults: SQLite at ./.rag.sqlite3, Chroma at ./.chroma, collection rag_demo.
  • Embeddings: text-embedding-3-small. Generation: gpt-4o-mini. Change via env vars RAG_EMBED_MODEL/RAG_GEN_MODEL.
  • Idempotent ingestion: file checksum + mtime; re-ingesting updates vectors and SQLite rows only if the file changed.
  • Each retrieved chunk is labeled with its source filename and chunk index in the prompt for traceability.

If you want a separate “ingest daemon” or a simple REST wrapper (FastAPI) around query, say the word and I’ll add it.

References edit

  1. "Build RAG pipeline". ChatGPT. Retrieved 2025-11-12.