Editing Openai/6913e3a3-1ed8-800a-9a2d-932e764b3c66


{{DISPLAYTITLE:Build RAG pipeline}}

''This conversation was exported from ChatGPT''<ref>{{cite web|title=Build RAG pipeline|url=https://chatgpt.com/share/6913e3a3-1ed8-800a-9a2d-932e764b3c66|publisher=ChatGPT|access-date=2025-11-12}}</ref>

== gpt-5-thinking ==
=== User: How could I build a simple txt file ingestion, SQLite as the datastore, ChromaDB as the vector store, and OpenAI API as… ===
How could I build a simple txt file ingestion, SQLite as the datastore, ChromaDB as the vector store, and OpenAI API as the API, RAG pipeline?

It should be written in python, and allow me to ingest new text files. I also want chunking with optional sizing.

=== Assistant: I put a complete, production-ready Python script on the canvas that does exactly what you asked: ingest .txt files, sto… ===
I put a complete, production-ready Python script on the canvas that does exactly what you asked: ingest .txt files, store metadata in SQLite, persist vectors in ChromaDB, and run RAG with the OpenAI API. It supports configurable chunk size/overlap and token-aware chunking (uses tiktoken if present; otherwise falls back to char-based).

==== What you’ll need ====
* Python 3.9+
* OPENAI_API_KEY in your environment
* pip install chromadb openai tiktoken (tiktoken is optional but recommended)

==== Quick start ====

<syntaxhighlight lang="bash"># 1) install
pip install chromadb openai tiktoken

=== 2) set your key ===
export OPENAI_API_KEY=sk-...

=== 3) ingest .txt files (folder or single file) ===
python rag.py ingest --path ./texts --chunk-size 1200 --chunk-overlap 200

=== 4) query ===
python rag.py query --question "Summarize the incident notes" --top-k 5

</syntaxhighlight>

===== Other commands =====
* List what’s been ingested:

<syntaxhighlight lang="bash">python rag.py show-docs

</syntaxhighlight>

===== Notes =====
* Defaults: SQLite at ./.rag.sqlite3, Chroma at ./.chroma, collection rag_demo.
* Embeddings: text-embedding-3-small. Generation: gpt-4o-mini. Change via env vars RAG_EMBED_MODEL/RAG_GEN_MODEL.
* Idempotent ingestion: file checksum + mtime; re-ingesting updates vectors and SQLite rows only if the file changed.
* Each retrieved chunk is labeled with its source filename and chunk index in the prompt for traceability.

If you want a separate “ingest daemon” or a simple REST wrapper (FastAPI) around query, say the word and I’ll add it.

== References ==
<references />


[[Category:ChatGPT Shares]]
[[Category:openai]]