Built a production-ready conversational AI agent from scratch โ then went beyond the tutorial: redesigned the entire knowledge base after discovering that retrieval quality depends far more on input data design than on pipeline code.
The goal was to learn LangChain hands-on by building something real and immediately useful: a chat agent for this portfolio that can answer questions about my education, professional experience, and projects โ grounded entirely in my own documents (CV, thesis presentations, project write-ups).
Rather than following a tutorial, the project was built end-to-end in a single day, covering every layer of the stack: document ingestion, vector search, conversational memory, API design, security, and cloud deployment.
The chat widget live on cv.manuelmezo.com, powered by a FastAPI backend deployed on HF Spaces.
After deploying the initial version, the agent gave poor, vague, or hallucinated answers to most interesting questions. The pipeline code was correct. The problem was the data going into it.
| v1 โ What we had | v2 โ What we built |
|---|---|
| Presentation slides (PPT โ PDF) Bullet fragments, no prose, minimal text per page. A slide saying "โ improved DEA 300 to 90bps" is meaningless without context. |
Full thesis documents + clean markdown files Complete text with narrative, evidence, and context. Each chunk can stand alone. |
| Raw interview-prep DOCX Mixed formats, bullet lists, rough notes. The 500-char splitter cut mid-story leaving fragments like "Additionally he handled with 'brio' this cross-functional project..." |
17 hand-authored STAR stories in prose Each story is a self-contained narrative unit. The splitter keeps full Situation โ Task โ Action โ Result coherent per chunk. |
| No cross-cutting summaries No single document could answer "What is Manuel's working style?" โ that answer lives across 17 stories, not in any one chunk. |
Synthetic summary documents Dedicated files for professional summary, technical skills taxonomy, and personality/approach โ designed to answer exactly those high-level questions. |
| Blind 500-char chunking for everything Character count is agnostic to document structure. A 500-char cut lands in the middle of a Result paragraph. |
Semantic story-level chunking for markdown Stories are split on --- separators first. Character splitting is a fallback only if a story exceeds 1,800 chars. |
| No chunk metadata Every chunk looks the same to the retriever. No way to filter by company, topic, or skill. |
Structured metadata per chunk Each story chunk carries company, skills, year fields extracted from the header, enabling future filtered retrieval. |
17 STAR-format stories from Amazon and McKinsey, each with metadata header and clean S/T/A/R prose. Replaces a raw interview-prep DOCX.
~600-word narrative bio covering career arc, what each role taught, and target roles. Answers "who is Manuel?" directly.
Structured skills taxonomy with evidence per skill: not "knows Python" but "used for X, Y, Z โ see projects A, B." Answers skills questions with specifics.
7 sections on working style โ each backed by a real story from the STAR document. Answers "how does Manuel approach X?" questions.
Descriptions of all 15+ portfolio projects: tech stack, what it does, what problem it solves. Sourced from the portfolio website and GitHub.
The key insight is that different document types require different splitting strategies. A one-size-fits-all character splitter destroys narrative structure. The prepare_knowledge_base.py script applies the right strategy per file type:
def load_markdown(filepath):
with open(filepath, encoding="utf-8") as f:
content = f.read()
# Primary: split on "---" story separators
# Each STAR story stays as one coherent chunk
raw_sections = re.split(r'\n---\n', content)
# Fallback: split on ## headers if no --- separators
if len(raw_sections) == 1 and len(content) > MAX_CHUNK:
raw_sections = re.split(r'\n(?=## )', content)
chunks = []
for section in raw_sections:
meta = _extract_metadata_from_section(section, filepath)
# Sub-split only if section exceeds 1800 chars
chunks.extend(_section_chunks(section, meta))
return chunks
def _extract_metadata_from_section(text, filepath):
# Pull Company, Skills, Year from the story header line
company_m = re.search(r'\*\*Company:\*\*\s*([^|]+)', text)
skills_m = re.search(r'\*\*Skills:\*\*\s*(.+)$', text, re.MULTILINE)
return {
"source": filepath,
"company": company_m.group(1).strip() if company_m else "",
"skills": skills_m.group(1).strip() if skills_m else "",
}
The infrastructure didn't change. The answer quality improved entirely because the chunks retrieved now carry full narrative context โ company, challenge, specific actions, measurable result โ rather than isolated bullet fragments.
LangChain provides document loaders for almost every file format. Each loader reads a file and returns a list of Document objects โ the universal unit of content in LangChain.
A Document has two fields: page_content (the raw text) and metadata (a dict with the source path, page number, etc.). This metadata travels with the content all the way to the final answer, making it possible to cite sources.
from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader
# PyPDFLoader splits one Document per physical PDF page
loader = PyPDFLoader("data/cv.pdf")
docs = loader.load() # returns list[Document]
print(docs[0].metadata) # {'source': 'data/cv.pdf', 'page': 0}
print(docs[0].page_content) # raw text of page 1
PyPDFLoader maps exactly to physical PDF pages. Docx2txtLoader returns the entire Word file as a single Document.
LLMs and embedding models have context limits. More importantly, a full PDF page often contains multiple topics โ sending the whole page as context would pollute the retrieval signal. We split each document into smaller chunks that each represent one coherent idea.
RecursiveCharacterTextSplitter is the recommended default. It tries to split on paragraph breaks first, then sentences, then words โ so chunks stay semantically coherent rather than cutting mid-sentence.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # max characters per chunk
chunk_overlap=80, # shared chars between consecutive chunks
separators=["\n\n", "\n", " ", ""],
)
chunks = splitter.split_documents(raw_docs)
# Filter blank chunks (e.g. from empty presentation slides)
chunks = [c for c in chunks if c.page_content.strip()]
print(f"92 pages โ {len(chunks)} chunks") # 92 โ 318
Maximum characters per chunk. Too large = noisy context. Too small = insufficient information. 500 chars works well for dense PDFs.
Characters shared between consecutive chunks. Prevents a sentence from being cut right at a boundary and losing context.
An embedding is a vector of numbers (e.g. 384 floats) that represents the meaning of a piece of text. The key property: semantically similar text produces vectors that are close together in that high-dimensional space.
This project uses BAAI/bge-small-en-v1.5 โ a ~90 MB model that runs entirely locally on CPU with no API calls. BGE (Beijing General Embedding) models are trained specifically for retrieval tasks, making them a better fit for RAG than general-purpose sentence similarity models like MiniLM.
from langchain_huggingface import HuggingFaceEmbeddings
# Loads the model locally โ downloads ~90MB on first run
embeddings = HuggingFaceEmbeddings(
model_name="BAAI/bge-small-en-v1.5"
)
# .embed_query() converts a string to a list of 384 floats
vector = embeddings.embed_query("What did Manuel study?")
print(len(vector)) # 384
print(vector[:3]) # [-0.031, 0.072, -0.014, ...]
all-MiniLM-L6-v2 (MTEB retrieval score ~40) โ a fast, general-purpose model. After reviewing benchmarks, it was upgraded to BAAI/bge-small-en-v1.5 (MTEB retrieval ~51), which is the same size but ~25% better on retrieval tasks because it was trained specifically for that purpose. Cloud APIs like Gemini Embedding offer higher quality but add API cost and latency to every single query โ not worth it when a strong local model works just as well for this use case.
A vector store is a database optimised for similarity search. Given a query vector, it finds the stored vectors that are closest to it (by cosine similarity) and returns the corresponding documents.
ChromaDB runs entirely locally โ no server setup, no account, no cost. It persists to disk so the index survives restarts without re-embedding everything.
from langchain_chroma import Chroma
import uuid
# Embed all chunks once and store in ChromaDB
texts = [c.page_content for c in chunks]
metas = [c.metadata for c in chunks]
ids = [str(uuid.uuid4()) for _ in chunks]
vectors = embeddings.embed_documents(texts)
vs = Chroma(persist_directory="chroma_db", embedding_function=embeddings)
vs._collection.add(
documents=texts,
embeddings=vectors,
metadatas=metas,
ids=ids,
)
print(vs._collection.count()) # 318
LangChain's Chroma.from_documents() has a batching bug on some versions. Using _collection.add() directly bypasses it and gives full control.
persist_directory stores the SQLite database and HNSW index files locally. On subsequent runs, reload with Chroma(persist_directory=...) โ no re-embedding needed.
With the vector store built, the next step is connecting retrieval to an LLM. LangChain's LCEL (LangChain Expression Language) uses the | pipe operator to chain components โ conceptually similar to a Unix shell pipe.
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser
retriever = vs.as_retriever(search_type="similarity", search_kwargs={"k": 8})
def format_docs(docs):
return "\n\n".join(d.page_content for d in docs)
rag_chain = (
# Step 1: add 'context' key by retrieving relevant chunks
RunnablePassthrough.assign(
context=RunnableLambda(lambda x: x["input"]) | retriever | format_docs
)
# Step 2: fill the prompt template with {input} and {context}
| answer_prompt
# Step 3: send to the LLM
| llm
# Step 4: extract the plain text string from the response object
| StrOutputParser()
)
answer = rag_chain.invoke({"input": "What did Manuel study?"})
context (the retrieved text) so the prompt template has both input (the question) and context (the evidence) available.
A basic RAG chain treats every question independently. If a user asks "What tools did he use for it?" after asking about the thesis, the retriever has no idea what "it" refers to. The fix requires two additions:
# Step 1: rephrase ambiguous questions using history
contextualize_prompt = ChatPromptTemplate.from_messages([
("system", "Rewrite the question as standalone. Do not answer."),
MessagesPlaceholder("chat_history"),
("human", "{input}"),
])
contextualize_chain = contextualize_prompt | llm | StrOutputParser()
def contextualize_question(inputs):
if inputs.get("chat_history"):
return contextualize_chain.invoke(inputs)
return inputs["input"] # first message โ no history, pass through
# Step 2: wrap chain with automatic history management
session_store = {}
def get_session_history(session_id):
if session_id not in session_store:
session_store[session_id] = ChatMessageHistory()
return session_store[session_id]
conversational_rag = RunnableWithMessageHistory(
rag_chain,
get_session_history,
input_messages_key="input",
history_messages_key="chat_history",
)
session_id. The server maps that ID to a ChatMessageHistory object in memory. The client stores the ID in JavaScript and sends it with every follow-up message โ this is how the agent "remembers" the conversation without any login or user accounts.
The RAG chain is wrapped in a FastAPI application, exposing a single POST /chat endpoint. FastAPI was chosen for its automatic request validation via Pydantic models, async support, and auto-generated interactive docs at /docs.
Heavy resources (vectorstore, model, chain) are loaded once at startup via the @asynccontextmanager lifespan pattern โ not on every request.
Request bodies are declared as Pydantic models. FastAPI validates them automatically and returns HTTP 422 with a clear error if the shape is wrong.
Browsers block cross-origin requests by default. The CORS middleware adds the headers that tell the browser your portfolio domain is allowed to call the API.
Input capped at 500 chars, output at 512 tokens, per-IP rate limit (10/min), daily global cap (200/day), session turn limit (20 turns).
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest, http_request: Request):
# Rate limit โ daily budget โ session cap โ invoke chain
if is_daily_budget_exceeded():
raise HTTPException(429, "Daily limit reached.")
if is_rate_limited(http_request.client.host):
raise HTTPException(429, "Too many requests.")
session_id = request.session_id or str(uuid.uuid4())
answer = chain.invoke(
{"input": request.message},
config={"configurable": {"session_id": session_id}},
)
return ChatResponse(answer=answer, session_id=session_id)
The API is containerised with Docker and deployed as a Hugging Face Space (free tier, always-on). The frontend portfolio is hosted on Firebase Hosting.
A key design decision: the chroma_db binary files are not stored in git. Instead, the knowledge base is exported as knowledge_base.json (plain text, version-control friendly), and the Dockerfile runs build_db.py at image build time to reconstruct ChromaDB from it. Updating the knowledge base is as simple as adding documents, re-running the export script, and pushing.
# Dockerfile โ builds ChromaDB at image build time
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY knowledge_base.json build_db.py .
RUN python build_db.py # embeds chunks locally, writes chroma_db/
COPY api.py .
EXPOSE 7860 # HF Spaces requires port 7860
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "7860"]