Everyone's playing with ChatGPT. But what if you could build a GPT that:
- Knows your own notes, codebase, and habits
- Writes like you
- Runs entirely offline
- Costs zero to query
- And doesn't leak your data to a third party?
That's exactly what I built.
I created a personal LLM that understands my files, mimics my tone, and runs on a machine with just 8GB of RAM using quantized models, sentence embeddings, and local retrieval.
In this article, I'll walk you through:
- The local LLM stack (Mistral + Ollama + LangChain)
- How I chunk, embed, and store documents
- Fine-tuning to replicate my writing style
- A local Gradio interface that feels like ChatGPT
- And how it integrates with code, email, and documentation
1. Choosing the Right Local Model (and Keeping RAM Happy)
I tried several models before landing on this combo:
- Mistral 7B (quantized) — for balanced speed and coherence
- Phi-3-mini (1.8B) — great with limited memory
- LLaMA-3 8B (if you have 16GB RAM) — crazy powerful locally
Install Ollama:
curl -fsSL https://ollama.com/install.sh | shPull a quantized model:
ollama run mistralOr use a lightweight option:
ollama run phi3Check running models:
ollama list2. Preparing My Personal Knowledge Base
I had:
- Markdown notes
- PDFs from courses
- Meeting transcripts
- My own blog posts
I wanted the model to answer questions like:
"What's my process for debugging a failing pipeline?"
To do that, I created a RAG system: Retrieval Augmented Generation.
3. Chunking and Embedding Files with LangChain
Install dependencies:
pip install langchain faiss-cpu sentence-transformers PyMuPDFChunk and embed:
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from pathlib import Path
docs = []
for file in Path("my_notes").rglob("*"):
if file.suffix == '.pdf':
docs += PyPDFLoader(str(file)).load()
else:
docs += TextLoader(str(file)).load()
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectordb = FAISS.from_documents(docs, embedding)
vectordb.save_local("kb_index")Now I had a searchable, vectorized version of all my knowledge.
4. Wiring Up the Local Chatbot with LangChain + Ollama
from langchain.llms import Ollama
from langchain.chains import RetrievalQA
llm = Ollama(model="mistral")
retriever = FAISS.load_local("kb_index", embedding).as_retriever()
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
def ask(query):
return qa_chain.run(query)Test it:
print(ask("What's my approach to writing technical articles?"))Boom. It answered in my voice, based on my past writings.
5. Fine-Tuning for My Tone and Style
I trained the model to:
- Use my intro structure
- Avoid fluff
- Write like a developer talking to another developer
Steps:
- Export my blog articles
- Format as Alpaca-style instruction-response pairs
- Fine-tune using LoRA (Low-Rank Adaptation)
Use QLoRA with HuggingFace PEFT for lightweight training.
Example instruction format:
{
"instruction": "Write an intro for a Python article on file parsing.",
"input": "",
"output": "If you've ever tried parsing a 400MB log file with regex, you know the pain. Let's fix that."
}Train on Google Colab with 8GB GPU using LoRA.
Result: The model writes intros I don't have to edit.
6. Building a Local Chat Interface with Gradio
pip install gradio
import gradio as gr
def answer_user(q):
return ask(q)
gr.ChatInterface(fn=answer_user, title="My Private GPT").launch()It looks and behaves like ChatGPT, but it's:
- Local
- Fast
- Private
- Styled like me
7. Extending with Tools: Search, Calendar, Code Execution
LangChain + tools = AI agent.
from langchain.tools import tool
@tool
def get_calendar_events():
return "No events today."
@tool
def execute_python_code(code: str):
try:
exec_globals = {}
exec(code, exec_globals)
return exec_globals
except Exception as e:
return str(e)Add to agent:
from langchain.agents import initialize_agent, Tool
tools = [Tool(name="Calendar", func=get_calendar_events), Tool(name="Python", func=execute_python_code)]
agent = initialize_agent(tools, llm=llm, agent="zero-shot-react-description")
print(agent.run("What's on my calendar and calculate 2**20?"))8. Offline-Only Mode
Disable all external API calls.
Set offline=True in HuggingFace embeddings.
No Whisper, no internet, no telemetry.
This means:
- You can run it on flights
- Safe for enterprise/local data
- Your codebase, documents, and emails stay private
9. Optimizing for Speed: Quantization + Caching
I used:
- GGUF quantized models from HuggingFace
- Ollama's built-in memory cache
- FAISS to retrieve only 3–5 chunks per query
Optional:
- Add
cache=Truefor repeated prompts - Use
llm.invoke()for async calls
Performance:
- Mistral answers in seconds locally
- On 8GB RAM, Phi-3 runs with 50% CPU usage
10. Final Thoughts: LLMs Aren't Just for Chat — They're for You
This wasn't just a fun project. It was redefining how I work:
- Every old note became searchable knowledge
- My GPT wrote emails, code, and articles like me
- All offline, all private, all mine
I now trust this system more than any cloud assistant.
And the best part? No API costs. No throttling. No data leak.
Thank you for being a part of the community
Before you go:
- Be sure to clap and follow the writer 👏.
- For more content, visit CODRIFT.