Everyone's playing with ChatGPT. But what if you could build a GPT that:

  • Knows your own notes, codebase, and habits
  • Writes like you
  • Runs entirely offline
  • Costs zero to query
  • And doesn't leak your data to a third party?

That's exactly what I built.

I created a personal LLM that understands my files, mimics my tone, and runs on a machine with just 8GB of RAM using quantized models, sentence embeddings, and local retrieval.

In this article, I'll walk you through:

  • The local LLM stack (Mistral + Ollama + LangChain)
  • How I chunk, embed, and store documents
  • Fine-tuning to replicate my writing style
  • A local Gradio interface that feels like ChatGPT
  • And how it integrates with code, email, and documentation

1. Choosing the Right Local Model (and Keeping RAM Happy)

I tried several models before landing on this combo:

  • Mistral 7B (quantized) — for balanced speed and coherence
  • Phi-3-mini (1.8B) — great with limited memory
  • LLaMA-3 8B (if you have 16GB RAM) — crazy powerful locally

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Pull a quantized model:

ollama run mistral

Or use a lightweight option:

ollama run phi3

Check running models:

ollama list

2. Preparing My Personal Knowledge Base

I had:

  • Markdown notes
  • PDFs from courses
  • Meeting transcripts
  • My own blog posts

I wanted the model to answer questions like:

"What's my process for debugging a failing pipeline?"

To do that, I created a RAG system: Retrieval Augmented Generation.

3. Chunking and Embedding Files with LangChain

Install dependencies:

pip install langchain faiss-cpu sentence-transformers PyMuPDF

Chunk and embed:

from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from pathlib import Path

docs = []
for file in Path("my_notes").rglob("*"):
    if file.suffix == '.pdf':
        docs += PyPDFLoader(str(file)).load()
    else:
        docs += TextLoader(str(file)).load()

embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectordb = FAISS.from_documents(docs, embedding)
vectordb.save_local("kb_index")

Now I had a searchable, vectorized version of all my knowledge.

4. Wiring Up the Local Chatbot with LangChain + Ollama

from langchain.llms import Ollama
from langchain.chains import RetrievalQA

llm = Ollama(model="mistral")

retriever = FAISS.load_local("kb_index", embedding).as_retriever()

qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

def ask(query):
    return qa_chain.run(query)

Test it:

print(ask("What's my approach to writing technical articles?"))

Boom. It answered in my voice, based on my past writings.

5. Fine-Tuning for My Tone and Style

I trained the model to:

  • Use my intro structure
  • Avoid fluff
  • Write like a developer talking to another developer

Steps:

  1. Export my blog articles
  2. Format as Alpaca-style instruction-response pairs
  3. Fine-tune using LoRA (Low-Rank Adaptation)

Use QLoRA with HuggingFace PEFT for lightweight training.

Example instruction format:

{
  "instruction": "Write an intro for a Python article on file parsing.",
  "input": "",
  "output": "If you've ever tried parsing a 400MB log file with regex, you know the pain. Let's fix that."
}

Train on Google Colab with 8GB GPU using LoRA.

Result: The model writes intros I don't have to edit.

6. Building a Local Chat Interface with Gradio

pip install gradio
import gradio as gr

def answer_user(q):
    return ask(q)

gr.ChatInterface(fn=answer_user, title="My Private GPT").launch()

It looks and behaves like ChatGPT, but it's:

  • Local
  • Fast
  • Private
  • Styled like me

7. Extending with Tools: Search, Calendar, Code Execution

LangChain + tools = AI agent.

from langchain.tools import tool

@tool
def get_calendar_events():
    return "No events today."

@tool
def execute_python_code(code: str):
    try:
        exec_globals = {}
        exec(code, exec_globals)
        return exec_globals
    except Exception as e:
        return str(e)

Add to agent:

from langchain.agents import initialize_agent, Tool

tools = [Tool(name="Calendar", func=get_calendar_events), Tool(name="Python", func=execute_python_code)]

agent = initialize_agent(tools, llm=llm, agent="zero-shot-react-description")

print(agent.run("What's on my calendar and calculate 2**20?"))

8. Offline-Only Mode

Disable all external API calls. Set offline=True in HuggingFace embeddings. No Whisper, no internet, no telemetry.

This means:

  • You can run it on flights
  • Safe for enterprise/local data
  • Your codebase, documents, and emails stay private

9. Optimizing for Speed: Quantization + Caching

I used:

  • GGUF quantized models from HuggingFace
  • Ollama's built-in memory cache
  • FAISS to retrieve only 3–5 chunks per query

Optional:

  • Add cache=True for repeated prompts
  • Use llm.invoke() for async calls

Performance:

  • Mistral answers in seconds locally
  • On 8GB RAM, Phi-3 runs with 50% CPU usage

10. Final Thoughts: LLMs Aren't Just for Chat — They're for You

This wasn't just a fun project. It was redefining how I work:

  • Every old note became searchable knowledge
  • My GPT wrote emails, code, and articles like me
  • All offline, all private, all mine

I now trust this system more than any cloud assistant.

And the best part? No API costs. No throttling. No data leak.

Thank you for being a part of the community

Before you go:

  1. Be sure to clap and follow the writer 👏.
  2. For more content, visit CODRIFT.