Why this matters
The assistants most people use are a window to someone else’s computer. Every question you type travels to a company, is answered there, and on the free tiers is kept and used to train the next model. A local model turns that around.
A capable AI can run on the laptop in front of you. You install a small program, download a model once, and after that it answers from your own machine. There is no account to make, no subscription to pay, and nothing to send. The chat that on a cloud service would be logged and learned from happens in your own memory and is gone when you close it.
This is the clearest meeting point of the two halves of this site, the new AI tools and the older idea of holding your own things. The same instinct that keeps passwords on a key in your pocket keeps a model on your own disk. What you ask it is yours, and the companion guide on keeping cloud chats off the training set becomes a setting you no longer need, because there is no server to reach.
There is one honest catch, and the next section deals with it first. A model is a large thing to run, so this asks a little more of your hardware than opening a website does.
Install one free program, download a model once, and from then on you have an assistant that runs on your machine and sends nothing anywhere.
Will your laptop run it
This is the part to settle before you start. A model runs in your computer’s memory, so the question is how much memory you have, and that decides how large a model you can hold.
The rule of thumb is simple. A model squeezed down to a compact form needs roughly its own size in memory, a little less. A model of three billion parameters wants about two to three gigabytes free, and one of seven or eight billion about five to six. So a laptop with 8 GB of memory comfortably runs a small model, 16 GB is room for a mid-sized one, and 32 GB or more opens up the larger models.
Two things help. A recent Apple Silicon Mac shares its memory between the processor and the graphics chip, which suits these models well. A separate graphics card with its own memory makes replies much faster. Without one, the model runs on the main processor instead, which works and is slower, so the smaller models are the happier choice there.
You will see models offered in versions labelled with a Q and a number, such as Q4. Quantising squeezes the model’s internal numbers into fewer bits so it fits in less memory, for a small loss of accuracy. A four-bit build is the usual sweet spot, and the tools below pick a reasonable default, so this is good to recognise rather than something to agonise over.
Set it up
There are two good tools, and the right one depends on your comfort with a terminal. Ollama is a few words typed into a black window and is the quickest to script. LM Studio is an ordinary app with a chat window and never needs the terminal. Pick one.
The terminal way: Ollama
Download it for your system
Go to ollama.com and get the version for your system. Windows and Mac have an installer to double-click. On Linux, the site gives a single line to paste into a terminal. Ollama runs on Windows, macOS and Linux.
One command downloads and starts it
Open a terminal and run the line below. The first time, it downloads the model, a couple of gigabytes, then drops you into a chat where you type a question and press Enter.
ollama run llama3.2
That model, Llama 3.2, is a small one that suits most laptops. After the download, the same command starts it instantly.
Leave, list, add and remove
Type `/bye` to leave the chat. The handful of commands below cover the rest. Browse the model library for others, and match the size to your memory using the previous section.
ollama list # models you have ollama pull mistral # download another ollama rm llama3.2 # delete one
The no-terminal way: LM Studio
Download the app
Get LM Studio for your system and install it like any other application. It is free for home and work use, and runs on Windows, Mac and Linux.
Search and download from inside the app
Open the search tab, type a model name such as Llama 3.2 or Qwen 3, and pick a version sized for your machine. The app marks which ones will fit your memory. Click to download.
Load it and start typing
Open the chat tab, load the model you downloaded, and type. Everything runs on your machine, and the model answers in the window like any chat app.
Let other apps use your model
LM Studio can run a local server so other programs can use the model, in the same shape the big providers use. Turn it on only if you want a separate tool to talk to your local model. Most people never need this.
On a laptop with 8 GB of memory, start with a three-billion model such as Llama 3.2. With 16 GB, step up to an eight-billion model such as Llama 3.1, Qwen 3 or Mistral, which answer noticeably better. Try one, and adjust by how quickly it replies.
Prove it is private
The whole promise is that nothing leaves your machine. You can check it yourself in a few seconds, rather than take it on trust.
Once the model is downloaded, turn off your wifi, or pull out the network cable, and ask the model a question. It answers exactly as before. With no connection, there is nowhere for your words to go, so the reply is proof that the work is happening on your own computer and nowhere else.
Downloading a model is the only step that needs a connection. After that first pull, the model lives on your disk and runs offline. You can even copy the files to a machine that has never been online and run it there.
What it is good for
A laptop-sized model is a capable everyday tool, not a match for the largest cloud systems. Knowing where it shines keeps the experience a good one.
It is well suited to drafting and rewriting text, summarising a document you paste in, brainstorming, explaining ideas, and helping with code. Because it runs locally, it is the natural choice for anything private or sensitive, a medical question, a draft of something personal, work you are not allowed to send to a third party, and for working with no connection at all.
The limits are worth stating plainly. A model small enough for a laptop is less capable than the biggest cloud models, so expect good help rather than the last word. It has no live access to the web unless you add a tool for that, so it does not know today’s news. Very long documents strain memory. The honest summary is that you are buying privacy and independence, and paying for them with some raw capability. For a great many tasks, that is a fair trade.
If something breaks
| Symptom | What to try |
|---|---|
| replies are slow | Use a smaller model, close other heavy apps, and prefer a Q4 build. On a machine with no separate graphics card, a three-billion model is the comfortable floor. |
| it runs out of memory or crashes | The model is too big for your memory. Drop to a smaller size, or a more compressed build with a lower Q number. |
| the answers feel weak | Step up to a larger model if your memory allows, or choose one tuned for your task, such as a coding model. |
| it will not use my graphics card | Look in the tool’s settings for a GPU option. Ollama and LM Studio detect most cards, but some need it switched on by hand. |
| I am low on disk space | Models are large files. Remove ones you do not use with `ollama rm` or LM Studio’s model manager. |
| will it work with no internet | Yes, once the model is downloaded. Only that first download needs a connection. |
Quick reference
| Want | Do |
|---|---|
| The simple terminal way | Install Ollama, then run ollama run llama3.2. |
| The no-terminal way | Install LM Studio, search a model, download, chat. |
| A model for 8 GB of memory | A three-billion model such as Llama 3.2. |
| A model for 16 GB of memory | An eight-billion model such as Llama 3.1, Qwen 3 or Mistral. |
| Prove it is private | Turn the wifi off and ask it something. |
| Use it from other apps | Turn on LM Studio’s local server. |
| Free up space | Remove a model with ollama rm or in LM Studio. |
Common questions
The questions people ask before they download their first model.
Is it really free?
Yes. Ollama and LM Studio are free, and the models are open weights you download once. There is no account and no subscription. The only cost is the disk space the models take and a machine capable enough to run them.
How good is it compared to ChatGPT or Claude?
A model sized for a laptop is not as sharp as the largest cloud models, and it is fair to expect that. A current model of seven or eight billion parameters is still useful for drafting, summarising and everyday questions. You trade some capability for privacy and for the ability to work offline.
What hardware do I need?
Eight gigabytes of memory runs a small model, sixteen is comfortable for a mid-sized one, and more lets you run larger models. Apple Silicon Macs handle this well because they share memory between processor and graphics. A dedicated graphics card makes it faster but is not required.
Does it work offline?
Yes, after the one-time download of the model. With the model on disk it answers with the wifi switched off, which is the clearest proof that nothing is being sent anywhere.
Does it train on my chats or send them anywhere?
No. The model does not learn from your conversations, and nothing leaves the machine. This is the opposite of the cloud assistants and their training settings, and it needs no settings to switch off because there is no server to send to.
What does quantised mean, and which should I pick?
It means the model's numbers are squeezed into fewer bits so the model fits in less memory, for a small loss of accuracy. A four-bit build, often labelled Q4, is the usual sweet spot of size against quality. The tools pick a sensible default, so you rarely choose by hand.
Which model should I start with?
On a modest laptop, a three-billion model such as Llama 3.2. With sixteen gigabytes of memory, an eight-billion model such as Llama 3.1, Qwen 3 or Mistral. Try one, see how it runs, and step up or down from there.
Can other apps use it?
Yes. Both LM Studio and Ollama can expose the model to other programs through a local server that speaks the same shape as the big providers' interfaces. A tool that expects a cloud model can often be pointed at yours instead, with nothing leaving your machine.