Running LLMs Decentralized: Browser Inference and Internet Computer Hosting

Build privacy-first AI applications with client-side LLMs using Transformers.js, Ollama.js for local models, and decentralized hosting on the Internet Computer

Running LLMs in the browser with Transformers.js and Ollama

Modern web applications can now leverage powerful AI capabilities directly in the browser or through decentralized hosting. This guide explores three approaches: Transformers.js for pure client-side inference, Ollama.js for local model connections, and Internet Computer canisters for decentralized AI hosting.

New to Ollama? Start with Running LLMs Locally with Ollama - Privacy-First AI Development to learn the CLI basics, model management, and local development workflows before diving into web integration.

Why Run LLMs in the Browser?

Benefits

Privacy: User data never leaves their device
Zero Backend Costs: No API fees or server infrastructure
Offline Capability: Works without internet connection
Low Latency: No network round-trips
Scalability: Computation distributed across users

Use Cases

Private chat interfaces
Content generation tools
Code assistants
Text analysis and summarization
Image captioning and understanding
Translation services
Sentiment analysis

Approach 1: Ollama.js (Local Model Access)

Ollama.js lets your web app connect to a locally running Ollama server, combining browser convenience with powerful local models.

Installation

		npm install ollama

Basic Setup

		import { Ollama } from 'ollama'
 
// Connect to local Ollama server
const ollama = new Ollama({ host: 'http://localhost:11434' })
 
// Generate text
async function generate(prompt) {
  const response = await ollama.generate({
    model: 'llama3.2',
    prompt: prompt,
  })
  return response.response
}
 
// Usage
const result = await generate('Explain async/await in JavaScript')
console.log(result)

Streaming Responses

For real-time output:

		async function streamGenerate(prompt, onChunk) {
  const response = await ollama.generate({
    model: 'llama3.2',
    prompt: prompt,
    stream: true,
  })
 
  for await (const part of response) {
    onChunk(part.response)
  }
}
 
// Usage in React
function ChatComponent() {
  const [output, setOutput] = useState('')
 
  const handleSubmit = async (prompt) => {
    setOutput('')
    await streamGenerate(prompt, (chunk) => {
      setOutput(prev => prev + chunk)
    })
  }
 
  return (
    <div>
      <div>{output}</div>
      <button onClick={() => handleSubmit('Write a poem')}>
        Generate
      </button>
    </div>
  )
}

Chat Interface

Build conversational interfaces:

		async function chat(messages) {
  const response = await ollama.chat({
    model: 'llama3.2',
    messages: messages,
  })
  return response.message
}
 
// Usage
const conversation = [
  { role: 'user', content: 'What is recursion?' },
  { role: 'assistant', content: 'Recursion is when a function calls itself...' },
  { role: 'user', content: 'Give me an example in Python' }
]
 
const reply = await chat(conversation)
console.log(reply.content)

Vision Models

Process images with multi-modal models:

		async function analyzeImage(imagePath, prompt) {
  const response = await ollama.generate({
    model: 'llava',
    prompt: prompt,
    images: [imagePath]
  })
  return response.response
}
 
// With base64 encoded images
async function analyzeBase64Image(base64Image, prompt) {
  const response = await ollama.generate({
    model: 'llava',
    prompt: prompt,
    images: [base64Image]
  })
  return response.response
}
 
// Usage in web app
const fileInput = document.getElementById('imageInput')
fileInput.addEventListener('change', async (e) => {
  const file = e.target.files[0]
  const reader = new FileReader()
 
  reader.onload = async (event) => {
    const base64 = event.target.result.split(',')[1]
    const description = await analyzeBase64Image(
      base64,
      'Describe this image in detail'
    )
    console.log(description)
  }
 
  reader.readAsDataURL(file)
})

React Example - Complete Chat App

		import { useState } from 'react'
import { Ollama } from 'ollama'
 
function ChatApp() {
  const [messages, setMessages] = useState([])
  const [input, setInput] = useState('')
  const [loading, setLoading] = useState(false)
 
  const ollama = new Ollama({ host: 'http://localhost:11434' })
 
  const sendMessage = async () => {
    if (!input.trim()) return
 
    const userMessage = { role: 'user', content: input }
    setMessages(prev => [...prev, userMessage])
    setInput('')
    setLoading(true)
 
    try {
      const response = await ollama.chat({
        model: 'llama3.2',
        messages: [...messages, userMessage],
        stream: true,
      })
 
      let assistantMessage = { role: 'assistant', content: '' }
      setMessages(prev => [...prev, assistantMessage])
 
      for await (const part of response) {
        assistantMessage.content += part.message.content
        setMessages(prev => {
          const updated = [...prev]
          updated[updated.length - 1] = { ...assistantMessage }
          return updated
        })
      }
    } catch (error) {
      console.error('Error:', error)
    } finally {
      setLoading(false)
    }
  }
 
  return (
    <div className="chat-container">
      <div className="messages">
        {messages.map((msg, idx) => (
          <div key={idx} className={`message ${msg.role}`}>
            <strong>{msg.role}:</strong> {msg.content}
          </div>
        ))}
      </div>
      <div className="input-area">
        <input
          value={input}
          onChange={(e) => setInput(e.target.value)}
          onKeyPress={(e) => e.key === 'Enter' && sendMessage()}
          placeholder="Type a message..."
          disabled={loading}
        />
        <button onClick={sendMessage} disabled={loading}>
          {loading ? 'Sending...' : 'Send'}
        </button>
      </div>
    </div>
  )
}
 
export default ChatApp

CORS Configuration

Enable CORS for browser access:

		# Set Ollama origins environment variable
export OLLAMA_ORIGINS="http://localhost:3000,http://localhost:5173"
 
# Then start Ollama
ollama serve

Or permanently in config:

		# ~/.ollama/config.json
{
  "origins": ["http://localhost:3000", "http://localhost:5173"]
}

Approach 2: Transformers.js (In-Browser Inference)

Transformers.js runs models entirely in the browser using WebAssembly and WebGPU, with no backend required.

Installation

		npm install @xenova/transformers

Basic Text Generation

		import { pipeline } from '@xenova/transformers'
 
// Create a text generation pipeline
const generator = await pipeline('text-generation', 'Xenova/gpt2')
 
// Generate text
const output = await generator('Once upon a time', {
  max_new_tokens: 50,
  temperature: 0.7,
})
 
console.log(output[0].generated_text)

Sentiment Analysis

		import { pipeline } from '@xenova/transformers'
 
// Create sentiment analyzer
const classifier = await pipeline(
  'sentiment-analysis',
  'Xenova/distilbert-base-uncased-finetuned-sst-2-english'
)
 
// Analyze text
const result = await classifier('I love this product!')
console.log(result)
// [{ label: 'POSITIVE', score: 0.9998 }]

Text Summarization

		import { pipeline } from '@xenova/transformers'
 
const summarizer = await pipeline('summarization', 'Xenova/distilbart-cnn-6-6')
 
const text = `
  The Transformer architecture was introduced in the paper "Attention is All You Need"
  in 2017. It revolutionized natural language processing by replacing recurrent neural
  networks with self-attention mechanisms, enabling parallel processing of sequences.
`
 
const summary = await summarizer(text, {
  max_length: 50,
  min_length: 10,
})
 
console.log(summary[0].summary_text)

Translation

		import { pipeline } from '@xenova/transformers'
 
// English to German
const translator = await pipeline(
  'translation',
  'Xenova/nllb-200-distilled-600M'
)
 
const result = await translator('Hello, how are you?', {
  src_lang: 'eng_Latn',
  tgt_lang: 'deu_Latn',
})
 
console.log(result[0].translation_text)
// "Hallo, wie geht es dir?"

Question Answering

		import { pipeline } from '@xenova/transformers'
 
const qa = await pipeline('question-answering', 'Xenova/distilbert-base-cased-distilled-squad')
 
const context = `
  The Eiffel Tower is located in Paris, France. It was completed in 1889
  and stands 330 meters tall. It was designed by Gustave Eiffel.
`
 
const answer = await qa({
  question: 'How tall is the Eiffel Tower?',
  context: context,
})
 
console.log(answer.answer) // "330 meters"

Image Classification

		import { pipeline } from '@xenova/transformers'
 
// Load image classifier
const classifier = await pipeline(
  'image-classification',
  'Xenova/vit-base-patch16-224'
)
 
// From URL
const result = await classifier('https://example.com/cat.jpg')
console.log(result)
// [{ label: 'tabby cat', score: 0.95 }, ...]
 
// From File Input
const fileInput = document.getElementById('imageInput')
fileInput.addEventListener('change', async (e) => {
  const file = e.target.files[0]
  const result = await classifier(file)
  console.log(result)
})

Feature Extraction / Embeddings

		import { pipeline } from '@xenova/transformers'
 
// Create embeddings for semantic search
const extractor = await pipeline(
  'feature-extraction',
  'Xenova/all-MiniLM-L6-v2'
)
 
const embeddings = await extractor('This is a sample sentence', {
  pooling: 'mean',
  normalize: true,
})
 
console.log(embeddings.data) // Float32Array of embeddings

React Example - Text Summarizer

		import { useState, useEffect } from 'react'
import { pipeline } from '@xenova/transformers'
 
function TextSummarizer() {
  const [summarizer, setSummarizer] = useState(null)
  const [input, setInput] = useState('')
  const [summary, setSummary] = useState('')
  const [loading, setLoading] = useState(false)
  const [modelLoading, setModelLoading] = useState(true)
 
  useEffect(() => {
    async function loadModel() {
      const model = await pipeline('summarization', 'Xenova/distilbart-cnn-6-6')
      setSummarizer(model)
      setModelLoading(false)
    }
    loadModel()
  }, [])
 
  const handleSummarize = async () => {
    if (!input.trim() || !summarizer) return
 
    setLoading(true)
    try {
      const result = await summarizer(input, {
        max_length: 100,
        min_length: 30,
      })
      setSummary(result[0].summary_text)
    } catch (error) {
      console.error('Summarization error:', error)
    } finally {
      setLoading(false)
    }
  }
 
  if (modelLoading) {
    return <div>Loading model...</div>
  }
 
  return (
    <div className="summarizer">
      <h2>Text Summarizer</h2>
      <textarea
        value={input}
        onChange={(e) => setInput(e.target.value)}
        placeholder="Paste text to summarize..."
        rows={10}
      />
      <button onClick={handleSummarize} disabled={loading}>
        {loading ? 'Summarizing...' : 'Summarize'}
      </button>
      {summary && (
        <div className="summary">
          <h3>Summary:</h3>
          <p>{summary}</p>
        </div>
      )}
    </div>
  )
}
 
export default TextSummarizer

Available Models

Popular models on Hugging Face compatible with Transformers.js:

Text Generation:

Xenova/gpt2
Xenova/distilgpt2
Xenova/LaMini-Flan-T5-783M

Classification:

Xenova/distilbert-base-uncased-finetuned-sst-2-english (sentiment)
Xenova/toxic-bert (toxicity detection)

Summarization:

Xenova/distilbart-cnn-6-6
Xenova/distilbart-cnn-12-6

Translation:

Xenova/nllb-200-distilled-600M (200 languages)

Embeddings:

Xenova/all-MiniLM-L6-v2
Xenova/all-mpnet-base-v2

Vision:

Xenova/vit-base-patch16-224 (image classification)
Xenova/clip-vit-base-patch32 (image-text matching)

Browse all models: Hugging Face Transformers.js

Performance Optimization

Transformers.js Optimization

		// Use WebGPU for faster inference (if available)
import { env } from '@xenova/transformers'
 
env.backends.onnx.wasm.numThreads = 4 // Use multiple threads
env.backends.onnx.wasm.proxy = false // Disable worker for small models
 
// Cache models in browser
env.cacheDir = './.cache'

Lazy Loading

		// Load model only when needed
let model = null
 
async function getModel() {
  if (!model) {
    model = await pipeline('sentiment-analysis', 'Xenova/distilbert-base-uncased-finetuned-sst-2-english')
  }
  return model
}
 
// Usage
const classifier = await getModel()
const result = await classifier('Great product!')

Progress Tracking

		import { pipeline, env } from '@xenova/transformers'
 
// Track download progress
env.onProgress = (progress) => {
  console.log(`Loading: ${progress.file} - ${progress.progress}%`)
}
 
const model = await pipeline('text-generation', 'Xenova/gpt2')

Comparison: Browser Inference vs Local Models vs Decentralized Hosting

Feature	Transformers.js	Ollama.js	Internet Computer
Setup	Pure browser, no backend	Requires Ollama server	Deploy to ICP canisters
Model Size	Small-Medium (10MB-500MB)	Large (1GB-7GB+)	Flexible (canister limits)
Performance	Moderate (WASM/WebGPU)	Very fast (native)	Distributed (network dependent)
Privacy	Fully client-side	Local but requires server	Decentralized (blockchain)
Offline	Yes (after first load)	Yes (server must run)	No (requires internet)
Censorship Resistance	High	Medium	Very High
Cost	Free (client resources)	Free (local hardware)	ICP cycles
Best For	Specific tasks, embeddings	Powerful chat/completion	Censorship-resistant apps

Real-World Examples

		// Using Transformers.js for sentiment analysis
import { pipeline } from '@xenova/transformers'
 
const sentimentAnalyzer = await pipeline('sentiment-analysis', 'Xenova/distilbert-base-uncased-finetuned-sst-2-english')
 
function JournalEntry({ text }) {
  const [sentiment, setSentiment] = useState(null)
 
  useEffect(() => {
    async function analyze() {
      const result = await sentimentAnalyzer(text)
      setSentiment(result[0])
    }
    if (text) analyze()
  }, [text])
 
  return (
    <div>
      <p>{text}</p>
      {sentiment && (
        <span className={sentiment.label}>
          Mood: {sentiment.label} ({(sentiment.score * 100).toFixed(1)}%)
        </span>
      )}
    </div>
  )
}

Code Documentation Generator

		// Using Ollama.js for code analysis
import { Ollama } from 'ollama'
 
async function generateDocs(code) {
  const ollama = new Ollama({ host: 'http://localhost:11434' })
 
  const response = await ollama.generate({
    model: 'codellama',
    prompt: `Generate JSDoc documentation for this code:\n\n${code}`,
    stream: false,
  })
 
  return response.response
}
 
// Usage in code editor
const code = `
function fibonacci(n) {
  if (n <= 1) return n
  return fibonacci(n - 1) + fibonacci(n - 2)
}
`
 
const docs = await generateDocs(code)
console.log(docs)

Smart Search with Embeddings

		// Using Transformers.js for semantic search
import { pipeline } from '@xenova/transformers'
 
const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2')
 
// Generate embeddings for documents
const documents = [
  'The cat sat on the mat',
  'Dogs are great pets',
  'Machine learning is fascinating',
]
 
const docEmbeddings = await Promise.all(
  documents.map(doc => extractor(doc, { pooling: 'mean', normalize: true }))
)
 
// Search function
async function search(query) {
  const queryEmbedding = await extractor(query, { pooling: 'mean', normalize: true })
 
  // Cosine similarity
  const scores = docEmbeddings.map((docEmb, idx) => {
    const similarity = cosineSimilarity(queryEmbedding.data, docEmb.data)
    return { document: documents[idx], score: similarity }
  })
 
  return scores.sort((a, b) => b.score - a.score)
}
 
function cosineSimilarity(a, b) {
  let dotProduct = 0
  let normA = 0
  let normB = 0
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i]
    normA += a[i] * a[i]
    normB += b[i] * b[i]
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB))
}
 
// Search usage
const results = await search('feline animals')
console.log(results)
// [{ document: 'The cat sat on the mat', score: 0.82 }, ...]

Deployment Considerations

Browser Compatibility

Ollama.js: Any modern browser with fetch API
Transformers.js: Chrome 90+, Firefox 88+, Safari 15+

Bundle Size

		// Lazy load to reduce initial bundle
const loadTransformers = () => import('@xenova/transformers')
 
// Use code splitting
const loadOllama = () => import('ollama')

Caching Strategies

		// Service Worker for model caching
self.addEventListener('fetch', (event) => {
  if (event.request.url.includes('huggingface.co')) {
    event.respondWith(
      caches.match(event.request).then((response) => {
        return response || fetch(event.request).then((response) => {
          const clone = response.clone()
          caches.open('models').then((cache) => {
            cache.put(event.request, clone)
          })
          return response
        })
      })
    )
  }
})

Resources

Ollama.js

Transformers.js

Approach 3: Internet Computer (Blockchain-Based Hosting)

The Internet Computer Protocol (ICP) enables blockchain-based AI applications through canister smart contracts. While the AI inference is currently centralized (run by DFINITY), your application logic executes in a tamper-proof, decentralized environment with transparent, auditable code execution.

What Makes ICP AI Different

Trustworthy Execution:

Computation validated across 130+ independent data centers globally
No single point of failure or centralized control
AI agents execute only what their code allows with transparent, auditable logic
Fully open-source infrastructure

Unique Capabilities:

Financial Operations: Securely manage and transact digital assets within AI agents
Massive Storage: Support for 500GB+ data per canister, ideal for RAG (retrieval-augmented generation) systems
Multi-language Support: Build with Motoko, Rust, TypeScript, Python, or C++
DAO Governance: AI agents can be tokenized and governed by decentralized communities

LLM Canister - Deploy AI Agents in Minutes

The official LLM Canister provides simple APIs for integrating large language models into your canisters. Currently supports Llama 3.1 8B and is free during the MVP phase.

Rust Example:

		use ic_llm::Model;
 
// Simple prompt
let response = ic_llm::prompt(
    Model::Llama3_1_8B,
    "What's the speed of light?"
).await;
 
// Chat conversation
let messages = vec![
    Message::system("You are a helpful assistant"),
    Message::user("Explain blockchain in simple terms"),
];
let chat_response = ic_llm::chat(Model::Llama3_1_8B, messages).await;

Motoko Example:

		import LLM "mo:llm";
 
// Simple prompt
let response = await LLM.prompt(
    #Llama3_1_8B,
    "What's the speed of light?"
);
 
// Chat conversation
let messages = [
    { role = #system; content = "You are a helpful assistant" },
    { role = #user; content = "Explain blockchain in simple terms" }
];
let chatResponse = await LLM.chat(#Llama3_1_8B, messages);

TypeScript (via Azle 0.27.0+):

		import { llm } from 'azle';
 
// Simple prompt
const response = await llm.prompt(
    'Llama3_1_8B',
    'What is the speed of light?'
);
 
// Multi-message chat
const messages = [
    { role: 'system', content: 'You are a helpful assistant' },
    { role: 'user', content: 'Explain blockchain simply' }
];
const chatResponse = await llm.chat('Llama3_1_8B', messages);

How LLM Canister Works

The system uses dedicated "AI workers" - stateless nodes that process LLM requests:

Your canister sends a prompt to the LLM canister
The request is queued and routed to available AI workers
Workers execute the inference and return results
Your canister receives the generated response

Current Limitations:

Maximum 10 messages per chat request
10KiB prompt size limit
200-token output limit
DFINITY controls the LLM canister and workers (decentralization planned)

Privacy: Prompts are not logged. DFINITY only tracks aggregate usage metrics.

Building a Full LLM Chatbot on ICP

Deploy a complete chatbot with React frontend and Rust backend canister.

Backend Canister (Rust):

		use ic_cdk::api::call::call;
use ic_cdk_macros::{query, update};
use ic_llm::{Model, Message};
 
#[update]
async fn chat(user_message: String, history: Vec<Message>) -> String {
    let mut messages = history;
    messages.push(Message::user(&user_message));
 
    match ic_llm::chat(Model::Llama3_1_8B, messages).await {
        Ok(response) => response,
        Err(e) => format!("Error: {:?}", e)
    }
}
 
#[query]
fn get_model_info() -> String {
    "Using Llama 3.1 8B on Internet Computer".to_string()
}

Frontend Integration:

		import { Actor, HttpAgent } from '@dfinity/agent'
 
// Connect to your deployed canister
const agent = new HttpAgent({ host: 'https://ic0.app' })
const canisterId = 'your-canister-id'
 
const actor = Actor.createActor(idlFactory, {
  agent,
  canisterId
})
 
// Chat with the AI
async function sendMessage(userMessage, conversationHistory) {
  try {
    const response = await actor.chat(userMessage, conversationHistory)
    return response
  } catch (error) {
    console.error('Chat error:', error)
    throw error
  }
}
 
// React component
function ICPChatbot() {
  const [messages, setMessages] = useState([])
  const [input, setInput] = useState('')
  const [loading, setLoading] = useState(false)
 
  const handleSend = async () => {
    if (!input.trim()) return
 
    const userMsg = { role: 'user', content: input }
    const newMessages = [...messages, userMsg]
    setMessages(newMessages)
    setInput('')
    setLoading(true)
 
    try {
      const response = await sendMessage(input, messages)
      setMessages([...newMessages, { role: 'assistant', content: response }])
    } catch (error) {
      console.error('Error:', error)
    } finally {
      setLoading(false)
    }
  }
 
  return (
    <div className="chatbot">
      <div className="messages">
        {messages.map((msg, idx) => (
          <div key={idx} className={`message ${msg.role}`}>
            <strong>{msg.role}:</strong> {msg.content}
          </div>
        ))}
      </div>
      <div className="input-area">
        <input
          value={input}
          onChange={(e) => setInput(e.target.value)}
          onKeyPress={(e) => e.key === 'Enter' && handleSend()}
          disabled={loading}
        />
        <button onClick={handleSend} disabled={loading}>
          {loading ? 'Sending...' : 'Send'}
        </button>
      </div>
    </div>
  )
}

Retrieval-Augmented Generation (RAG) on ICP

Build RAG systems with embeddings stored directly in canisters for semantic search.

Motoko Canister for Embeddings:

		import Array "mo:base/Array";
import Float "mo:base/Float";
import Time "mo:base/Time";
 
actor EmbeddingStore {
  type Embedding = {
    text: Text;
    vector: [Float];
    timestamp: Int;
  };
 
  stable var embeddings : [Embedding] = [];
  stable var secretKey : Text = "your-secret-key";
 
  // Store embedding
  public shared func storeEmbedding(
    key: Text,
    text: Text,
    vector: [Float]
  ) : async Bool {
    if (key != secretKey) return false;
 
    let newEmbedding : Embedding = {
      text = text;
      vector = vector;
      timestamp = Time.now();
    };
 
    embeddings := Array.append(embeddings, [newEmbedding]);
    true
  };
 
  // Retrieve all embeddings
  public query func getEmbeddings(key: Text) : async ?[Embedding] {
    if (key != secretKey) return null;
    ?embeddings
  };
 
  // Cosine similarity search
  public query func search(
    key: Text,
    queryVector: [Float],
    topK: Nat
  ) : async ?[Embedding] {
    if (key != secretKey) return null;
 
    // Calculate similarities and return top K results
    // Implementation details omitted for brevity
    ?embeddings
  };
}

Node.js Integration Layer:

		import express from 'express';
import { HttpAgent, Actor } from '@dfinity/agent';
import { idlFactory } from './embedding-canister.did.js';
 
const app = express();
app.use(express.json());
 
const agent = new HttpAgent({ host: 'https://ic0.app' });
const actor = Actor.createActor(idlFactory, {
  agent,
  canisterId: process.env.CANISTER_ID
});
 
// Store embedding endpoint
app.post('/embeddings', async (req, res) => {
  const { text, embedding } = req.body;
  const result = await actor.storeEmbedding(
    process.env.SECRET_KEY,
    text,
    embedding
  );
  res.json({ success: result });
});
 
// Retrieve embeddings endpoint
app.get('/embeddings', async (req, res) => {
  const embeddings = await actor.getEmbeddings(process.env.SECRET_KEY);
  res.json(embeddings);
});
 
// Semantic search endpoint
app.post('/search', async (req, res) => {
  const { query_vector, top_k } = req.body;
  const results = await actor.search(
    process.env.SECRET_KEY,
    query_vector,
    top_k || 5
  );
  res.json(results);
});
 
app.listen(3000, () => console.log('RAG API running on port 3000'));

Quick Deployment with ICP Ninja

For Rapid Prototyping:

Visit ICP Ninja and select the LLM Chatbot template
Click "Deploy" to deploy directly to mainnet - no local setup required
Get instant access to a working chatbot with Llama 3.1 8B
Download the project files for local customization

For Local Development:

Install DFINITY SDK
Download Ollama and run locally: ollama serve then ollama run llama3.1:8b
Deploy your canister: dfx deploy

Real-World Use Cases

Ecosystem Applications:

ELNA.ai: Personal AI memory and knowledge management
Anda: Conversational AI interface
ALICE: Autonomous DAO agents
Kinic: Decentralized search with AI
Pickpump: DeFi tools with AI assistance

Why ICP for AI?

Versus Traditional Cloud:

No vendor lock-in or platform censorship
Transparent, auditable execution
True data ownership and sovereignty
Financial operations without intermediaries

Versus Client-Side (Transformers.js):

Access to larger, more powerful models
Shared compute resources across users
Persistent memory and state management
Integration with blockchain features

Versus Local (Ollama.js):

No local infrastructure required
Accessible from any device
Collaborative AI agents
Censorship-resistant hosting

Available Libraries & Resources

Official Libraries:

Rust: ic-llm on docs.rs
Motoko: mo:llm on mops.one
TypeScript: Azle 0.27.0+
Python & C++: Via DFINITY SDK

Documentation & Samples:

LLM Chatbot Sample (Rust) - Complete chatbot implementation
Introducing the LLM Canister - Official announcement and guide
ICP Retrieval System Tutorial - RAG implementation guide
AI Agents on Internet Computer - Overview and ecosystem
GitHub: DFINITY LLM Examples - Code samples and templates

Conclusion

Running LLMs in decentralized ways opens up new possibilities for privacy-first, censorship-resistant AI applications. Choose Transformers.js for pure client-side inference, Ollama.js when users can run local servers, and Internet Computer for truly decentralized hosting that cannot be censored or controlled by any single entity.

All three approaches prioritize user privacy while delivering intelligent features. Start experimenting with these tools to build the next generation of decentralized AI applications!

Running LLMs Decentralized: Browser Inference and Internet Computer Hosting

Why Run LLMs in the Browser?

Benefits

Use Cases

Approach 1: Ollama.js (Local Model Access)

Installation

Basic Setup

Streaming Responses

Chat Interface

Vision Models

React Example - Complete Chat App

CORS Configuration

Approach 2: Transformers.js (In-Browser Inference)

Installation

Basic Text Generation

Sentiment Analysis

Text Summarization

Translation

Question Answering

Image Classification

Feature Extraction / Embeddings

React Example - Text Summarizer

Available Models

Performance Optimization

Transformers.js Optimization

Lazy Loading

Progress Tracking

Comparison: Browser Inference vs Local Models vs Decentralized Hosting

Real-World Examples

Code Documentation Generator

Smart Search with Embeddings

Deployment Considerations

Browser Compatibility

Bundle Size

Caching Strategies

Resources

Ollama.js

Transformers.js

Related

Approach 3: Internet Computer (Blockchain-Based Hosting)

What Makes ICP AI Different

LLM Canister - Deploy AI Agents in Minutes

How LLM Canister Works

Building a Full LLM Chatbot on ICP

Retrieval-Augmented Generation (RAG) on ICP

Quick Deployment with ICP Ninja

Real-World Use Cases

Why ICP for AI?

Available Libraries & Resources

Conclusion