What is Petals?
Petals introduces a collaborative approach to running large language models (LLMs). It allows users to operate demanding models such as Llama 3.1 (up to 405B parameters), Mixtral (8x22B), Falcon (40B+), and BLOOM (176B) without requiring high-end enterprise hardware. The system operates in a distributed, peer-to-peer manner, similar to BitTorrent. Users load a segment of the desired model onto their machine (compatible with consumer-grade GPUs or Google Colab) and connect to a network where other participants host the remaining parts.
This distributed structure facilitates inference speeds suitable for interactive applications like chatbots, achieving up to 6 tokens per second for Llama 2 (70B). Beyond standard inference, Petals offers enhanced flexibility compared to typical LLM APIs. It supports various fine-tuning methods, custom sampling techniques, and allows users to execute specific computational paths through the model or inspect its hidden states. This integration with PyTorch and 🤗 Transformers provides API-like convenience coupled with deep model access and control.
Features
- Distributed LLM Execution: Runs large models across a network of user devices.
- Support for Major LLMs: Compatible with Llama 3.1, Mixtral, Falcon, BLOOM, and others.
- Consumer Hardware Compatibility: Operates on consumer-grade GPUs or Google Colab.
- Interactive Inference Speed: Delivers speeds suitable for chatbots and interactive apps (e.g., up to 6 tokens/sec for Llama 2 70B).
- Advanced Model Control: Allows fine-tuning, custom sampling, custom execution paths, and access to hidden states.
- PyTorch & Transformers Integration: Offers flexibility through integration with popular ML frameworks.
Use Cases
- Running large-scale language models on standard hardware.
- Developing and testing interactive AI applications and chatbots.
- Fine-tuning large language models for specific tasks.
- Conducting AI research requiring deep access to model internals.
- Collaboratively hosting and utilizing powerful AI models.
- Experimenting with custom inference and sampling techniques.
Helpful for people in the following professions
Petals Uptime Monitor
Average Uptime
100%
Average Response Time
133 ms