Serving API
LLMEngine Class
Main class for loading and serving models.
from atom import LLMEngine
llm = LLMEngine(model="meta-llama/Llama-2-7b-hf")
Parameters:
model (str) - HuggingFace model name or path
gpu_memory_utilization (float) - GPU memory usage (0.0-1.0). Default: 0.9
max_model_len (int) - Maximum sequence length
tensor_parallel_size (int) - Number of GPUs for tensor parallelism. Default: 1
dtype (str) - Model dtype (‘float16’, ‘bfloat16’, ‘float32’)
Methods
generate()
sampling_params = SamplingParams(max_tokens=50, temperature=0.8)
outputs = llm.generate(prompts, sampling_params)
Generate text from prompts.
Parameters:
prompts (list[str]) - Input prompts (must be a list, even for single prompt)
sampling_params (SamplingParams | list[SamplingParams]) - Sampling configuration
Returns:
outputs (list[str]) - Generated text strings
Note
Unlike some APIs, generate() requires prompts to be a list and returns
a list of strings, not RequestOutput objects. Parameters like max_tokens
must be specified via SamplingParams.
SamplingParams
from atom import SamplingParams
params = SamplingParams(
temperature=0.8,
max_tokens=100,
ignore_eos=False,
stop_strings=["</s>", "\n\n"]
)
Configuration for text generation.
Parameters:
temperature (float) - Controls randomness. Default: 1.0
max_tokens (int) - Maximum tokens to generate. Default: 64
ignore_eos (bool) - Whether to ignore EOS token. Default: False
stop_strings (list[str] | None) - Strings that stop generation. Default: None
Note
The following parameters are NOT currently supported (may be added in future): top_p, top_k, presence_penalty, frequency_penalty
Return Values
The generate() method returns a list of strings (not RequestOutput objects).
outputs = llm.generate(["Hello, world!"], sampling_params)
# outputs is list[str], e.g., ["Hello, world! How are you today?"]
Note
Unlike some LLM serving frameworks, ATOM’s generate() method returns plain strings, not structured output objects. If you need token IDs or other metadata, these are not currently exposed in the API.
Example
Complete example:
from atom import LLM, SamplingParams
# Initialize model
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=2,
gpu_memory_utilization=0.9
)
# Configure sampling
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=200
)
# Generate
prompts = ["Tell me about AMD GPUs"]
outputs = llm.generate(prompts, sampling_params=sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Generated: {output.text}")