Serving API

LLMEngine Class

Main class for loading and serving models.

from atom import LLMEngine

llm = LLMEngine(model="meta-llama/Llama-2-7b-hf")

Parameters:

  • model (str) - HuggingFace model name or path

  • gpu_memory_utilization (float) - GPU memory usage (0.0-1.0). Default: 0.9

  • max_model_len (int) - Maximum sequence length

  • tensor_parallel_size (int) - Number of GPUs for tensor parallelism. Default: 1

  • dtype (str) - Model dtype (‘float16’, ‘bfloat16’, ‘float32’)

Methods

generate()

sampling_params = SamplingParams(max_tokens=50, temperature=0.8)
outputs = llm.generate(prompts, sampling_params)

Generate text from prompts.

Parameters:

  • prompts (list[str]) - Input prompts (must be a list, even for single prompt)

  • sampling_params (SamplingParams | list[SamplingParams]) - Sampling configuration

Returns:

  • outputs (list[str]) - Generated text strings

Note

Unlike some APIs, generate() requires prompts to be a list and returns a list of strings, not RequestOutput objects. Parameters like max_tokens must be specified via SamplingParams.

SamplingParams

from atom import SamplingParams

params = SamplingParams(
    temperature=0.8,
    max_tokens=100,
    ignore_eos=False,
    stop_strings=["</s>", "\n\n"]
)

Configuration for text generation.

Parameters:

  • temperature (float) - Controls randomness. Default: 1.0

  • max_tokens (int) - Maximum tokens to generate. Default: 64

  • ignore_eos (bool) - Whether to ignore EOS token. Default: False

  • stop_strings (list[str] | None) - Strings that stop generation. Default: None

Note

The following parameters are NOT currently supported (may be added in future): top_p, top_k, presence_penalty, frequency_penalty

Return Values

The generate() method returns a list of strings (not RequestOutput objects).

outputs = llm.generate(["Hello, world!"], sampling_params)
# outputs is list[str], e.g., ["Hello, world! How are you today?"]

Note

Unlike some LLM serving frameworks, ATOM’s generate() method returns plain strings, not structured output objects. If you need token IDs or other metadata, these are not currently exposed in the API.

Example

Complete example:

from atom import LLM, SamplingParams

# Initialize model
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9
)

# Configure sampling
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=200
)

# Generate
prompts = ["Tell me about AMD GPUs"]
outputs = llm.generate(prompts, sampling_params=sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.text}")