Serving API =========== LLMEngine Class --------------- Main class for loading and serving models. .. code-block:: python from atom import LLMEngine llm = LLMEngine(model="meta-llama/Llama-2-7b-hf") **Parameters:** * **model** (*str*) - HuggingFace model name or path * **gpu_memory_utilization** (*float*) - GPU memory usage (0.0-1.0). Default: 0.9 * **max_model_len** (*int*) - Maximum sequence length * **tensor_parallel_size** (*int*) - Number of GPUs for tensor parallelism. Default: 1 * **dtype** (*str*) - Model dtype ('float16', 'bfloat16', 'float32') Methods ^^^^^^^ generate() """""""""" .. code-block:: python sampling_params = SamplingParams(max_tokens=50, temperature=0.8) outputs = llm.generate(prompts, sampling_params) Generate text from prompts. **Parameters:** * **prompts** (*list[str]*) - Input prompts (must be a list, even for single prompt) * **sampling_params** (*SamplingParams | list[SamplingParams]*) - Sampling configuration **Returns:** * **outputs** (*list[str]*) - Generated text strings .. note:: Unlike some APIs, ``generate()`` requires prompts to be a list and returns a list of strings, not RequestOutput objects. Parameters like max_tokens must be specified via SamplingParams. SamplingParams -------------- .. code-block:: python from atom import SamplingParams params = SamplingParams( temperature=0.8, max_tokens=100, ignore_eos=False, stop_strings=["", "\n\n"] ) Configuration for text generation. **Parameters:** * **temperature** (*float*) - Controls randomness. Default: 1.0 * **max_tokens** (*int*) - Maximum tokens to generate. Default: 64 * **ignore_eos** (*bool*) - Whether to ignore EOS token. Default: False * **stop_strings** (*list[str] | None*) - Strings that stop generation. Default: None .. note:: The following parameters are NOT currently supported (may be added in future): top_p, top_k, presence_penalty, frequency_penalty Return Values ------------- The ``generate()`` method returns a list of strings (not RequestOutput objects). .. code-block:: python outputs = llm.generate(["Hello, world!"], sampling_params) # outputs is list[str], e.g., ["Hello, world! How are you today?"] .. note:: Unlike some LLM serving frameworks, ATOM's generate() method returns plain strings, not structured output objects. If you need token IDs or other metadata, these are not currently exposed in the API. Example ------- Complete example: .. code-block:: python from atom import LLM, SamplingParams # Initialize model llm = LLM( model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=2, gpu_memory_utilization=0.9 ) # Configure sampling sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=200 ) # Generate prompts = ["Tell me about AMD GPUs"] outputs = llm.generate(prompts, sampling_params=sampling_params) for output in outputs: print(f"Prompt: {output.prompt}") print(f"Generated: {output.text}")