Degree Name

MS (Master of Science)

Program

Computer Science

Date of Award

5-2026

Committee Chair or Co-Chairs

Brian T. Bennett

Committee Members

Shehenaz Shaik, Mathew Desjardins

Abstract

The advancement of Large Language Models (LLMs) has fundamentally changed the nature of natural language processing. The substantial memory requirements of frontier models creates a significant barrier to entry, centralizing inference. This thesis presents the design and implementation of a distributed inference framework designed to democratize LLMs by leveraging commodity devices. The framework combines the resources of heterogeneous COTS devices into a unified compute pool, enabling the inference of models exceeding a single device's memory capacity. A novel Task Partitioning Engine (TPE) analyzes model architectures, profiles node capabilities, and supports pipeline and expert parallelism strategies. The primary contribution is a fault tolerance mechanism which detects node failures during active inference via heartbeat monitoring, automatically recovers lost model shards, and resumes generation from the point of failure with zero token loss. Evaluation on a heterogeneous cluster demonstrates successful distributed inference across heterogeneous devices and validates  mid-inference recovery following node failure.

Document Type

Thesis - unrestricted

Copyright

Copyright by the authors.

Share

COinS