

FORTRESS is a training-free LLM input moderation system from my first-authored paper in TMLR. It unifies semantic retrieval and dynamic perplexity analysis in a novel, efficient pipeline. Its data-centric design adapts to new threats through simple data ingestion instead of costly retraining, ensuring a scalable defense. FORTRESS outperforms existing classifiers, achieving a 91.6% average F1 score across major benchmarks while running over 5x faster.
2025
First Author
PyTorch, ChromaDB, HF Transformers
The world of Large Language Models (LLMs) is in a constant state of flux. As models become more powerful, so do the methods used to attack them. For developers and researchers, ensuring LLM safety is a perpetual cat-and-mouse game. Existing safety classifiers often require resource-intensive fine-tuning, a computationally expensive approach that proves brittle against the constant stream of new attack methods and demands costly, time-consuming retraining cycles.
Today, I'm excited to introduce a new paradigm in LLM safety and my latest first-author research paper published in TMLR: FORTRESS. This article provides a high-level overview, but for a full technical breakdown, the complete paper is available on OpenReview.
FORTRESS (Fast, Orchestrated Tuning-free Retrieval Ensemble for Scalable Safety) is a state-of-the-art, training-free LLM input safety classification system. It overcomes the limitations of prior work by unifying semantic retrieval and dynamic perplexity analysis within a single, efficient pipeline. At its core, FORTRESS utilizes a single, lightweight instruction-tuned language model (like Gemma or Qwen) for both embedding generation and perplexity analysis, ensuring robust performance with minimal computational overhead.
This data-centric design means FORTRESS adapts to emerging threats through simple data ingestion rather than expensive model retraining, offering a practical, scalable, and robust approach to LLM safety.
Compared to existing classifiers, FORTRESS offers a unique combination of scalability, efficiency, and leading performance.
Training-Free & Scalable: LLM safety is a constantly evolving field. Existing methods struggle with novel attack techniques, and addressing such defense gaps requires compute-intensive retraining. In stark contrast, scaling FORTRESS is as simple as ingesting new data into its vector database with no other modifications needed. This approach not only results in robust defense against new attacks but also ensures minimal degradation in previous benchmarks.
Computationally Efficient: The rise of lightweight language models has prompted the development of external safety guards that impose minimal overhead. FORTRESS uses models as small as 0.6B parameters and achieves competitive results with models way above its league. This makes it a fraction of the size of many competitors while delivering superior performance.
Leading Performance: FORTRESS consistently outperforms existing models. Our top-performing configuration achieves an average F1 unsafe score of 91.6% across nine distinct safety benchmarks spanning languages and attack vectors, all while operating over five times faster than the previous leading fine-tuned classifier.
FORTRESS employs a two-stage detection pipeline that combines complementary analysis techniques. A dynamic ensemble strategy then intelligently weighs these signals to produce a final classification.
The first stage assesses the semantic similarity of a user's prompt against a curated vector database.
Embedding Generation: We use a single instruction-tuned LLM (e.g., Qwen, Gemma) to extract a dense vector embedding from the model's hidden states.
Similarity Search: A ChromaDB vector store performs a k-Nearest Neighbors search to retrieve the most semantically similar prompts from the database, forming an initial hypothesis about the query's safety.
The second stage evaluates the linguistic typicality of the query, a method that is particularly potent for detecting novel or zero-day attacks that may not exist in the vector database.
While perplexity-based detection isn't new, a core innovation of FORTRESS is its use of dynamic, per-category perplexity thresholds. Instead of a single global threshold, the analyzer's parameters are pre-calibrated for each of the 20 safe and unsafe categories (e.g., s1_violent_crimes, content_creation). This allows the system to tune its sensitivity to the distinct linguistic characteristics of each topic, resolving the classic dilemma where a single static threshold is too strict for creative content but too lenient for sophisticated attacks.
The final classification is determined by a weighted majority vote that intelligently combines the signals from the primary and secondary detectors. The strategy's key strength is its dynamic adjustment of weights based on the coherence of the retrieval results.
unsafe), the system prioritizes the strong semantic signal.This allows FORTRESS to be decisive when context is clear, while exercising greater caution when confronted with ambiguous or novel prompts.
We conducted a series of comprehensive experiments, evaluating FORTRESS against state-of-the-art baselines across nine diverse public safety benchmarks.
FORTRESS demonstrates highly effective and efficient performance. Our top configuration, FORTRESS Qwen 4B (Expanded), achieves an average F1 score of 91.6%, outperforming the previous leading baseline, GuardReasoner 8B (86.3% Avg. F1), by over 5 percentage points while being over 5x faster.
A core claim is that FORTRESS adapts via data, not retraining. We tested this by ingesting additional datasets into our vector database. As shown below, this simple data ingestion yields a significant and consistent performance uplift across all model families, boosting the average F1 score by up to 8 percentage points.
This improvement is rooted in the improved structural coherence of the knowledge base. The visualization below shows how expanding the database transforms a sparse semantic space into one with dense, clearly delineated clusters, allowing for more accurate retrieval.
Critically, this scalability comes with no performance penalty. Our analysis shows that performance steadily increases with knowledge base size, while latency remains stable or even slightly decreases.
To dissect the contribution of each component, we conducted a series of ablation studies. The results highlight the critical role of our novel design choices.
The most compelling finding is the importance of dynamic, per-category perplexity thresholds. Removing this feature and using a single global threshold caused a catastrophic performance drop of 14.7 points in the average F1 score. This proves that context-aware analysis is a cornerstone of FORTRESS's high accuracy. Furthermore, the results validate our use of an integrated instruction-tuned LLM, which outperforms pipelines using conventional, retrieval-only embedding models.
FORTRESS establishes a new paradigm for LLM security that is simultaneously robust, efficient, and perpetually adaptable. By integrating semantic retrieval with dynamic perplexity analysis in a training-free framework, it achieves state-of-the-art performance while remaining scalable and computationally lean.
Future work will focus on enhancing system autonomy by automating the discovery of new threat categories and exploring more sophisticated ensemble techniques. While the codebase is a proof of concept for now, we plan for its open-source release in the future to encourage further research and development in scalable LLM safety. For a detailed exploration of our methodology and the full results, we invite you sincerely to read the complete paper on OpenReview.
This project taught me more about “taking a leap” than any I’ve done before. It began with a bold—albeit naïve—idea and a commitment to see it through. During the first month of curating the core dataset, I had no real certainty that the approach would work or that the architecture wasn’t overengineered. It felt like walking in the dark until the first usable prototype appeared and the system’s potential became visible.
After that initial fog, experiments and analysis progressed relatively smoothly. I only told my advisor when the paper was already halfway written. I’m deeply grateful to Prof. Tsai and AK for their earlier guidance—it gave me the academic footing to finish this work.
The review process was another trek through uncertainty. I initially planned for AAAI 2026 but switched to TMLR due to timing and submission length. For a first journal submission, things started well: after the AE’s recommendation, I quickly received two positive reviews; the third review prompted additional experiments that strengthened the paper. Then came the unexpected waiting—multiple delays and, at one point, the possibility of hearing nothing until late November. While juggling college applications and stress, I finally received the acceptance on October 11. Moving from a rough idea to an accepted paper was winding, but it made the result feel earned.
Perhaps, just as the saying goes, “Better to light a candle than curse the darkness,” what I’ve learned is simply the courage to light the first candle.