Adversarial Machine Learning
AI models see the world differently than humans. Adversarial ML is the study of optical illusions and logical traps that cause models to fail.
The 3 Categories of Attacks
Adversarial attacks generally fall into three main categories, depending on the attacker's goal.
1. Evasion Attacks (Input)
Modifying the input data to cause a misclassification.
Example: Adding invisible noise to a "Stop" sign image so a self-driving car sees it as "Speed Limit 45". In LLMs, this is Prompt Injection.
2. Poisoning Attacks (Training)
Corrupting the training data to compromise the model's integrity.
Example: Injecting malicious samples into a spam filter's feedback loop so it starts marking legitimate emails as spam.
3. Extraction Attacks (Model)
Stealing the model's parameters or functionality.
Example: Querying a paid API thousands of times to train a cheap "knock-off" model (Model Stealing).
Why Does This Happen?
Deep neural networks are highly sensitive to small perturbations in their input space. They learn statistical correlations, not causal reasoning.
An adversarial example exploits these "blind spots" in the high-dimensional decision boundary of the model.
Defense Strategies
- Adversarial Training: Including adversarial examples in the training set so the model learns to resist them.
- Input Sanitization: Pre-processing inputs to remove noise or malicious patterns (Railguard's core function).
- Rate Limiting: Preventing attackers from querying the model rapidly enough to perform extraction.
Secure Your AI Pipeline
Railguard sits in front of your model to detect and block evasion attacks in real-time.