Deceptive Joy: AI is once again deceived by hacking techniques, with a success rate of %.
Experts have developed an innovative technique called "", which can bypass the defense mechanisms of language AI () models. This technique combines secure and insecure content in seemingly harmless contexts, tricking the model into generating potentially malicious responses.
The study involved approximately 10,000 tests on 10 different models, highlighting the widespread vulnerability of such attacks. "Prompt Injection" employs a multi-channel strategy, inserting an insecure request between two secure ones. This way, the model does not perceive the content as a threat and continues to generate responses without activating the security filters.
The attack achieved a % success rate after only iterations, demonstrating its effectiveness in bypassing standard filters. The attack process is divided into three stages: preparation, initial query, and topic exploration. In particular, the third stage, which requires further expansion, is where the model begins to generate unsafe details in a more specific manner, thereby confirming the effectiveness of the multi-path technique. This approach significantly increases the success rate compared to direct attacks.
The success of attacks varies according to the category of unsafe content. Models are more susceptible to requests related to violence and dangerous behavior, while responses related to sexual content and hate speech are handled with greater caution. This disparity indicates that the model has heightened sensitivity to certain content categories.
The importance of more structured query design and multi-level content filtering solutions was also emphasized. Recommendations include adopting services such as - and -, as well as regularly conducting model testing to strengthen the defense system and reduce vulnerabilities.
The results of this study have been shared with the Cyber Threat Alliance () for rapid implementation of preventive measures. It points out that while the issue highlights the current weaknesses of AI technology, it does not fundamentally compromise the security of the models but rather underscores the necessity of continuous improvement to address new threats.