LLM Unlearning plays a crucial role in removing sensitive information from language models to mitigate potential misuse. However, previous approaches often treat nonsensical responses or template-based refusals (e.g., “Sorry, I cannot answer.”) as the unlearning target, which can give the impression of deliberate information suppression, making the process even more vulnerable to attacks and jailbreaks. Moreover, most methods rely on auxiliary models or retaining datasets, which adds complexity to the unlearning process. To address these challenges, we propose MEOW, a streamlined and stealthy unlearning method that eliminates the need for auxiliary models or retaining data while avoiding leakage through its innovative use of inverted facts. These inverted facts are generated by an offline LLM and serve as fine-tuning labels. Meanwhile, we introduce MEMO, a novel metric that measures the model’s memorization, to select optimal fine-tuning targets. The use of inverted facts not only maintains the covert nature of the model but also ensures that sensitive information is effectively forgotten without revealing the target data. Evaluated on the ToFU Knowledge Unlearning dataset using Llama2-7B-Chat and Phi-1.5, MEOW outperforms baselines in forgetting quality while preserving model utility. MEOW also maintains strong performance across NLU and NLG tasks and demonstrates superior resilience to attacks, validated via the Min-K% membership inference method.
This paper studies the effectiveness-quality trade-off in red-green list watermarking for LLM text generation. It formulates watermarking as a multi-objective optimization problem and identifies a key factor behind this dilemma. Based on the analysis, MorphMark is proposed as an adaptive method that dynamically adjusts watermark strength instead of using a fixed hyperparameter. The method is model-agnostic and model-free, making deployment easier across rapidly evolving models. Experiments show improved balance between watermark detectability and text quality, while also providing strong efficiency and flexibility.
We present Invisible Entropy, a safe and efficient low-entropy watermarking paradigm for large language models. The method improves robustness and detectability while preserving text quality under practical decoding settings. It is designed to reduce deployment overhead and maintain compatibility with modern LLM generation pipelines. Extensive experiments show favorable trade-offs against prior watermarking methods across quality, security, and efficiency metrics.
We study model extraction under adversarial settings and propose HoneypotNet, a backdoor-based attack strategy against extraction pipelines. The method injects stealthy triggers that remain latent during normal usage but induce targeted behavior in extracted surrogate models. We analyze transferability and attack success under different query budgets and defense settings. Results show that extraction systems can inherit hidden vulnerabilities, motivating stronger auditing and robust defense mechanisms.
We introduce MLLMGuard, a multi-dimensional safety evaluation suite for multimodal large language models. The framework covers diverse risk categories with bilingual data, standardized inference tooling, and both manual and automatic evaluation protocols. We further train a lightweight evaluator (GuardRank) on human annotations to enable scalable and reproducible scoring. Experiments on closed-source and open-source MLLMs show substantial safety differences across dimensions and reveal gaps not captured by existing single-axis benchmarks.
We propose Chain of History, an LLM-based framework for temporal knowledge graph completion that jointly models historical evolution and future forecasting. The approach leverages sequential temporal context to improve reasoning over dynamic entities and relations. It supports both interpolation and extrapolation settings and integrates naturally with large language model priors. Experiments on benchmark temporal KGs demonstrate competitive or superior performance with strong generalization across time horizons.
Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (e.g., ChatGPT) and ESC-oriented LLMs (e.g., ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4.