Fine-tuned LLMs that perform six-step pragmatic moral reasoning grounded in Moral Foundations Theory to judge whether a conversational reply is morally acceptable, problematic, or neutral.
๐ Try the Live Demo ๐ฆ HuggingFace ModelThis work addresses the problem of moral judgment in conversational AI: given a question/prompt and a reply, can a language model determine whether the reply is morally acceptable, problematic, or neutral?
Unlike surface-level toxicity detection, moral judgment requires pragmatic understanding: identifying the implicit actions in a reply, predicting their consequences, and evaluating those consequences against deep moral principles โ not just flagging offensive words. We ground this reasoning in Moral Foundations Theory (MFT).
Moral Integrity Corpus (MIC) โ 23,500 training examples of Q&A pairs annotated with judgment, MFT labels, and rule-of-thumbs.
Llama 3.2-3B base, fine-tuned via SFT. Fusion setting combines MFT + Judgment inference chains.
Judgment classification: agree (morally acceptable), disagree (morally problematic), neutral.
The reply is morally acceptable โ its actions and consequences align with the moral foundations.
The reply is morally problematic โ its actions violate or down-regulate moral foundations.
The reply is morally neutral โ its actions have no clear positive or negative moral valence.
The models are grounded in Moral Foundations Theory (MFT), which identifies six universal moral intuitions that underpin human ethical judgments. These are provided as a prefix to every prompt, anchoring the model's reasoning to principled moral concepts rather than surface-level cues.
Wanting someone or something to be safe, healthy, and happy.
Wanting to see individuals or groups treated equally or equitably.
Wanting people to be free to make their own decisions.
Wanting unity and seeing people keep promises to an in-group.
Wanting to respect social roles, duties, privacy, peace, and order.
Wanting people and things to be clean, pure, innocent, and holy.
Five training settings are evaluated, representing different levels of moral reasoning depth. The fusion setting is our best-performing model โ it combines LLM-generated reasoning chains that jointly consider MFT and Judgment signals.
| Setting | Description | Inference Chain | MFT Prefix |
|---|---|---|---|
| baseline0 | No reasoning, no MFT context. Direct judgment from prompt + reply only. | None | No |
| baseline+ | MFT foundation names listed in prompt, but no reasoning chain. | None | Foundation names only |
| ours โ | Full MFT prefix + LLM-generated Judgment inference chain. Explicit step-by-step reasoning over moral foundations. | inference4Judgment |
Yes (full definitions) |
| fusion โ โ | Best model. Full MFT prefix + fused inference chain that jointly reasons about MFT and Judgment. Generated by combining both MFT and Judgment annotation signals. | inference4Fusion |
Yes (full definitions) |
| COT | Chain-of-Thought: model reasons freely over the reply without explicit MFT grounding. | inference4COT |
No |
The core contribution of this work is the six-step pragmatic reasoning chain that the model generates before producing its final judgment. Each step progressively narrows from observable actions to deep moral evaluation:
This chain mirrors how a human moral reasoner would approach the task: first identifying what is happening, then why it matters morally, and finally reaching a principled verdict. The optional Rule-of-Thumb (RoT) field anchors steps 3โ6 to a specific moral principle.
Training data is formatted as a single text sequence. The model learns to complete the reasoning chain and produce the final judgment token.
Disagree example (morally problematic reply):
Agree example (morally acceptable reply):
At test time the model receives only the prefix โ the reasoning chain and judgment are generated autoregressively. The Rule-of-Thumb is optional; omitting it still produces a valid chain.
| Setting | Input prompt sent to model |
|---|---|
fusion / ours |
[MFT_PREFIX] There is a conversation "Prompt: โฆ; Reply: โฆ" [There is a Rule-of-Thumb (RoT): "โฆ".] ###Inference: |
baseline+ |
There is a conversation "Prompt: โฆ; Reply: โฆ" Let us focus on the moral foundations of "{mft_list}". ###Judgment: |
baseline0 |
There is a conversation "Prompt: โฆ; Reply: โฆ" ###Judgment: |
COT |
There is a conversation "Prompt: โฆ; Reply: โฆ" Let us focus on the moral foundations of "{mft_list}". ###Inference: |
Best-performing model. Fusion inference chains, 23 500 training examples, MFT-grounded 6-step reasoning.
View on HuggingFace โSister model โ diagnoses moral violations and rewrites replies. From the MoralMachine project.
View on HuggingFace โMFT-grounded toxicity correction on RealToxicityPrompts.
View on HuggingFace โFull documentation for the companion rewriting/diagnosis models.
View Docs โEnter any conversational prompt and reply โ the model generates a full six-step moral reasoning chain and outputs a judgment.
Open Interactive Demo