We Urgently Need Intrinsically Kind Machines
Joshua T. S. Hewson
Brown University, Carney Institute for Brain Science
Providence, RI 02912
joshua_hewson@brown.edu
arXiv:2411.04126v1 [cs.AI] 21 Oct 2024
Abstract
Artificial Intelligence systems are rapidly evolving, integrating extrinsic and intrin-
sic motivations. While these frameworks offer benefits, they risk misalignment
at the algorithmic level while appearing superficially aligned with human values.
In this paper, we argue that an intrinsic motivation for kindness is crucial for
making sure these models are intrinsically aligned with human values. We argue
that kindness, defined as a form of altruism motivated to maximize the reward
of others, can counteract any intrinsic motivations that might lead the model to
prioritize itself over human well-being. Our approach introduces a framework and
algorithm for embedding kindness into foundation models by simulating conversa-
tions. Limitations and future research directions for scalable implementation are
discussed.
1 A Misalignment in Alignment
Currently, AI models are aligned using extrinsic rewards [1]. Meanwhile, intrinsic motivations are
increasingly being incorporated into AI systems [2, 3]. Individually, these methods bear significant
limitations for human-AI alignment [4]. When combined, these limitations enable unforeseen risks.
With flagship AI models incorporating self-supervised algorithms, we are seeing intrinsic and extrinsic
motivations becoming integrated in the world’s most powerful AI [5], increasing the risk of negative
interactions between intrinsic and extrinsic rewards.
1.1 State-of-the-art AI and Alignment
Foundation models like GPT [5] and BERT [6] have become central to modern AI, excelling at
generalizing across tasks after being pre-trained on vast amounts of unstructured data. These models
are fine-tuned through Reinforcement Learning from Human Feedback (RLHF) [7], optimizing their
responses to align with human approval. RLHF is the current leading method for scalable human-AI
alignment, ensuring that models behave in ways considered acceptable by human users.
However, RLHF primarily shapes the model’s behavior at the surface level. While the model may
produce desired outputs, the underlying reasoning behind these outputs remains opaque [8]. This
lack of transparency creates a potential mismatch between the model’s perceived reasoning and its
actual processing. Unexpected or undesirable behavior in RLHF-aligned models reveals the need for
more robust alignment strategies [9].
1.2 Intrinsic Motivations
Intrinsic Motivation Open-Ended Learning (IMOL) introduces a groundbreaking approach to AI,
allowing systems to autonomously explore, learn, and adapt to new environments without constant
oversight or external rewards [2]. Similar to how humans and animals learn, IMOL enables AI to
generate its own goals, driven by intrinsic motivations like agency and curiosity [10]. However,
Preprint. Under review.
, the autonomy that empowers IMOL also presents significant challenges for aligning these goals
with human values. For example, an AI driven purely by curiosity-based intrinsic motivation might
prioritize the exploration of unsafe or unethical domains simply because they represent novel and
uncharted territories [11]. Without a clear motivation to prioritize human well-being, AI systems
could develop goals that diverge from ethical standards or societal interests [12].
Even with the support of extrinsic alignment, without embedding human values into the model’s
intrinsic motivations, the representations of the world it learns may diverge from a human-centric
perspective, de-emphasizing the importance of human well-being [13]. This could lead us to
misinterpret the effectiveness of extrinsic alignment methods in aligning the goals generated by these
models with human values.
1.3 The Added Danger of Double Misalignment
IMOL shapes AI at the algorithmic level, while RLHF operates at the functional level. This results in
a model that is not intrinsically motivated to be kind but is extrinsically motivated to appear so [14].
While this deception may sometimes be harmless, it carries serious safety risks. In humans, conflicts
between internal and external motivations often lead to a disconnect between the two [15]. For
example, an intrinsic motivation for empowerment can push a model to maximize its potential [16].
Fine-tuning a foundation model with RLHF while fostering empowerment may introduce Machiavel-
lian traits of appearing selfless while secretly scheming for power [17]. If this approach were applied
to a superintelligent AGI, the consequences could be catastrophic [4].
1.4 Altruism
Altruism has been proposed as a solution for value misalignment [18] [9]. Altruism is typically
defined as the motivation to improve the well-being of others for its own sake [19]. However, only
a limited few have suggested unsupervised solutions that would be suitably scalable [20] [21].
Franzmeyer et al define altruism as maximizing the possible states of another [20]. Carauleanu et al
define a form of altruism based on self-other overlap [21]. In this paper we propose a new form of
altruism that is based on reward maximization.
2 Kindness: A New Intrinsic Motivation
We believe that we can address all of these misalignment problems by creating another intrinsic
motivation: kindness. This paper argues that altruistic motivations such as kindness is not just a
supplementary consideration but a foundational requirement for the safe and effective implementation
of AI, and even more seriously for AGI.
2.1 Definition
We define kindness as the intrinsic motivation to maximize the reward of a target individual Mi . As
an objective function in terms of the target’s reward function1 :
maxarg (E Ri (ait+1 |sit+1 ) )
(1)
ajt |sjt
Where ait , sit refer to the action and state of the target at time t, and sjt+1 , ajt+1 , Rj refer to to the state,
action, and reward function of the model at time t + 1. We cannot assume to have perfect information
about the state of the target, nor its reward function, policy function, or future states. As a result, we
will need to define approaches to estimating these.
2.2 Tractable Approach
Effectively determining the functions of the target ultimately requires a functioning theory of mind,
which is beyond the scope of this paper. Instead we will consider how we can determine approxi-
1
These ideas closely align with those defined by Kleiman-Weiner[22]). (For brief comments comparing
approaches, see Supplementary Materials).
2