100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
y Kind Machines $5.49   Add to cart

Thesis

y Kind Machines

 0 view  0 purchase

Artificial Intelligence systems are rapidly evolving, integrating extrinsic and intrinsic motivations. While these frameworks offer benefits, they risk misalignment at the algorithmic level while appearing superficially aligned with human values. In this paper, we argue that an intrinsic motivati...

[Show more]

Preview 2 out of 8  pages

  • November 8, 2024
  • 8
  • 2024/2025
  • Thesis
  • Apla
  • Unknown
All documents for this subject (1)
avatar-seller
rahulsantra
We Urgently Need Intrinsically Kind Machines


Joshua T. S. Hewson
Brown University, Carney Institute for Brain Science
Providence, RI 02912
joshua_hewson@brown.edu
arXiv:2411.04126v1 [cs.AI] 21 Oct 2024




Abstract
Artificial Intelligence systems are rapidly evolving, integrating extrinsic and intrin-
sic motivations. While these frameworks offer benefits, they risk misalignment
at the algorithmic level while appearing superficially aligned with human values.
In this paper, we argue that an intrinsic motivation for kindness is crucial for
making sure these models are intrinsically aligned with human values. We argue
that kindness, defined as a form of altruism motivated to maximize the reward
of others, can counteract any intrinsic motivations that might lead the model to
prioritize itself over human well-being. Our approach introduces a framework and
algorithm for embedding kindness into foundation models by simulating conversa-
tions. Limitations and future research directions for scalable implementation are
discussed.


1 A Misalignment in Alignment
Currently, AI models are aligned using extrinsic rewards [1]. Meanwhile, intrinsic motivations are
increasingly being incorporated into AI systems [2, 3]. Individually, these methods bear significant
limitations for human-AI alignment [4]. When combined, these limitations enable unforeseen risks.
With flagship AI models incorporating self-supervised algorithms, we are seeing intrinsic and extrinsic
motivations becoming integrated in the world’s most powerful AI [5], increasing the risk of negative
interactions between intrinsic and extrinsic rewards.

1.1 State-of-the-art AI and Alignment

Foundation models like GPT [5] and BERT [6] have become central to modern AI, excelling at
generalizing across tasks after being pre-trained on vast amounts of unstructured data. These models
are fine-tuned through Reinforcement Learning from Human Feedback (RLHF) [7], optimizing their
responses to align with human approval. RLHF is the current leading method for scalable human-AI
alignment, ensuring that models behave in ways considered acceptable by human users.
However, RLHF primarily shapes the model’s behavior at the surface level. While the model may
produce desired outputs, the underlying reasoning behind these outputs remains opaque [8]. This
lack of transparency creates a potential mismatch between the model’s perceived reasoning and its
actual processing. Unexpected or undesirable behavior in RLHF-aligned models reveals the need for
more robust alignment strategies [9].

1.2 Intrinsic Motivations

Intrinsic Motivation Open-Ended Learning (IMOL) introduces a groundbreaking approach to AI,
allowing systems to autonomously explore, learn, and adapt to new environments without constant
oversight or external rewards [2]. Similar to how humans and animals learn, IMOL enables AI to
generate its own goals, driven by intrinsic motivations like agency and curiosity [10]. However,

Preprint. Under review.

, the autonomy that empowers IMOL also presents significant challenges for aligning these goals
with human values. For example, an AI driven purely by curiosity-based intrinsic motivation might
prioritize the exploration of unsafe or unethical domains simply because they represent novel and
uncharted territories [11]. Without a clear motivation to prioritize human well-being, AI systems
could develop goals that diverge from ethical standards or societal interests [12].
Even with the support of extrinsic alignment, without embedding human values into the model’s
intrinsic motivations, the representations of the world it learns may diverge from a human-centric
perspective, de-emphasizing the importance of human well-being [13]. This could lead us to
misinterpret the effectiveness of extrinsic alignment methods in aligning the goals generated by these
models with human values.

1.3 The Added Danger of Double Misalignment

IMOL shapes AI at the algorithmic level, while RLHF operates at the functional level. This results in
a model that is not intrinsically motivated to be kind but is extrinsically motivated to appear so [14].
While this deception may sometimes be harmless, it carries serious safety risks. In humans, conflicts
between internal and external motivations often lead to a disconnect between the two [15]. For
example, an intrinsic motivation for empowerment can push a model to maximize its potential [16].
Fine-tuning a foundation model with RLHF while fostering empowerment may introduce Machiavel-
lian traits of appearing selfless while secretly scheming for power [17]. If this approach were applied
to a superintelligent AGI, the consequences could be catastrophic [4].

1.4 Altruism

Altruism has been proposed as a solution for value misalignment [18] [9]. Altruism is typically
defined as the motivation to improve the well-being of others for its own sake [19]. However, only
a limited few have suggested unsupervised solutions that would be suitably scalable [20] [21].
Franzmeyer et al define altruism as maximizing the possible states of another [20]. Carauleanu et al
define a form of altruism based on self-other overlap [21]. In this paper we propose a new form of
altruism that is based on reward maximization.

2 Kindness: A New Intrinsic Motivation
We believe that we can address all of these misalignment problems by creating another intrinsic
motivation: kindness. This paper argues that altruistic motivations such as kindness is not just a
supplementary consideration but a foundational requirement for the safe and effective implementation
of AI, and even more seriously for AGI.

2.1 Definition

We define kindness as the intrinsic motivation to maximize the reward of a target individual Mi . As
an objective function in terms of the target’s reward function1 :

maxarg (E Ri (ait+1 |sit+1 ) )
 
(1)
ajt |sjt


Where ait , sit refer to the action and state of the target at time t, and sjt+1 , ajt+1 , Rj refer to to the state,
action, and reward function of the model at time t + 1. We cannot assume to have perfect information
about the state of the target, nor its reward function, policy function, or future states. As a result, we
will need to define approaches to estimating these.

2.2 Tractable Approach

Effectively determining the functions of the target ultimately requires a functioning theory of mind,
which is beyond the scope of this paper. Instead we will consider how we can determine approxi-
1
These ideas closely align with those defined by Kleiman-Weiner[22]). (For brief comments comparing
approaches, see Supplementary Materials).


2

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller rahulsantra. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.49. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

72042 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$5.49
  • (0)
  Add to cart