文章目录

作业1：
- 1. 余弦相似度
- 2. 单词类比
- 3. 词向量纠偏
- - 3.1 消除对非性别词语的偏见
  - 3.2 性别词的均衡算法
作业2：Emojify表情生成
- 1. Baseline model: Emojifier-V1
- - 1.1 数据集
  - 1.2 模型预览
  - 1.3 实现 Emojifier-V1
  - 1.4 在训练集上测试
- 2. Emojifier-V2: Using LSTMs in Keras
- - 2.1 模型预览
  - 2.2 Keras and mini-batching
  - 2.3 Embedding 层
  - 2.3 建立 Emojifier-V2

测试题：参考博文

笔记：W2.自然语言处理与词嵌入

作业1：

加载预训练的单词向量，用 c o s ( θ ) cos(\theta) cos(θ) 余弦夹角测量相似度
使用词嵌入解决类比问题
修改词嵌入降低性比歧视

import numpy as np
from w2v_utils import *

这个作业使用 50-维的 GloVe vectors 表示单词

words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

1. 余弦相似度

CosineSimilarity(u, v) = u . v ∣ ∣ u ∣ ∣ 2 ∣ ∣ v ∣ ∣ 2 = c o s ( θ ) \text{CosineSimilarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta) CosineSimilarity(u, v)=∣∣u∣∣2∣∣v∣∣2u.v=cos(θ)

其中 ∣ ∣ u ∣ ∣ 2 = ∑ i = 1 n u i 2 ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2} ∣∣u∣∣2=∑i=1nui2

# GRADED FUNCTION: cosine_similarity

def cosine_similarity(u, v):
    """ Cosine similarity reflects the degree of similariy between u and v Arguments: u -- a word vector of shape (n,) v -- a word vector of shape (n,) Returns: cosine_similarity -- the cosine similarity between u and v defined by the formula above. """
    
    distance = 0.0
    
    ### START CODE HERE ###
    # Compute the dot product between u and v (≈1 line)
    dot = np.dot(u, v)
    # Compute the L2 norm of u (≈1 line)
    norm_u = np.linalg.norm(u)
    
    # Compute the L2 norm of v (≈1 line)
    norm_v = np.linalg.norm(v)
    # Compute the cosine similarity defined by formula (1) (≈1 line)
    cosine_similarity = dot/(norm_u*norm_v)
    ### END CODE HERE ###
    
    return cosine_similarity

2. 单词类比

例如：男人：女人 --> 国王：王后

# GRADED FUNCTION: complete_analogy

def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """ Performs the word analogy task as explained above: a is to b as c is to ____. Arguments: word_a -- a word, string word_b -- a word, string word_c -- a word, string word_to_vec_map -- dictionary that maps words to their corresponding vectors. Returns: best_word -- the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity """
    
    # convert words to lower case
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    
    ### START CODE HERE ###
    # Get the word embeddings v_a, v_b and v_c (≈1-3 lines)
    e_a, e_b, e_c = word_to_vec_map[word_a],word_to_vec_map[word_b],word_to_vec_map[word_c]
    ### END CODE HERE ###
    
    words = word_to_vec_map.keys()
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output

    # loop over the whole word vector set
    for w in words:        
        # to avoid best_word being one of the input words, pass on them.
        if w in [word_a, word_b, word_c] :
            continue
        
        ### START CODE HERE ###
        # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c) (≈1 line)
        cosine_sim = cosine_similarity(e_b-e_a, word_to_vec_map[w]-e_c)
        
        # If the cosine_sim is more than the max_cosine_sim seen so far,
            # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
        ### END CODE HERE ###
        
    return best_word

测试：

triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:
    print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad,word_to_vec_map)))

输出：

italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> larger

额外测试：

good -> ok :: bad -> oops（糟糕）
father -> dad :: mother -> mom

3. 词向量纠偏

研究反映在单词嵌入中的性别偏见，并探索减少这种偏见的算法

g = word_to_vec_map['woman'] - word_to_vec_map['man']
print(g)

输出：向量（50维）

[-0.087144    0.2182     -0.40986    -0.03922    -0.1032      0.94165
 -0.06042     0.32988     0.46144    -0.35962     0.31102    -0.86824
  0.96006     0.01073     0.24337     0.08193    -1.02722    -0.21122
  0.695044   -0.00222     0.29106     0.5053     -0.099454    0.40445
  0.30181     0.1355     -0.0606     -0.07131    -0.19245    -0.06115
 -0.3204      0.07165    -0.13337    -0.25068714 -0.14293    -0.224957
 -0.149       0.048882    0.12191    -0.27362    -0.165476   -0.20426
  0.54376    -0.271425   -0.10245    -0.32108     0.2516     -0.33455
 -0.04371     0.01258   ]

print ('List of names and their similarities with constructed vector:')

# girls and boys name
name_list = ['john', 'marie', 'sophie', 'ronaldo', 'priya', 'rahul', 'danielle', 'reza', 'katy', 'yasmin']

for w in name_list:
    print (w, cosine_similarity(word_to_vec_map[w], g))

输出：

List of names and their similarities with constructed vector:
john -0.23163356145973724
marie 0.315597935396073
sophie 0.31868789859418784
ronaldo -0.31244796850329437
priya 0.17632041839009402
rahul -0.16915471039231716
danielle 0.24393299216283895
reza -0.07930429672199553
katy 0.2831068659572615
yasmin 0.2331385776792876

可以看出，

女性的名字往往与向量

• 【C++】一篇文章搞懂为什么CPP支持函数重载而C	• 最优化问题(一)
• 用C++做一个猜数字游戏	• 重拾旧时光——Bringing-Old-Photos-Back-to-Li
• 图神经网络之针对短文本分类的异质图注意力网络	• 关于jupyter notebook 使用tensorflow-gpu 2.0

• Esp8266天猫精灵_RGB灯_非点灯平台	• STM32F103 串口1和串口3对发数据配合蓝牙模块
• TMS570学习【1】了解什么是TMS570	• 新闻稿 \| Qt公司收购froglogic公司以巩固市场领
• [Java]SpringBoot2整合mqtt服务器EMQ实现消息订	• 苹果群控投屏同步操作原理及运用的平台APP分享

• Esp8266天猫精灵_RGB灯_非点灯平台	• STM32F103 串口1和串口3对发数据配合蓝牙模块
• TMS570学习【1】了解什么是TMS570	• 新闻稿 \| Qt公司收购froglogic公司以巩固市场领
• [Java]SpringBoot2整合mqtt服务器EMQ实现消息订	• 苹果群控投屏同步操作原理及运用的平台APP分享
• STM32查询式按键输入[直接用寄存器]	• Ubuntu系统 USB设备端口绑定
• 2021-04-14 第四次按键输入实验	• Flutter扫码功能完美实现

05.序列模型 W2.自然语言处理与词嵌入（作业：词向量+Emoji表情生成）

文章目录

作业1：

1. 余弦相似度

2. 单词类比

3. 词向量纠偏