计算单词列表之间的相似度（计算单词列表之间的相似度怎么算）

25-01-29 28

在本文中，我们将详细介绍计算单词列表之间的相似度的各个方面，并为您提供关于计算单词列表之间的相似度怎么算的相关解答，同时，我们也将为您带来关于android–如何识别2“黑白”图像之间的相似度(％)？

在本文中，我们将详细介绍计算单词列表之间的相似度的各个方面，并为您提供关于计算单词列表之间的相似度怎么算的相关解答，同时，我们也将为您带来关于android – 如何识别2“黑白”图像之间的相似度(％)？、bash – 计算单词列表中每个单词出现在文件中的次数？、C# Net 比较2个字符串的相似度（使用余弦相似度）、java – 如何有效地计算数百万字符串之间的余弦相似度的有用知识。

本文目录一览：

计算单词列表之间的相似度（计算单词列表之间的相似度怎么算）
android – 如何识别2“黑白”图像之间的相似度(％)？
bash – 计算单词列表中每个单词出现在文件中的次数？
C# Net 比较2个字符串的相似度（使用余弦相似度）
java – 如何有效地计算数百万字符串之间的余弦相似度

计算单词列表之间的相似度（计算单词列表之间的相似度怎么算）

我想计算两个单词列表之间的相似度，例如：

[''email'',''user'',''this'',''email'',''address'',''customer'']

类似于此列表：

[''email'',''mail'',''address'',''netmail'']

例如，我希望比其他列表具有更高的相似性百分比：[''address'',''ip'',''network'']即使 address 该列表中存在相似性
。

答案1

小编典典

由于您实际上还无法演示晶体输出，因此以下是我的最佳镜头：

list_A = [''email'',''user'',''this'',''email'',''address'',''customer'']list_B = [''email'',''mail'',''address'',''netmail'']

在上面的两个列表中，我们将找到列表中每个元素与其余元素之间的余弦相似度。即email从list_B与每一个元素list_A：

def word2vec(word):    from collections import Counter    from math import sqrt    # count the characters in word    cw = Counter(word)    # precomputes a set of the different characters    sw = set(cw)    # precomputes the "length" of the word vector    lw = sqrt(sum(c*c for c in cw.values()))    # return a tuple    return cw, sw, lwdef cosdis(v1, v2):    # which characters are common to the two words?    common = v1[1].intersection(v2[1])    # by definition of cosine distance we have    return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]list_A = [''email'',''user'',''this'',''email'',''address'',''customer'']list_B = [''email'',''mail'',''address'',''netmail'']threshold = 0.80     # if neededfor key in list_A:    for word in list_B:        try:            # print(key)            # print(word)            res = cosdis(word2vec(word), word2vec(key))            # print(res)            print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100))            # if res > threshold:            #     print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))        except IndexError:            pass

输出：

The cosine similarity between : email and : email is: 100.0The cosine similarity between : mail and : email is: 89.44271909999159The cosine similarity between : address and : email is: 26.967994498529684The cosine similarity between : netmail and : email is: 84.51542547285166The cosine similarity between : email and : user is: 22.360679774997898The cosine similarity between : mail and : user is: 0.0The cosine similarity between : address and : user is: 60.30226891555272The cosine similarity between : netmail and : user is: 18.89822365046136The cosine similarity between : email and : this is: 22.360679774997898The cosine similarity between : mail and : this is: 25.0The cosine similarity between : address and : this is: 30.15113445777636The cosine similarity between : netmail and : this is: 37.79644730092272The cosine similarity between : email and : email is: 100.0The cosine similarity between : mail and : email is: 89.44271909999159The cosine similarity between : address and : email is: 26.967994498529684The cosine similarity between : netmail and : email is: 84.51542547285166The cosine similarity between : email and : address is: 26.967994498529684The cosine similarity between : mail and : address is: 15.07556722888818The cosine similarity between : address and : address is: 100.0The cosine similarity between : netmail and : address is: 22.79211529192759The cosine similarity between : email and : customer is: 31.62277660168379The cosine similarity between : mail and : customer is: 17.677669529663685The cosine similarity between : address and : customer is: 42.640143271122085The cosine similarity between : netmail and : customer is: 40.08918628686365

注意：我也已threshold在代码中注释了该部分，以防万一您只需要单词的相似度超过某个阈值（即80％）

编辑：

OP ： 但是我想做的不是逐字比较，而是逐个列出

使用Counter和math：

from collections import Counterimport mathcounterA = Counter(list_A)counterB = Counter(list_B)def counter_cosine_similarity(c1, c2):    terms = set(c1).union(c2)    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))    return dotprod / (magA * magB)print(counter_cosine_similarity(counterA, counterB) * 100)

输出：

53.03300858899106

android – 如何识别2“黑白”图像之间的相似度(％)？

我搜索了所有可能的相关关键词,但我的结果并不是我想要的,因为我发现的大多数算法都非常关注COLOR.

我的应用程序的想法是识别具有最高相似性的图像对.

例如,我的输入是a,图像池包含b,c,d,e.
结果将是b(90％),d(85％),e(80％),c(20％).

我的问题是可以采用什么方法来计算这种“图像相似度”？
或者我必须从零构建我自己的代码？

解决方法:

您可以查看形状上下文http://en.wikipedia.org/wiki/Shape_context以计算形状之间的相似性,互联网上有许多形状上下文的实现,全文可以在这里访问http://www.cs.berkeley.edu/~malik/papers/BMP-shape.pdf

bash – 计算单词列表中每个单词出现在文件中的次数？

我有一个文件list.txt,其中包含一个单词列表.我想检查每个单词出现在另一个文件file1.txt中的次数,然后输出结果.所有数字的简单输出就足够了,因为我可以用电子表格程序手动将它们添加到list.txt,但是如果脚本在list.txt的每行末尾添加数字,那就更好了,例如：

bear 3
fish 15

我试过这个,但它不起作用：

cat list.txt | grep -c file1.txt

您可以在循环中执行此操作,该循环一次从单词列表文件中读取单个单词,然后计算数据文件中的实例.例如：

while read; do
    echo -n "$REPLY "
    fgrep -ow "$REPLY" data.txt | wc -l
done < <(sort -u word_list.txt)

“秘密酱”包括：

>使用隐式REPLY变量;>使用进程替换从单词列表文件中收集单词;和>确保您在数据文件中搜索整个单词.

C# Net 比较2个字符串的相似度（使用余弦相似度）

复制代码使用：

　　　　　/// <summary>
        /// 比较2个字符串的相似度（使用余弦相似度）
        /// </summary>
        /// <param name="str1"></param>
        /// <param name="str2"></param>
        /// <returns>0-1之间的数</returns>
        public static double SimilarityCos(string str1, string str2)
        {
            str1 = str1.Trim();
            str2 = str2.Trim();
            if (string.IsNullOrEmpty(str1) || string.IsNullOrEmpty(str2))
                return 0;

            List<string> lstr1 = SimpParticiple(str1);
            List<string> lstr2 = SimpParticiple(str2);
            //求并集
            var strUnion = lstr1.Union(lstr2);
            //求向量
            List<int> int1 = new List<int>();
            List<int> int2 = new List<int>();
            foreach (var item in strUnion)
            {
                int1.Add(lstr1.Count(o => o == item));
                int2.Add(lstr2.Count(o => o == item));
            }

            double s = 0;
            double den1 = 0;
            double den2 = 0;
            for (int i = 0; i < int1.Count(); i++)
            {
                //求分子
                s += int1[i] * int2[i];
                //求分母（1）
                den1 += Math.Pow(int1[i], 2);
                //求分母（2）
                den2 += Math.Pow(int2[i], 2);
            }

            return s / (Math.Sqrt(den1) * Math.Sqrt(den2));
        }

        /// <summary>
        /// 简单分词（需要更好的效果，需要这里优化，比如把：【今天天气很好】，分成【今天，天气，很好】，同时可以做同义词优化，【今天】=【今日】效果更好）
        /// </summary>
        public static List<string> SimpParticiple(string str)
        {
            List<string> vs = new List<string>();
            foreach (var item in str)
            {
                vs.Add(item.ToString());
            }
            return vs;
        }

java – 如何有效地计算数百万字符串之间的余弦相似度

我需要计算列表中字符串之间的余弦相似度.例如,我有一个超过1000万个字符串的列表,每个字符串必须确定它自己与列表中的每个其他字符串之间的相似性.什么是我可以用来有效和快速完成这项任务的最佳算法？分而治之算法是否适用？

编辑

我想确定哪些字符串与给定字符串最相似,并且能够获得与相似性相关的度量/分数.我认为我想做的事情与群集相符合,群集的数量最初并不为人所知.

解决方法

使用转置矩阵.这就是Mahout在Hadoop上做的快速完成这种任务(或者只是使用Mahout).

从本质上讲,计算余弦相似性的天真方式是不好的.因为你最终计算了很多0 *的东西.相反,你最好在列中工作,并在那里留下所有0.

关于计算单词列表之间的相似度和计算单词列表之间的相似度怎么算的介绍现已完结，谢谢您的耐心阅读，如果想了解更多关于android – 如何识别2“黑白”图像之间的相似度(％)？、bash – 计算单词列表中每个单词出现在文件中的次数？、C# Net 比较2个字符串的相似度（使用余弦相似度）、java – 如何有效地计算数百万字符串之间的余弦相似度的相关知识，请在本站寻找。

本文标签：