haskell – 有效的并行策略

栏目: 编程语言 · 发布时间: 6年前

内容简介:代码日志版权声明:翻译自:http://stackoverflow.com/questions/14624376/efficient-parallel-strategies

我试图围绕平行策略.我想我明白了每个组合器的作用,但是每次尝试使用超过1个核心的程序时,程序会显着减慢.

例如,一段时间后,我尝试从〜700个文档中计算直方图(并从它们独特的单词).我以为使用文件级粒度会好的.与-N4我得到1.70工作平衡.然而与-N1相比,它的运行时间要比使用-N4的时间长一半.我不知道这个问题是什么,但是我想知道如何决定何时/何时/如何并行并获得一些理解.这将如何并行化,以便速度随着核心而不是减少而增加?

import Data.Map (Map)
import qualified Data.Map as M
import System.Directory
import Control.Applicative
import Data.Vector (Vector)
import qualified Data.Vector as V
import qualified Data.Text as T
import qualified Data.Text.IO as TI
import Data.Text (Text)
import System.FilePath ((</>))
import Control.Parallel.Strategies
import qualified Data.Set as S
import Data.Set (Set)
import GHC.Conc (pseq, numCapabilities)
import Data.List (foldl')

mapReduce stratm m stratr r xs = let
  mapped = parMap stratm m xs
  reduced = r mapped `using` stratr
  in mapped `pseq` reduced

type Histogram = Map Text Int

rootDir = "/home/masse/Documents/text_conversion/"

finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"]
englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]
isStopWord :: Text -> Bool
isStopWord x = x `elem` (finnishStop ++ englishStop)

textFiles :: IO [FilePath]
textFiles = map (rootDir </>) . filter (not . meta) <$> getDirectoryContents rootDir
  where meta "." = True
        meta ".." = True
        meta _ = False

histogram :: Text -> Histogram
histogram = foldr (\k -> M.insertWith' (+) k 1) M.empty . filter (not . isStopWord) . T.words

wordList = do
  files <- mapM TI.readFile =<< textFiles
  return $mapReduce rseq histogram rseq reduce files
  where
    reduce = M.unions

main = do
  list <- wordList
  print $M.size list

对于文本文件,我使用的pdf转换为文本文件,所以我不能提供它们,但为了这个目的,几乎任何书/书从古腾堡项目应该做.

编辑:添加导入脚本

实际上,让并联组合器进行规模化可能是困难的.

其他人已经提到使你的代码更严格,以确保你实际上

并行工作,这绝对是重要的.

两个可以真正杀死性能的东西是大量的内存遍历

垃圾收集.即使你没有生产大量的垃圾,很多

内存遍历给CPU缓存带来更多的压力,最终在你的

内存总线成为瓶颈.您的isStopWord功能执行很多

的字符串比较,并且必须遍历一个相当长的链表才能这样做.

你可以使用内置的Set类型来节省大量的工作,甚至可以节省大量的工作

来自无序容器包的HashSet类型(由于重复的字符串

比较可能是昂贵的,特别是如果他们共享公共前缀).

import           Data.HashSet                (HashSet)
import qualified Data.HashSet                as S

...

finnishStop :: [Text]
finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"]
englishStop :: [Text]
englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]

stopWord :: HashSet Text
stopWord = S.fromList (finnishStop ++ englishStop)

isStopWord :: Text -> Bool
isStopWord x = x `S.member` stopWord

使用此版本替换您的isStopWord功能执行得更好

并且更好的缩放(尽管绝对不是1-1).你也可以考虑

使用HashMap(来自同一个包)而不是Map的原因相同,

但是我没有得到明显的改变.

另一个选择是增加默认的堆大小来取得一些

压力GC,并给它更多的空间移动的东西.给予

编译代码一个默认的堆大小为1GB(-H1G标志),我得到一个GC平衡

在4核心上约为50%,而我只有〜25%没有(它也跑〜30%

更快).

有了这两个变化,四个内核的平均运行时间(在我的机器上)

从〜10.5s降至〜3.5s.可以说,有改进的余地

GC统计数字(仍然只花58%的时间做生产性工作),

但要做得更好,可能需要更加剧烈的改变

你的算法

代码日志版权声明:

翻译自:http://stackoverflow.com/questions/14624376/efficient-parallel-strategies


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

从问题到程序

从问题到程序

裘宗燕 / 机械工业出版社 / 2005-9-1 / 36.00元

本书以C作为讨论程序设计的语言,讨论了基本程序设计的各方面问题。书中给出程序实例时没有采用常见的提出问题,给出解答,再加些解释的简单三步形式,而是增加了许多问题的分析和讨论,以帮助读者认识程序设计过程的实质,理解从问题到程序的思考过程。书中还尽可能详尽地解释了许多与C语言和程序设计有关的问题。 本书适合作为高等院校计算机及相关专业的教材,也可供其他学习C程序设计语言的读者阅读。一起来看看 《从问题到程序》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具