haskell – 有效的并行策略

栏目: 编程语言 · 发布时间: 7年前

内容简介:代码日志版权声明:翻译自:http://stackoverflow.com/questions/14624376/efficient-parallel-strategies

我试图围绕平行策略.我想我明白了每个组合器的作用,但是每次尝试使用超过1个核心的程序时,程序会显着减慢.

例如,一段时间后,我尝试从〜700个文档中计算直方图(并从它们独特的单词).我以为使用文件级粒度会好的.与-N4我得到1.70工作平衡.然而与-N1相比,它的运行时间要比使用-N4的时间长一半.我不知道这个问题是什么,但是我想知道如何决定何时/何时/如何并行并获得一些理解.这将如何并行化,以便速度随着核心而不是减少而增加?

import Data.Map (Map)
import qualified Data.Map as M
import System.Directory
import Control.Applicative
import Data.Vector (Vector)
import qualified Data.Vector as V
import qualified Data.Text as T
import qualified Data.Text.IO as TI
import Data.Text (Text)
import System.FilePath ((</>))
import Control.Parallel.Strategies
import qualified Data.Set as S
import Data.Set (Set)
import GHC.Conc (pseq, numCapabilities)
import Data.List (foldl')

mapReduce stratm m stratr r xs = let
  mapped = parMap stratm m xs
  reduced = r mapped `using` stratr
  in mapped `pseq` reduced

type Histogram = Map Text Int

rootDir = "/home/masse/Documents/text_conversion/"

finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"]
englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]
isStopWord :: Text -> Bool
isStopWord x = x `elem` (finnishStop ++ englishStop)

textFiles :: IO [FilePath]
textFiles = map (rootDir </>) . filter (not . meta) <$> getDirectoryContents rootDir
  where meta "." = True
        meta ".." = True
        meta _ = False

histogram :: Text -> Histogram
histogram = foldr (\k -> M.insertWith' (+) k 1) M.empty . filter (not . isStopWord) . T.words

wordList = do
  files <- mapM TI.readFile =<< textFiles
  return $mapReduce rseq histogram rseq reduce files
  where
    reduce = M.unions

main = do
  list <- wordList
  print $M.size list

对于文本文件,我使用的pdf转换为文本文件,所以我不能提供它们,但为了这个目的,几乎任何书/书从古腾堡项目应该做.

编辑:添加导入脚本

实际上,让并联组合器进行规模化可能是困难的.

其他人已经提到使你的代码更严格,以确保你实际上

并行工作,这绝对是重要的.

两个可以真正杀死性能的东西是大量的内存遍历

垃圾收集.即使你没有生产大量的垃圾,很多

内存遍历给CPU缓存带来更多的压力,最终在你的

内存总线成为瓶颈.您的isStopWord功能执行很多

的字符串比较,并且必须遍历一个相当长的链表才能这样做.

你可以使用内置的Set类型来节省大量的工作,甚至可以节省大量的工作

来自无序容器包的HashSet类型(由于重复的字符串

比较可能是昂贵的,特别是如果他们共享公共前缀).

import           Data.HashSet                (HashSet)
import qualified Data.HashSet                as S

...

finnishStop :: [Text]
finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"]
englishStop :: [Text]
englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]

stopWord :: HashSet Text
stopWord = S.fromList (finnishStop ++ englishStop)

isStopWord :: Text -> Bool
isStopWord x = x `S.member` stopWord

使用此版本替换您的isStopWord功能执行得更好

并且更好的缩放(尽管绝对不是1-1).你也可以考虑

使用HashMap(来自同一个包)而不是Map的原因相同,

但是我没有得到明显的改变.

另一个选择是增加默认的堆大小来取得一些

压力GC,并给它更多的空间移动的东西.给予

编译代码一个默认的堆大小为1GB(-H1G标志),我得到一个GC平衡

在4核心上约为50%,而我只有〜25%没有(它也跑〜30%

更快).

有了这两个变化,四个内核的平均运行时间(在我的机器上)

从〜10.5s降至〜3.5s.可以说,有改进的余地

GC统计数字(仍然只花58%的时间做生产性工作),

但要做得更好,可能需要更加剧烈的改变

你的算法

代码日志版权声明:

翻译自:http://stackoverflow.com/questions/14624376/efficient-parallel-strategies


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Python for Everyone

Python for Everyone

Cay S. Horstmann、Rance D. Necaise / John Wiley & Sons / 2013-4-26 / GBP 181.99

Cay Horstmann's" Python for Everyone "provides readers with step-by-step guidance, a feature that is immensely helpful for building confidence and providing an outline for the task at hand. "Problem S......一起来看看 《Python for Everyone》 这本书的介绍吧!

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

html转js在线工具
html转js在线工具

html转js在线工具

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试