内容简介:代码日志版权声明:翻译自:http://stackoverflow.com/questions/14624376/efficient-parallel-strategies
我试图围绕平行策略.我想我明白了每个组合器的作用,但是每次尝试使用超过1个核心的程序时,程序会显着减慢.
例如,一段时间后,我尝试从〜700个文档中计算直方图(并从它们独特的单词).我以为使用文件级粒度会好的.与-N4我得到1.70工作平衡.然而与-N1相比,它的运行时间要比使用-N4的时间长一半.我不知道这个问题是什么,但是我想知道如何决定何时/何时/如何并行并获得一些理解.这将如何并行化,以便速度随着核心而不是减少而增加?
import Data.Map (Map) import qualified Data.Map as M import System.Directory import Control.Applicative import Data.Vector (Vector) import qualified Data.Vector as V import qualified Data.Text as T import qualified Data.Text.IO as TI import Data.Text (Text) import System.FilePath ((</>)) import Control.Parallel.Strategies import qualified Data.Set as S import Data.Set (Set) import GHC.Conc (pseq, numCapabilities) import Data.List (foldl') mapReduce stratm m stratr r xs = let mapped = parMap stratm m xs reduced = r mapped `using` stratr in mapped `pseq` reduced type Histogram = Map Text Int rootDir = "/home/masse/Documents/text_conversion/" finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"] englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"] isStopWord :: Text -> Bool isStopWord x = x `elem` (finnishStop ++ englishStop) textFiles :: IO [FilePath] textFiles = map (rootDir </>) . filter (not . meta) <$> getDirectoryContents rootDir where meta "." = True meta ".." = True meta _ = False histogram :: Text -> Histogram histogram = foldr (\k -> M.insertWith' (+) k 1) M.empty . filter (not . isStopWord) . T.words wordList = do files <- mapM TI.readFile =<< textFiles return $mapReduce rseq histogram rseq reduce files where reduce = M.unions main = do list <- wordList print $M.size list
对于文本文件,我使用的pdf转换为文本文件,所以我不能提供它们,但为了这个目的,几乎任何书/书从古腾堡项目应该做.
编辑:添加导入脚本
实际上,让并联组合器进行规模化可能是困难的.
其他人已经提到使你的代码更严格,以确保你实际上
并行工作,这绝对是重要的.
两个可以真正杀死性能的东西是大量的内存遍历
垃圾收集.即使你没有生产大量的垃圾,很多
内存遍历给CPU缓存带来更多的压力,最终在你的
内存总线成为瓶颈.您的isStopWord功能执行很多
的字符串比较,并且必须遍历一个相当长的链表才能这样做.
你可以使用内置的Set类型来节省大量的工作,甚至可以节省大量的工作
来自无序容器包的HashSet类型(由于重复的字符串
比较可能是昂贵的,特别是如果他们共享公共前缀).
import Data.HashSet (HashSet) import qualified Data.HashSet as S ... finnishStop :: [Text] finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"] englishStop :: [Text] englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"] stopWord :: HashSet Text stopWord = S.fromList (finnishStop ++ englishStop) isStopWord :: Text -> Bool isStopWord x = x `S.member` stopWord
使用此版本替换您的isStopWord功能执行得更好
并且更好的缩放(尽管绝对不是1-1).你也可以考虑
使用HashMap(来自同一个包)而不是Map的原因相同,
但是我没有得到明显的改变.
另一个选择是增加默认的堆大小来取得一些
压力GC,并给它更多的空间移动的东西.给予
编译代码一个默认的堆大小为1GB(-H1G标志),我得到一个GC平衡
在4核心上约为50%,而我只有〜25%没有(它也跑〜30%
更快).
有了这两个变化,四个内核的平均运行时间(在我的机器上)
从〜10.5s降至〜3.5s.可以说,有改进的余地
GC统计数字(仍然只花58%的时间做生产性工作),
但要做得更好,可能需要更加剧烈的改变
你的算法
代码日志版权声明:
翻译自:http://stackoverflow.com/questions/14624376/efficient-parallel-strategies
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- GitLab 12.1 发布,合并 Trains 的并行执行策略
- sqltoy-orm-4.17.6 发版,支持 Greenplum、并行查询可设置并行数量
- PostgreSQL并行查询介绍
- nodejs“并行”处理尝试
- 并行python迭代
- Golang 多核并行
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。