Your input is a 200Mb byte array and your output is a corresponding string using 2 chars per input byte, so that it takes 800Mb to store. At one point both are in memory, so you need 1Gb. You also have to assume some memory will be used by the intermediate structures during the execution of your algorithm, in-between GC passes.
Solution 1: Buy more memory.
Solution 2: Tweak your algorithm so that it does not require the complete input to be available at a single time. Then process in chunks.
Solution 3: Implement regular expressions over bytes, or rewrite your algorithm to not use regular expressions. The idea is to use a 200Mb byte array as the only big structure permanently in memory. This assumes you are OK using 200Mb.
As an aside, this is not what you asked for, but there are plenty of things you can do to the code to make it run faster - not faster in theory, as your complexity is linear which is OK, but by a constant factor faster, which can be significant in practice. For example, try to not allocate the intermediate strings. Use StringWriter and
Solution 1: Buy more memory.
Solution 2: Tweak your algorithm so that it does not require the complete input to be available at a single time. Then process in chunks.
Solution 3: Implement regular expressions over bytes, or rewrite your algorithm to not use regular expressions. The idea is to use a 200Mb byte array as the only big structure permanently in memory. This assumes you are OK using 200Mb.
As an aside, this is not what you asked for, but there are plenty of things you can do to the code to make it run faster - not faster in theory, as your complexity is linear which is OK, but by a constant factor faster, which can be significant in practice. For example, try to not allocate the intermediate strings. Use StringWriter and
Write("{0:x2}", x).Thanks for your input!
I realised that I clearly was not addressing the real problem which was me trying to analyse a 200mb input in one string.
I changed my algorithm to use the pdf file structure to split de file in the parts i needed and now im getting decent performance and resources usage.
I realised that I clearly was not addressing the real problem which was me trying to analyse a 200mb input in one string.
I changed my algorithm to use the pdf file structure to split de file in the parts i needed and now im getting decent performance and resources usage.
Topic tags
- f# × 3656
- compiler × 263
- functional × 199
- c# × 119
- websharper × 112
- classes × 96
- web × 94
- book × 84
- .net × 82
- async × 72
- parallel × 43
- server × 43
- parsing × 41
- testing × 41
- asynchronous × 30
- monad × 28
- ocaml × 26
- tutorial × 26
- haskell × 25
- workflows × 22
- html × 21
- linq × 21
- introduction × 19
- silverlight × 19
- wpf × 19
- fpish × 18
- collections × 14
- pipeline × 14
- templates × 12
- monads × 11
- opinion × 10
- reactive × 10
- plugin × 9
- scheme × 9
- sitelets × 9
- solid × 9
- basics × 8
- concurrent × 8
- deployment × 8
- how-to × 8
- python × 8
- complexity × 7
- javascript × 6
- jquery × 6
- lisp × 6
- real-world × 6
- workshop × 6
- xaml × 6
- conference × 5
- dsl × 5
- java × 5
- metaprogramming × 5
- ml × 5
- scala × 5
- visual studio × 5
- formlets × 4
- fsi × 4
- lift × 4
- sql × 4
- teaching × 4
- alt.net × 3
- aml × 3
- enhancement × 3
- reflection × 3
- blog × 2
- compilation × 2
- computation expressions × 2
- corporate × 2
- courses × 2
- cufp × 2
- enterprise × 2
- entity framework × 2
- erlang × 2
- events × 2
- f# interactive × 2
- fsc × 2
- google maps × 2
- html5 × 2
- http × 2
- interactive × 2
- interface × 2
- iphone × 2
- iteratee × 2
- jobs × 2
- keynote × 2
- list × 2
- mvc × 2
- numeric × 2
- obfuscation × 2
- oop × 2
- packaging × 2
- pattern matching × 2
- pipelines × 2
- rx × 2
- script × 2
- seq × 2
- sockets × 2
- stm × 2
- tcp × 2
- trie × 2
- type × 2
- type provider × 2
- xna × 2
- zh × 2
- .net interop × 1
- 2012 × 1
- abstract class × 1
- accumulator × 1
- active pattern × 1
- addin × 1
- agents × 1
- agile × 1
- android × 1
- anonymous object × 1
- appcelerator × 1
- architecture × 1
- array × 1
- arrays × 1
- asp.net 4.5 × 1
- asp.net mvc × 1
- asp.net mvc 4 × 1
- asp.net web api × 1
- aspnet × 1
- ast × 1
- b-tree × 1
- bistro × 1
- bug × 1
- camtasia studio × 1
- canvas × 1
- class × 1
- client × 1
- clojure × 1
- closures × 1
- cloud × 1
- cms × 1
- coding diacritics × 1
- color highlighting × 1
- combinator × 1
- confirm × 1
- constructor × 1
- continuation-passing style × 1
- coords × 1
- coursera × 1
- csla × 1
- css × 1
- data × 1
- database × 1
- declarative × 1
- delete × 1
- dhtmlx × 1
- discriminated union × 1
- distance × 1
- docs × 1
- documentation × 1
- dol × 1
- domain × 1
- du × 1
- eclipse × 1
- edsl × 1
- em algorithm × 1
- emacs × 1
- emotion × 1
- error × 1
- etw × 1
- euclidean × 1
- event × 1
- example × 1
- ext js × 1
- extension methods × 1
- extra × 1
- facet pattern × 1
- fantomas × 1
- fear × 1
- fp × 1
- frank × 1
- fsdoc × 1
- fsharp.core × 1
- fsharp.powerpack × 1
- fsharpx × 1
- function × 1
- functional style × 1
- gc × 1
- generic × 1
- geometry × 1
- getlastwin32error × 1
- google × 1
- group × 1
- hash × 1
- history × 1
- hosting × 1
- httpcontext × 1
- https × 1
- hubfs × 1
- ie 8 × 1
- if-doc × 1
- inheritance × 1
- installer × 1
- interpreter × 1
- io × 1
- ios × 1
- ipad × 1
- kendo × 1
- learning × 1
- licensing × 1
- macro × 1
- macros × 1
- maps × 1
- markup × 1
- marshal × 1
- math × 1
- metro style × 1
- micro orm × 1
- minimum-requirements × 1
- multidimensional × 1
- multithreading × 1
- mysql × 1
- mysqlclient × 1
- nancy × 1
- nested × 1
- nested loops × 1
- node × 1
- object relation mapper × 1
- object-oriented × 1
- offline × 1
- option × 1
- orm × 1
- osx × 1
- owin × 1
- paper × 1
- parameter × 1
- performance × 1
- persistent data structure × 1
- phonegap × 1
- pola × 1
- powerpack × 1
- prefix tree × 1
- principle of least authority × 1
- programming × 1
- projekt_feladat × 1
- protected × 1
- provider × 1
- ptvs × 1
- quant × 1
- quotations × 1
- range × 1
- raphael × 1
- razor × 1
- rc × 1
- real-time × 1
- reference × 1
- restful × 1
- round table × 1
- runtime × 1
- scriptcs × 1
- scripting × 1
- service × 1
- session-state × 1
- sitelet × 1
- stickynotes × 1
- stress × 1
- strong name × 1
- structures × 1
- tdd × 1
- template × 1
- tracing × 1
- tsunamiide × 1
- type inference × 1
- type providers × 1
- upload × 1
- vb × 1
- vb.net × 1
- vector × 1
- visual f# × 1
- visual studio 11 × 1
- visual studio shell × 1
- visualstudio × 1
- web api × 1
- webapi × 1
- windows 8 × 1
- windows-phone × 1
- winrt × 1
- xml × 1
|
Copyright (c) 2011-2012 IntelliFactory. All rights reserved. Home | Products | Consulting | Trainings | Blogs | Jobs | Contact Us |
Built with WebSharper |
I am trying to open a big PDF file (60 to 200mb) for analysis using regex on a string value of the hexdecimal value of the file. For that I read the file as a byte array and then convert it to string array containing the hex value and I then build array into a string using
StringBuilder.I've been able to control the memory usage using
StringBuilderbut for the bigger files 150mb and up I still getOutOfMemoryexceptions.open System open System.IO open System.Text let byteToHex byte = let sb = new System.Text.StringBuilder(byte |> Seq.length) byte |> Seq.map (fun (x : byte) -> String.Format("{0:X2}", x)) |> Seq.iter (fun i -> sb.Append(String.Empty + i) |> ignore) sb.ToString() let file = @"c:\200mb.pdf" let bdata = File.ReadAllBytes(file) let hdata = byteToHex bdataAnyone have a better idea?Thanks!