Don,
Nice. Thanks for the code. Ummm, somehow you always provide a pertinent example of something that you have done that fits in something that I'm doing. jjbHop (soon to get a hubFS project name and codebase) needs this code for the crawl engine (because I'll admit your code is *more better* than anything that I've built to date in the crawler for Hop).
You've mentioned F# is fast; I've started at google.com and am almost done with all of the internet; progress bar says 78% complete after 163 minutes. :-)
---O
Nice. Thanks for the code. Ummm, somehow you always provide a pertinent example of something that you have done that fits in something that I'm doing. jjbHop (soon to get a hubFS project name and codebase) needs this code for the crawl engine (because I'll admit your code is *more better* than anything that I've built to date in the crawler for Hop).
You've mentioned F# is fast; I've started at google.com and am almost done with all of the internet; progress bar says 78% complete after 163 minutes. :-)
---O
Topic tags
- f# × 3660
- compiler × 263
- functional × 199
- c# × 119
- websharper × 114
- classes × 96
- web × 94
- book × 84
- .net × 82
- async × 72
- parallel × 43
- server × 43
- parsing × 41
- testing × 41
- asynchronous × 30
- monad × 28
- ocaml × 26
- tutorial × 26
- haskell × 25
- workflows × 22
- html × 21
- linq × 21
- introduction × 19
- silverlight × 19
- wpf × 19
- fpish × 18
- collections × 14
- pipeline × 14
- templates × 12
- monads × 11
- opinion × 10
- reactive × 10
- plugin × 9
- scheme × 9
- sitelets × 9
- solid × 9
- basics × 8
- concurrent × 8
- deployment × 8
- how-to × 8
- python × 8
- complexity × 7
- javascript × 6
- jquery × 6
- lisp × 6
- real-world × 6
- workshop × 6
- xaml × 6
- conference × 5
- dsl × 5
- java × 5
- metaprogramming × 5
- ml × 5
- scala × 5
- visual studio × 5
- formlets × 4
- fsi × 4
- lift × 4
- sql × 4
- teaching × 4
- alt.net × 3
- aml × 3
- enhancement × 3
- list × 3
- reflection × 3
- blog × 2
- compilation × 2
- computation expressions × 2
- corporate × 2
- courses × 2
- cufp × 2
- enterprise × 2
- entity framework × 2
- erlang × 2
- events × 2
- f# interactive × 2
- fsc × 2
- google maps × 2
- html5 × 2
- http × 2
- interactive × 2
- interface × 2
- iphone × 2
- iteratee × 2
- jobs × 2
- keynote × 2
- mvc × 2
- numeric × 2
- obfuscation × 2
- oop × 2
- packaging × 2
- pattern matching × 2
- pipelines × 2
- rx × 2
- script × 2
- seq × 2
- sockets × 2
- stm × 2
- tcp × 2
- trie × 2
- type × 2
- type provider × 2
- xna × 2
- zh × 2
- .net interop × 1
- 2012 × 1
- abstract class × 1
- accumulator × 1
- active pattern × 1
- addin × 1
- agents × 1
- agile × 1
- android × 1
- anonymous object × 1
- appcelerator × 1
- architecture × 1
- array × 1
- arrays × 1
- asp.net 4.5 × 1
- asp.net mvc × 1
- asp.net mvc 4 × 1
- asp.net web api × 1
- aspnet × 1
- ast × 1
- b-tree × 1
- bistro × 1
- bug × 1
- camtasia studio × 1
- canvas × 1
- class × 1
- client × 1
- clojure × 1
- closures × 1
- cloud × 1
- cms × 1
- coding diacritics × 1
- color highlighting × 1
- combinator × 1
- confirm × 1
- constructor × 1
- continuation-passing style × 1
- coords × 1
- coursera × 1
- csla × 1
- css × 1
- data × 1
- database × 1
- declarative × 1
- delete × 1
- dhtmlx × 1
- discriminated union × 1
- distance × 1
- docs × 1
- documentation × 1
- dol × 1
- domain × 1
- du × 1
- duf-101 × 1
- eclipse × 1
- edsl × 1
- em algorithm × 1
- emacs × 1
- emotion × 1
- error × 1
- etw × 1
- euclidean × 1
- event × 1
- example × 1
- ext js × 1
- extension methods × 1
- extra × 1
- facet pattern × 1
- fantomas × 1
- fear × 1
- float × 1
- fp × 1
- frank × 1
- fsdoc × 1
- fsharp.core × 1
- fsharp.powerpack × 1
- fsharpx × 1
- function × 1
- functional style × 1
- gc × 1
- generic × 1
- geometry × 1
- getlastwin32error × 1
- google × 1
- group × 1
- hash × 1
- history × 1
- hosting × 1
- httpcontext × 1
- https × 1
- hubfs × 1
- ie 8 × 1
- if-doc × 1
- inheritance × 1
- installer × 1
- interpreter × 1
- io × 1
- ios × 1
- ipad × 1
- kendo × 1
- learning × 1
- licensing × 1
- macro × 1
- macros × 1
- maps × 1
- markup × 1
- marshal × 1
- math × 1
- metro style × 1
- micro orm × 1
- minimum-requirements × 1
- multidimensional × 1
- multithreading × 1
- mysql × 1
- mysqlclient × 1
- nancy × 1
- nested × 1
- nested loops × 1
- node × 1
- object relation mapper × 1
- object-oriented × 1
- offline × 1
- option × 1
- orm × 1
- osx × 1
- owin × 1
- paper × 1
- parameter × 1
- performance × 1
- persistent data structure × 1
- phonegap × 1
- pola × 1
- powerpack × 1
- prefix tree × 1
- principle of least authority × 1
- programming × 1
- projekt_feladat × 1
- protected × 1
- provider × 1
- ptvs × 1
- quant × 1
- quotations × 1
- range × 1
- raphael × 1
- razor × 1
- rc × 1
- real-time × 1
- reference × 1
- restful × 1
- round table × 1
- runtime × 1
- scriptcs × 1
- scripting × 1
- service × 1
- session-state × 1
- sitelet × 1
- stickynotes × 1
- stress × 1
- strong name × 1
- structures × 1
- tdd × 1
- template × 1
- tracing × 1
- tsunamiide × 1
- type inference × 1
- type providers × 1
- upload × 1
- vb × 1
- vb.net × 1
- vector × 1
- visual f# × 1
- visual studio 11 × 1
- visual studio shell × 1
- visualstudio × 1
- web api × 1
- webapi × 1
- windows 8 × 1
- windows-phone × 1
- winrt × 1
- xml × 1
|
Copyright (c) 2011-2012 IntelliFactory. All rights reserved. Home | Products | Consulting | Trainings | Blogs | Jobs | Contact Us |
Built with WebSharper |
It's a script, so it works best if you go through it line by line using F# Interactive and Visual Studio, though it also works if you use F# Interactive on a command line and paste lines of code into it.
Show your friends and take them through it line by line - it really is enormous fun. I've also pasted the code below to give the code highlighter a decent run :-)
//--------------------------------------------------------------------------- // Part O. Hello World //System.Console.WriteLine("Hello World");; System.Console.WriteLine("Hello World");; open Printf;; printf "Hello World\n";; //--------------------------------------------------------------------------- // Part I. Web. open System.Net open System open System.IO let id x = x let req = WebRequest.Create("http://www.microsoft.com") let resp = req.GetResponse() let stream = resp.GetResponseStream() let reader = new IO.StreamReader(stream) let html = reader.ReadToEnd();; html;; /// Fetch the contents of a web page let http(url: string) = let req = WebRequest.Create(url) in let resp = req.GetResponse() in let stream = resp.GetResponseStream() in let reader = new IO.StreamReader(stream) in let html = reader.ReadToEnd() in resp.Close(); html let google = http("http://www.google.com");; let bbc = http("http://news.bbc.co.uk");; let msft = http("http://www.microsoft.com");; let nytRSS = http("http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml");; //let bbcRSS = http("http://www.bbc.co.uk/go/homepage/int/ne/nrss/log/i/-/news/rss/newsonline_uk_edition/front_page/rss.xml") // ---------------------------- // Windows Forms open System.Windows.Forms;; open System.Drawing;; let form = new Form();; form.Visible <- true;; form.Text <- "Welcome to F# Interactive Programming";; form.TopMost <- true;; let textB = new RichTextBox();; form.Controls.Add(textB);; textB.Dock <- DockStyle.Fill ;; textB.Text <- nytRSS;; textB.ForeColor <- Color.DarkBlue;; textB.Font <- new Font("Lucida Console",12.0f,FontStyle.Bold) ; form.Size <- new Size(400,600);; let setText text = textB.Text <- text let appendText text = textB.AppendText(text + "\n");; setText "hello";; setText "hello again";; //let (|>) x f = f x let any_to_string_ex opts x = x |> any_to_layout opts |> layout_to_string opts let show x = let opts= { format_options.Default with printWidth = form.Width/16 } in setText (any_to_string_ex opts x);; (1,2,3) |> show;; Array.create 100 (1,2,3) |> show;; nytRSS |> setText;; show 1;; // ---------------------------- // Scan RSS for news titles open System.Xml;; open System.Collections;; open System.Collections.Generic;; let xdoc = new XmlDocument();; xdoc.LoadXml(nytRSS);; xdoc.SelectNodes("//title");; xdoc.SelectNodes("//title") |> show;; /// Hmmm... XPathNodeList supports System.IEnumerable /// First extract the text from the nodes then display... xdoc.SelectNodes("//title") |> IEnumerable.map_with_type (fun (i:XmlNode) -> i.InnerText) |> IEnumerable.to_list |> show;; // ---------------------------- // Search for URLs in HTML open System.Text.RegularExpressions;; let httpPat = "http://[a-z-A-Z0-9./_]*" let urlPat = "href=\s*\"[^\"h]*(http://[^&\"]*)\"";; let bbcUrls = Regex.Matches(bbc,urlPat);; let getUrls (txt:string) = Regex.Matches(txt,urlPat) |> IEnumerable.map_with_type (fun (m:Match) -> (m.Groups.Item(1)).Value) |> IEnumerable.to_list;; let collectUrls url = appendText url; Application.DoEvents(); let html = try http(url) with _ -> "" in let urls = getUrls html in urls;; collectUrls "http://news.google.com" |> show;; // ---------------------------- // Crawling (Synchronous) let crawlLimit = 10;; let rec crawl sofar url = if Set.size sofar >= crawlLimit or Set.mem url sofar then sofar else let urls = collectUrls url in List.fold_left crawl (Set.add url sofar) urls;; textB.Clear();; crawl Set.empty "http://news.google.com";; // ---------------------------- // HTTP Requests (Asynchronous) open System.Threading open Microsoft.FSharp.Idioms open System.Collections.Generic let httpAsync (url:string) (cont: string -> unit) = let req = WebRequest.Create(url) in let iar = req.BeginGetResponse((fun iar -> let rsp = req.EndGetResponse(iar) in let str = new StreamReader(rsp.GetResponseStream()) in let html = str.ReadToEnd() in rsp.Close(); cont html), 0) in () do httpAsync "http://www.microsoft.com" (fun html -> show html) do httpAsync "http://www.google.com" (fun html -> show html) let collectUrlsAsync url cont = httpAsync url (getUrls >> cont) do collectUrlsAsync "http://news.google.com" (fun urls -> show urls) // ---------------------------- // Crawling (Asynchronous) /// Spawn a worker thread let spawn (f : unit -> unit) = ThreadPool.QueueUserWorkItem(fun _ -> f ()) |> ignore /// Add text to the window from a worker thread let appendTextRemote t = form.Invoke(new MethodInvoker(fun () -> appendText t)) |> ignore let addToSet (d: Dictionary<_,_>) x = let res = d.ContainsKey(x) in if not res then d.Add(x,1); res /// Async crawling let acrawl(url:string) = appendTextRemote (sprintf "Crawling %s..." url); // Local state, protected by locks let sofar = new Dictionary<_,_>() in let rec search url = let wasPresent = lock sofar (fun () -> addToSet sofar url) in if not wasPresent && sofar.Count < crawlLimit * 2 then begin spawn (fun () -> collectUrlsAsync url (fun urls -> List.iter search urls; appendTextRemote url) ); end in spawn (fun () -> search url) do textB.Clear() do acrawl "http://news.google.com" // --------------------------------------------- // Random web walk // List functions // Random numbers let rand = new System.Random() let dice n = rand.Next(n) let diceList xs dflt = let n = List.length xs in if n=0 then dflt else List.nth xs (dice n);; // Web browser control inside a form open System open System.Windows.Forms;; let wb = new WebBrowser();; wb.Dock <- DockStyle.Fill;; wb.AllowNavigation <- true;; let webForm = new Form();; webForm.Controls.Add(wb);; webForm.Visible <- true;; webForm.Size <- new Size(600,400);; webForm.TopMost <- true;; // Point it at pages and get the text wb.Navigate("http://news.bbc.co.uk");; let text = wb.DocumentText // Regular expressions open System.Text.RegularExpressions let rx = new Regex("http://news.bbc.co.uk/[a-z0-9_.-/]+stm") // Regular expression to filter urls let urlsOfDocument (doc : HtmlDocument) = let urls = doc.Links |> IEnumerable.map_with_type (fun (elt:HtmlElement) -> elt.GetAttribute("href")) in let urls = urls |> IEnumerable.filter (fun url -> rx.IsMatch(url)) in let urls = urls |> IEnumerable.to_list in urls;; // Test it urlsOfDocument (wb.Document) |> show;; let randomLink doc = let urls = urlsOfDocument doc in let url = diceList urls "http://news.bbc.co.uk" in Printf.printf "JUMP: %s\n" url; url;; // Test it randomLink wb.Document |> show;; // Click on a timer event let randomClick () = wb.Navigate(randomLink(wb.Document)) let timer = new Timer();; timer.Interval <- 1500;; timer.Tick.Add(fun _ -> randomClick ());; timer.Start();; // Enough! timer.Stop();;