Tadpole the Language for Scraping 0.2.0 – Complex Control Flow, Stealth and More
3 by zachperkitny | 0 comments on Hacker News.
Hello, I posted a few weeks ago about my custom scraping language. It definitely got some traction, which was very exciting to see. Github Repo: https://ift.tt/oVU90CO Docs: https://tadpolehq.com/ The past 2 weeks, I've been focusing my efforts in introducing specific stealth actions, more complicated control flow actions and a lot of various evaluators for cleaning data. Here is an example for scraping from `books.toscrape.com` main { new_page { goto "https://ift.tt/BQfAZU9" loop { do { $$ article.product_pod { extract "books[]" { title { $ "h3 a"; attr title } rating { $ ".star-rating"; attr "class"; extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true; func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)" } price { $ "p.price_color"; text; as_float } in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true } } } } while { $ "li.next" } next { $ "li.next a" { click } wait_until } } } } I've introduced actions like `apply_identity` to override User Agent Headers and User Agent Metadata. Here is an example module to selectively create different identities: module stealth { // Apple M2 Pro action apply_apple_m2 { apply_identity mac set_webgl_vendor "Apple Inc." "Apple M2" set_device_memory 16 set_hardware_concurrency 8 set_viewport 1440 900 deviceScaleFactor=2 } // Windows Desktop action apply_windows_16_8 { apply_identity windows set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)" set_device_memory 16 set_hardware_concurrency 8 set_viewport 1920 1080 } // Windows Budget Laptop action apply_windows_8_4 { apply_identity windows set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)" set_device_memory 8 set_hardware_concurrency 4 set_viewport 1366 768 } } The full release changelog is available here: https://ift.tt/yrMHbmh My goals for the next 0.3.0 release is to heavily focus on Plugins, Distributed Execution through Message Queues, Redis Support for Crawling, Static Parsing as opposed to exclusively over CDP/Chrome. I will keep trying to keep my release cadence at every 2 weeks!
Don't forget to subscribe our youtube channel Click here:- http://www.youtube.com/c/techgk Product of the day
Post Top Ad
Responsive Ads Here
Home
Latest technews
New ask Hacker News story: Tadpole the Language for Scraping 0.2.0 – Complex Control Flow, Stealth and More
New ask Hacker News story: Tadpole the Language for Scraping 0.2.0 – Complex Control Flow, Stealth and More
Share This
Subscribe to:
Post Comments (Atom)
Post Bottom Ad
Responsive Ads Here
Author Details
Templatesyard is a blogger resources site is a provider of high quality blogger template with premium looking layout and robust design. The main mission of templatesyard is to provide the best quality blogger templates which are professionally designed and perfectlly seo optimized to deliver best result for your blog.
No comments:
Post a Comment