Thursday, 28 April 2016

Web Scraping Service Vs Web Scraping Tool – Choosing The Best

Web scraping is a rapidly emerging technique of extracting data from any web source with the intent to use it for analyzing the market trends. Different business owners adopt this method to enhance their sales and growth. They are open with different tools and services of extracting data from the internet.

Here, the major question that rises is- What is suitable between web scraping services and web scraping tools or software? The feasible answer is web scraping service as it offers comparatively more benefits that any software or tool.

Advantages of web scraping services

A thick of web scraping companies render custom-based support to the businesses in data extraction. Some of the major compelling benefits of preferring web scraping services may include:

    Lowered cost: You can conveniently save your thousands of money and man-power as these services are available at comparatively low prices.

    Accuracy in results: Unlike the data extraction software, these services render premium level of accuracy in terms of results. The leading companies of web services ensure that they deliver exact outcomes to you as per your need through their services.

    Instant outcomes: It only takes maximum time duration of 3 to 4 hours to generate valuable information about any database by acquiring the services of reputed web scraping company. You can avail the advantage of time over market on your competitors.

Drawbacks of using web scraping tools

    Using data extraction software accompanies certain drawbacks with it that may include:
    Difficulty in data extraction from multifaceted websites.
    Difficulty in extracting huge bulk of data.
    It is comparatively a slower process than a service provider.
    Several sites have well-defined policies for screen scraping.

Summary: While making a comparison between web scraping tools and services, you may arrive at the conclusion that the services are much more beneficial, reliable, and efficient than the major tools.
   
 Source : http://www.web-parsing.com/blog/web-scraping-services-vs-web-scraping-tools-choosing-the-best/

Wednesday, 27 April 2016

Data Extraction is not a Rocket Science: Follow These 4 Tips to Get Exemplary Results!

Data extraction is a skill, the more you master it – more are the chances of having a lucid picture of the volatile market and getting better perceptive of constantly changing trends. Escalating volatility in the market and intensifying competition has been the most contributing factors that have led to the rise of data extraction and data mining.

Data extraction is primarily used by companies (large and small, alike) to collect data from a specific industry, or data related to targeted customers or about their competition in the market. In fact, it has become a primary tool for marketers to plan their moves for branding and promoting particular products or services. It helps a wide plethora of industrial sectors to find and learn about specific data, based on their requirements.

And now with the rise of internet, web scraping has emerged as an important aspect that contributes to your success – the success of your venture or organization. It processes the HTML of a Web page to obtain data and convert it into to another format (i.e. HTML to XML).

Various extraction tools form an integral part of data extraction and data scrapping. Following offers a brief outline of some of these tools:

Email Extraction – An email extractor tool is used to acquire the email ids from any dependable sources automatically

Screen Scrapping – Screen scraping is a practice of reading text information from a screen and collecting visual data, rather than analyzing data as done in web scraping.

Data Mining as name suggests is a process of gathering patterns from information. It basically transforms the information into formats like CSV, MS excels, HTML and so and so forth, depending to your requirements. Web Spider – A Web spider is a computer program which browses internet in a systematic, automated manner. It is used by many search engines in order to provide up-to-date data.

It is often seen that while extracting data; many get lost into the labyrinth of confusion, data overabundance, along with a lot of weird and not-so-familiar terms. Proper handling of these may sound easy, however; when not executed with appropriate procedure and processes; it may bring in disastrous results.

This no way means that data mining is a rocket science which only a few gifted and skilled people can take up. All it requires is undivided attention, keen preparation, and training, so brace up yourself for an overview of some practical tips that can help in successful data extraction and give a boost to your business.

Identify your Business Goals!:

Get a clear perspective in mind as to what are your business goals.

Data extraction can be bifurcated into various branches; and one needs to choose it wisely, depending on the business goals. E.g. your primary requirement is to get email ids of potential clients to conduct an email campaign; and for that you certainly need an email extractor. Use of this tool assists in extracting the email ids from trustworthy sources automatically. It essentially collects business contacts from various web pages, text files, HTML files, or any other format without duplicating the email ids. So, if you are not sure what you want; even applying the best tools will be of no use!

A crystal clear mindset helps in better understanding of market scenario and thus helps in formulation of powerful and effective strategies to get desired outcomes. E.g., people dealing in real estate business, should have a vision for it and which area they want to target specifically. With a clear vision they can clearly spell out what you want and where it should be.

Set Realistic Expectations:

Upon identifying your business goals, make sure to check out that they are realistic and attainable! Unrealistic and unachievable targets are the real cause for the obstacles and frustrations in the future.

Since, there are various tools that are and can be employed to extract data; vague or unclear goals make it difficult to determine which tool can be applied.

This crystal clear mindset; will help you give that insight about the direction your business is headed to.

Moreover, you can determine which method can be used to get excellent results. You can get a lucid picture of the past and present of your competitors and therefore helps in setting targets based on the others’ experiences. It is usually a wise move to set expectations that you have not achieved before.

Appoint Skilled Data Miner:

Skilled data miner with excellent data mining skills will reduce the painstaking and tiresome process of planning, devising and preparation.

For fresh start-ups, you can go ahead with the standard procedure however; if you have ample professionals at your disposal, pick up the right one who is not only knowledgeable but also reliable and sincere towards the task.

Prevent Data Deposits:

Being dead-sure of what you really want will help you avoid unnecessary data deposition.

Data mining just like real mining is a skill to know where the real treasure lies and being able to get it in the most efficient and effective way.

Being able to spot on authenticated & reliable resources, well researched information is what gives a short cut to locate the right and exact data.

If you are aimlessly opening every website; the results are bound to be ambiguous and would ultimately be a waste of time and effort.

Source : http://www.habiledata.com/blog/data-extraction-is-not-a-rocket-science-follow-these-4-tips-to-get-exemplary-results

Saturday, 11 July 2015

Mobile app developers “duped” into distributing data-scraping malware: NICTA

The surge in mobile malware has led many to condemn developers' poor security practices, yet recent NICTA research suggests that – even though data-stealing is ubiquitous among both paid and free Android applications – many mobile application developers are in fact being “duped” into incorporating data-stealing routines into their applications.

A methodical analysis of Android applications and source code found that all of the top 100 paid and non-paid apps in Australia were collecting personal information, with 60 percent of the apps incorporating some sort of tracking library and 20 percent of the apps featuring more than three different tracking libraries.

While many have blamed developers for their poor security, NICTA mobile systems research group leader, Aruna Seneviratne, who leads the organisation's Networks Research Group, told CSO Australia that many tracking libraries were inadvertently added when developers incorporated third-party libraries into their mobile apps.

“In most cases app developers just use third-party libraries and don't know what's in them,” he said. “They're not being malicious for the sake of being malicious; they are just being duped into doing a thing that collects a lot of information.”

 And collect they do. Apps analysed by the team – whose paper 'early detection of spam mobile apps' was accepted for presentation at the recent WWW 2015 conference in Florence, Italy – were siphoning all kinds of personal information off of users' mobile devices, often sending it to enlarge what have become massive databases of personal preferences and behavioural modeling.

“It's amazing how much information each of those apps collects,” he said, “and the scary thing is that most of them actually go to a small number of sources – which means these guys can actually infer a lot of information about you. They have a very good idea of who you are and what you're doing – and they are cross-matching the information they collect.”

Ever more-clever data-siphoning routines were making data collection richer all the time, with many Android apps now being designed with libraries that collect information about nearby Wi-Fi access points and can correctly extrapolate the user's location 90 percent of the time.

Read more: The week in security: Android apps collecting your location data, home routers hit by drive-by malware

Seneviratne blamed Google's relatively lax app-approval process for the proliferation of such apps, which join the malware-laden apps that by the team's figures account for around 3 percent of all Google Play Store apps.

Recognising that developers are often as clueless as users about the extent of the data collection going on, the team has proposed an app-rating system that will give consumers a better idea of what they're enabling by downloading and installing a particular app.

A basic prototype has already been developed and a pilot site is expected to be up and running by the fourth quarter of this year. The service, which rates apps on criteria such as privacy and security, will be available to third parties as a Web service that Seneviratne hopes will eventually help it gain traction on app-rating and other sites.

Read more: Surveillance laws driving companies to limit data collection, developers to boost security

“We've been working to come up with a scheme that is similar to the energy-ratings system that you have for electrical appliances,” he said, noting that the site will also seek to boost developers' security awareness by correlating app ratings “to let consumers know they can download an alternate app that has the same functionality but a higher security rating”.

Israeli developer-tools firm Checkmarx has taken its own approach to improving developers' security skills, recently learning extensive lessons as hackers worked to manipulate its Game of Hacks security application – which is now under development to be sold to large corporates for developer training and testing.

This article is brought to you by Enex TestLab, content directors for CSO Australia.

Read more: The week in security: Budget flags encryption troubles, cross-government IAM

Feeling social? Follow us on Twitter and LinkedIn Now!

Read More:

    Victorian Commissioner for Privacy and Data Protection sorts sheep from the goats

    Better than email: VISA launches FireEye threat intel platform for merchants

Source: http://www.cso.com.au/article/576533/mobile-app-developers-duped-into-distributing-data-scraping-malware-nicta/

Friday, 26 June 2015

Data Scraping - Enjoy the Appeal of the Hand Scraped Flooring

Hand scraped flooring is appreciated for the character it brings into the home. This style of flooring relies on hand scraped planks of wood and not the precise milled boards. The irregularities in the planks provide a certain degree of charm and help to create a more unique feature in the home.

Distressed vs. Hand scraped

There are two types of flooring in the market that have an aged and unique charm with a non perfect finish. However, there is a significant difference in the process used to manufacture the planks. The more standard distresses flooring is cut on a factory production line. The grooves, scratches, dents, or other irregularities in these planks are part of the manufacturing process and achieved by rolling or pressed the wood onto a patterned surface.

The real hand scraped planks are made by craftsmen and they work on each plant individually. By using this working technique, there is complete certainty that each plank will be unique in appearance.

Scraping the planks

The hand scraping process on the highest-quality planks is completed by the trained carpenter or craftsmen who will produce a high-quality end product and take great care in their workmanship. It can benefit to ask the supplier of the flooring to see who completes the work.

Beside the well scraped lumber, there are also those planks that have been bought from the less than desirable sources. This is caused by the increased demand for this type of flooring. At the lower end of the market the unskilled workers are used and the end results aren't so impressive.

The high-quality plank has the distinctive look that feels and functions perfectly well as solid flooring, while the low-quality work can appear quite ugly and cheap.

Even though it might cost a little bit more, it benefits to source the hardwood floor dealers that rely on the skilled workers to complete the scraping process.

Buying the right lumber

Once a genuine supplier is found, it is necessary to determine the finer aspects of the wooden flooring. This hand scraped flooring is available in several hardwoods, such as oak, cherry, hickory, and walnut. Plus, it comes in many different sizes and widths. A further aspect relates to the finish with darker colored woods more effective at highlighting the character of the scraped boards. This makes the shadows and lines appear more prominent once the planks have been installed at home.

Why not visit Bellacerafloors.com for the latest collection of luxury floor materials, including the Handscraped Hardwood Flooring.

Source: http://ezinearticles.com/?Enjoy-the-Appeal-of-the-Hand-Scraped-Flooring&id=8995784

Saturday, 20 June 2015

Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples

My intrepid colleague (@jayjacobs) informed me of this (and didn’t gloat too much). I’ve got a “pirate day” post coming up this week that involves scraping content from the web and thought folks might benefit from another example that compares the “old way” and the “new way” (Hadley excels at making lots of “new ways” in R :-) I’ve left the output in with the code to show that you get the same results.

The following shows old/new methods for extracting a table from a web site, including how to use either XPath selectors or CSS selectors in rvest calls. To stave of some potential comments: due to the way this table is setup and the need to extract only certain components from the td blocks and elements from tags within the td blocks, a simple readHTMLTable would not suffice.

The old/new approaches are very similar, but I especially like the ability to chain output ala magrittr/dplyr and not having to mentally switch gears to XPath if I’m doing other work targeting the browser (i.e. prepping data for D3).

The code (sans output) is in this gist, and IMO the rvest package is going to make working with web site data so much easier.

library(XML)
library(httr)
library(rvest)
library(magrittr)

# setup connection & grab HTML the "old" way w/httr

freak_get <- GET("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

freak_html <- htmlParse(content(freak_get, as="text"))

# do the same the rvest way, using "html_session" since we may need connection info in some scripts

freak <- html_session("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

# extracting the "old" way with xpathSApply

xpathSApply(freak_html, "//*/td[3]", xmlValue)[1:10]

##  [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"       

##  [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "                        

##  [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"                 

## [10] "Zero Dark Thirty "

xpathSApply(freak_html, "//*/td[1]", xmlValue)[2:11]

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

xpathSApply(freak_html, "//*/td[4]", xmlValue)

##  [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

##  [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

xpathSApply(freak_html, "//*/td[4]/a[contains(@href,'imdb')]", xmlAttrs, "href")

##                                    href                                    href                                    href

##  "http://www.imdb.com/title/tt1045658/"  "http://www.imdb.com/title/tt0903624/"  "http://www.imdb.com/title/tt0454876/"

##                                    href                                    href                                    href

##  "http://www.imdb.com/title/tt1024648/"  "http://www.imdb.com/title/tt2024432/"  "http://www.imdb.com/title/tt1234719/"

##                                    href                                    href                                    href

##  "http://www.imdb.com/title/tt1446192/"  "http://www.imdb.com/title/tt1853728/"  "http://www.imdb.com/title/tt0443272/"

##                                    href

## "http://www.imdb.com/title/tt1790885/?"


# extracting with rvest + XPath

freak %>% html_nodes(xpath="//*/td[3]") %>% html_text() %>% .[1:10]

##  [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"       

##  [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "                        

##  [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"                 

## [10] "Zero Dark Thirty "

freak %>% html_nodes(xpath="//*/td[1]") %>% html_text() %>% .[2:11]

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

freak %>% html_nodes(xpath="//*/td[4]") %>% html_text() %>% .[1:10]

##  [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

##  [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes(xpath="//*/td[4]/a[contains(@href,'imdb')]") %>% html_attr("href") %>% .[1:10]

##  [1] "http://www.imdb.com/title/tt1045658/"  "http://www.imdb.com/title/tt0903624/"

##  [3] "http://www.imdb.com/title/tt0454876/"  "http://www.imdb.com/title/tt1024648/"

##  [5] "http://www.imdb.com/title/tt2024432/"  "http://www.imdb.com/title/tt1234719/"

##  [7] "http://www.imdb.com/title/tt1446192/"  "http://www.imdb.com/title/tt1853728/"

##  [9] "http://www.imdb.com/title/tt0443272/"  "http://www.imdb.com/title/tt1790885/?"

# extracting with rvest + CSS selectors

freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10]

##  [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"       

##  [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "                        

##  [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"                 

## [10] "Zero Dark Thirty "

freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11]

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10]

##  [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

##  [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10]

##  [1] "http://www.imdb.com/title/tt1045658/"  "http://www.imdb.com/title/tt0903624/"

##  [3] "http://www.imdb.com/title/tt0454876/"  "http://www.imdb.com/title/tt1024648/"

##  [5] "http://www.imdb.com/title/tt2024432/"  "http://www.imdb.com/title/tt1234719/"

##  [7] "http://www.imdb.com/title/tt1446192/"  "http://www.imdb.com/title/tt1853728/"

##  [9] "http://www.imdb.com/title/tt0443272/"  "http://www.imdb.com/title/tt1790885/?"

# building a data frame (which is kinda obvious, but hey)

data.frame(movie=freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10],

           rank=freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11],

           rating=freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10],

           imdb.url=freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10],

           stringsAsFactors=FALSE)

##                                 movie rank        rating                              imdb.url

## 1            Silver Linings Playbook     1 7.4 / trailer  http://www.imdb.com/title/tt1045658/

## 2  The Hobbit: An Unexpected Journey     2 8.2 / trailer  http://www.imdb.com/title/tt0903624/

## 3          Life of Pi (DVDscr/DVDrip)    3 8.3 / trailer  http://www.imdb.com/title/tt0454876/

## 4                       Argo (DVDscr)    4 8.2 / trailer  http://www.imdb.com/title/tt1024648/

## 5                     Identity Thief     5 8.2 / trailer  http://www.imdb.com/title/tt2024432/

## 6                           Red Dawn     6 5.3 / trailer  http://www.imdb.com/title/tt1234719/

## 7      Rise Of The Guardians (DVDscr)    7 7.5 / trailer  http://www.imdb.com/title/tt1446192/

## 8           Django Unchained (DVDscr)    8 8.8 / trailer  http://www.imdb.com/title/tt1853728/

## 9                    Lincoln (DVDscr)    9 8.2 / trailer  http://www.imdb.com/title/tt0443272/

## 10                  Zero Dark Thirty    10 7.6 / trailer http://www.imdb.com/title/tt1790885/?

Source: http://www.r-bloggers.com/migrating-table-oriented-web-scraping-code-to-rvest-wxpath-css-selector-examples/

Tuesday, 9 June 2015

Web Scraping Services : Data Discovery vs. Data Extraction

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the "details" links within the search results pages to get to the data you're actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a commercial screen-scraping tool can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase you've already arrived at the page containing the data you're interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URL's and link titles). Regular expressions can be a bit complex to deal with, so most screen-scraping applications will hide these details from you, even though they may use regular expressions behind the scenes.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once you've extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the user's web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once it's been extracted.

Source: http://ezinearticles.com/?Data-Discovery-vs.-Data-Extraction&id=165396

Wednesday, 3 June 2015

WordPress Titles: scraping with search url

I’ve blogged for a few years now, and I’ve used several tools along the way. zachbeauvais.com began as a Drupal site, until I worked out that it’s a bit overkill, and switched to WordPress. Recently, I’ve been toying with the idea of using a static site generator (a lá Jekyll or Hyde), or even pulling together a kind of ebook of ramblings. I also want to be able to arrange the posts based on the keywords they contain, regardless of how they’re categorised or tagged.

Whatever I wanted to do, I ended up with a single point of messiness: individual blog posts, and how they’re formatted. When I started, I seem to remember using Drupal’s truly awful WYSIWYG editor, and tweaking the HTML soup it produced. Then, when I moved over to WordPress, it pulled all the posts and metadata through via RSS, and I tweaked with the visual and text tools which are baked into the engine.

A couple years ago, I started to write in Markdown, and completely apart from the blog (thanks to full-screen writing and loud music). This gives me a local .md file, and I copy/paste into WordPress using a plugin to get rid of the visual editor entirely.

So, I wrote a scraper to return a list of blog posts containing a specific term. What I hope is that this very simple scraper is useful to others—WordPress is pretty common, after all—and to get some ideas for improving it, and handle post content. If you haven’t used ScraperWiki before, you might not know that you can see the raw scraper by clicking “view source” from the scraper’s overview page (or going here if you’re lazy).

This scraper is based on WordPress’ built-in search, which can be used by passing the search terms to a url, then scraping the resulting page:

http://zachbeauvais.com/?s=search_term&submit=Search

The scraper uses three Python libraries:

    Requests
    ScraperWiki
    lxml.html

There are two variables which can be changed to search for other terms, or using a different WordPress site:

term = "coffee"

site = "http://www.zachbeauvais.com"

The rest of the script is really simple: it creates a dictionary called “payload” containing the letter “s”, the keyword, and the instruction to search. The “s” is in there to make up the search url: /?s=coffee …

Requests then GETs the site, passing payload as url parameters, and I use Request’s .text function to render the page in html, which I then pass through lxml to the new variable “root”.

payload = {'s': str(term), 'submit': 'Search'}

r = requests.get(site, params=payload)  # This'll be the results page

html = r.text

root = lxml.html.fromstring(html)  # parsing the HTML into the var root

Now, my WordPress theme renders the titles of the retrieved posts in <h1> tags with the CSS class “entry-title”, so I loop through the html text, pulling out the links and text from all the resulting h1.entry-title items. This part of the script would need tweaking, depending on the CSS class and h-tag your theme uses.

for i in root.cssselect("h1.entry-title a"):

    link = i.cssselect("a")

    text = i.text_content()

    data = {

        'uri': link[0].attrib['href'],

        'post-title': str(text),

        'search-term': str(term)

    }

    if i is not None:

        print link

        print text

        print data

        scraperwiki.sqlite.save(unique_keys=['uri'], data=data)

    else:

        print "No results."

These return into an sqlite database via the ScraperWiki library, and I have a resulting database with the title and link to every blog post containing the keyword.

So, this could, in theory, run on any WordPress instance which uses the same search pattern URL—just change the site variable to match.

Also, you can run this again and again, changing the term to any new keyword. These will be stored in the DB with the keyword in its own column to identify what you were looking for.

See? Pretty simple scraping.

So, what I’d like next is to have a local copy of every post in a single format.

Has anyone got any ideas how I could improve this? And, has anyone used WordPress’ JSON API? It might be a logical next step to call the API to get the posts directly from the MySQL DB… but that would be a new blog post!

Source: https://scraperwiki.wordpress.com/2013/03/11/wordpress-titles-scraping-with-search-url/