Monday, March 24, 2014

Outsourcing Web Crawls- SLAs to visit

If you deal with data on the web, more often than not would you have made a choice to employ an external entity for your data acquisition needs.

Here are the SLAs you'd need to consider when having a DaaS provider do the crawls for you-

1. Crawlability- If you're into the crawling business, this is the primary attribute to be assured of. Irrespective of the technical variety of the websites, crawls should run smoothly. The crawlers need to be adept with the roadblocks and their corresponding workarounds. Here's a post discussing these roadblocks and this one digging into AJAX pages.

2. Scalability- While crawling as a process might seem overrated when doing this for a few web pages or even to a couple of sites at max, the problem changes by an order of magnitude when this needs to be done at scale. Managing multiple clusters, distributing crawls across them, monitoring the same, collating results from these crawls and then grouping them is where the devils of crawling lie. Make sure your provider is agnostic to the scale you anticipate (look for cues like thousands of sites or millions of pages). Even if your current need is a low-scale arrangement, it's better to go with a solution that's scalable so that you have a reasonably though-out solution at your disposal with all nuts and bolts in place.

3. Data structuring capabilities- Crawling is only half the problem if your requirement is ready to use structured data. Every web page is different, and so are the requirements pertaining to every project. How detailed can your provider be in terms of extracting information from any nook of the page is something for you to validate. This becomes especially critical when your vendor is using a generic crawler in which case number of fields is limited as opposed to writing custom rules per site wherein you define the data schema as per your needs. It's also a good idea to add quality checks at your end to avoid compromises because with web-scale and automation, there could be surprises.

4. Data accuracy- This is in lieu with the above point on structuring capabilities. You'd like access to untouched and uncontaminated information from the web pages. Most providers will extract data as-is from the site for the same reason because any minor modification might defeat the purpose of extracting such data in most cases. However, sometimes you might be resistant to too many new lines, spaces, tabs, etc. (from the web page itself) and hence some level of cleaning could be asked for.

5. Data coverage- Crawls can end up in few pages being missed or skipped for various reasons like page does not exist, page timing out or taking faster to load, or just that the crawler never got to that page. Although such issues are unavoidable especially at scale, they can sure be cured by keeping logs and, for the least, being aware of which ones crept in. Discuss the tolerance levels that you're comfortable with so that the provider can configure their system accordingly.

6. Availability- Data acquisition, at its core, demands availability of right data at the right time. Let your provider know beforehand of the uptimes that you expect. Most of the providers who run data acquisition as a primary business should be able to guarantee ~99% availability of their data delivery channels.

7. Adaptability- Let's come to terms with the fact that whichever process you have adopted between waterfall to agile, requirements do change because of the market dynamism. When acquiring data, you might reveal that adding more information to the data feeds will give you a competitive edge or you might simply have gotten aware of other data sources. How easily your provider can adapt to (if at all) such dynamics is something to check for upfront.

8. Maintainability- As big a deal the crawling and structuring of data is, so is monitoring the pipeline for  regular automated feeds. Although it purely depends on your provider's business model, it's better to be aware of what's included with the project. Given how often websites change, it's better to employ someone who gets notified of changes and does the fixes, so that your team can avoid the hassles of maintaining it.

Do you think there's more to this? We welcome your comments.

Wednesday, February 26, 2014

Crawling the web: The Trends and Challenges

As an evolving field, extracting data from the web is still a gray area - without any clear ground rules regarding the legality of web scraping. With growing concerns among companies regarding how others use their data, crawling the web is gradually becoming more and more complicated. The situation is further aggravated by the growing complexity of web page elements such as AJAX.

Monday, February 17, 2014

Interactive Crawls for Scraping AJAX Pages on the Web

Crawling pages on the web has become an everyday affair for most enterprises. Too often do we come across offline businesses as well who'd like data gathered from the web for internal analyses. All this eventually to serve customers faster and better. At times, when the crawl job is high-end cum high-scale, businesses also consider DaaS providers to supplement their efforts.

However, the web landscape too has evolved with newer technologies that provide fancy experiences to web users. AJAX elements are one such common aid that leave even the DaaS providers perplexed. They come in various forms from a user's point of view-

1. Load more results on the same page
2. Filter results based on various selection criteria
3. Submit forms, etc.

When crawling a non-AJAX page, simple GET requests do the job. However, AJAX pages work with POST requests that are not easy to trace for a normal bot.

Difference between GET request and POST request- Scraping
At PromptCloud, from our experience with a number of AJAX sites on the web, we've crossed the tech barrier. Below is a quick review about the challenges that come with AJAX crawling and its indicative solutions-

Wednesday, February 12, 2014

3 Simple Steps for Incorporating Data into Your Marketing Strategy

Data is definitely a game-changer. It opens up newer opportunities for letting you save money, save time or make more money. Effective use of data can aid in decision-making or help optimizing processes for leveraging your resources to their maximum. Big data and other forms of data-related tools such as visualization, data mining, data analytics, data integration etc. have revolutionized many sectors. Right from operations, research, management, tech, marketing to retail, banking, advertising and industrial automation - the list is endless. Big data efforts have indeed started showing results, coming a long way from being called 'hype' just some time ago. In fact, a study of global companies conducted by TCS suggests that 80% of companies have improved their decision making as a result of utilizing big data.

Monday, January 27, 2014

Can we use Big Data for Improving Millions of Lives?

Big data is constantly changing the fates of thousands of companies, big or small. Proliferation of tools for mining and analysing vast amounts of data are ensuring a wider reach within the corporate landscape. But can big data be used for the larger good of the society?

Tuesday, January 21, 2014

Big Data Democratization via Web Scraping

If  we had to put democratization of data inline with the classroom definition of democracy, it would read- Data by the people, for the people, of the people. Makes a lot of sense, doesn't it? It resonates with the generic feeling we have these days with respect to easy access to data for our daily tasks. Thanks to the internet revolution, and now the social media.

By the people- most of the public data on the web is a user group's sentiments, analyses and other information.

Thursday, January 16, 2014

12 Online tools that use Big Data for Empowering Consumers

Retailers use big data to extract as much money from the consumers as they can. For consumers, doesn't it make sense to have ready access to the right kind of information, which enables them to make smart purchase decisions. Knowing how much to pay in order to consume a particular product or service is recommended for avoiding any sort of price discrimination by the seller.

Apps and websites empower anyone to research products thoroughly, make comparisons among different sellers and buy based on the best offers. Reviews and recommendations from other buyers as well as people in their own social network circles add another layer of reinforcement/dissuasion before making that buying decision.

Online cost estimators/calculators are quite useful tools as they put the power of making informed decisions in the hands of the consumers. Although such tools won't ensure that you don't get fleeced by the shopkeepers, they are surely a great way of providing a reference point in terms of how much you should be paying for consuming an equivalent product or service. Here's a list of tools that use 'big data' to enable smart buying by consumers: