Thursday, January 17, 2008

Fight Duplicate Content Filter Google

Changing Contents without Changing Theme - A Strategy to Fight Duplicate Content Filter: Any better suggestion and a Good Name?


After Google's Duplicate Content Update, I was thinking about a new way to create unique content with the same theme.

It is quite easy to rearrange words, even sentence structures, and replace synonyms (unlike Semantic Markup) of several keywords through Microsoft Word. But, every human has a limitation and can't go beyond this & create large amount of versions of the same theme. One way or the other, some versions or portion of the versions will be redundant.

To resolve this, the best way is to involve several content developers or analyzer to replace synonyms and rearrange words & sentence structure while keeping the same theme. It can be quite handy for Affiliate Marketing websites.

Another way to do it is by providing the contents in French & Spanish. And assigning different French & Spanish Translators or Machine Translators to Translate the content back to English. And then allowing them to fine tune (proof-read) the contents. The reason I chose French & Spanish, is because the Machine Translation of French & Spanish is very smooth. Moreover, no two translations will be the same.

It can be done in 3 ways:
- With MS Word & Other Semantic/Thesaurus Tools (Something like Manual Semantic Markup)
- With Thematic Content Differentiators (English 2 English)
- With Translators (Multilingual to English)

Question-1: Can you suggest me any Good Name for this (Changing Contents without Changing Theme)?

My suggestions are:
- Themetic Content Differentiation/Alteration
- Semantic Differentiation/Alteration
- Natural or Organic Semantic Parsing/Differentiation
- Manual Semantic Markup

I would be pleased to know your choice among these suggestions. However, would highly appreciate innovative name.

Question-2: Do you have any better suggestion about Differentiating Thematic Contents?

Note: The reason I don't like Semantic Markup, is because it may become Black Hat in future and Search Engines rewards people working the hard way.


Stop The Slaughter Of Innocent Copy


It's one of the worst things to ever happen in the search engine copywriting field: the discovery of keyword density. Without any regard to flow or customer experience, website owners around the world began shoving keyphrases into their copy like wild men.

Key points Seo Updates

  • One common mistake many site owners and newbie copywriters make is to replace every single instance of a generic key term with one of their chosen keyphrases.
  • Use keyphrases to describe what your product or service is not, or what it is similar to or what it is better than.
  • Another frequent stumbling block for SEO copywriters is the use of phrases that seem to end abruptly. In these cases, simply add a word to the end.

I won't venture off into a discussion about whether keyword density is still a valid measure of search engine optimized (SEO) copywriting success. I will say, however, that the mere introduction of this concept led to the mutilation and destruction of innocent copy all across the globe. Without any regard to flow or customer experience, website owners around the world began shoving keyphrases into their copy like wild men. The results have been disastrous! Otherwise wonderful content has been utterly destroyed. This slaughter of innocent copy must stop!

All joking aside, the realization several years ago that keyword density was a factor in search engine rankings instantly transformed the landscape of copywriting for the engines. That lone concept lit a fire under people who absolutely butchered their copy for the sake of the engines. A pity really because it doesn't have to be that way.

Keep It Sounding Natural

One primary goal is to write copy so that the keyphrases are virtually undetectable when read by someone with no knowledge of SEO. One vital step in making this happen is to carefully research and select your keyphrases.

If you're writing a page about wedding gowns, it would be complicated to include keyphrases such as "wedding reception music" or "wedding caterers." The amount of traffic these terms might bring would be offset by the awkward fit with the focus of your page. Instead, opt for phrases that lend themselves directly to the topic of wedding gowns.

One common mistake many site owners and newbie copywriters make is to replace every single instance of a generic key term with one of their chosen keyphrases. Doing this in moderation is certainly acceptable, but frequently copywriters get carried away with tragic results.

For example, you would not want to have the following copy on your site:

Spanish Villas For Rent
*If you are looking for Spanish villas vacations, search our site for the best deals in Spanish villas. No other Spanish villas site has the selection of premium Spanish villas with the most sought after locations that we have. View some of our Spanish villas pictures or take virtual tours of our Spanish villas today.*

Whew! I get tired just reading that! Not only is it extremely annoying to read, but also many of the phrases are used incorrectly, making it look as though there are typos on the page. Not a pretty sight!

To keep your copy sounding as natural as possible, you need to think outside the keyword box. Most often, people believe that writing in a similar manner as the example above is the only way to use keywords in copy. Not true! In fact, far from it.

Let me share three of my favorite tips with you for creative writing with keyphrases.

Don't Use Keyphrases To Describe Your Products/Services

That's right, I said DON'T use keyphrases to describe your own products or services. Instead, use them to describe what your product or service is not, or what it is similar to or what it is better than.

Key points

  • One common mistake many site owners and newbie copywriters make is to replace every single instance of a generic key term with one of their chosen keyphrases.

  • Use keyphrases to describe what your product or service is not, or what it is similar to or what it is better than.

  • Another frequent stumbling block for SEO copywriters is the use of phrases that seem to end abruptly. In these cases, simply add a word to the end.

An example of this is any keyphrase that begins with the word "cheap." "Cheap insurance," "cheap sunglasses," "cheap software" - the list is endless. It's simply not a good idea to call your own product cheap. Yes, I understand that people are looking for cheap things, but that is because they don't want to pay a lot. When THEY call your product cheap, it is in relation to price. When YOU call your own product or service cheap, it degrades the product or service's perceived value.

Instead, let others know that your product is NOT cheap. For example:

Unlike cheap travel insurance offered by other underwriters, our policies have provided long-standing, publicly held companies with a history of exceptional customer service. You get affordable coverage and peace of mind.

The phrase is highly relevant to the page, you get to attract lots of visitors, and the copy is set to convince them that "cheap insurance" isn't what they really wanted after all.

How about this one? I got an email from a student asking me how to use the phrase "doggie litter box" in his copy even though that was not what he was selling. His product was a replacement for the doggie litter box, so I suggested he use the phrase in exactly that way. Here's what I would have done:

Here's a great solution for that messy doggie litter box. Attractive, compact and easy to use even in the smallest apartments, [Name of Product] is destined to replace the doggie litter box forever!

See? You aren't calling your product a litter box; rather you are positioning yourself against it to show how you are better.

Add A Word

Another frequent stumbling block for SEO copywriters is the use of phrases that seem to end abruptly. In these cases, simply add a word to the end. Here are two examples.

The phrase "web design for small business" seems out of place because, most often, we would use the plural term (small businesses) when we were writing. To correct the problem, just add a plural word to the end of the phrase. Perhaps you might talk about web design for small business startups or web design for small business owners. You get the idea. Break It Up

When the phrases get too long, it is often best to break them up. Search engines don't pay attention to standard punctuation marks or line breaks. They read right through periods, commas, semi-colons and the like without hesitation. That means you have a lot more flexibility than you might think.

One keyphrase I had to work with was "Texas Hill Country real estate." That would get pretty cumbersome if it were left as it is seen there. But by breaking it up with some punctuation, it sounds perfectly natural. Here's how it can be done.

There is no more beautiful place than the Texas Hill Country. Real estate listings in this area are filled with stunning homes that …

Do you see what happened? I broke the phrase up using a period. In the eyes of the search engines the phrase is still intact. They don't even notice the period. That period, however, causes the reader to take a mental pause and helps alleviate any repetitive feel to the copy.

If you take the time to look at SEO copywriting as an art rather than an assembly line task, your content will sound more natural, will convert better and will help prevent further additions to the already overcrowded collection of tortured copy everywhere!

Wednesday, January 16, 2008

Google's Knol Affect SEO Yahoo Ultimately Google

How Will Google's "Knol" Affect SEO, Yahoo! and Ultimately Google

I was one of the first Google Answers Researchers when they first launched their service several years ago. At the time, it seemed like the future of knowledge aggregation. The best part about Google Answers was that it allowed anyone to ask a question for a nominal fee and approved researchers could jump at the opportunity to answer it. The service was so popular that many researchers were actually making a living from answering questions. Unfortunately, when Google loses interest in something, like their Google Search SOAP API, they simply shut it down — regardless of what it means to the people who have come to depend on it for answers and income. That was the fate of Google Answers. Similar to their SOAP API, Google Answers lives on as a static relic, reminding us that they are the ones in control, not their users.

All of this came to mind when I read Rex Hammock's article about a new service Google is working on called Knol. Udi Manber, VP of Engineering at Google, described knol like this:

A knol on a particular topic is meant to be the first thing someone who searches for this topic for the first time will want to read. The goal is for knols to cover all topics, from scientific concepts, to medical information, from geographical and historical, to entertainment, from product information, to how-to-fix-it instructions. Google will not serve as an editor in any way, and will not bless any content. All editorial responsibilities and control will rest with the authors. We hope that knols will include the opinions and points of view of the authors who will put their reputation on the line. Anyone will be free to write. For many topics, there will likely be competing knols on the same subject. Competition of ideas is a good thing.

This sounds like a good idea — and may actually be one — but Rex had some sobering ideas about knol and its potential threat to Wikipedia. He listed the following reasons why Wikipedia won't be killed by Google's knol:

  1. Google's resources and dominance may be massive, but Google hasn't reached death star status
  2. Google may have more resources than anyone else, but it doesn't have enough resources to fight endless multi-front wars
  3. Google may have an army of PhDs, but Wikipedia has a militia of Ph.D candidates
  4. Knol is not an encyclopedia — or a wiki — or even kinda like a wiki, so how's it going to kill something it's not like?
  5. Wikipedia's business model crushes even Google's
  6. Knol may finally wake up the hippie fretards who keep Wikipedia from rolling in cash like the Mozilla Foundation

(Rex goes into further detail on all of those points in his blog entry, Has Google killed Wikipedia with a shot from the grassy knol? Get real.)

However, for me it has less to do with who will win, but what opportunities may arise from the introduction of a huge online resource like knol. Being an SEO specialist, I mainly concern myself with how I can use a new online service for marketing purposes. For example, if I have a client who makes, distributes or sells widgets, will I be able to write about those widgets? More importantly, will I be able to link to my client's website and will they use rel="nofollow" on the links? My guess is that I will be able to contribute, just like I can now on Wikipedia, but everything will continue to be heavily moderated by the community leaders.

As for rel="nofollow", it would be nice to see Google take a bold move and utilize rel="nofollow" in a way that makes more sense. For example, I would like a policy that automatically applies rel="nofollow" to new links, but after several months, if nobody has removed the link, the rel="nofollow" attribute is removed. That way, the community itself determines the validity of a link. If it's worth leaving on the site, then it's worth a search engine to follow it and consider it in their search algorithm.

Regardless of how Google plans to implement knol, there is definitely a reason to be concerned if you're in the field of search marketing. Udi stated quite clearly their intention for knol as it relates to Google search. And even though he says Google won't endorse knol's content, you can't escape the obvious dual relationship that's being created (Google most certainly being the therapist in this relationship!).

A knol on a particular topic is meant to be the first thing someone who searches for this topic for the first time will want to read. The goal is for knols to cover all topics, from scientific concepts, to medical information, from geographical and historical, to entertainment, from product information, to how-to-fix-it instructions.

As Google increasingly enters the content space — no longer scraping other sites, but actually producing and delivering the content themselves — they stand to overtake their own search engine results (SERPs). The introduction to Universal Search and the increased Google Web properties will almost certainly ensure that the above the fold search results will be from Google entities or those closely aligned with Google. If that becomes true, and users continue to flock to Google's search engine for information, Google will not only become a target for influencing SERPs, it will also become the next target for content manipulation.

If that's truly a glimpse of what the future holds, then there may actually be a chance for other quality search engines like Yahoo! to compete. Google could easily end up shutting out their core users and promoters, thus influencing a mass exodus to other search engines. Another possibility is anti-trust. If Google succeeds in controlling all content, including the physical dissemination of that content, then Google may be in for a long Microsoft-like ride.


Encouraging people to contribute knowledge

The web contains an enormous amount of information, and Google has helped to make that information more easily accessible by providing pretty good search facilities. But not everything is written nor is everything well organized to make it easily discoverable. There are millions of people who possess useful knowledge that they would love to share, and there are billions of people who can benefit from it. We believe that many do not share that knowledge today simply because it is not easy enough to do that. The challenge posed to us by Larry, Sergey and Eric was to find a way to help people share their knowledge. This is our main goal.

Earlier this week, we started inviting a selected group of people to try a new, free tool that we are calling "knol", which stands for a unit of knowledge. Our goal is to encourage people who know a particular subject to write an authoritative article about it. The tool is still in development and this is just the first phase of testing. For now, using it is by invitation only. But we wanted to share with everyone the basic premises and goals behind this project.

The key idea behind the knol project is to highlight authors. Books have authors' names right on the cover, news articles have bylines, scientific articles always have authors -- but somehow the web evolved without a strong standard to keep authors names highlighted. We believe that knowing who wrote what will significantly help users make better use of web content. At the heart, a knol is just a web page; we use the word "knol" as the name of the project and as an instance of an article interchangeably. It is well-organized, nicely presented, and has a distinct look and feel, but it is still just a web page. Google will provide easy-to-use tools for writing, editing, and so on, and it will provide free hosting of the content. Writers only need to write; we'll do the rest.

A knol on a particular topic is meant to be the first thing someone who searches for this topic for the first time will want to read. The goal is for knols to cover all topics, from scientific concepts, to medical information, from geographical and historical, to entertainment, from product information, to how-to-fix-it instructions. Google will not serve as an editor in any way, and will not bless any content. All editorial responsibilities and control will rest with the authors. We hope that knols will include the opinions and points of view of the authors who will put their reputation on the line. Anyone will be free to write. For many topics, there will likely be competing knols on the same subject. Competition of ideas is a good thing.

Knols will include strong community tools. People will be able to submit comments, questions, edits, additional content, and so on. Anyone will be able to rate a knol or write a review of it. Knols will also include references and links to additional information. At the discretion of the author, a knol may include ads. If an author chooses to include ads, Google will provide the author with substantial revenue share from the proceeds of those ads.

Once testing is completed, participation in knols will be completely open, and we cannot expect that all of them will be of high quality. Our job in Search Quality will be to rank the knols appropriately when they appear in Google search results. We are quite experienced with ranking web pages, and we feel confident that we will be up to the challenge. We are very excited by the potential to substantially increase the dissemination of knowledge.

We do not want to build a walled garden of content; we want to disseminate it as widely as possible. Google will not ask for any exclusivity on any of this content and will make that content available to any other search engine.

As always, a picture is worth a thousands words, so an example of a knol is below (double-click on the image to see the page in full). The main content is real, and we encourage you to read it (you may sleep better afterwards!), but most of the meta-data -- like reviews, ratings, and comments -- are not real, because, of course, this has not been in the public eye as yet. Again, this is a preliminary version.

Seo Updates

Yahoo Testing Relevance Variety Search Results

Yahoo on Testing Relevance and Variety in Search Results


At Yahoo, if you’ve ever seen the words “Also Try” at the top or bottom of a set of search results, along with a list of selected queries, then you may have seen part of Yahoo’s internal relevance and variety checking process in action.

image showing a yahoo search box, with the search term jaguar within it, and results that include the phrase also try, with suggestions of other queries

Determining Relevance and Variety

The process that provides those “also try” results also may be a way for the search engine to check up on how well they are doing - how relevant their results are, and how much variety they provide.

This relevance and variety process goes roughly (very roughly) like this:

Looking for Related Terms in Query Logs

Someone searches at Yahoo, and search results are returned. Each time someone searches like that, an entry is made in a query log.

Query logs at the search engine are looked at to find a number of the top related terms for a query. The actual amount of “top related queries” might be different for each query.

These “related terms” are queries that might have included the word or words from the original query within them, and may be considered as units - distinct phrases, terms or concepts recognized by the search engine. In my “Also try” image example above, we see “jaguar cars,” “jaguar xf,” “jaguar animal pictures,” and “jaguar parts.”

Those are terms that are “related” to the query “jaguar” under this process. The related terms in log files might only be looked at for a specific period of time, like the last week or two.

If you were to then take that “top” set of queries that contained the primary query term (jaguar) and see how many times each of the related query terms appeared relative to each other, you could get a “relative frequency.”

Example of relative frequency (roughly, out of 60 appearances of related terms):

jaguar cars - 30 times (50 percent)
jaguar xf - 15 times (25 percent)
jaguar animal pictures - 10 times (17 percent)
jaguar parts - 5 times (8 percent)

Yahoo might also look to see how often the top related terms were used during query sessions from individuals, to redefine their queries. For example, how often does someone searching for “jaguar” then go on to search for “jaguar cars” or “jaguar animal pictures”?

Related Terms in Search Results

If you search for “jaguar,” and were to look at the number of results for each of the top related terms in a top certain number of search results (let’s say the top 100), and then see which percentage of that number existed for each of the related terms, you would have the “relative frequency in relation to all terms in the set of terms” for each of the related terms.

Looking at the top 100 results (to keep the math simple), we might see how often the word “jaguar” and the other term or terms appeared on the same pages in those results. Let’s just quess at some numbers to show how this works:

jaguar cars - 39 times (39 percent)
jaguar xf - 24 times (24 percent)
jaguar animal pictures - 15 times (15 percent)
jaguar parts - 22 times (22 percent)

Under the patent application, this part of the process might look at the actual content found upon the pages pointed to in the search results, or it might limit itself to only counting results where those words appear in the page title and abstract for the each search result.

Comparing Query Logs with Search Results

If we match up the number of times that people searched for the top related terms for “jaguar” with the number of times that results for those related terms appear in search results for “jaguar”, we might be able to use those numbers to see how “relevant” the search results are for the primary query term “jaguar.”

jaguar cars - 50 percent of queries, 39 percent of search results
jaguar xf - 25 percent of queries, 24 percent of search results
jaguar animal pictures - 17 percent of queries, 15 percent of search results
jaguar parts - 8 percent of queries, 22 percent of search results

How well do the searches for the top related terms in query logs match up with appearances of those top related terms in search results for the primary search term or phrase?

If they match up well, then you might be able to say that the search engine is providing relevant results. If the frequencies of appearances (percentages) don’t match up well, then it’s possible that a search algorithm or two might need to be tweaked by a search engineer.

Checking for Variety of Search Results

This might be as simple as making sure that each of the top number of top related terms that appear within the queries also appear within the top number of search results at least once.

The Patent Application

Automatic relevance and variety checking for web and vertical search engines
Invented by Jignashu G. Parikh
US Patent Application 20080010269
Published January 10, 2008
Filed: July 5, 2006

Yahoo came out with another patent application a while back, Using matrix representations of search engine operations to make inferences about documents in a search engine corpus, which explores use query histories to improve search results

The inventor listed in that document published a paper that appears related, titled Unity: relevance feedback using user query logs.

His co-author on that paper is listed as the inventor of this new patent application from Yahoo.

Seo Updates

Yahoo Replaces PageRank Assumptions with User Data

Yahoo Replaces PageRank Assumptions with User Data

PageRank is an algorithm that measures the importance or quality of a Web document.

It can be used in a number of ways by a search engine, such as being combined with relevance factors to rank search results, or to determine which web pages to crawl (pdf) and how frequently to crawl them, or which part of a database a document should be placed within.

Search algorithms are based upon assumptions about how people use the Web, how they might search, what they might pay attention to, and what they might find important. That’s true with PageRank in both theory, and how it may be used in actual practice.

Challenging PageRank Assumptions

It’s good to see folks in the search community challenging some assumptions behind PageRank. A patent application from Yahoo, published last week raises a number of issues, from people who know PageRank very well.

Here are some problems the inventors of the patent filing point to involving some basic assumptions about PageRank:

Not All Links are Equal — people don’t randomly choose links on pages that they visit - some pages are more important than others, and some are rarely followed at all like “disclaimer” links.

The assumption that all the outgoing links in a Web page are followed by a random surfer uniformly randomly is unrealistic. In reality, links can be classified into different groups, some of which are followed rarely if at all (e.g., disclaimer links).

Such “internal links” are known to be less reliable and more self-promotional than “external links” yet are often weighted equally. Attempts to assign weights to links based on IR similarity measures have been made but are not widely used.

See, for example, The Intelligent Surfer. Probabilistic Combination of Link and Content Information in PageRank (pdf), M. Richardson and P. Domingos, Advances in Neural Information Processing Systems 14, MIT Press, 2002.

Bored Surfers Don’t Go to Random Pages — one of the assumptions of the PageRank formula is that sometimes, instead of following a link on a page, the “random surfer” will grow bored and just go anywhere else at random. The patent application notes that it is unrealistic to assume that most people using the web choose major portals and tiny home pages with an equal probability. When someone leaves a page to go somewhere else (a uniform teleportation jump to any random page under PageRank) it’s unlikely to be any random page at all where they will go.

Bored Surfers Don’t Only Go to Trusted Pages — when that “random surfer” leaves instead of following links, it’s also unlikely that they will only go to a trusted set of pages or sites, under something like TrustRank (See, for example, Combating Web Spam with TrustRank - pdf). This assumption really has nothing to do with how people actually use the Web, but is instead retrofitted into PageRank to combat link spam instead of being “reflective of real-world user behavior.”

Pages Change and Lose Value at Different Rates — the PageRank process also ignores that pages are purchased and repurposed, or decay and become less valuable over time and do so at very different rates.

Sometimes PageRank Calculations Cheat — some uses of PageRank formulations in practice are “typically implemented with regard to aggregations of pages by site, host, or domain, also referred to as ‘blocked’ PageRank.” See Exploiting the Block Structure of the Web for Computing PageRank (pdf)., This means that links between pages are being somehow aggregated to a block level. The patent application tells us that, “Unfortunately, most heuristics for performing this aggregation do not work well.”

User Sensitive PageRank Patent Application

I mentioned that the people behind the patent application know PageRank well. One of the most comprehensive and detailed documents I’ve seen on PageRank is A Survey on PageRank Computing, which was written by one of the named inventors on the following document. It’s also cited in the patent filing.

User-sensitive pagerank
Invented by Pavel Berkhin, Usama M. Fayyad, Prabhakar Raghavan, Andrew Tomkins
Assigned to yahoo
US Patent Application 20080010281
Published January 10, 2008
Filed: June 22, 2006

Abstract

Techniques are described for generating an authority value of a first one of a plurality of documents. A first component of the authority value is generated with reference to outbound links associated with the first document. The outbound links enable access to a first subset of the plurality of documents.

A second component of the authority value is generated with reference to a second subset of the plurality of documents. Each of the second subset of documents represents a potential starting point for a user session.

A third component of the authority value is generated representing a likelihood that a user session initiated by any of a population of users will end with the first document.

The first, second, and third components of the authority value are combined to generate the authority value. At least one of the first, second, and third components of the authority value is computed with reference to user data relating to at least some of the outbound links and the second subset of documents.

The patent application adds elements of user behavior to the calculation of PageRank.

Link Weight — the weight or value of links can be influenced by actual “user data representing a frequency with which the corresponding outbound link was selected by a population of users.”

Likelihood of Randomly Leaving to a New Page — the chance that someone might leave (or teleport) to another page instead of following a link on a page is also influenced by user data.

Satisfaction with Found Pages — the probability that someone might stop, and not visit new pages by following links on the page they are on also is calculated by looking at user data.

These three components can be used to create an “authority value” for a document on the Web.

The importance of anchor text, and other text associated with a link, is also addressed in User Sensitive PageRank:

According to yet another embodiment, an authority value of a first one of a plurality of documents is generated.

Text associated with each of a plurality of inbound links enabling access to the first document is identified.

A weight is assigned to the text associated with each of the inbound links.

Each of the weights is derived with reference to user data representing a frequency with which the corresponding inbound link was selected by a population of users.

The authority value is generated with reference to the weights.

The Role of User Data

User data incorporated into this algorithm should “reflect the behavior and/or demographics of an underlying user population.” It’s actual real user data reflecting the way that people browse pages. User Sensitive PageRank can reflect “the navigational behavior of the user population with regard to documents, pages, sites, and domains visited, and links selected.”

Other Implications of a User Sensitive PageRank

The patent application describes a number of different mathematical formulations to calculate this User Sensitive PageRank. I’m not going to delve deeply into those. It also addresses some other interesting implications:

User Segment Personalized PageRank — user data from different demographic profiles (based upon age, gender, income, user location, user behavior, etc.) could be specified, so that search results could be different for people from those different demographics. This could be used with other approaches to personalized PageRank, like a Topic Sensitive PageRank.

People Visit Blocks — user behavior based upon visiting and browsing blocks (sites, hosts, or domains) may be helpful in understanding how people go from one block to another block, and augment a block level PageRank approach based solely upon links between those blocks.

How the Passage of Time Can Affect PageRank — PageRank should be updated regularly because the links between pages on the Web change over time. Pages that might be considered core pages can also change in significance, or go out of fashion even though the links to and from those pages haven’t changed. Incorporating user data into PageRank means that recent events can be emphasized, and older events discounted.

Choosing Pages to Crawl — PageRank can be used in determining whether to crawl and follow links associated with a page. The addition of user data in PageRank may make choosing easier.

Beyond PageRank to Analysis of Text Associated with Links — anchor text can be “one of the most useful features used in ranking retrieved Web search results.” The importance of anchor text (and related text) can be associated with user behavior scores much like the importance of link weights can vary in User Sensitive PageRank.

Conclusion Seo Updates

PageRank, in most of the different formulations that have been described in patent filings and papers, focuses upon links published upon the Web, and makes a number of assumptions about how people visit, browse, and use documents attached to those links.

User Sensitive PageRank attempts to replace some of those assumptions with actual user data about how people do travel to and use Web documents.

Google on Reading Text in Images

Google on Reading Text in Images from Street Views, Store Shelves, and Museum Interiors



One of the standard rules of search engine optimization that’s been around for a long time is that “search engines cannot read text that is placed within images.” What if that changed?

How easy or difficult is it for a search engine to recognize text within digital images and video, and index that text?

Three new Google patent applications explore that topic, and describe some ways in which Google might try to capture information from text within images.

Capturing Text from Street View Images

This patent filings don’t address text found within headings and logos, but rather much more complex pictures, including street scenes of the kind that might be taken for instance, when filming streets for something like Google’s Street Views (video).

They also discuss the use of picture taking robots inside stores and museums.

The documents lay out some of the obstacles faced in reading that kind of text:

The text within images can be difficult to automatically identify and recognize due both to problems with image quality and environmental factors associated with the image. Low image quality is produced, for example, by low resolution, image distortions, and compression artefacts.

Environmental factors include, for example, text distance and size, shadowing and other contrast effects, foreground obstructions, and effects caused by inclement weather.

Recognizing text in images
Invented by Luc Vincent and Adrian Ulges
CUS Patent Application 20080002893
Published January 3, 2008
Filed June 29, 2006

Abstract

Methods, systems, and apparatus including computer program products for recognizing text in images are provided. In one implementation, a computer-implemented method for recognizing text in an image is provided. The method includes receiving a plurality of images.

The method also includes processing the images to detect a corresponding set of regions of the images, each image having a region corresponding to each other image region, as potentially containing text. The method further includes combining the regions to generate an enhanced region image and performing optical character recognition on the enhanced region image.

These other two patent filings also cover aspects of the process described:

The patent applications describe in detail how images might be processed to make it easier to identify and extract text within many different types of images.

An example of how this process might be used is that in an urban scene, text recognition could be used to “identify such things as building addresses, street signs, business names, restaurant menus,and hours of operation.”

Images from Digital Cameras and Video Recordings

Images used might include those captured from conventional digital cameras or video recording devices. Those pictures might include panoramic images, still images, or frames of digital video. A system for capturing some of the images might also incorporate the use of three-dimensional ranging data as well as location information.

That sounds like some information that might be captured when pictures are taken for a project like Google’s Street Views program.

Associating Locations with Images

A panoramic image of a street scene might capture more than one street address, like a city block, or a string of locations on a street. This might be done using a moving camera. Locations could be associated with those images using GPS coordinates.

While there are other options presented to collect GPS information, here’s a description of how it might be determined for something like the Street Views project:

Additionally, exact GPS coordinates of every image or vertical line in an image can be determined.

For example, a differential GPS antenna on a moving vehicle can be employed, along with wheel speed sensors, inertial measurement unit, and other sensors, which together allow a very accurate GPS coordinate to be computed for each image or portions of the image.

The text detection and classification (text versus non-text) process is also presented in some detail within the patent filings. Part of that process involves looking at similar patterns within images that might be similar to each other. So, known city street scenes are looked at when classifying other city street scenes to try to determine if text appears within those images.

The three dimensional range information might also help detecting false positives, when this system believes that it has identified an area that may contain text, and it actually hasn’t.

Solving Problems Reading Text Seo Updates

Some of the problems involved with characters that are difficult to read, because of small size or blurriness or distortions or other problems, might be solved by looking at more than one image, from pictures that are slighly offset from each other:

For example, a high speed camera taking images as a machine (e.g., a motorized vehicle) can traverse a street perpendicular to the target structures. The high speed camera can therefore capture a sequence of images slightly offset from each previous image according to the motion of the camera.

Thus, by having multiple versions of a candidate text region, the resolution of the candidate text region can be improved using the superresolution process.

Additionally, a candidate text region that is partially obstructed from one camera position may reveal the obstructed text from a different camera position (e.g., text partially obscured by a tree branch from one camera position may be clear from another).

While the text recognition part of this process will try to use variations of character recognition, they may also try to find certain specific business names that are kept in a database, such as McDonalds, Fry’s Electronics, H&R Block, and Pizza Hut.

They could also try to find text from images at certain locations by looking at information from places like Yellow Pages listings.

Some Specific Applications Under this Process

Where this gets really interesting is in the descriptions of some of the ways text recognition and extraction from images might be used by the search engine, including the use of robots within stores and museums.

Image search - The text taken from images can be indexed and associated with the image. That can then be used in different search result applications like image search, and mapping, or other applications.

Images are Associated with a Mapping Program - Extracted text from street scene images can be indexed and associated with a mapping application. People can then search for a location by business name, address, store hours, or other keywords.

The mapping application can also retrieve images matching the user’s search - like looking for a McDonald’s in a particular city or near a particular address - the mapping program would create a map showing the location of the McDonald’s as well as a picture of the restaurant.

Images Near Specific Locations are Associated with Each Other - Since the images are associated with location data, the mapping program can provide images of other businesses near a searched location, and show their locations on a map.

Images of Similar Businesses Presented as Alternatives - Images of businesses that offer similar goods or services may be presented to the searcher as alternatives. So, a search for McDonalds might show other nearby fast food joints.

Advertisements Shown with Images - Advertisements can be presented along with images. When a business is shown in an image, an ad for the business may also be shown. Or, ads for alternative business could be displayed. And ads can be shown for products associated with the business.

Google Interior Images - While this patent filing describes many images from street scenes, this indexing can be applied to other image sets. One of the more interesting sections of this patent application:

In one implementation, a store (e.g., a grocery store or hardware store) is indexed. Images of items within the store are captured, for example, using a small motorized vehicle or robot. The aisles of the store are traversed and images of products are captured in a similar manner as discussed above.

Additionally, as discussed above, location information is associated with each image. Text is extracted from the product images. In particular, extracted text can be filtered using a product name database in order to focus character recognition results on product names.

Kudos to Sandra Niehaus, and her post Google Interiors - the day my house became searchable. I know your post was satire, Sandra, but good call.

Searching Museums for Images - Images associated with museums could also be indexed. Many museums include text displays associated with exhibits, artefacts, and objects. Images of museum items and the associated text displays can be captured using a process like that involved in indexing a store.

Location information can be associated with each captured image. Museums can be searched, or browsed, to learn about the various objects.

Why SEOs Will Kill Wikia Search

When you launch a new search engine, admitting defeat at the outset isn’t exactly a promising start. Jimmy Wales founder of Wikia Searchwhich launches today–did just that.

“We want to make it really clear that when people arrive and do searches, they should not expect to find a Google killer,” Mr. Wales told the New York Times.

Hmm, not exactly the statement of faith I’d want to make when launching a new product. Sure, Wikia Search may technically be in “alpha”–what happens when that gets exhausted by start-ups, will we have to invent a new Greek letter?–but do you really want the media (and users) to form a negative sentiment before even giving Wikia a chance to impress?

Judging by early reviews, Wikia is living up to Wales’ expectactions. TechCrunch’s Michael Arrington–the bestower of all tech reputations–isn’t exactly impressed with the launch.

“…it may be one of the biggest disappointments I’ve had the displeasure of reviewing.”

Makes you want to rush over there and test it out, doesn’t it?

Still, the teething problems Wikia faces now will be nothing compared to the onslaught it will face, should it reach any level of success–the kind measured by market share, not self-righteous satisfaction that your search engine is built by the people, for the people.

SEOs Will Game Wikia Search Seo Updates

If Jimmy Wales thought the Wikipedia-gaming by search engine optimizers was a pain in the ass, their efforts will feeling like a welcomed pat on the derrière compared to the kick up the butt they’ll provide, once they start figuring out the inner-workings of Wikia’s algorithm.

Like other search engines and sites that rely on the so-called “wisdom of crowds,” the Wikia search engine is likely to be susceptible to people who try to game the system, by, for example, seeking to advance the ranking of their own site. Mr. Wales said Wikia would attempt to “block them, ban them, delete their stuff,” just as other wiki projects do.

“Attempt” is the key word here. Even the mighty Google has a hard time keeping the blackest of hats under control–and their algo doesn’t encourage the input of others like Wikia’s does. Likewise, when SEOs tried to game Wikipedia, it was in an effort to obtain some valuable links that would help them with Google–Wikipedia played a secondary role to their main goal. If Wikia achieves any measurable market share, it’s going to face a direct onslaught–something that might be hard to battle, when you have such an open-door policy.

And, in case you think I’m exaggerating the danger here, it’s already started–and you’ll never guess who’s first to try and pick apart the Wikia algorithm. Google’s own Matt Cutts!

It’s very early days for Wikia. I’m not about to tell you that it will face certain doom–despite my headline–but I’m not convinced that Wikia has a model that can be sustained, once it gets beyond the market share of, say, Ask.com. Still, Jimmy Wales is smart guy, and my track record of predicting failure isn’t always correct–Mahalo appears to be doing well, despite my concerns–but Wikia will at some point play a very challenging game of SEO chess. The “black” players are very adept at the game, let’s hope Wikipedia taught Wales how to defend from a check-mate.



Wiki Search Engine (née “WikiaSari”) to Launch by Year’s End

A quick refresher: WikiSeek is a search engine designed to search Wikipedia and sites that Wikipedia links to. Wikia is a for-profit company started by Wikipedia founder Jimbo Wales. WikiaSari is the name of the software behind a Wikia search engine project, named in a naming contest in 2004.

Although the wiki search engine has frequently been called “WikiaSari,” Wikia does not plan on using that name. The support site is called “Search Wikia,”� but that won’t be the name of the search engine, either. Still with me?

Not to be confused with WikiSeek, the also-much-publicized NotWikiaSari wiki-inspired search engine should launch by the end of the year, according to Jimbo Wales. Wales made the statement from a Wiki Camp in India. (And no, that’s not where exiled Wikipedians go, it’s an unstructured, wiki-inspired gathering for Wikipedians.)

For those of us who haven’t been rabidly following this story, how is a wiki search engine different? Naturally, anyone can edit the SERPs. The algorithm will also be publicly available.

While Search Wikia calls this “a new free/open source search engine with user-editable search results,”� I’m thinking “Spam City.”� Whether by exploiting the algorithm (or editing it if it’s completely wikified) or by editing the results, unscrupulous SEOs (ie, the ones who actually are crooks) will take advantage of this if it’s at all possible.

According to one interview, this has not escaped Wales:

Sure, says Wales, “But security through obscurity is a bad idea.” If you have published algorithms, then everyone, including scientists can see it. Then those illuminated minds could contribute to the improvement of the product, is the inference. “But if it is kept secret, then the bad guys, who have all the time in the world and are dedicated to gaining access to your algorithm, will somehow find a way.”

They’ll find a way to exploit it . . . or erase it and replace it with, “ha ha u wont find NE content hear!”

Unlike Wikipedia, NotWikiaSari is a for-profit venture. In the same interview, Wales estimates that “if [it] garners 3 per cent of the search engine market, it would be a sustainable model” based on advertising revenues.

Google Grants Free Advertising to Non-Profits organizations via Google Grants Beta

You learn something new every day. Today I learned that Google provides free AdWords advertising to non-profit organizations via a scheme called Google Grants (Beta).

I knew that default ads for charities and other non-profit organisations are shown to searchers when there are no suitable contextual AdWords ads available, but I thought these were largely selected at random by Google staff. I know now that the organizations displayed have actually qualified via the Google Grants scheme. Here's an extract from the Grants page:

The Google Grants program supports organizations sharing our philosophy of community service to help the world in areas such as science and technology, education, global public health, the environment, youth advocacy, and the arts.

Designed for 501(c)(3) non-profit organizations, Google Grants is a unique in-kind advertising program. It harnesses the power of our flagship advertising product, Google AdWords, to non-profits seeking to inform and engage their constituents online. Google Grants has awarded AdWords advertising to hundreds of non-profit groups whose missions range from animal welfare to literacy, from supporting homeless children to promoting HIV education.

Google Grant recipients use their award of free AdWords advertising on Google.com to raise awareness and increase traffic.

I found out about Google Grants via a thread on crea8asite's new Non Profits on the Web forum.

Organizations can apply for a Grant online, but they must have current 501(c)(3) status as assigned by the Internal Revenue Service to be considered eligible and they cannot already be an AdWords advertiser. A Google Grants committee consisting of Google employees is responsible for selecting award recipients. Each organization awarded a Google Grant receives at least three months of in-kind advertising.

Interested organizations can learn more and apply online here. Seo Updates

The program has existed for a long, long time. However, Google is selective about whom it gives the grant too. All non-profits can apply, but not all of them receive the grant. We've applied in the past on behalf of AKT and Panos and Google were not keen to grant either anything because they were UK based. So I guess Google charity begins and stays in the home!

Monday, January 14, 2008

Wikia Launches Search Engine, Challenges Google

Wikia Launches Search Engine, Challenges Google

Wikipedia founder Jimmy Wales has unveiled his rival platform, Wikia Search, for navigating the Web. But does it work? DW-WORLD asked experts on both sides and did some sample searches. The results were mixed.

"It will take a lot of time, if Wikia is ever to rival Google"

It's a David and Goliath story. With the launch of the Wikia Search engine, entrepreneur Jimmy Wales is taking on industry giant Google, which currently enjoys a 90 market share.

The new search engine uses Web 2.0 components and touts -- in contrast to Google -- an open algorithm for greater transparency about how the order in which the results appear is determined.

Google is taking the arrival of a new competitor in stride.

"We expressly welcome the fact that Wikia Search has now joined Google and the other competitors in Germany and other countries," Stefan Keuchel, press spokesman for Google Germany, told DW-WORLD.DE. "There's room enough for more than one provider. Both the market as a whole and users profit from competition. We and others will have to move ahead with innovations."

But some specialists question how innovative Wikia Search's software is.

""This project uses so-called grub software, something that was mothballed a few years ago and has now been revived," independent IT scientist Wolfgang Sauer-Beuermann told DW-WORLD.DE. "The only thing that's new is the name."

Taking Wikia out for a spin

Wikia Search's biggest problem will probably be overcoming Google's massive head start. The search giant has been in existence since 1998, whereas the Wikia company was only founded in 2004.

Tests of the alpha version of Wikia Search instantly reveal some shortcomings. The new engine encompasses only 50-100 million websites, compared with the 10 billion claimed by Google.

For example, when DW searched Wikia for "Bundesliga," it only yielded two hits for Germany's professional soccer league -- a commercial page and a weblog.

Likewise a search for "Troja" -- the German word for "Troy" -- resulted in advertisements, instead of entries on the ancient Greek city or references to Homer's Iliad.

"It will take significant time before the evaluation systems deliver a number of quality search results," Keuichel said.

User rankings versus algorithms


"Wales' Wikipedia has established itself alongside traditional reference sources"

Wikia boss Jimmy Wales thinks his search engine has a decisive advantage over Google in that the process used to decide the ranking of hits will be transparent and open to influence by users.

"We expect Wikia Search to be like fine wine in that it will get better and better as time goes by and more and more people contribute, " Wales told Reuters news agency, when the alpha program was unveiled. "I've said before that Internet searches must be more open and transparent, and today marks a major milestone in our mission to make it just that."

Google counters that its algorithm-based ranking system is a great strength.

"We revolutionized the market ten years ago with this process, and it's used by everyone else," Keuchel said. "For us, it's the best way to reach the masses. And in any case every algorithm has to be written by someone. Robots don't make them -- very, very intelligent people do. So it's hard to understand this argument by Mr. Wales."

In Wikia Search, individual users are asked to rank the quality of the hits. It's a Web 2.0 approach familiar from Wales' on-line, user-generated reference source Wikipedia.

But some analysts are unimpressed, disputing the idea that this approach will lead to more fairness and transparency.

"It isn't that great a feature," Sander-Beuermann said. "Users will rate their own websites as best so that they appear at the top of the index. Wikia Search is not a serious competitor for Google. It will more likely be a means of earning even more money."

Seo Updates

Yahoo could work with Google on mobile programs

Yahoo could work with Google on mobile programs


Yahoo Inc., after failing to dent Google Inc.'s dominance of the Internet-search market, expects to work with its rival to reach more mobile-phone users.

more stories like this

Yahoo says it will provide software that lets independent developers and companies build programs for phones, including handsets running Android, Google's operating software for wireless devices.

"Once it becomes reality, meaning we find devices that ship with it, we're going to make sure that Yahoo services" run on Android, executive vice president Marco Boerries said Monday. "It is one of many operating systems that we support."

Boerries, speaking at the Consumer Electronics Show in Las Vegas, said phone carriers will hold Google accountable to its commitment to giving users equal access to services such as e-mail and online maps from competitors. Yahoo, Google, and Microsoft Corp. are trying to lure advertisers that are increasingly targeting Web users on the move.

The global mobile-phone ad market will surge tenfold to $16.2 billion by 2011, according to EMarketer Inc., a research firm in New York. To capitalize on that growth, Yahoo also introduced an upgraded mobile home page and a new version of its Yahoo Go software on Monday.

Google said in November that it's working with 33 companies, including Sprint Nextel Corp., T-Mobile USA Inc., and Motorola Inc., on Android. The group, called the Open Handset Alliance, offers free software for programmers who want to develop features for the devices.

Yahoo wants to reach as many users as possible, whether they have phones with operating systems from Google, Microsoft, or Symbian Ltd., Boerries said.

Last month, Sunnyvale, Calif.-based Yahoo forged a partnership with America Movil SAB, Latin America's largest mobile-phone company, to provide Web-search services in 16 countries.

Expanding in mobile advertising has gained importance for Yahoo as its sales growth slows. Revenue rose 12 percent in the third quarter, compared with a 57 percent jump at Mountain View, Calif.-based Google.

The gap reflects Google's lead in the Internet search market. Google had 59 percent of US queries in November, compared with Yahoo's 22 percent, according to Reston, Va.-based researcher ComScore Inc.

Seo Updates

FeeD

Add to Google

Subscribe

Enter your email address:

Delivered by FeedBurner

Labels

Search engine updates,Latest Updates in SEO, New updates in SEO

view my stats


View My Stats