Not sure what types of jobs you are interested in?
Based on Your Education
BloomReach is a startup located in Mountain View. We have a top-notch
team from Google, Amazon, Groupon, VMware, Yahoo!, Cisco and other
successful Valley startups. We're graduates of Stanford, Carnegie
Mellon, Berkeley, MIT, UIUC - Urbana-Champaign, UT Austin, Princeton,
Dartmouth, IITs, Harvard Business School and many more! Well-funded,
backed by Bain Capital Ventures, Lightspeed Ventures, and NEA
Ventures. We are building a "web relevance engine" - a web
scale effort to make great content on the internet discoverable. With
a talented team of engineers, we have built products that are
processing web-scale data, handling millions of page views a day, and
generating meaningful revenue.
We are looking for talented engineers with expertise in web search,
data mining, machine learning, algorithms, natural language processing
and/or large scale systems who want to be at a rapidly growing
company. If you're really smart, don't have all of the relevant domain
experience, but have great coding skills, believe you have the
technical aptitude to learn fast, and have a passion for a start-up -
let us know.
The Engineer's primary role will be processing millions of web
pages, extracting interesting pieces of data, analyzing the structure
of the web, and finding relevant correlations.
Design and build software to tackle difficult, large-scale machine
learning/information retrieval problems. Build large scale, backend
infrastructure for low latency, high throughput applications.
Additionally, you will work with rest of the engineering team in
scaling machine learning algorithms and deploying them on clusters. Be
ready to lead or help with other product design and engineering projects!
Minimum Job Requirements:
Currently pursuing a BS/MS/PhD in Computer Science or
related field with a plan to graduate in Dec 2013 or Spring
Solid programmer and significant development
experience in a Linux environment
information retrieval and machine learning at web scale
BloomReach helps its customers manage potential duplicate content by creating clusters of equivalent pages and nominating a "canonical" page to represent them all, forcing all BloomReach links to the relevant content to target the canonical page. The additional links to the canonical page will provide a better user experience while improving its page rank and relevance score.
Additionally, BloomReach makes the identity of these canonical pages available to its customers, giving them the option to automatically forward all pages in the cluster to the canonical page. This forwarding effectively eliminates all duplicate pages from the site, ensuring that the canonical page gets the most traffic, is most likely to be indexed, is most relevant to search queries, and is most likely to lead to a conversion when visited by a user.
At BloomReach, we use a variety of algorithms to build clusters of mutually equivalent web pages. Since most of our customers are e-commerce sites, we employ specific strategies for two specific page types: product pages, which contains a specific page with all information on a product, and category pages, which contains a list of products and links to the product pages. We frame the problem as a general forwarding capability from less-desirable pages to more-desirable pages. All algorithms individually run over our customers' sites, building clusters. Finally we merge clusters to produce the final equivalence classes and canonical pages. Figure 1 depicts our process.
The challenges in this process include:
Scale - doing the content clustering over 100M pages is hard. Standard clustering algorithms do not work at this scale. Merging of clusters is hard. One needs to implement associative merging. We have implemented it using the hadoop framework. While it scales, it scales linearly in terms of the number of associative steps required - we will describe this in more detail later. Noise - the data is noisy and at times subjective. A site might be going through maintenance and during that time every page is getting redirected to home page. It does not mean that we should group every page together.
For example, one algorithm for clustering operates on the content of product pages, identifying redundant products. It follows a four-step process, first filtering and parsing the raw data to isolate relevant content. Next the algorithm extracts features from the products, defined as weighted n-grams present on the page. Based on the n-grams, the algorithm employs Min-Hash to compute a cluster id as a list of hash codes. All products with the same cluster id are treated as equivalent.
A second algorithm operates on the contents of category pages, clustering pages based on the products listed. This algorithm consists of three steps, starting with filtering and parsing to list the products on the page. All pages with a similar set of products constitute a cluster, and the canonical page is the one with all the products. This is an expensive calculation but easily parallelizable.
A final example is redirect clustering. Every page that is redirected via HTTP 301 or 302 is linked to the final page. There could be a sequence of redirects, forming a chain or tree. The cluster is all connected pages, and the canonical page is the final destination of all the pages.
Cluster merging is primarily concerned with resolving conflicts. If two pages are in the same cluster for one algorithm, but not in another algorithm, the merge uses proprietary algorithms to estimate the strength of each algorithm and the likelihood that the clustering or non-clustering is due to noise. The merging takes place in a series of hadoop steps.
The final result of our extensive website analysis is a subset of pages that we judge to be non-redundant and highest quality. We target our products to these pages, and we provide services to our customers to let them prune out the unnecessary pages, increasing the signal-to-noise ratio of the rest of the site and raising the relevancy score of the best pages.
For definitions of Min-Hash and n-gram, please see a text-mining textbook such as Rajamaran & Ullman, MiningofMassiveDatasets.