15.1: The Redis-backed indexer
- Page ID
- 12820
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)In my solution, we store two kinds of structures in Redis:
- For each search term, we have a URLSet, which is a Redis Set of URLs that contain the search term.
- For each URL, we have a TermCounter, which is a Redis Hash that maps each search term to the number of times it appears.
We discussed these data types in the previous chapter. You can also read about Redis Sets and Hashes at thinkdast.com/redistypes.
In JedisIndex, I provide a method that takes a search term and returns the Redis key of its URLSet:
private String urlSetKey(String term) { return "URLSet:" + term; }
And a method that takes a URL and returns the Redis key of its TermCounter:
private String termCounterKey(String url) { return "TermCounter:" + url; }
Here’s the implementation of indexPage, which takes a URL and a jsoup Elements object that contains the DOM tree of the paragraphs we want to index:
public void indexPage(String url, Elements paragraphs) { System.out.println("Indexing " + url); // make a TermCounter and count the terms in the paragraphs TermCounter tc = new TermCounter(url); tc.processElements(paragraphs); // push the contents of the TermCounter to Redis pushTermCounterToRedis(tc); }
To index a page, we
- Make a Java TermCounter for the contents of the page, using code from a previous exercise.
- Push the contents of the TermCounter to Redis.
Here’s the new code that pushes a TermCounter to Redis:
public List<Object> pushTermCounterToRedis(TermCounter tc) { Transaction t = jedis.multi(); String url = tc.getLabel(); String hashname = termCounterKey(url); // if this page has already been indexed, delete the old hash t.del(hashname); // for each term, add an entry in the TermCounter and a new // member of the index for (String term: tc.keySet()) { Integer count = tc.get(term); t.hset(hashname, term, count.toString()); t.sadd(urlSetKey(term), url); } List<Object> res = t.exec(); return res; }
This method uses a Transaction to collect the operations and send them to the server all at once, which is much faster than sending a series of small operations.
It loops through the terms in the TermCounter. For each one it
- Finds or creates a TermCounter on Redis, then adds a field for the new term.
- Finds or creates a URLSet on Redis, then adds the current URL.
If the page has already been indexed, we delete its old TermCounter before pushing the new contents.
That’s it for indexing new pages.
The second part of the exercise asked you to write getCounts, which takes a search term and returns a map from each URL where the term appears to the number of times it appears there. Here is my solution:
public Map<String, Integer> getCounts(String term) { Map<String, Integer> map = new HashMap<String, Integer>(); Set<String> urls = getURLs(term); for (String url: urls) { Integer count = getCount(url, term); map.put(url, count); } return map; }
This method uses two helper methods:
- getURLs takes a search term and returns the Set of URLs where the term appears.
- getCount takes a URL and a term and returns the number of times the term appears at the given URL.
Here are the implementations:
public Set<String> getURLs(String term) { Set<String> set = jedis.smembers(urlSetKey(term)); return set; } public Integer getCount(String url, String term) { String redisKey = termCounterKey(url); String count = jedis.hget(redisKey, term); return new Integer(count); }
Because of the way we designed the index, these methods are simple and efficient.