The Basic of Computer Science: Dr. Manish Kumar Kamboj Assistant Professor, CSE

The Basic of

Computer Science

Dr. Manish Kumar Kamboj

Assistant Professor, CSE
Some facts !!!

• Finding information from search engine, finding Needle in a haystack.

• “The total number of words spoken by entire human race so far!” written in .txt ~
size of Web.

• 1 Yotta-byte = 1,000,000,000,000,000,000,000,000 Bytes!

This assumes we know the
90% size of entire Web. Do we?
80% Can you define “the size of
70% the Web”?
60% 50%
35% 34%
2000 2005 2010 2014 Slide 4
Take a moment to think about how amazing
the Internet is:

–It’s always on
–It is “free”
–It’s (almost) never noticeably congested (though individual
sites or access points might be)
–you can get messages to anywhere in the world
–you can communicate for free, including voice and video
–you can stream music and movies
–it is uncensored (in most places) (of course, this can be
viewed as good or bad)

Slide 5
Search Engine
• Search Engine is a software program that helps in locating information stored on a computer
system typically on www.

• They are of two types:

✓1.Crawler based
❖Create their listing automatically like Google, Yahoo.
❖Crawl or spider web to create directory of info.
❖Changes made to page are updated automatically.

✓2.Human Powered
❖Depends on user for creation like keyword submission like
❖User submits description of webpage along with keywords.
❖When searching only description submitted are looked for.
• Hybrid search engines combine these two features e.g. looksmart, submitexpress.

Slide 6
Components of Crawler based Search Engine

• 1. Crawler or Spider
✓ Crawl from one web pages using hyperlinks based on some criteria.
✓ Visit sites regularly to look for changes.

• 2.Index or Catalog
✓ Huge book containing a copy of every webpage that crawler finds.
✓Pages only after indexing become searchable.

• 3. Search Engine Software

✓ It searches through million of entries in the index to find match.
✓Can do ranking of matches based on relevance of search query.

search queries

user crawler Slide 7
Challenges faced in Web Crawling

• Problem of Big Data

✓ If we have 10 billion pages of 10 KB each, it require 100TB of storage for index.
✓ Indexing and managing huge data.
✓ Web is growing at a much faster rate than we can index

• High Resource Requirement

✓ If we have 100machines and 10 billion pages, 100machines crawling at 100 pages/second will
require 11.6 days with a very high connection.
✓Other resources are required for high availability and to calculate query.

• Problem of Frequent Updation

✓Many dynamic sites update more frequently, thus adding to the work.
✓Consistency problem.

Slide 8
Web Crawler Policies

• Politeness Policy
✓Do not hamper sites.
✓Only crawl allowed pages.
✓Respect robots.txt (more on this in next slide)

• Robustness Policy
✓ Be immune to spider traps and other malicious behaviour from web servers.

• Parallelization Policy
✓ Different Thread should not visit same site, if crawler using multithreading.

• Revisit Policy
✓When to check for changes.
✓If we cover too much, it will get stale

Slide 9

• Protocol for giving spiders (“robots”) limited access to a website

✓Website announces its request on what can(not) be crawled.
✓Respect robots.txt (more on this in next slide).
✓For a server, create a file /robots.txt which specifies access restrictions.

# robots.txt for

User-agent: *

All crawlers…

…can go anywhere!
Slide 10

# Robots.txt file for

All crawlers…
User-agent: *
Disallow: /canada/Library/mnp/2/aspx/
Disallow: /communities/bin.aspx
Disallow: /communities/eventdetails.mspx
Disallow: /communities/blogs/PortalResults.mspx
Disallow: /communities/rss.aspx
Disallow: /downloads/Browse.aspx
Disallow: /downloads/info.aspx
Disallow: /france/formation/centres/planning.asp
Disallow: /france/mnp_utility.mspx
Disallow: /germany/library/images/mnp/
Disallow: /germany/mnp_utility.mspx
Disallow: /ie/ie40/ …are not
Disallow: /info/customerror.htm allowed in these
Disallow: /info/smart404.asp paths…
Disallow: /intlkb/
Disallow: /isapi/
Slide 11

# Robots.txt for (fragment)

User-agent: Googlebot Google crawler is

Disallow: /chl/* allowed everywhere
Disallow: /uk/*
Disallow: /italy/* except these paths
Disallow: /france/*

User-agent: slurp Yahoo and

Crawl-delay: 2
MSN/Windows Live
are allowed
User-agent: MSNBot everywhere but
Crawl-delay: 2 should slow down
User-agent: scooter
Disallow: AltaVista has no limits
# all others
User-agent: * Everyone else keep off!
Disallow: /

Slide 12
Most Valuable Asset in todays world


Slide 13
What is knowledge?

• Data - Facts, observations, or perceptions.

• Information - Subset of data, only including those data

that possess context, relevance, and purpose.

• Knowledge - A more simplistic view considers

knowledge as being at the highest level in a hierarchy

with data (at the lowest level) and information (at the
middle level).
•Data refers to bare facts void of context.
–A telephone number.
•Information is data in context.
–A phone book.
•Knowledge is information that facilitates action.
–Recognizing that a phone number belongs to a good client,
who needs to be called once per week to get his orders.
Slide 14
Great Predictions


• Artificial Intelligence:
– speech recognition
– Some reasoning; computer beats man in
– Privacy and security problems
– Computers can be a pain in the butt


• Missed Moore’s law and ubiquity of


Slide 15
Predicting the future

–“The future ain’t what it used to be” Yogi Berra

•Can we really predict the future?
•Who predicted the implications of the web
and search engines?
•Social networking?
•Can we understand power laws and their
–We have no examples of exponential growth in our
evolution except plagues.
•Can we understand the pervasiveness of

Slide 16
Information Science and Data Generation

• What does large amounts of information provide?

–New opportunities for search!
–New discoveries
• Business opportunities?
• Research opportunities?
• Problems?
• Wisdom search engine?

Slide 17
How Much Data Is on the Internet?

▪ The amount of data in the world was estimated to be 44

zettabytes at the dawn of 2020.
▪ By 2025, the amount of data generated each day is
expected to reach 463 exabytes globally.
▪ Google, Facebook, Microsoft, and Amazon store at least
1,200 petabytes of information.
▪ The world spends almost $1 million per minute on
commodities on the Internet.
▪ Electronic Arts process roughly 50 terabytes of data
every day.
▪ By 2025, there would be 75 billion Internet-of-Things
(IoT) devices in the world
▪ By 2030, nine out of every ten people aged six and
above would be digitally active.
Slide 18
How much information is there?

•Soon everything will be Recorded
recorded and indexed All Books Exa
•Most bytes will never be MultiMedia
seen by humans. Peta
•Data summarization, All books
trend detection (words) Tera
anomaly detection
are key technologies .Movi
e Giga

A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Slide 19
Slide 20
Slide 21
Slide 22
Moore's Law

• Defined by Dr. Gordon Moore during the sixties.

• Predicts an exponential increase in component density
over time, with a doubling time of 18 months.
• Applicable to microprocessors, DRAMs , DSPs and
other microelectronics.
• Monotonic increase in density observed since the

Slide 23
First Disk 1956

•4 MB

•50x24” disks

•1200 rpm

•100 ms access

•35k$/y rent

•Included computer &

accounting software
(tubes not transistors)
Slide 24
10 years later

1.6 meters

30 MB

Slide 25

