Skip to main content Skip to navigation

JILT 2000 (3) - Patrick Chen


Contents

Word icon and download article in .rtf format  Download

An Automaic System for
Collecting Crime Information on the the Internet

Patrick S. Chen
Department of Information Management
Central Police University
Taiwan
chenps@sun4.cpu.edu.tw
 

The author wishes to thank Anthony Chu for his programming work in the research project and the anonymous referees for their valuable advice in improving the quality of the paper.

Abstract

This paper describes the operation of an automatic crime information collecting system, called e-Detective, which is developed by the author. It is one of the first special-purpose search engines ever developed. Since searching suspected crime information is a difficult task, the system is designed from a practical point of view and several special functions are implemented for it. The system is composed of a Web crawler, a lexicographical parser, a database of search concepts, a match program, a natural- language processor, and a data manager. In order to search crime information, Web pages are first analyzed lexicographically and semantically by the computer, then verified by human experts. Two experiments are reported to demonstrate the way in which the system works. The result reveals that the retrieval precision of our system is superior to other commercially available search engines. Thus, it is able to assist law enforcement agencies in finding information to investigate cybercrime.

Keywords: Cybercrime, e-Detective, Internet, Computer Crime


This is a Refereed article published on 31 October 2000.

Citation: Chen P, 'An Automatic System for Collecting Crime on the Internet', 2000 (3) The Journal of Information, Law and Technology (JILT). <http://elj.warwick.ac.uk/jilt/00-3/chen.html>. New citation as at 1/1/04: <http://www2.warwick.ac.uk/fac/soc/law/elj/jilt/2000_3/chen/>



1. Introduction

With the popularity of the Internet, a large quantity of information is provided on it. Just the amount of textual data available is estimated in the order of one terabyte, without mentioning other media such as images, audio, and video (Baeza-Yates, 1998). While Internet provides us with knowledge, opportunity, and convenience of living, the abuse of network comes with it. Since Internet-related legislation is not yet mature and the infrastructure of information society is still under construction, there is a room for illegal opportunists to commit crime, which is commonly known as cybercrime.

The first step to investigate crime cases is to collect suspected information. However, on the Internet it cannot be easily done due to its enormous volume. The Uniform Resource Locator (URL) mechanism of the World Wide Web (WWW, or Web for short), that enables data search by address, does not support our search with special intention effectively. With such diverse contents and enormous volume of information on the Web, retrieving data we need is far from assured (Filman and Pena-Mora 1998), (Konopnicki & Shmueli 1995). One of the most primitive ways to investigate cybercrime is to download Web pages according to their URLs and analyze them manually. This method is both labor-intensive and inefficient because finding suspected data from millions of Web pages is as difficult as finding a needle from a haystack. Therefore, it makes sense to develop a computer system to search crime information automatically. In our research project we have designed and implemented an electronic detective system, the e-Detective, which could help us in doing this job.

The e-Detective system is a proprietary search engine, which differs from general-purpose search engines, such as AltaVista,Yahoo or Openfind in several aspects:

1) It is specially designed for the purpose of collecting crime information;

2) Enhancement of search accuracy is made for collecting specific information. This system is attached with several subject-specific thesauri, databases of term phrases with respect to specific crime types that could help us in analyzing crime data patterns.

With the help of this system, we expect to assist law enforcement agencies to keep the order of cyber society so that the Internet remains a platform for well-beings, not a place where the illegal people perform their activities.

The e-Detective system is one of the first special-purpose search engines (Chen, 1998), which can fetch data from the Internet automatically for specific purpose. The way the system works is also different from that of general-purpose search engines. This system is developed for law enforcement and lawyer firms that conduct highly intelligent retrieval tasks to find information so as to solve difficult problems. Typical examples of these tasks are to find suspected information on the Web about:

1) Child pornography;

2) Drug trafficking;

3) Infringement of intellectual property, e.g. piracy of copy rights; and

4) Underground monetary institutions, etc.

In order to get an understanding of effectiveness of the tool we have carried out a series of system tests. Assessment of the search results reveals that the system achieves high retrieval precision. Thus, this system has proven to be capable of carrying out retrieval tasks to find useful information for our purpose.

The rest of contents is organised as follows: Section 2 classifies illegal information to be searched on the Internet. Section 3 describes the system architecture. Section 4 deals with system implementation where we describe the technique used in implementing the main components of the system. The use of a new method for natural language processing (NLP) is worthy of mentioning. Section 5 reports the operation of system with the help of two experiments. We also evaluate retrieval efficacy in the notion of precision and recall. Last section summarizes the contribution of research and offers some discussions.

2. Classification of Illegal information on the Internet

In order to find the clue of a crime, the first step is to collect information, denoted as crime information in this paper. Though the information can be published in different ways, we distinguish two types of crime information on the Internet:

Type I : The behaviour of diffusing certain kinds of information on the Internet constitutes a crime. This kind of information is, for example, fraud, intimidation, defamation, infringement of copyrights, etc.

Type II: The behaviour of diffusing certain kinds of information on the Internet is in itself not a crime, but is able to help others to commit crime. For example, teaching method of making bombs on the Web is protected under the speech freedom, but such action would enable others to commit crime. Whether this kind of behavior constitutes aiding or abetting, depends on the mens rea. If no actus reus is involved, it is not punishable.

Based on the nature of information, we are only allowed to collect information that is punishable, i.e., the information of Type I. Since the diffusion of this kind of information constitutes a part of crime, it is the object to be collected by our system, denoted as our search intention.

After having determined the type of information to be collected, we shall translate this kind of information into an adequate form that a computer system can process. In the next section we are going to describe the architecture of such a system that will carry out the task of data searching.

3. System Architecture

The system is composed of three main components: A Web crawler fetches pages from the Internet one by one; a match program compares the pages with the search intention; and a data manager is responsible for the management of search results. Principles for constructing such a search engine include:

1) The information needed by law enforcement agencies will be prepared through a special process. This information need will, then, be analyzed and processed into a form representing the search intention of the law enforcement agency;

2) A Web crawler is used to collect data on the Internet;

3) A program is made to compare the search intention with Web pages lexicographically to filter out irrelevant pages;

4) Semantic analysis of the selected pages shall be done to identify the pages containing crime information;

5) Analysis of search results based on human expertise is necessary;

6) Facilities for organizing and managing search results should be provided.

Based on the above principles, we draw up a diagram (Figure 1) to demonstrate the e-Detective system.


Architecture of the e-Detective System

Figure 1: Architecture of the e-Detective System


The way in which e-Detective works is described briefly as follows: A parser transforms narrative descriptions of the law enforcement agency into a set of concept terms representing the search intention of the agency. A Web crawler fetches data from the Internet, and the parser also transforms the data into a set of concept terms. A match program compares the two sets of concept terms. The best-matched pages will first be analyzed semantically by a natural language processor (NLP) and, then, by human experts. Thereafter, relevant pages and Web sites are stored together with their addresses. If a page is verified as containing crime information, it will be processed automatically to abstract new concept terms that are to be added to the database for supporting further search. Therefore, the e-Detective is a continuously evolving system.

4. System Implementation

We develop the e-Detective system by employing a rigorous methodology. In the following subsections we describe its four main components.

4.1 The Parser

A parser is principally a lexicographical analyzer that identifies concept terms contained in a text. A list of stop-words is used to filter out the words of less significance.

4.2 The Database of Concept Terms

We assign a weight to a term to denote its significance (e.g. its relevance to a subject). The term is stored together with its weight in database. Thus, the content of a database is a list of 2-tuples (t, w) with t a term phrase and w its associated weight. The databases are classified according to their specific subjects, e.g., politics, economics or electronics.

4.3 The Match Program

A match program is used to compare terms contained in a database, which is related to a certain subject, with the terms appearing in a Web page. Output of the match program is the accumulated weight of the Web page. If a term of the database also appears in a Web page, the weight of the term will be added to the accumulated weight of the page. Pages with heavier weight are considered as relevant for the subject. We also suggest some inference rules to be applied in the match procedure (Chen, 1994). These rules are used to handle the problem of synonyms, acronyms, and so on.

4.4 The NLP Processor

While the parser processes texts lexicographically, the NLP processor analyzes them semantically. Main function of the match program is to filter out irrelevant Web pages such that the search space can be narrowed. After lexicographical analysis only a limited number of pages are chosen for semantic analysis. In order to understand the meaning of the text, we have suggested a Semanto-Syntactical parsing method, which will be described briefly here. Since natural language processing is not the main topic of this paper, interested readers may refer to Chen et al.(1995) for further details.

A Semanto-Syntactical analyzer works according to a head-modifier principle. Tesniere (1959), mentioned in the theory of dependency grammar that every language construct consists of two parts: A + B (head and modifier). The head is defined as follows:

Definition: Head-Operand. In a language construct 'A + B', A is the head (operator) of B (operand), if:

the meaning of B is such a function that narrows the meaning of 'A + B';

and the syntactical property of the whole construct 'A + B' coincides with that of the category of A.

With this postulate, we are able to develop a language parser. The heads of 13 language constructs (of English) listed in Figure 2 can be determined unambiguously:


No.

Constructs

Head

Modifier

1

V+XP(NP, PP, S, AP, AdvP)

V

XP

2

P+XP(NP, AP, S)

P

XP

3

A(Predicative)+XP(NP, PP, S)

A

XP

4

Aux+VP

Aux

VP

5

Comp+S

Comp

S

6

InfV(VP)+NP(Subject)

InfV

NP

7

Det+N

N

Det

8

A(Attributive)+N

N

A

9

Part+XP(NP,AP,PP)

XP

Part

10

Adv+XP(NP,AP,PP)

XP

Adv

11

V+Adv

V

Adv

12

NP+XP(PP,S)

NP

XP

13

N1+N

N

N1

Key

A (Adjective)

Adv (Adverb)

AdvP (Adv. Phrase)

AP (Adj. Phrase)

Comp (Conjunctive)

Det (Determinant)

N, N1 (Noun)

InfV (Infinite Verb)

NP (Noun Phrase)

P (Preposition)

Part (Particle)

PP (Prep. Phrase)

S (Sentence)

V (Verb)

VP (Verb Phrase)

XP (as indicated in the parenthesis)

Figure 2: Thirteen language constructs with their headers


After parsing, a sentence will be decomposed into several constructs. We will obtain a syntax tree together with a set of heads and modifiers. The meaning of a sentence may be captured by its concept terms that are constructed in the following way:

1) Heads belonging to the categories of noun, verbs, and adjectives are concept terms because they are semantic bearers;

2) Heads that belong to the categories of prepositions and conjunctives, together with '', are connectors that combine concept terms to form expressions. The '' is called the null connector (Bruza and van der Weide, 1991), which is used to concatenate two or more nouns. The system for formation of expressions has a simple syntax:

Expression Term {Connector Expression}*

Term String

Connector | to | from | and | .

Where a term is associated with a noun, a verb, or an adjective and a connector determines the type of relationship between two terms.

For example, let us analyze the sentence 'The compact disc is sold at the price of 20 dollars'.


Parsed tree of the sentence

Figure 3 A parsed tree of the sentence 'The compact disc is sold at the price of 20 dollars'.


The analyzed syntax tree is illustrated in Figure 3 and the heads can be read from the left column of Figure 4.


Head

Modifier

is sold :

disc

disc :

the compact

at :

the price

price :

of 20 US dollars

of :

20 US dollars

US dollars :

20

Figure 4: Heads and modifiers of the sentence 'The compact disc is sold at the price of 20 dollars'.


We are going to ignore the verb to construct a set of noun phrases as they do in the community of IR. The concept terms obtained from the sentence are {disc, compact disc, price, US dollars, 20 US dollars, price of 20 US dollars}. In this way we are able to extract e.g. 'price of 20 US dollars' . If we substitute 'of' with ':', we obtain 'price: 20 US dollars', which gives us important information, the price. In traditional methods we extract keywords by counting their occurrences, we may obtain, e.g. 'US dollars', but not '20 US dollars' in case the occurrence frequency of the latter is not above a certain threshold. In other words, the Semanto-Syntactical parsing could take into account semantics of sentences.

5. Operating with the System

In this section we describe a way of collecting crime information with the help of e-Detective and the evaluation of retrieval effectiveness. We provide a detailed report of the experiments that are done in three steps:

  • Construction of a database of concept terms associated with their weights;
     

  • Determination of the threshold of the accumulated weight for Web pages to be retrieved;
     

  • Semantic analysis for the pages with accumulated weight above the threshold. (Gorden and Pathak, 1999) argued that there are several criteria to evaluate search tools, e.g., retrieval speed, friendliness of user interface, easiness in browsing search results, assistance in formulating queries, and so on. In our system we emphasize the relevance of retrieved data for search intentions in the notion of precision and recall ( Salton and McGill, 1983), because they are the most important indicia showing the effectiveness of a search tool.

Several experiments have been made for evaluating our system, we report two of them in this paper. In the first experiment, we try to collect information of child pornography that is illegal in most countries. The second experiment is much more appealing since it cannot be done merely by lexicographical method. Here, we try to collect the information of selling pirated compact discs (CD), which may be identified by, e.g. unreasonably low price.

To report a well-conducted experiment, it is necessary to obtain meaningful measures of performance. That is, the experiment should follow standard design and conform to well-known measurements, such as recall-precision curve as we do in the field of Information Retrieval, to allow results to be evaluated in a familiar context. In addition, precision and recall are computed at various cut-off values.

5.1 Searching Crime Information of Child Pornography

We report the crime investigation process, which is done in the following steps:

1) Construction of a Database of Search Concepts
In the first step we are going to construct a database of search concepts. A search concept recorded in the database has the form of (term, weight). The concept terms are selected from representative pages judged by human experts, and the weights are determined by their relative frequencies appearing on the pages. In this case, we extract 336 keywords from 97 representative pages of child pornography provided by domain experts;

2) Determination of the Threshold
In order to determine the threshold of accumulated weight for a Web page of child pornography, we randomly choose pages from classified Web sites to form a pool of samples. Then, we insert the above-mentioned 97 representative pages into the pool. Note that the representative pages are distinct from the pages in the pool. The pool serves as input for the system to evaluate retrieval precision and recall. Based on the precision-recall curve we can determine an adequate threshold for retrieving crime information.

In total, 36 Web sites are selected from 6 local portal sites, namely Dreamer, Hinet, Kimo, Openfind, Todo and Yam. These Web sites are classified under 'Sex', 'Adult', 'Women & Girls', 'Porno', 'Fortune', 'Pastime', 'Teenager', and 'Partnership'. Note that no Web page is selected twice. In total, 5267 pages are randomly chosen from these Web sites.


 

Portal site

No of
Web site

Pages
chosen

 

Portal site

No of
Web site

Pages
chosen

1

Dreamer

1

2

21

SinaNet

1

9

2

 

2

16

22

 

2

6

3

 

3

14

23

 

3

2

4

 

4

1

24

 

4

1

5

 

5

30

25

 

5

1

6

Hinet

1

1

26

Todo

1

31

7

 

2

5

27

 

2

19

8

 

3

1

28

 

3

15

9

 

4

5

29

 

4

6

10

 

5

22

30

 

5

1

11

Kimo

1

155

31

Yam

1

128

12

 

2

131

32

Yahoo!

1

169

13

 

3

9

33

 

2

1

14

 

4

11

34

 

3

1

15

 

5

5

35

 

4

2

16

Openfind

1

1

36

 

5

160

17

 

2

1

       

18

 

3

2

       

19

 

4

5

       

20

 

5

4298

Total

   

5267

Figure 5: Number of Web sites chosen from portal sites


Accumulate d weight of a page (¥)

Pages selected
by our system

Correct pages judged by
domain expert

Pages with
weight ¥ the weight indicated in column 1

Precision

Recall

0.00

139

139

5261

2.64

100.00

0.14

90

139

1334

6.75

64.75

0.16

89

139

1278

6.96

64.03

0.21

87

139

1150

7.57

62.59

0.26

86

139

1065

8.08

61.87

0.27

85

139

1059

8.03

61.15

0.29

84

139

1040

8.08

60.43

0.33

83

139

982

8.45

59.71

0.34

80

139

974

8.21

57.55

0.36

79

139

953

8.29

56.83

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

3.98

34

139

248

13.71

24.46

4.27

33

139

235

14.04

23.74

4.34

31

139

231

13.42

22.30

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

35.01

5

139

112

4.46

3.60

44.47

4

139

91

4.40

2.88

74.50

3

139

26

11.54

2.16

82.59

2

139

13

15.38

1.44

87.99

1

139

10

10.00

0.72

Figure 6: Determination of Threshold Based on the Precision/Recall


From Figure 6 we learn that the precision at the accumulated weight 4.27 is a local minimum. It is legitimate to select the accumulated weight of 4.27 as a threshold where we obtain precision of 14.04 and recall of 23.74.

3) Evaluation of the Retrieval Effectiveness

From our experiment we learn that the retrieval precision of 14.04% is not satisfactory. The reason for it is that most pages of child pornography are presented in a form of image. The meaning of the few words attached to these pictures is difficult to be captured by merely keyword comparison. Semantic analysis is then used by means of natural language processing. However, the work done so far is useful since it narrows the search space to a large extent.

Next, we fetch the texts of the pages above the threshold from the Web for parsing semantically. We use semanto-syntactical method to parse these texts. A text will be analyzed from sentence to sentence so as to determine whether it bears the semantic of 'age under 16', 'school children'and the like. If the meaning of any sentence is relevant for these concepts, the page containing such sentences will be selected for further investigation by human experts.


 

Portal site

Sequence
Number of
Web sites

Pages
chosen

Relevance
judged by
human experts

Relevance
judged by the system

 

1

Dreamer

1

2

Yes

no

*Foreign language

2

 

2

16

Yes

no

 

3

 

3

14

Yes

no

 

4

 

4

1

No

no

 

5

 

5

30

No

no

*unrecognizable code

6

Hinet

1

1

No

no

 

7

 

2

5

No

no

 

8

 

3

1

Yes

no

 

9

 

4

5

Yes

no

 

10

 

5

22

No

no

 

11

Kimo

1

155

Yes

no

* Foreign language

12

 

2

131

No

no

 

13

 

3

9

No

no

 

14

 

4

11

Yes

no

* Foreign language

15

 

5

5

No

no

 

16

Openfind

1

1

No

no

* Foreign language

17

 

2

1

Yes

yes

 

18

 

3

2

No

no

* Foreign language

19

 

4

5

Yes

yes

 

20

 

5

4298

Yes

no

* Foreign language

21

SinaNet

1

9

Yes

no

 

22

 

2

6

No

no

 

23

 

3

2

Yes

no

* Foreign language

24

 

4

1

No

no

* Foreign language

25

 

5

1

Yes

no

 

26

Todo

1

31

No

no

 

27

 

2

19

No

no

 

28

 

3

15

Yes

yes

 

29

 

4

6

Yes

no

 

30

 

5

1

Yes

no

 

31

Yam

1

128

Yes

yes

 

32

Yahoo!

1

169

Yes

yes

 

33

 

2

1

Yes

no

 

34

 

3

1

No

no

 

35

 

4

2

Yes

no

 

36

 

5

160

Yes

yes

 

Figure 7: Identifying relevant Web sites


Based on the data shown in Figure 7, there are 21 relevant Web sites correctly judged by the system. The accuracy in classifying Web sites is 21/36*100=58.33%.

5.2 Finding Clues of Selling Pirated CDs on the Web

In order to identify a site selling pirated CDs on the Web, we start from the following assumption that the CDs are sold at an unreasonably low price. Therefore, our search task is to determine:

  • which sites are selling CDs; and
     

  • which CDs are sold at an unusually low price.

To fulfill the above search task we will first construct a database of search concepts. Based on this database we can identify which pages are advertising the sale of CDs. Then, we will further identify the related price made to the public on the Web. Here, the technique of natural language processing is a necessity.

(1) Constructing a Knowledge Base

We select some typical Web pages related to the topic of CD to form the sample space. The pages are selected from five directory-based portal sites for extracting concept terms. Domain experts are asked to judge the representation of these pages concerning our search intention. Irrelevant documents are abandoned and pages are selected anew. Analog to the way of the previous experiment, a representative sample of 122 pages is prepared by human experts, and 879 terms are extracted from them. The most significant concept terms associated with their weights are listed in Figure 8.

 

Concept
term

Frequencies

Accumulated occurrences

100%

99.5%

99%

97.5%

95%

90%

1

CD

941

941

10.19

10.24

10.29

10.45

10.72

11.32

2

Selection

536

1477

15.99

16.07

16.15

16.40

16.83

17.77

3

Set

531

2008

21.74

21.85

21.96

22.30

22.89

24.15

4

Piece

343

2351

25.45

25.58

25.71

26.11

26.80

28.28

5

Selected
Album

319

2670

28.91

29.05

29.20

29.65

30.43

32.12

6

Chinese
Version

300

2970

32.16

32.32

32.48

32.98

33.85

35.73

7

Exclusive
Album

282

3252

35.21

35.39

35.56

36.11

37.06

39.12

8

Disk

197

3449

37.34

37.53

37.72

38.30

39.31

41.49

9

VCD

186

3635

39.36

39.55

39.75

40.37

41.43

43.73

10

Album

175

3810

41.25

41.46

41.67

42.31

43.42

45.83

Total keywords

879

833

787

648

417

210

Accumulated occurrences of terms

9236

9190

9144

9005

8774

8313

*Percentage is obtained from dividing the accumulated terms occurrences by total term occurrences.

Figure 8: The most frequent terms with their weights at various cut-off values


(2) Determination of Thresholds

From six portal sites, namely, Dreamer, Hinet, Kimo, Openfind , Todo, and Yam, we randomly choose 32 Web sites for providing a total of 3290 pages indexed under CD (Figure 9) as our training set. Let us compare the correctness of our classification with that of these 7 search engines.


 

Portal Site

Web Sites

No of pages

 

Portal Site

Web Sites

Page

1

Dreamer

1

4

21

 

4

1

2

 

2

1

22

 

5

3

3

 

3

13

23

Todo

1

3

4

 

4

2

24

 

2

3

5

 

5

38

25

 

3

5

6

Hinet

1

10

26

 

4

16

7

 

2

186

27

 

5

1

8

 

3

2

28

Yam

1

699

9

 

4

3

29

 

2

5

10

 

5

3

30

 

3

1090

11

Kimo

1

1

31

 

4

4

12

 

2

1

32

 

5

69

13

Openfind

1

541

       

14

 

2

13

       

15

 

3

1

       

16

 

4

1

       

17

 

5

1

       

18

SinaNet

1

534

       

19

 

2

7

       

20

 

3

29

Total

   

3290

Figure 9: Selected pages indexed under CD in portal sites and Web sites


These 3290 pages, which are considered as being relevant for CD by other search engines, are the input to our system. The system output (Figure 10) gives us a hint to determine the optimal threshold based on recall and precision. Figure 10 also shows the difference between the results provided by our system and human experts. 2907 of the 3290 pages are verified as relevant for the search intention by human experts.


Accumulated weight of
a page (¥)

Pages
selected by
our system

Correct pages
judged by
domain experts

Pages with
weight ¥ the weight
indicated in column 1

Precision

Recall

0.005

3273

2907

2907

88.82

100.00

0.01

3053

2794

2907

91.52

96.11

0.02

3051

2792

2907

91.51

96.04

0.03

3044

2788

2907

91.59

95.91

0.05

3041

2787

2907

91.65

95.87

0.10

3036

2784

2907

91.70

95.77

0.11

3034

2782

2907

91.69

95.70

0.12

3030

2779

2907

91.72

95.60

0.13

3020

2769

2907

91.69

95.25

0.14

3018

2768

2907

91.72

95.22

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

13.79

1814

1704

2907

93.94

58.62

13.80

1768

1700

2907

96.15

58.48

13.83

1767

1699

2907

96.15

58.45

13.90

1766

1698

2907

96.15

58.41

13.92

1765

1697

2907

96.15

58.38

14.02

1763

1695

2907

96.14

58.31

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

ò

3396.78

5

5

2907

100.00

0.17

3545.12

4

4

2907

100.00

0.14

4128.15

3

3

2907

100.00

0.10

4411.46

2

2

2907

100.00

0.07

6052.15

1

1

2907

100.00

0.03

Figure 10: Determination of threshold based on recall and precision


Experiments are also made at different cut-off values (Gordon and Pathak 1999) upon accumulated term occurrences. Corresponding thresholds are listed in Figure 11, where we can ascertain that the choice of support value at 13.8 for 99.5% of the total term occurrences is adequate, and we obtain 96.15% retrieval precision and 58.48% recall. It shows the superiority of the e-Detective system in precision in comparison with the classification precision, 89%, of other search engines.


Percentage of accumulated term occurrences

Threshold for
support value

Precision (%)

Recall (%)

99.5%

13.80

96.15

58.48

99%

14.34

95.37

58.43

97.5%

13.76

96.03

58.47

95%

13.92

95.87

58.34

90%

14.61

96.11

58.13

Figure 11: Thresholds at various cut-off values of percentage of accumulated term occurrences


5.3 Summary of the Experiments

The findings in the experiments are summarized as follows:

1) We use the databases of concepts as a basis for searching data from the Web; however, not all concepts are to be used. While we use 99.5% of the most significant terms to search CDs and attain optimal retrieval relevance, we use 97.5% of the most significant terms in searching pornographic information;

2) The threshold for searching pornographic information is 3.75 in contrast to the threshold of 13.8 for searching CD-related information. The reason for this difference is that there are more narrative descriptions in CD-pages than the overwhelming pictures in pornographic pages;

3) 100% accuracy in classifying CD Web sites is much better than 58.33% accuracy in classifying pornographic pages. The reason lies in the amount of narrative information contained in the pages;

4) Searching crime information is beyond syntactical comparison; semantic analysis should follow in order to extract interesting information.

6. Conclusion and Discussion

To search crime information is not an easy task. In this paper we give the idea of constructing a proprietary search engine for law enforcement agencies, called the e-Detective, which differs from a common search engine in many aspects: It can process Web pages both syntactically and semantically. It may work assiduously in background and report the search result periodically. It provides the user with a well-ranked list of relevant pages for easy reference, and it organises interesting Web sites such that it knows where to acquire information the user wanted.

The first set of statistics of system evaluation shows that the precision is high in the notion of information retrieval. Thus, it is convinced to be capable of carrying out retrieval tasks to find information for supporting crime investigation. Even though the recall remains humble commensurate to precision, research in this direction deserves our dedication in the future. Another research direction is the application of image processing technique in crime information search.

References

Bruza P D and van der Weide T P (1991) 'The Modelling and Retrieval of Documents Using Index Expressions', ACM SIGIR FORUM 25(2), 91-102.

Baeza-Yates R A (1998) 'Searching the World Wide Web: Challenges and Partial Solutions', Proceedings of Annual Meeting of Pacific Neighborhood Consortium, (May 15-18), Taipei, 153-166.

Chen P S (1994) 'On Inference Rules of Logic-Based Information Retrieval Systems', Information Processing & Management, Vol. 30, No.1, 43-59.

Chen P S and Hennicker R and Jarke M (1995) 'On the Retrieval of Formal Specifications for Reuse', Journal of the Chinese Institute of Engineers.

Chen P S (1998) Collection and Investigation of Illegal Information on Networks, Technical Report (in Chinese), (Taoyuan: Central Police University).

Filman R and Pena-Mora F (1998) 'Seek, And Ye Shall Find', IEEE, Internet Computing, (July/August ), 78-83.

Gordon M and Pathak P (1999) 'Finding Information on the World Wide Web: the retrieval effectiveness of Search Engines', Information Processing & Management 35, 141-180.

Konopnicki D and Shmueli O (1995) 'W3QS: A Query System for the World-Wide Web', Proceedings of the 21st International Conference on Very Large Data Bases, 54-65.

Salton G and McGill M J (1983) Introduction to Modern Information Retrieval, (New York: McGraw-Hill).

Tesniere L (1959) Elemente de Syntaxe Structurale, (Paris: Klincksieck)

Hyperlinks

<http://www.altavista.com>

<http://www.yahoo.com>

<http://www.openfind.com.tw>

<http://www.dreamer.com.tw/>

<http://www.hinet.net/>

<http://www.kimo.com.tw/>

<http://www.todo.com.tw/>

<http://www.yam.com.tw/>

JILT logo and link to JILT home page