Tianwang storage format for raw web pages


  • store pages according to the following format (Tianwang format):

      1) a raw page depot consists of records, every record includes a raw data of a page, records are stored sequentially, without delimitation between records.
      2) a record consists of a header(HEAD), a data(DATA) and a line feed ('\n'), such as is HEAD + blank line + DATA + '\n'
      3) a header consists of some properties. Each property is a non blank line. Blank line is forbidden in the header.
      4) a property consists of a name and a value, with delimitation ":".
      5) the first property of the header must be the version property, such as: version: 1.0
      6) the last property of the header must be the length property, such as: length: 1800
      7) for the sake of simplicity, all names of properties should be in lowercase.
      Example: (tip: codes to generate the following lines)

      version: 1.0
      url: http://net.pku.cn/~cnds/
      origin: http://net.pku.cn/~cnds
      date: Tue, 16 Sep 2003 14:13:20 GMT
      ip: 162.105.203.25
      length: 11683       // tip: length = iPage.m_nLenContent + iPage.m_nLenHeader + 1

      HTTP/1.1 200 OK
      Date: Tue, 16 Sep 2003 14:19:15 GMT
      Server: Apache/2.0.40 (Red Hat Linux)
      Last-Modified: Tue, 16 Sep 2003 13:18:19 GMT
      ETag: "10f7a5-2c8e-375a5cc0"
      Accept-Ranges: bytes
      Content-Length: 11406
      Connection: close
      Content-Type: text/html; charset=GB2312

      XXXXXXXXX
      XXXXXXXXX
      ....
      XXXXXXXXX