专注细节
努力进步

linux shell 与web


1, Use wget to download files or Web pages

wget URL
Example: wget http://slynux.org
wget URL1 URL2 URL3

use -O to specify file_name

wget URL1 -O file1

use -o to specify log file, this should not be printed in stdout

wget ftp://example_domain.com/somefile.img -O dloaded__file.img -o log

use -t to specify the times to repeat

wget -t 5 URL

to limit the rate of download

wget --limit-rate 20k http://example.com/file.iso

Specify the quota, as soon as it is used done, quit the download

wget -Q 100 http://example/com/file1 http://example.com/file2

use -c to suppory resume

wget -c URL

copy or mirror the all website

wget --mirror exampledomain.com

visit the pages that need to be Authenticated

wget --user username --password pass URL

use lynx to download page with the format of text file

lynx -dump URL > webpage_as_text.txt

eg:
lynx -dump http://google.com > plain_text_page.txt

2, A more powerfull tool: cURL

-- silent: not to print the rate of progress

curl URL --silent -O

-O to write into the files named with URL instead of stdout

-o to specify the files name writed

curl URL/file -C offet

curl -C -URL # use -C to resume the process of download

Use cURL to set referer (a string to specify from which page to reach this page)

curl --referer http://google.com http://slynux.org

Use curl to save the cookie

curl http://example.com --cookie "user=slynux;pass=hack"

save the cookie as a file

curl URL --cookie-jar cookie-file

use curl to set user agent (-A or --user-agent)

curl URL --user-agent "Mozilla/5.0"

other http head information such as

curl -H "Host: www.slynux.org" -H "Accept-language: en" URL

set the rate use --limit-rate

cur URL --limit-rate 20k

set the max filesize

curl URL --max-filesize bytes

use curl to be authenticated

curl -u user:pass http://test_auth.com

only print http heads information

curl -I URL

3,从命令行访问GMAIL

原理就是先通过用户认证下载RSS feed。用户认证信息由-u username: password参数提供。

sed -n :只打印匹配的部分,

代码见 [visit_gmail_command.sh](https://github.com/burness/linux_shell/tree/master/chapter5/visit_gmail_command.sh)

关键代码:sed -n 's/.*<title>(.*)</title.*<author><name>([^<]*)</name><email>([^<]*).*/Author: 2 [3] nSubject: 1n/p'

主要是利用sed命令将邮件标题、发件人姓名、电子邮件地址抠出然后匹配到一个指定的格式:Author: 2 [3] nSubject: 1n/p

最后的那个p表示打印匹配行

4,解析网站数据

利用lynx, 好像只支持部分,比如我进百度的网页好像就不能正确显示,可能只支持html?wiki百科是ok的

lynx http://zh.wikipedia.org/wiki/Lynx

5,制作图片抓取器及下载工具
首先是对参数进行处理,得到目录的参数和URL的参数;
其次,从URL中得到baseurl用于后续可能的相对路径转换为绝对路径进行下载
最后,通过一个循环读入文件中每行内容,然后使用curl工具进行下载。

#!/bin/bash

if [ $# -ne 3 ];
then
    echo "Usage: $0 URL -d DIRECTORY"
    exit -1
fi

for i in {1..4}
do
    case $1 in
    -d) shift; directory=$1; shift;;
    *) url=${url:-$1}; shift;;
esac
done
echo test
echo $directory
mkdir -p $directory
baseurl=$(echo $url | egrep -o "https?://[a-z.]+")
curl -s $url | egrep -o "<img src=[^>]*>" |
sed 's/<img src="([^"]*).*/1/g' > /tmp/$$.list
cat /tmp/$$.list

sed -i "s|^/|$baseurl/|" /tmp/$$.list

cd $directory
while read filename;
do
    curl -s -O "$filename" --silent

done < /tmp/$$.list

6,网页相册生成器

用当前目录下的照片生成相册,其实就是将照片地址写入到html文档的指定部分

即可。

具体操作是使用convert resize为缩略图,然后显示并加入链接
`
#!/bin/bash

echo "Creating album.."
mkdir -p thumbs
cat <<EOF1 > index.html
<html>
<head>
<style>

body
{
  width:470px;
  margin:auto;
  border: 1px dashed grey;
  padding:10px;
}

img
{
  margin:5px;
  border: 1px solid black;

}
</style>
</head>
<body>
<center><h1> #Album title </h1></center>
<p>
EOF1

for img in *.jpg;
do
  /usr/bin/convert "$img" -resize "100x" "thumbs/$img"
  echo "<a href="$img" ><img src="thumbs/$img" title="$img" /></a>" >> index.html
done

cat <<EOF2 >> index.html

</p>
</body>
</html>
EOF2

echo Album generated to index.html
`

分享到:更多 ()