2014-05-07

Windowsのプロセス単位のCPU使用率アラートをfluentdを使ってやってみる…の続き

2014.08.27 追記

Release 0.10.49 で入っているTextFormatterを使って、plaintextformatterを使ってるところを無理やりjson固定でのTextFormatterに入れ替えることで、末尾にある文字コード変換のエラーは解消できるようになりました。
fluentd/ChangeLog at master · fluent/fluentd · GitHub

前回（Windowsのプロセス単位のCPU使用率アラートをfluentdを使ってやってみる - メモ帳みたいなもの）の続きです。
やってること自体はfluentdを使っている人であればありふれた内容なのでほぼ自分用のメモです。

「1時間単位での min/max/avg も出したくなってくる。」の部分とグラフ化
windowsのイベントログを受ける部分の追加

参考

http://blog.livedoor.jp/sonots/archives/25189820.html

nxlog.conf

im_msvistalogはExec でsyslog形式に変換して流しても動きましたが、情報が冗長という感じだったためim_fileと同じ形にしました。
fluent側で受けるtagを分けるために、Outputを追加。
どうも重複して利用がダメそうで、結局流れでInput/Processor/Output/Routeをすべて追加。
こんなんで良いのかな…。

 define ROOT C:\Program Files (x86)\nxlog
 Moduledir %ROOT%\modules
 CacheDir %ROOT%\data
 Pidfile %ROOT%\data\nxlog.pid
 SpoolDir %ROOT%\data
 LogFile %ROOT%\data\nxlog.log

 <Extension syslog>
   Module      xm_syslog
 </Extension>

 <Extension json>
   Module      xm_json
 </Extension>

 <Input in>
   Module im_file
   File "D:\work\winfluent\srclog\log.txt"
   SavePos TRUE
   InputType LineBased
 </Input>

 <Input ev>
   Module      im_msvistalog
#   Exec        $Message = to_json(); to_syslog_bsd();
   SavePos TRUE
   ReadFromLast TRUE
 </Input>

 <Processor t>
   Module pm_transformer
   OutputFormat syslog_bsd
   Exec $Message=(": "+$raw_event);
 </Processor>

 <Processor t_ev>
   Module pm_transformer
   OutputFormat syslog_bsd
   Exec $Message=($raw_event);
 </Processor>

 <Output out>
   Module om_udp
   Host xxx.xxx.xxx.xxx
   Port 55514
 </Output>

 <Output out_ev>
   Module om_udp
   Host xxx.xxx.xxx.xxx
   Port 55515
 </Output>

 <Route r>
   Path in => t => out
 </Route>
 
 <Route r_ev>
   Path ev => t_ev => out_ev
 </Route>

追加プラグイン

/usr/lib64/fluent/ruby/bin/gem install fluent-plugin-numeric-monitor
/usr/lib64/fluent/ruby/bin/gem install fluent-plugin-growthforecast

入れてるけどまだ使ってないプラグイン

/usr/lib64/fluent/ruby/bin/gem install fluent-plugin-file-alternative
/usr/lib64/fluent/ruby/bin/gem install fluent-plugin-datacounter
/usr/lib64/fluent/ruby/bin/gem install fluent-plugin-redeliver

td-agent.conf

im_fileとim_msvistalogでtagを変えたかったため、受けポートを増やして対応。
2回目のrewrite_tag_filterで、growthforecastに投げる際に「tag_for section」するとタグにPIDが入っているのが却って邪魔になったのでタグからPIDは外した。notifier のタグにPIDが入るのとメリット・デメリット差し引きでどちらかを取るか…という感じではある。

<source>
  type syslog
  protocol_type udp
  port 55514
  tag  winps
</source>

<source>
  type syslog
  protocol_type udp
  port 55515
  tag  winev
</source>

##################################################################
<match winev.**>
  type copy
  <store>
    type file
    path /var/log/td-agent/arch/winev
    time_slice_format  %Y%m%d
    buffer_type file
    buffer_path /var/log/td-agent/buffer/winev/
    buffer_chunk_limit 100m
    flush_interval 5s
  </store>

</match>


##################################################################
<match winps.**>
  type copy
  <store>
    type file
    path /var/log/td-agent/arch/winps
    buffer_type file
    buffer_path /var/log/td-agent/buffer/winps/
    buffer_chunk_limit 100m
    flush_interval 5s
  </store>

  <store>
    type filter
    all deny
    allow message: /firefox/, message: /Idle/
  </store>

</match>

<match filtered.**>
  type rewrite_tag_filter
  rewriterule1  host  ^(.+)$  filterrewrited.$1.winps
  remove_tag_prefix filtered
</match>
<match filterrewrited.**>
  type parser
  remove_prefix filterrewrited
  add_prefix winproc
  format /^(?<Name>[^ ]* +\d+) +(?<Cpu>\d+) +(?<Thd>\d+) +(?<Hnd>\d+) +(?<Priv>\d+) +(?<CpuTime>.+) +(?<ElapsTime>.+)$/
  key_name message
  suppress_parse_error_log true
</match>

<match winproc.**>
  type rewrite_tag_filter
  rewriterule1  Name  ^([^ ]*) +(\d+)$  $1.${tag}
  remove_tag_prefix winproc
</match>
<match firefox.**>
  type copy
  <store>
    type notifier
    <def>
      pattern     firefox
      check       numeric_upward
      warn_threshold 5
      crit_threshold 10
      target_keys Cpu
    </def>
  </store>
  <store>
    type numeric_monitor
    count_interval 60
    aggregate tag
    output_per_tag yes
    tag_prefix monitor
    monitor_key Cpu
    output_key_prefix cpu_stat
    percentiles 50
  </store>
</match>
<match Idle.**>
  type copy
  <store>
    type notifier
    <def>
      pattern     Idle
      check       numeric_downward
      warn_threshold 95
      crit_threshold 80
      target_keys Cpu
    </def>
  </store>
  <store>
    type numeric_monitor
    count_interval 60
    aggregate tag
    output_per_tag yes
    tag_prefix monitor
    monitor_key Cpu
    output_key_prefix cpu_stat
    percentiles 50
  </store>
</match>

##################################################################
<match notification.**>
  type stdout
#  type     mail
#  host     localhost
#  port     25
#  from     FROM
#  to       TO
#  subject  fluentd notification
#  out_keys pattern,target_tag,target_key,level,value,message_time
</match>

#<match monitor.**>
#  type stdout
#</match>
<match monitor.**>
  type copy
  <store>
    type stdout
  </store>
  <store>
    type growthforecast
    remove_prefix monitor
    gfapi_url http://localhost:5125/api/
    service cpustat
    tag_for section
    name_keys cpu_stat_max,cpu_stat_min,cpu_stat_avg,cpu_stat_percentile_50
  </store>
</match>

その他

fluent-plugin-file-alternative でイベントログをファイル保存しようとすると文字コード変換部でエラー。
ちょっと追いきれてないので、とりあえずメモだけ。out_file であれば問題なくファイルに保存できました。

2014-05-07 14:10:07 +0900 [warn]: emit transaction failed  error_class=Encoding::UndefinedConversionError error=#<Encoding::UndefinedConversionError: "\xE3" from ASCII-8BIT to UTF-8>
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:85:in `encode'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:85:in `to_json'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:85:in `stringify_record'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:115:in `format'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/output.rb:527:in `block in emit'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/event.rb:54:in `call'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/event.rb:54:in `each'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/output.rb:518:in `emit'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/match.rb:36:in `emit'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/engine.rb:152:in `emit_stream'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/engine.rb:132:in `emit'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/plugin/in_syslog.rb:199:in `emit'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/plugin/in_syslog.rb:173:in `receive_data'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/plugin/in_syslog.rb:245:in `call'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/plugin/in_syslog.rb:245:in `on_read'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/cool.io-1.1.1/lib/cool.io/io.rb:108:in `on_readable'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/cool.io-1.1.1/lib/cool.io/io.rb:170:in `on_readable'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/cool.io-1.1.1/lib/cool.io/loop.rb:96:in `run_once'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/cool.io-1.1.1/lib/cool.io/loop.rb:96:in `run'
  2014-05-07 14:10:07 +0900 [warn]: /usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/plugin/in_syslog.rb:118:in `run'
2014-05-07 14:10:07 +0900 [error]: syslog failed to emit error="\"\\xE3\" from ASCII-8BIT to UTF-8" error_class="Encoding::UndefinedConversionError" tag="winev.user.info" record="{\"host\":\"desktop-PC\",\"ident\":\"Service_Control_Manager\",\"pid\":\"672\",\"message\":\": 2014-05-06 20:59:12 nsr-PC INFO 7036 WWAN AutoConfig \u30B5\u30FC\u30D3\u30B9\u306F \u5B9F\u884C\u4E2D \u72B6\u614B\u306B\u79FB\u884C\u3057\u307E\u3057\u305F\u3002\\r\"}"

2014-05-04

Windowsのプロセス単位のCPU使用率アラートをfluentdを使ってやってみる

fluentdにいつの間にかWindowsのユースケースが載っていました。気づくの遅すぎ。
　http://docs.fluentd.org/ja/articles/windows
ということでレッツトライします。
お題はWindows側の特定プロセスのCPU使用率が●%を越えたらアラート。
ユースケースに則ってWindows側にnxlogをインストールして、fluentd側で type syslog で受けるところまでは進めます。
受け側のfluentd側はセットアップで楽するためにtd-agentを入れました。
なんとか目的のものはできましたが、なんか非効率な気がしているので何だかなぁ…という感じです。

参考

filterの書き方　http://muddydixon.hatenablog.com/entry/2012/08/31/144853
paser の正規表現チェック　http://fluentular.herokuapp.com/

Windows側で使うもの（Linux側のfluentdはなくて良いよね…）

PsList http://technet.microsoft.com/en-us/sysinternals/bb896682
nxlog http://nxlog.org/download

大雑把な構造

pslist -s でファイルに pslist の結果を出力
nxlog で pslist の結果を fluentd に飛ばす
fluentd の syslog で受ける
filtered で必要なプロセスに絞る
notifier で閾値設定とアラート出力

nxlog.conf

 define ROOT C:\Program Files (x86)\nxlog
 Moduledir %ROOT%\modules
 CacheDir %ROOT%\data
 Pidfile %ROOT%\data\nxlog.pid
 SpoolDir %ROOT%\data
 LogFile %ROOT%\data\nxlog.log
 <Extension syslog>
   Module      xm_syslog
 </Extension>
 <Extension json>
   Module      xm_json
 </Extension>
 <Input in>
   Module im_file
   File "D:\work\winfluent\srclog\log.txt"
   SavePos TRUE
   InputType LineBased
 </Input>
 <Processor t>
   Module pm_transformer
   OutputFormat syslog_bsd
   Exec $Message=(": "+$raw_event);
 </Processor>
 <Output out>
   Module om_tcp
   Host xxx.xxx.xxx.xxx
   Port 5140
 </Output>
 <Route r>
   Path in => t => out
 </Route>

fluentd側追加プラグイン

/usr/lib64/fluent/ruby/bin/gem install fluent-plugin-filter
/usr/lib64/fluent/ruby/bin/gem install fluent-plugin-mail
/usr/lib64/fluent/ruby/bin/gem install fluent-plugin-notifier
/usr/lib64/fluent/ruby/bin/gem install fluent-plugin-parser
/usr/lib64/fluent/ruby/bin/gem install fluent-plugin-rewrite-tag-filter

td-agent.conf（syslogで受けるところからの部分のみ。最終的にはmail通知。）

firefoxプロセスのCPU使用率5%/10%以上でのアラート
IdleのCPU使用率50%/20%以下でのアラート（マシン全体のCPU使用率の代わりです）

# サンプルそのまま
<source>
  type syslog
  protocol_type tcp
  port 5140
  tag winlog
</source>

# filterプラグインで以降に流す部分の量を減らすためにプロセス名でフィルタします。
# syslog で受けたものをそのまま受けてフィルタするため、正規表現パターンで行います。
<match winlog.**>
  type filter
  all deny
  allow message: /firefox/, message: /Idle/
</match>
#<match filtered.**>
#  type stdout
#</match>

# この後 parse に入力すると syslog で受けた key=host が無くなるため、
# ここで rewrite_tag_filter を使ってホスト名をタグ部分に移動させます。
<match filtered.**>
  type rewrite_tag_filter
  rewriterule1  host  ^(.+)$  filterrewrited.$1.${tag}
  remove_tag_prefix filtered
</match>

# parse で message 内容に key 付けして、notifier で使う target_keys を作ります。
<match filterrewrited.**>
  type parser
  remove_prefix filterrewrited
  add_prefix winproc
  format /^(?<Name>[^ ]* +\d+) +(?<Cpu>\d+) +(?<Thd>\d+) +(?<Hnd>\d+) +(?<Priv>\d+) +(?<CpuTime>.+) +(?<ElapsTime>.+)$/
  key_name message
  suppress_parse_error_log true
</match>
#<match winproc.**>
#  type stdout
#</match>

# notifier のデフォルトアラート出力だとプロセス名などが抜けてしまうので、
# rewrite_tag_filter でタグ側に移動させます。今回はプロセス名とPID。
# ついでに、message のパースの際プロセス名を key にする方法がちょっと探しきれずにタイムオーバしたため
# 暫定でプロセス名をタグの先頭に配置することで、後の notifier の match にプロセス名を使えるようにして
# プロセスごとの閾値を設定できるようにします。
<match winproc.**>
  type rewrite_tag_filter
  rewriterule1  Name  ^([^ ]*) +(\d+)$  $1.$2.${tag}
  remove_tag_prefix winproc
</match>

# プロセス名ごとに match ルールで notifier で閾値設定。
# 本当は <def> 内の target_keys を プロセス名_Cpu とかにして、記述できるようが綺麗だと思ってる…。
<match firefox.**>
  type notifier
  <def>
    pattern     firefox
    check       numeric_upward
    warn_threshold 5
    crit_threshold 10
    target_keys Cpu
  </def>
</match>
<match Idle.**>
  type notifier
  <def>
    pattern     Idle
    check       numeric_downward
    warn_threshold 50
    crit_threshold 20
    target_keys Cpu
  </def>
</match>

# 最終的にはメール通知とかIRCに出すとか。もうお好みで。
<match notification.**>
  type stdout
#  type     mail
#  host     localhost
#  port     25
#  from     FROM
#  to       TO
#  subject  fluentd notification
#  out_keys pattern,target_tag,target_key,level,value,message_time
</match>

Windows側でのテストキック

D:\work\winfluent\PSTools
pslist.exe -s 100 >> ..\srclog\log.txt

サンプル出力

2014-05-04 20:26:54 +0900 notification: {"pattern":"Idle","target_tag":"Idle.0.desktop-PC.winlog.user.notice","target_key":"Cpu","check_type":"numeric_upward","level":"crit","threshold":10.0,"value":91.0,"message_time":"2014-05-04 03:15:51 +0900"}

改善したいもの

本当は notifier の target_keys でプロセス名を指定して閾値を設定したいのだけど、その場合は送られてきた message 内のプロセス名を key に設定しないといけない。
ざっくりプラグインリストを見ていたけど、ちょっと目的に合うものが簡単には見つからなかったので rewrite_tag_filter を使ってプロセス名やホスト名をタグの方に持ってきて notification 出力を調整。
実運用時には、オリジナルログの保全のために type syslog で受けたものをファイルコピーする部分も別に必要。
win側では pslist の定期実行とログローテ用のスクリプト作成が必要。
1時間単位での min/max/avg も出したくなってくる。
メモ：fluent-plugin-forestとfluent-plugin-datacalculator

2014-04-20

LINE Developer Conference インフラの部の参加メモ

今更ですが「LINE Developer Conference」インフラの部に参加させて頂けたのでメモです。
　http://line-hr.jp/archives/37147547.html
とても参考になる話が聞けてLINEさん大変ありがとうございました。
↓のようにもう既に奇麗に他の方が沢山書き留めていますので、もうひたすら個人的にメモっていたのを羅列するだけ。
　http://masasuzu.hatenablog.jp/entry/2014/04/16/LINE_Developer_conference
　http://dev.classmethod.jp/server-side/line-dev-conf-infra/
　http://blog.mogmet.com/line-developer-conference-infra-introduce-network-case/
　http://blog.mogmet.com/line-developer-conference-infra-db-high-availability/

思ったこと

L3DSR/statelessSLBあたりは今後のサービスインフラ考える上でとても参考になりました。
サービス運用におけるDBのDDL運用は…やっぱりまだ手間がかかるところかなという感じでした。
ただ、MySQLでもオンラインDDLができるようになっているので、今後Immutable Infrastructureに合わせてどう対応させて行くかが詰められて今後1,2年内には何かしら形になるんじゃないかという気もしています。
　参考：http://dev.classmethod.jp/cloud/aws/jawsdays2014-08/
　　（ここを見ても、DBはまだどうやって対応させて行くかという感じかな？）
　

※付きは個人感

ITSCの紹介

　範囲
　　データセンタ
　　ネットワーク
　　サーバ
　　DB
　　セキュリティ分野ではアプリケーションも含めた全体
　サービスへの関わり方
　　企画・設計でキックオフ
　　開発・テストでインフラ設計、キャパ設計
　　リリース・運営は体制組んで対応
　サービスインフラマネージャが表にたって対応

システム運営

　開発　ー　システム運営（ここ）　ー　パートナー企業（ISP/CDNなど）
　２年振り返り
　　ユーザ、サービス、サーバ増え続けているがメンバーはあまり増えていない
　　　自動インストール周りの整備がされているから（Plug and Installと呼んでいる）
　　　問題
　　　　DC内のコネクションタイムアウト（アプリタイムアウト）
　　　　　L２SWのパケドロ　　※いつものキューオーバーフローのこと
　　　　　　アプリケーションで対象できればうれしいが…
　　　　　　　メッセージサービスだとアプリケーション側で制御が難しい（一斉系）
　　　　　　結果、バッファの大きいSWに交換
　　　　　　NWを独立させてサーバを隔離
　　　　DC空調
　　　　　サーバがつめなくなってきた
　　　　　　スケールアップ
　　　　　　VM集約
　　　　　HBaseが一番台数が多い
　　　　　　経験則的に１０００台くらいで性能に壁がある感じ
　　　　　　ハード故障多くて運用コストがつらい
　　　　　　　スケールアップで台数減らしをもくろむ
　　　　　　　　IOPS重視
　　　　　　　　　Diskなら１５００くらいのIOPS
　　　　　　　　　PCI-SSDで数万から１０万クラスに
　　　　　　　Disk早くしたらCPUがネックになってきた
　　　　　仮想化ハイパーバイザ
　　　　　　VMWareでやってる
　　　　　　運用性などでの都合
　　　　　　１サーバで１０VMくらいの集約度　※48Gの24コアくらいか？
　　　　運営問題
　　　　　地震などがあると２倍くらいのトラフィックがあるがなんとかさばけた
　　　　　年末年始にRedisの一部に負荷集中
　　　　　　NIC割り込みとRedisの処理しているCPUコアが重複して遅延
　　　　　　tasksetでRedisプロセスが使うCPUを分散させて応急処置
　　　　　　irqbalanceの対象からNICを外して、NICは手動設定

LINE DBシステム

　主にLINEゲーム系のお話
　DBMS色々だけどMySQLが７３％くらいで最も多い、次がSQLServerで１７％くらい。ほかはほとんどない。
　DBサーバ台数1000台くらい運用
　容量算定について
　　同時接続が見込みから外れると大変だったという話
　自動フェイルオーバ（Multi-Master Replication Manager）
　MySQL+MMM 構成　MySQLはPerconaベースをカスタマイズ
　　Write用VIPとRead用VIPを振る
　　heatbeatでVIPの移動
　　フェイルオーバはMMMが指示
　無停止シャード追加
　　MGMTという独自開発されたプロセスがシャードサーバ情報を提供　　※fabricの独自版のようなもの
　　シャーディング数が増える場合はどうするか
　　　移行用マップが作られる
　　　既存DBから担当シャードデータ移動のためにDB間でつないでデータとる
　　　　取り方はselect/insertのSQLベース
　　　コピーが終わったらMGMTから新シャードDBを通知
　　MGMTはマネージャ・エージェントシステムでアプリサーバにエージェントが配置されており、アプリはエージェントに問い合わせてシャーディングマップを取得している
　　マネージャはPUSHでファイル配布＆エージェントメモリ展開

　・シャード追加中のトランザクションはどうするか？
　　　一番古いデータからselectして行く、そのための識別用タイムスタンプ列が存在する
　　　複数回selectでのゲットをまわして事前同期を進めて、エイや！でシャード追加
　・DDLの変更を伴う際にはどのように無停止対応をしているのか？
　　　DDLを事前にもらってメンテ入れて適用パターンが多め
　　　全部が全部必ずしも無停止ではない

　MMMカスタマイズ点
　　・モニターサーバ１台で複数担当
　　・フェイルオーバ時に外部スクリプト呼び出し
　　・レプリケーション遅延でのフェイルオーバはしないようにした
　　・障害判断のSQLにダミーテーブルへの操作を追加

事前アンケートFAQ回答

　LINEシステム規模
　　サーバは万、DBは千単位

ネットワークインフラの取り組み

　課題
　　内部ネットワークトラフィック
　　データセンタスペース
　　
　　ネットワーク
　　Unknown Unicast Flooding　→　MACアドレステーブルFLUSH　→　通信不安定
　　　これが1Gクラスで発生していた
　　前はPOD単位でL2
　　　L2がSS単位
　　　表と裏の分離、裏を160Gbps
　　　LBをL3DSR化（Tunnel方式（トンネリングチャネルをサーバとバランサで張る）・DSCP方式（これを採用、DSCPヘッダを利用した制御方式）
　　
　　データセンタ
　　　高効率、高密度、実効8kVA以上
　　　51Uラック　24kVA、100口の100V電源、特注
　　　スイッチを背面設置、後は空調の問題
　　　　スイッチ専用の箱を特注して、サーバの熱がスイッチに影響しないようにした

　　海外利用向けには拠点を構築
　　　海外はキャリアの相互接続ポイントが品質が落ちやすい
　　　なるべくLocal ISPと直接パスを持つ
　　　利用者に近いところにサーバを設置
　　　利用者が一番近いところのサイトに接続
　　　バックボーンはリングトポロジー（US 1、アジア3、EU1）　コストと冗長性の妥協点
　　　論理ネットワークはMPLSでPseudo wireを構築

　　　論理ネットワークでやりたかったこと
　　　　インターネット通信を利用者に近いところで収容
　　　　　クライアント側で機能を実装（得ている利用者情報からStaticに決まる）
　　　　　GLSBも検討中
　　　　拠点間通信
　　　　　MPLS上にサービス用、サーバ間通信用２つのネットワークを別々に構築
　　　
　　　メッセージサービスとして
　　　　リアルタイム性重視
　　　　TCPセッションを張り続けるためセッション数があまり落ちない
　　　　LBセッションテーブルがあふれる
　　　　　セッション管理どうする？
　　　　　　LB機器増やし、スペックアップ
　　　　　　LBでセッション管理しない構成
　　　　　　　stateless SLB　ハッシュテーブルを使ってバランスする形式
　　　　　　　バランシング先サーバ増減時の挙動がメーカーによって違いがある
　　　　　　　　ハッシュテーブル全体再計算
　　　　　　　　　負荷均等性は保証、サーバがかわってしまうので１回切断的な動き
　　　　　　　　部分再計算
　　　　　　　　　負荷均等性はだめだが、既存セッションはOK、LINEはこっちを採用
　　　　stateless SBLまとめ
　　　　　セッション数に対してはスケール
　　　　　障害時の挙動の理解が必要
　　　　　問題点
　　　　　　負荷の偏り
　　　　　　モニタリング
　　　　　　LB仕様をベンダーが開示してくれない

　　　・配線で気をつけていること
　　　　　電源ケーブルのあまりが出ないよう自作
　　　　　NWケーブルは専用経路を通す感じなラックデザイン

以上

2014-04-15

serverspecでコマンド結果でテストを分岐させる

小ネタです。
Oracle系のテストを書く上でsqlplusでSQL叩きまくってshould matchというテストを書くつもりですが、SQL叩くためにはインスタンスが上がってなければいけない。
もし、インスタンスが動いてない状態でsqlplus叩いてもまあ、普通にエラーになるだけなのですが無駄に叩くのも気持ち悪いので、できればテストの最初でインスタンス稼動チェックをして分岐させたい。
ということで、backend.run_commandでコマンド実行結果を受けて分岐させようという感じで試したメモです。

参考

http://qiita.com/doima_/items/e5ad8baa83642d07005a
※backend.run_commandについては全然知らなかったため、大変参考になりました。感謝。

とりあえずできたもの

describe command('su - grid -c "srvctl config database"') do
  its(:stdout) { should match /#{property[:oraclesid]}/ }
end

ps_smon = backend.run_command("ps -ef | grep ora_smon | grep -v grep | awk '{ print $NF }'")
if ps_smon.stdout == "ora_smon_#{property[:oraclesid]}\n" then
  describe command('su - oracle -c "sqlplus -s / as sysdba <<EOF
set lin 1000
set pages 0
set trims on
set tab off
select NAME||\',\'||VALUE line from v\\\\\\$parameter2 order by num ;
exit
EOF
"') do
    its(:stdout) { should match /processes,1000/ }
  end
end

ちょっとはまったところ

backend.run_command の結果を文字列比較する場合、改行に気をつけておきましょう。

ps_smon.stdoutの箇所はps_smon[:stdout] でも動きますが、command_result.rbにある「CommandResult#[] is obsolete～」な警告がでます。

SQLはファイル化するか悩みどころなのですが、シェル側のヒアドキュメントの場合はruby分とシェル分でこんなエスケープでとりあえず動きました。「v\\\\\\$parameter2」

2014-04-11

Solaris10 / OpenSSL 1.0.0l ビルド

久々にSolarisでビルドする必要が出たのでメモ。

参考
http://openssl.6102.n7.nabble.com/Runpath-definition-missing-for-libssl-so-td1375.html
https://groups.google.com/forum/#!topic/mailing.openssl.dev/F8tosbpFilE

export PATH=/usr/sfw/bin:/usr/sfw/sbin:/usr/xpg4/bin:/usr/ccs/bin:/usr/sbin:/usr/bin
export LD_OPTIONS='-L/usr/local/openssl-1.0.0l/lib -R/usr/local/openssl-1.0.0l/lib'
./Configure --prefix=/usr/local/openssl-1.0.0l no-asm shared solaris-x86-gcc
gmake
dump -Lv libssl.so |grep PATH
unset LD_OPTIONS
vi test/testssl
gmake test
gmake install

2014-03-23

serverspecに手を出してみた

※2014/4/15 追記
ヒアドキュメントによる複数行比較（擬似diff）についてはtagomorisさんの下記がマージされれば使えるようになるはず。
https://github.com/serverspec/serverspec/pull/387

ホストを同じ構成で複数台、且つ構築時期が微妙に違うというものがあるのでそろそろ設定チェック方式を決めないとなあという事情があったのでずっと使おう使おう思ってた、serverspecを使うことにしました。
とりあえずテストを流す上での核となるRakefile/spec_helper.rbをメモ。
1回作るとありがたみをすごく感じます。mizzy さんの素晴らしいツールに感謝しきりです。
OS部分のテストも大体できたので、今後はOracleの設定部分を作っていく予定。
　

参考

http://serverspec.org/
http://rubydoc.info/gems/rspec-core/RSpec/Core/RakeTask#fail_on_error-instance_method
　

試したバージョン

CentOS release 6.5
ruby 1.9.3p545 (2014-02-24 revision 45159) [x86_64-linux]　※2系にあげないとな…
specinfra (0.8.0)
serverspec (0.15.5)
　

Rake構成

IPアドレスなどの可変部分を外出ししたかったので、http://serverspec.org/advanced_tips.html の「How to use host specific properties」を基本にしました。
　

properties.yml

仮テストなので適当です。

vora:
  :roles:
    - ntp
    - account
    - network
  :mngip: 192.168.xxx.xxx

Rakefile

サンプルほぼそのままですが「t.fail_on_error = false」入れて、rspec が failed になっても次のホストに進めるようにしています。

require 'rake'
require 'rspec/core/rake_task'
require 'yaml'

properties = YAML.load_file('properties.yml')

desc "Run serverspec to all hosts"
task :spec => 'serverspec:all'

namespace :serverspec do
  task :all => properties.keys.map {|key| 'serverspec:' + key.split('.')[0] }
  properties.keys.each do |key|
    desc "Run serverspec to #{key}"
    RSpec::Core::RakeTask.new(key.split('.')[0].to_sym) do |t|
      ENV['TARGET_HOST'] = key
      t.fail_on_error = false
      t.pattern = 'spec/{' + properties[key][:roles].join(',') + '}/*_spec.rb'
    end
  end
end

spec_helper.rb

これもサンプルほぼそのままですが、パスワード入力でのsshログインさせる方向のため、ASK_LOGIN_PASSWORD 周りを入れています。

require 'serverspec'
require 'pathname'
require 'net/ssh'
require 'highline/import'
require 'yaml'

include Serverspec::Helper::Ssh
include Serverspec::Helper::DetectOS
include Serverspec::Helper::Properties

properties = YAML.load_file('properties.yml')
if ENV['ASK_LOGIN_PASSWORD']
  inputpassword = ask("\nEnter login password: ") { |q| q.echo = false }
else
  inputpassword = ENV['LOGIN_PASSWORD']
end

RSpec.configure do |c|
  c.host  = ENV['TARGET_HOST']
  set_property properties[c.host]
  options = Net::SSH::Config.for(c.host)
  options[:password] = inputpassword
  user    = options[:user] || Etc.getlogin
  c.ssh   = Net::SSH.start(c.host, user, options)
  c.os    = backend.check_os
end

はじめてやる上でハマったことやその他やったことのまとめ

・ファイルの内容を全比較する場合、match_～checksumがベースと思われるがレポートにファイル内容を出すにはファイルの内容をdiffするような形にしたい。its(:content)にガツっと書くのがいいのかな？
・複数テストを用意したときに failed になった場合でも先に進めるようにするにはどーすれば？
　→RSpecの「fail_on_error = false」で対応。たどり着くまでずいぶん時間がかかったorz
・Service resource typeで意図的に無効にしたサービスをテストするためにbe_disabledが欲しいと思った
　→it { should_not be_enabled } でできる。
　　が、OSによってはbe_enabledの実装次第で不適切になる可能性はあるので、その時は be_disabled を実装したほうがよいかもしれない。redhat/solarisは実装もすぐできた。
・Rake/RSpec周りがはじめてだったので色々調べながらで手間取った。けど、少しだけRake/RSpecの理解も進んで良かった。…けどまだまだ知らないことばかり。1回Railsアプリ作るしかないか？

2014-02-01

Solars10のzfsをSolaris11に持っていってsmbで見せるメモ

2014/2/1 にSolaris10で組んだzfsをSolaris11に持っていって、smb共有で見せてデータ取り出すとか何それ？的なことを試したメモです。ファイルサーバをSolaris10/ZFSで作ってあるところもあるので、丁度移行テストを兼ねるような兼ねないような…。
なんで日付も書いてるかというと、zfs upgrade したときにちょっと不可解なことがおきたので、時間が経てば解決する問題かもしれないので忘れないように。

参考
　http://docs.oracle.com/cd/E26924_01/html/E25824/gayne.html
　http://edo.blog.jp/archives/1709846.html
　https://blogs.oracle.com/paulie/entry/cifs_sharing_on_solaris_11

●Solaris10でzfs作る

mkfile 1g disk1
mkfile 1g disk2
mkfile 1g disk3
zpool create tpool /mnt/disk1 /mnt/disk2  /mnt/disk3
zfs create tpool/data
cd /tpool/data
適当にファイル作る

●Solaris11（upgradeしなかったとき）
cifs_sharing_on_solaris_11 の記事にしたがって、smbインストールとかは終わらせる
その後、disk1,2,3 を /var/tmp にコピって…

zpool import -d /var/tmp/disk1  -d /var/tmp/disk2  -d /var/tmp/disk3 -f tpool
zfs set share=name=sol1,path=/tpool/data,prot=smb tpool/data
zfs set sharesmb=on tpool/data
share
cat /etc/dfs/sharetab

●Solaris11（upgradeしたとき）

zpool import -d /var/tmp/disk1  -d /var/tmp/disk2  -d /var/tmp/disk3 -f tpool
zpool upgrade tpool
===
This system is currently running ZFS pool version 34.

Successfully upgraded 'tpool' from version 15 to version 34
===

zfs upgrade tpool
===
zfs upgrade tpool/data1 filesystems upgraded
===

zfs upgrade tpool/data
===
1 filesystems upgraded
===

zfs set share=name=sol1,path=/tpool/data,prot=smb tpool/data（上手く行ったように見えて実はNG）
===
name=sol1,path=/tpool/data,prot=smb
===

zfs set sharesmb=on tpool/data
share　（sol1エントリが見えなかった…）
===
share
IPC$            smb     -       Remote IPC
c$      /var/smb/cvol   smb     -       Default Share
===

cat /etc/dfs/sharetab
share -F smb /tpool/data sol1（ということで、zfs set share～の補完）
share
cat /etc/dfs/sharetab

※注意点

cifs_sharing_on_solaris_11 の記事などでありますが、smbパッケージは後付インストール
Solaris11 U1以降は /etc/pam.d/other への記載。pam.confに書くと"Permission denied"の嵐
zpool upgrade/zfs upgrade するとなぜか zfs set share=name=xx,path=/yy/yy,prot=smb yy/yy が上手くいかなくなる。原因不明。バグレポートの投げ先とかあるのかな。。