воскресенье, 26 марта 2017 г.

TSDB on PostgreSQL? But why?

This week I stumbled across two new time-series databases (TSDB) - which is good, but both were based on PostgreSQL, which is... confusing, to be honest.

First was named Tgres, it was created by Grisha Trubetskoy and already reach v0.10.0, after 18 month in development, which is quite impressive (no joke). It's written in Go, partially Graphite-compatible and even outperforming "ye olde Graphite" on a single ec2 i2.2xlarge instance (not go-carbon, just normal python Graphite, which is quite not fair IMO). It's also based on rollup archives idea as Whisper and also more effective as storage - around 8 bytes per point (whisper has 12 bytes).

I'm totally fine with people who want to develop something, and I'm not gonna say that Grisha does not understand what he's doing - he's experienced developer and Tgres looks very impressive. But to be honest all rational behind Tgres is really puzzling me. You can read it on the link above (you can go to part named "Avoid Solving the Storage Problem", but it's worth to read all article).
Grisha says: "Someone once said that “anything is possible when you don’t know what you’re talking about”, and nowhere is it more evident than in data storage. File systems and relational databases trace their origin back to the late 1960s and over half a century later I doubt that any field experts would say “the storage problem is solved”. And so it seems almost foolish to suppose that by throwing together a key-value store and a consensus algorithm or some such it is possible to come up with something better? Instead of re-inventing storage, why not focus on how to structure the data in a way that is compatible with a storage implementation that we know works and scales reliably?"
With all respect, but I think that's a wrong direction. Yes, filesystems and databases are in development from the 1960s - and what result do we have? The storage problem is not solved, indeed, but saying "OK, screw it, let's create something on top of weak foundation and hope that it'll fine" is also wrong.
I think that storage engine is the best part of any database and it creates and limits any DB - relational, column or time-series - doesn't matter. Whisper is a good example. It has its own weak points (e.g. no subsecond resolution, IO intensity, 12 bytes per point, only local storage) - and its good points (quite good speed, built-in rollups). But most of Graphite users know its limitations very well - and these limitations limiting Graphite usage from one side - but on the other hand, they created all this new generation of TSDBs / monitoring solutions which are flourishing last time.
And in the same way Tgres inherits all scalability flaws as PostgreSQL (as any relational database) has e.g. good vertical scalability, but quite weak horizontal one. Yes, the author mentions clustering for Tgres, but it's the same approach as we saw already in Whisper - it's external clustering, not built-in in storage.

Another PostgreSQL-based database, named TimescaleDB looks bit better - it still based on Postgres although it uses an own storage engine with built-in clustering and sharding. You can check their paper, it's quite interesting. Now it looks like early InfluxDB, but authors are saying that their approach is better because you can use all real SQL power across all your timeseries.
Let's see. TimescaleDB is quite young, less than 6 months in development, maybe we'll get something useful out there. They have a good and stable foundation, let's see how it will fit in TSDB world.

I still have a strong opinion that in database's world storage engine is a king, and horizontal scalability is a must for any modern data software.




понедельник, 19 сентября 2016 г.

Semi-irregular Sysadmin Ninja's Github Digest (Vol. 21)



Hello, fellow readers!
Issue 21 of "Semi-irregular Sysadmin Ninja's Github Digest" is here. The last issue was very dry, will add more of my thoughts and funny pictures. :)
Let's go!


teeproxy
"A reverse HTTP proxy that duplicates requests."
"You may have production servers running, but you need to upgrade to a new system. You want to run A/B test on both old and new systems to confirm the new system can handle the production load, and want to see whether the new system can run in shadow mode continuously without any issue."
https://github.com/chrislusf/teeproxy


kicksat
"A tiny open source spacecraft project. http://kicksat.io"

WOW. Just W-O-W. Your eyes are not lying, it's open-source spacecraft. "Our goal is to dramatically lower the cost of spaceflight, making it easy enough and affordable enough for anyone to explore space. We can do this by shrinking the size and mass of the spacecraft, allowing many to be launched together."
I hope guys will be succeeded and we'll see KickSat launch soon!
https://github.com/kicksat


web2web
"P2P web powered by torrents and blockchain."
Rejoice, my paranoid brothers and sisters! New Internet is here! Wear our foil hats on!
It's a combination of webtorrent and blockhain to make not-seizable internet!
"When you open index.html in the browser (live demo), here's what happens:
Bitcoin address 1DhDyqB4xgDWjZzfbYGeutqdqBhSF7tGt4 is searched for the latest outgoing transaction containing OP_RETURN script. Inside the script there is a torrent infohash of webpage.html. webpage.html is downloaded from torrent via webtorrent and displayed."
https://github.com/elendirx/web2web


ironssh
"IronSSH - End-to-end secure file transfer"

"While sftp and scp use ssh to keep files secure while they are being transferred over the network, once those files hit the remote server, they are no longer protected. The ironsftp executable provides additional security. When you put a file on the server using ironsftp, the file is encrypted before it is uploaded, and it stays that way on the server. When you get a file from the server, it is downloaded then decrypted. So the file remains secure until it is at the place you want to use it - on your local machine."
https://github.com/ironcorelabs/ironssh


quinedb
"QuineDB is a quine that is also a key/value store.
If your database can't print its own source code, can you really trust it?"
Very interesting and funny project! It's simple K/V storage, written in bash4, but it's also a quine!
"When you run it, the (possibly modified) source code of quinedb is printed to STDOUT, and the results of the specific command run are printed to STDERR."
https://github.com/gfredericks/quinedb


flask_jsondash
"Build javascript chart dashboards without any front-end code. Uses any json endpoint. JSON config only. Ready to go."
Long time dream is fulfilled! Yes, dashboards w/o any front-end code. !https://github.com/christabor/flask_jsondash


lograil
"LogTrail is a plugin for Kibana to view, analyze, search and tail log events from multiple hosts in realtime with devops friendly interface inspired by Papertrail."
Like "tail -f", but for ELK!

https://github.com/sivasamyk/logtrail


python-remote-companies
"A list of companies that allow remote work and use Python."
Yep, that simple, but maybe useful.https://github.com/mariusavram91/python-remote-companies


Games on the Github 
"list of open source games and game-related projects that can be found on GitHub"
https://github.com/leereilly/games


cog
"Bringing the power of the command line to chat http://operable.io"
"Cog is an open chatops platform that gives you a secure, collaborative command line right in your chat window. It is designed to be secure, highly available, chat provider agnostic, and to be extensible using your favorite programming language."
https://github.com/operable/cog


borg
"BORG - A terminal based search engine for bash commands"

Searching bash-related questions on Stackoverlow not leaving your terminal!
https://github.com/crufter/borg

git-stacktrace
"Easily figure out which git commit caused a given stacktrace https://pypi.python.org/pypi/git-stacktrace"
A little bit naive tool which helps you to find out wich commit is responsible for specific stacktrace. Python and Java are supported.
https://github.com/pinterest/git-stacktrace
pyinfra
"⚡ Deploy stuff by diff-ing the state you want against the remote server"
Interesting deploy tool. Looks nice, but IMO it better uses real configuration management tool in this case, e.g. Salt or Ansible.
https://github.com/Fizzadar/pyinfra


And something more for ML fans:

deepregex
"Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge"
For now, it's purely academic project - creating regexes using natural language and machine learning.
But beware - regexes it's a start, maybe sometimes computers maybe be able to write own programs.
Why they will need people then? :)
https://github.com/nicholaslocascio/deep-regex

tensorflow_image_classifier
Machine Learning is simple! You can make own TF image classifier in 5 minutes!
https://github.com/llSourcell/tensorflow_image_classifier

суббота, 3 сентября 2016 г.

Semi-irregular Sysadmin Ninja's Github Digest (Vol. 20)

Hello, fellow readers!
I'm back. Now back to the news! :)

parallel
Inspired by GNU Parallel, a command-line CPU load balancer written in Rust.
Same as GNU Parallel, but modern and fast.
https://github.com/mmstick/parallel

rclone
"rsync for cloud storage" - Google Drive, Amazon Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Cloudfiles, Google Cloud Storage, Yandex Files http://rclone.org
https://github.com/ncw/rclone

NASA's openmct
A web based mission control framework. https://nasa.github.io/openmct/
You will need it if your project is rocket science ;)
awesome-osx-command-line
Use your OS X terminal shell to do awesome things. A curated list of shell commands and tools specific to OS X.
https://github.com/herrbischoff/awesome-osx-command-line

worq
Job server in Go
https://github.com/iamduo/workq
Similar to Celery / Gearman but language agnostic and written in Go.

go-patterns
A curated list of Go patterns and idioms http://tmrts.com/go-patterns
https://github.com/tmrts/go-patterns
Worth checking for all Go programmers.

minio
Minio is an object storage server compatible with Amazon S3 and licensed under Apache 2.0 License https://minio.io
https://github.com/minio/minio
Defying Amazon's vendor lock. Didn't try it, though.

Mozilla's http-ovservatory
HTTP Observatory https://observatory.mozilla.org/
https://github.com/mozilla/http-observatory
"The Mozilla HTTP Observatory is a set of tools to analyze your website and inform you if you are utilizing the many available methods to secure it."
Special bonus for machine learning lovers!

spez
Image super-resolution through deep learning
https://github.com/david-gpu/srez
"From left to right, the first column is the 16x16 input image, the second one is what you would get from a standard bicubic interpolation, the third is the output generated by the neural net, and on the right is the ground truth."
Looks like magic -


PADDLE
PArallel Distributed Deep LEarning http://www.paddlepaddle.org/
https://github.com/baidu/Paddle
"PaddlePaddle (PArallel Distributed Deep LEarning) is an easy-to-use, efficient, flexible and scalable deep learning platform, which is originally developed by Baidu scientists and engineers for the purpose of applying deep learning to many products at Baidu."


суббота, 27 августа 2016 г.

Semi-irregular Sysadmin Ninja's Github Digest (Vol. 19)

I was slacking for a long time, I know. Sorry for that. I'll push two issues in a row now, this is the second one.

1. blessed-contrib
Build dashboards using ascii/ansi art and javascript
https://github.com/yaronn/blessed-contrib

2. dashiell
https://github.com/maclennann/dashiell
A websocket-y frontend to osquery and facter. http://dashiell.io

3. https://telekomlabs.github.io/
T-labs, official Deutsche Telekom R&D department Github page.
Home of FirefoxOS and other projects -  worth to check.

4. jetpack
FreeBSD Jail/ZFS based implementation of the Application Container Specification
https://github.com/3ofcoins/jetpack
https://github.com/appc/spec

5. cachet
An open source status page system is written in PHP https://cachethq.io
https://github.com/cachethq/Cachet

6. nginx-resources
A collection of resources covering Nginx, Nginx + Lua, OpenResty and Tengine http://www.cambus.net
https://github.com/fcambus/nginx-resources

7. socketplane
SocketPlane - Multi-Host Container Networking
https://github.com/socketplane/socketplane

8. h2o
H2O - an optimized HTTP server with support for HTTP/1.x and HTTP/2
http://blog.kazuhooku.com/search/label/H2O+in%20English
https://github.com/h2o/h2o
http://www.slideshare.net/kazuho/h2o-20141103pptx

Semi-irregular Sysadmin Ninja's Github Digest (Vol. 18)

I was slacking for a long time, I know. Sorry for that. I'll push two issues in a row now, this is the first one. Will try to make it more regular, will include other sources too.

1. The Crystal Programming Language 
http://crystal-lang.org
https://github.com/manastech/crystal
New programming language, named Crystal. "We love Ruby’s efficiency for writing code. We love C’s efficiency for running code. We want the best of both worlds." Programs look like Ruby, but compiles to efficient native code, and has compile-time error evaluation like Rust. Worth to check out, if you're PL freak, like me. :)

2. chef-koans
An experimental, test-driven way to learn about Chef.
https://github.com/leftathome/chef-koans
"An experimental, test-driven way to learn about Chef. Takes some inspiration from Ruby Koans and from other things that are awesome and simple." Unfortunately, only lesson number 0 is ready now - but you're welcome to contribute, of course!
Also, if you didn't read Vim koans or Git koans - please try, it's quite fun.

3. node-bell
Real-time anomalies detection for periodic time series.
http://eleme.io/blog/2014/metrics-monitor
https://github.com/eleme/node-bell

4. streem
prototype of stream based programming language
https://github.com/matz/streem
A prototype of new PL from an author of Ruby - Yukihiro "matz" Matsumoto. It's on very early stage of development.

5. sfs
Asynchronous Filesystem Replication
https://github.com/immobiliare/sfs

6. gitfs
Version controlled file system
http://presslabs.com/gitfs/
https://github.com/PressLabs/gitfs

7. awesome-courses
List of awesome university courses for learning Computer Science!
https://github.com/prakhar1989/awesome-courses

8. mochi
Dynamically typed functional programming language
https://github.com/i2y/mochi

9. consul-do
Do something based on leadership status
https://github.com/zeroXten/consul-do

10. openbay
The Pirate Bay source code.
https://github.com/isohuntto/openbay

суббота, 21 февраля 2015 г.

Introducing collectd-iostat-python


Collectd-iostat-python is an iostat plugin for collectd that allows you to graph Linux iostat metrics in Graphite or other output formats that are supported by collectd.
This plugin (and mostly this README) is rewrite of kieran's collectd-iostat in Python and collectd-python instead of Ruby and collectd-exec

Why ?

Disk performance is quite crucial for most of modern server applications, especially databases. E.g. MySQL - check out this slides from Percona Live conference.
Although collectd provides disk statistics out of the box, graphing the metrics as shown by iostat was found to be more useful and graphic, because iostat reports usage of block devices, partitions, multipath devices and LVM volumes.
Also this plugin was rewritten in Python, because its a preferable language for siteops' tools on my current job, and choice of using collectd-python instead of collectd-exec was made for performance and stability reasons.

How ?

Collectd-iostat-python functions by calling iostat with some predefined intervals and push that data to collectd using collectd-python plugin.
Collectd can be then configured to write the collected data into many output formats that are supported by it's write plugins, such as graphite, which was the primary use case for this plugin.

Setup

Deploy the collectd python plugin into a suitable plugin directory for your collectd instance.
Configure collectd's python plugin to execute the iostat plugin using a stanza similar to the following:

Once functioning, the iostat data should then be visible via your various output plugins.
In the case of Graphite, collectd should be writing data to graphite in thehostname_domain_tld.collectd_iostat_python.DEVICE.column-name style namespaces. Symbols like '/','-' and '%' in metric names (but not in device names) automatically replacing by underscores (i.e. '_')
Please note that plugin will take only last line of iostat output, so big Count numbers also have no sense, but Count needs to be more than 1 to get actual and not historical data. And please make Interval * Count << Collectd.INTERVAL (20 seconds by default). I found e.g. Count=2 and Interval=2 works quite well for me.

Technical notes

For parsing iostat output I'm using jakamkon's python-iostat python module, but as internal part of script instead of separate module because of couple of fixes - using Kbytes instead of blocks, adding -N to iostat for LVM endpoint resolving, migration to subprocess module as replacement of deprecated popen3, objectification etc.

TODO

Maybe some data aggregation needed, e.g. we can use some max / avg aggregation of data across intervals instead of picking last line of iostat output.

вторник, 9 декабря 2014 г.

Semi-irregular Sysadmin Ninja's Github Digest (Vol. 17)

1. reptyr
Reparent a running program to a new terminal
https://github.com/nelhage/reptyr

Quite old tool made by @nelhage, it seems he is actively developing it again. It is really changes terminal for process.  "'reptyr PID' will grab the process with id PID and attach it to your current terminal. After attaching, the process will take input from and write output to the new terminal, including ^C and ^Z."
It is also quite interesting to know how it works - check this blog post if you are curious.

2. dockerana
Docker + Graphite + Graphana = Dockerana
https://github.com/dockerana/dockerana
It's exactly what it looks - Graphite + Graphana  packed in Docker container. Quite convenient.

3. seagull
Friendly Web UI to monitor docker daemon http://96.126.127.93:10086
https://github.com/tobegit3hub/seagull
Seagull is the best friend of docker which provides Web UI to monitor docker daemon. Demo site is down but screenshots looks nice. It seems that demo is working now.

4. Algorithms
Data Structures and Algorithms in Python
https://github.com/prakhar1989/Algorithms
Not very exciting stuff, but might be useful. Just as it says, it is collection of data structures and algorithms in Python.

5. pg_shard
PostgreSQL extension to scale out real-time reads and writes http://citusdata.com/docs/pg-shard
https://github.com/citusdata/pg_shard
Sharding helper extension for PostgreSQL. Nuff said, check docs.

6. peru
Maybe sometimes better than copy-paste.
https://github.com/buildinspace/peru
Ah, nice tool. Another approach of eternal problems of dependencies on your repos. Like "git submodules" but easier. Works with Mercurial and SVN too, not only with git. Demo gif below:

7. awesome-public-datasets
A awesome list of (large-scale) public datasets on the Internet. (On-going collection)
https://github.com/caesar0301/awesome-public-datasets
List of many public (but sometimes not free) datasets on Internet, for your fun and big data projects.

8. rocket
App Container runtime
https://github.com/coreos/rocket
CoreOS creates own container instead of Docker. Quite controversial decision, check their blog for explanation.

9. instavpn
the most user-friendly L2TP/IPsec VPN server
https://github.com/sockeye44/instavpn
Very user-friendly simple but secure VPN. Ubuntu, 512 MB RAM, curl -sS https://sockeye.cc/instavpn.sh | sudo bash, browse at http://IP-ADDRESS:8080 or use cli to setup.

10. shapeme
Evolve images using simulated annealinghttps://github.com/antirez/shapeme
Small toy from @antirez - it takes PNG and try to evolve bunch of triangles to copy it. Just for fun.