Today, me and a colleague from StarterSquad spent some time puzzling out the deploy of a group of microservices on Vagrant. The environment was Apache + mod_wsgi + Flask. In the "real-life", these services more or less happily run each on their own AWS EBS instance. For the simplicity of the local development, we wanted to provision them all on a Vagrant instance instead. (We use an Ansible-based tool called Prudentia to orchestrate deployment on different platforms with relative ease, but as a first step, the priority was to just make it all run there with a single command.)
The problem we encountered in the end was that some services seemed to ignore environment variables set with SetEnv in Apache configuration. That was strange because on AWS, the environment variables were set separately for every EBS instance and they seemed to be picked up. (These variables were used in from_envvars Flask directive to replace the defaults).
Then came a learning moment for me (I did not work much with Apache / mod_wsgi / Python before) that the Python applications started with mod_wsgi are supposed to read their environment variables from a dictionary when they start. One can eventually add them to the OS environment with code like that:
def application(environ, start_response):
for key in environ:
os.environ[key] = environ[key]
from my_module import app
return app(environ, start_response)
However, not all of already existing microservices of our client were using that technique, and yet they seemed to have the right environment variables anyway.
As it appears, on AWS Elastic BeanStack the configuration variables are the system environment, and not Apache environment variables! It would be good if EBS documentation more clearly stated that difference.
A Naive Coder
SW development / programming related snippets.
Tuesday, October 6, 2015
Tuesday, July 15, 2014
puzzling difference between zsh and bash
Accidentally found a little difference between zsh and bash, which seems a bit annoying.
Suppose there is a simple script.
for file in `hadoop classpath | tr ':' ' ' | sort | uniq`; do echo $file; done
and the original output of `hadoop classpath` looks like this:
zsh %> hadoop classpath
/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
If I run the above in Bourne shell, the results will be like this (which will give the unique list of all jars contained in the specified classpath):
bash-4.1$ for file in `hadoop classpath | tr ':' ' ' | sort | uniq`; do echo $file; done
/usr/local/hadoop/etc/hadoop
/usr/local/hadoop/share/hadoop/common/lib/activation-1.1.jar
/usr/local/hadoop/share/hadoop/common/lib/asm-3.2.jar
/usr/local/hadoop/share/hadoop/common/lib/avro-1.7.4.jar
/usr/local/hadoop/share/hadoop/common/lib/commons-beanutils-1.7.0.jar
/usr/local/hadoop/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar
/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar
/usr/local/hadoop/share/hadoop/common/lib/commons-codec-1.4.jar
Suppose there is a simple script.
for file in `hadoop classpath | tr ':' ' ' | sort | uniq`; do echo $file; done
and the original output of `hadoop classpath` looks like this:
zsh %> hadoop classpath
/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
bash-4.1$ for file in `hadoop classpath | tr ':' ' ' | sort | uniq`; do echo $file; done
/usr/local/hadoop/etc/hadoop
/usr/local/hadoop/share/hadoop/common/lib/activation-1.1.jar
/usr/local/hadoop/share/hadoop/common/lib/asm-3.2.jar
/usr/local/hadoop/share/hadoop/common/lib/avro-1.7.4.jar
/usr/local/hadoop/share/hadoop/common/lib/commons-beanutils-1.7.0.jar
/usr/local/hadoop/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar
/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar
/usr/local/hadoop/share/hadoop/common/lib/commons-codec-1.4.jar
...
In zsh, however, I still get:
zsh %> for file in `hadoop classpath | tr ':' ' ' | sort | uniq`; do echo $file; done
/usr/local/hadoop/etc/hadoop
/usr/local/hadoop/share/hadoop/common/lib/*
/usr/local/hadoop/share/hadoop/common/*
/usr/local/hadoop/share/hadoop/hdfs
/usr/local/hadoop/share/hadoop/hdfs/lib/*
/usr/local/hadoop/share/hadoop/hdfs/*
/usr/local/hadoop/share/hadoop/yarn/lib/*
/usr/local/hadoop/share/hadoop/yarn/*
/usr/local/hadoop/share/hadoop/mapreduce/lib/*
/usr/local/hadoop/share/hadoop/mapreduce/*
/usr/local/hadoop/contrib/capacity-scheduler/*.jar
This was quite unexpected to find out. Now I am wondering how to make it work in zsh.
PS I was curious enough to ask in SO about this.
Saturday, July 12, 2014
How to mock a non-declared checked exception with Mockito
Little useful mocking hack, which costed me quite some time to find:
to throw checked exception from Scala method (which usually don't declare exceptions, which disappoints frameworks like Mockito), use answer:
mockedObject.doSomething(any[Whatever]).answers{ _ => throw new CheckedException("ouch!") }
where CheckedException is not an instance of RuntimeException. Not because RuntimeException cannot be thrown that way (it can), but because there is a simpler method for those:
mockedObject.doSomething(any[Whatever]) throws new RuntimeException("ouch!")
So if you need to throw e.g. SQLException, method 1 will work, whereas method 2 will compile but complain during the test that such type of exception does not correspond to the signature. For something like ArithmeticException, both methods will work.
to throw checked exception from Scala method (which usually don't declare exceptions, which disappoints frameworks like Mockito), use answer:
mockedObject.doSomething(any[Whatever]).answers{ _ => throw new CheckedException("ouch!") }
where CheckedException is not an instance of RuntimeException. Not because RuntimeException cannot be thrown that way (it can), but because there is a simpler method for those:
mockedObject.doSomething(any[Whatever]) throws new RuntimeException("ouch!")
So if you need to throw e.g. SQLException, method 1 will work, whereas method 2 will compile but complain during the test that such type of exception does not correspond to the signature. For something like ArithmeticException, both methods will work.
Tuesday, May 6, 2014
Sorting remote Git branches in reverse commit order
Small, but useful Git + Perl command:
for k in `git branch -r | perl -pe 's/^..(.*?)( ->.*)?$/\1/'`; do echo -e `git show --pretty=format:"%Cgreen%ci %Cblue%cr%Creset %aN" $k -- | head -n 1`\\t$k; done | sort -r
It will sort all remote Git branches in reverse order by commit date. The result looks like this:
2014-05-06 10:18:29 +0200 33 minutes ago origin/master
2014-05-06 10:18:29 +0200 33 minutes ago origin/HEAD
2014-05-05 19:10:50 +0200 16 hours ago origin/the-hot-branch
...
2014-05-01 18:33:08 +0200 5 days ago origin/less-recent-branch
...
2014-01-27 15:44:27 +0100 3 months ago origin/totally-forgotten branch
Found on SO who quoted it from there.
One useful modification: also show the last committer:
for k in `git branch -r | perl -pe 's/^..(.*?)( ->.*)?$/\1/'`; do echo -e `git show --pretty=format:"%Cgreen%ci %Cblue%cr%Creset %aN" $k -- | head -n 1`\\t$k; done | sort -r
And don't forget to call first:
git fetch -p
to prune away the zombie branch references.
Then you can easily grep for someone's stale branches :)
One useful modification: also show the last committer:
for k in `git branch -r | perl -pe 's/^..(.*?)( ->.*)?$/\1/'`; do echo -e `git show --pretty=format:"%Cgreen%ci %Cblue%cr%Creset %aN" $k -- | head -n 1`\\t$k; done | sort -r
And don't forget to call first:
git fetch -p
to prune away the zombie branch references.
Then you can easily grep for someone's stale branches :)
Thursday, November 14, 2013
Abusing Scala streams
Tried to run a simple experiment with the Sieve of Eratosphenes, defined in Scala course by Odersky, and find out how far are the limits for streams (which are, in fact, recursive functions). I expected that it will run out of stack at some point, instead it produced a funny error I've never seen before garbage collector gave up :)
scala> Welcome to Scala version 2.10.2 (OpenJDK 64-Bit Server VM, Java 1.7.0_25).
scala> def sieve(s: Stream[Int]): Stream[Int] =
|
| s.head #:: sieve(s.tail filter (_ % s.head != 0))
sieve: (s: Stream[Int])Stream[Int]
scala> def from(n: Int): Stream[Int] = n #:: from(n+1)
from: (n: Int)Stream[Int]
scala> val primes = sieve(from(2))
primes: Stream[Int] = Stream(2, ?)
scala> primes.take(10).toList
res0: List[Int] = List(2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
scala> primes.take(100).toList
res1: List[Int] = List(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353, 359, 367, 373, 379, 383, 389, 397, 401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541)
scala> primes.take(1000).toList
res2: List[Int] = List(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353, 359, 367, 373, 379, 383, 389, 397, 401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541, 547, 557, 563, 569, 571, 577, 587, 593, 599, 601, 607, 613, 617, 619, 631, 641, 643, 647, 653, 659, 661, 673, 677, 683, 691, 701, 709, 719, 727, 733, 739, 743, 751, 757, 761, 769, 773, 787, 797, 809, 811, 821, 823, 827, 829, 839, 853, 857, 859, 863, 877, 881, 883, 887, 907, 911, 919, 929, 937, 941, 947...
scala> primes.take(10000).toList
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Throwable.getStackTraceElement(Native Method)
at java.lang.Throwable.getOurStackTrace(Throwable.java:826)
at java.lang.Throwable.printStackTrace(Throwable.java:655)
at java.lang.Throwable.printStackTrace(Throwable.java:720)
at scala.tools.nsc.util.package$$anonfun$stackTraceString$1.apply(package.scala:84)
at scala.tools.nsc.util.package$$anonfun$stackTraceString$1.apply(package.scala:84)
at scala.tools.nsc.util.package$.stringFromWriter(package.scala:73)
at scala.tools.nsc.util.package$.stackTraceString(package.scala:84)
at scala.tools.nsc.Global.throwableAsString(Global.scala:299)
at scala.tools.nsc.interpreter.ILoop$$anonfun$1.applyOrElse(ILoop.scala:533)
at scala.tools.nsc.interpreter.ILoop$$anonfun$1.applyOrElse(ILoop.scala:531)
at scala.runtime.AbstractPartialFunction$mcZL$sp.apply$mcZL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcZL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcZL$sp.apply(AbstractPartialFunction.scala:25)
at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:573)
at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:576)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:867)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:822)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:822)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:822)
at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:83)
at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:96)
at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:105)
at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala)
I guess the reason there is that the list which is given to the next stream element is recomputed at every step.
Fibonacci stream (defined in Scaladocs for Stream) went somewhat better, it "simply" runs out of memory:
scala> import scala.math.BigInt
import scala.math.BigInt
scala> val fibs:Stream[BigInt] = BigInt(0) #:: BigInt(1) #:: fibs.zip(
| fibs.tail).map(n=> {
| n._1 + n._2})
fibs: Stream[scala.math.BigInt] = Stream(0, ?)
scala> fibs.take(10000)
res0: scala.collection.immutable.Stream[scala.math.BigInt] = Stream(0, ?)
scala> fibs.take(10000).toList
res1: List[scala.math.BigInt] = List(0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437, 701408733, 1134903170, 1836311903, 2971215073, 4807526976, 7778742049, 12586269025, 20365011074, 32951280099, 53316291173, 86267571272, 139583862445, 225851433717, 365435296162, 591286729879, 956722026041, 1548008755920, 2504730781961, 4052739537881, 6557470319842, 10610209857723, 17167680177565, 27777890035288, 44945570212853, 72723460248141, 117669030460994, 190392490709135, 308061521170129, 498454011879264, 806515533049393, 1304969544928657, 2111485077978050,...
scala> fibs.take(100000).toList
java.lang.OutOfMemoryError: Java heap space
at java.math.BigInteger.add(BigInteger.java:1075)
at java.math.BigInteger.add(BigInteger.java:1048)
at scala.math.BigInt.$plus(BigInt.scala:203)
at $anonfun$1$$anonfun$apply$1$$anonfun$apply$2.apply(<console>:10)
at $anonfun$1$$anonfun$apply$1$$anonfun$apply$2.apply(<console>:9)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at scala.collection.immutable.Stream$$anonfun$take$2.apply(Stream.scala:731)
at scala.collection.immutable.Stream$$anonfun$take$2.apply(Stream.scala:731)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at scala.collection.immutable.Stream.foreach(Stream.scala:548)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:629)
at scala.collection.AbstractTraversable.to(Traversable.scala:105)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
at scala.collection.AbstractTraversable.toList(Traversable.scala:105)
...
OK, let's look at the stream of the natural numbers. It seemed to survive longer, but as we reach certain limit things still crash horribly.
scala> Welcome to Scala version 2.10.2 (OpenJDK 64-Bit Server VM, Java 1.7.0_25).
|
| s.head #:: sieve(s.tail filter (_ % s.head != 0))
sieve: (s: Stream[Int])Stream[Int]
scala> def from(n: Int): Stream[Int] = n #:: from(n+1)
from: (n: Int)Stream[Int]
scala> val primes = sieve(from(2))
primes: Stream[Int] = Stream(2, ?)
scala> primes.take(10).toList
res0: List[Int] = List(2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
scala> primes.take(100).toList
res1: List[Int] = List(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353, 359, 367, 373, 379, 383, 389, 397, 401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541)
scala> primes.take(1000).toList
res2: List[Int] = List(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353, 359, 367, 373, 379, 383, 389, 397, 401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541, 547, 557, 563, 569, 571, 577, 587, 593, 599, 601, 607, 613, 617, 619, 631, 641, 643, 647, 653, 659, 661, 673, 677, 683, 691, 701, 709, 719, 727, 733, 739, 743, 751, 757, 761, 769, 773, 787, 797, 809, 811, 821, 823, 827, 829, 839, 853, 857, 859, 863, 877, 881, 883, 887, 907, 911, 919, 929, 937, 941, 947...
scala> primes.take(10000).toList
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Throwable.getStackTraceElement(Native Method)
at java.lang.Throwable.getOurStackTrace(Throwable.java:826)
at java.lang.Throwable.printStackTrace(Throwable.java:655)
at java.lang.Throwable.printStackTrace(Throwable.java:720)
at scala.tools.nsc.util.package$$anonfun$stackTraceString$1.apply(package.scala:84)
at scala.tools.nsc.util.package$$anonfun$stackTraceString$1.apply(package.scala:84)
at scala.tools.nsc.util.package$.stringFromWriter(package.scala:73)
at scala.tools.nsc.util.package$.stackTraceString(package.scala:84)
at scala.tools.nsc.Global.throwableAsString(Global.scala:299)
at scala.tools.nsc.interpreter.ILoop$$anonfun$1.applyOrElse(ILoop.scala:533)
at scala.tools.nsc.interpreter.ILoop$$anonfun$1.applyOrElse(ILoop.scala:531)
at scala.runtime.AbstractPartialFunction$mcZL$sp.apply$mcZL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcZL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcZL$sp.apply(AbstractPartialFunction.scala:25)
at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:573)
at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:576)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:867)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:822)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:822)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:822)
at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:83)
at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:96)
at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:105)
at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala)
I guess the reason there is that the list which is given to the next stream element is recomputed at every step.
Fibonacci stream (defined in Scaladocs for Stream) went somewhat better, it "simply" runs out of memory:
scala> import scala.math.BigInt
import scala.math.BigInt
scala> val fibs:Stream[BigInt] = BigInt(0) #:: BigInt(1) #:: fibs.zip(
| fibs.tail).map(n=> {
| n._1 + n._2})
fibs: Stream[scala.math.BigInt] = Stream(0, ?)
scala> fibs.take(10000)
res0: scala.collection.immutable.Stream[scala.math.BigInt] = Stream(0, ?)
scala> fibs.take(10000).toList
res1: List[scala.math.BigInt] = List(0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437, 701408733, 1134903170, 1836311903, 2971215073, 4807526976, 7778742049, 12586269025, 20365011074, 32951280099, 53316291173, 86267571272, 139583862445, 225851433717, 365435296162, 591286729879, 956722026041, 1548008755920, 2504730781961, 4052739537881, 6557470319842, 10610209857723, 17167680177565, 27777890035288, 44945570212853, 72723460248141, 117669030460994, 190392490709135, 308061521170129, 498454011879264, 806515533049393, 1304969544928657, 2111485077978050,...
scala> fibs.take(100000).toList
java.lang.OutOfMemoryError: Java heap space
at java.math.BigInteger.add(BigInteger.java:1075)
at java.math.BigInteger.add(BigInteger.java:1048)
at scala.math.BigInt.$plus(BigInt.scala:203)
at $anonfun$1$$anonfun$apply$1$$anonfun$apply$2.apply(<console>:10)
at $anonfun$1$$anonfun$apply$1$$anonfun$apply$2.apply(<console>:9)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at scala.collection.immutable.Stream$$anonfun$take$2.apply(Stream.scala:731)
at scala.collection.immutable.Stream$$anonfun$take$2.apply(Stream.scala:731)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at scala.collection.immutable.Stream.foreach(Stream.scala:548)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:629)
at scala.collection.AbstractTraversable.to(Traversable.scala:105)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
at scala.collection.AbstractTraversable.toList(Traversable.scala:105)
...
OK, let's look at the stream of the natural numbers. It seemed to survive longer, but as we reach certain limit things still crash horribly.
scala> def from(n:Int): Stream[Int] = n #:: from(n+1)
from: (n: Int)Stream[Int]
scala> val nats = from(0)
nats: Stream[Int] = Stream(0, ?)
scala> nats.take(10000)
res1: scala.collection.immutable.Stream[Int] = Stream(0, ?)
scala> nats.take(10000).toList
res2: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176,...
scala> nats.take(100000).toList
res3: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176,...
scala> nats.take(1000000).toList
res4: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176,...
scala> nats.take(10000000).toList
java.lang.OutOfMemoryError: Java heap space
at scala.collection.immutable.Stream.take(Stream.scala:731)
at scala.collection.immutable.Stream$$anonfun$take$2.apply(Stream.scala:731)
at scala.collection.immutable.Stream$$anonfun$take$2.apply(Stream.scala:731)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at scala.collection.immutable.Stream.foreach(Stream.scala:548)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:629)
at scala.collection.AbstractTraversable.to(Traversable.scala:105)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
at scala.collection.AbstractTraversable.toList(Traversable.scala:105)
...
scala> nats.slice(10000000,10000010).toList
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Neither will it work if we pretend that only the necessary part is evaluated (underlying List will have to go through all preceding elements anyway), and again, failing at the Garbage Collector level:
scala> nats.take(10000000).slice(9999990,10000000).toList
error:
while compiling: <console>
during phase: erasure
library version: version 2.10.2
compiler version: version 2.10.2
reconstructed args: -nowarn
last tree to typer: This(object $eval)
symbol: object $eval in package $line15 (flags: <module>)
symbol definition: class $eval extends Object
tpe: type
symbol owners: object $eval -> package $line15
context owners: package <root>
== Enclosing template or block ==
Import( // val <import>: ImportType(scala)
"scala" // final package scala, tree.tpe=scala.type
List(
ImportSelector(_,-1,null,-1)
)
)
== Expanded type of tree ==
ThisType(object $eval)
uncaught exception during compilation: java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:745)
at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:790)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:702)
at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:566)
at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:573)
at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:576)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:867)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:822)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:822)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:822)
at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:83)
at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:96)
at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:105)
at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala)
By the way, at that point the console compiler is already unusable (and it doesn't realize that it was not this but the previous instruction that sent it to ashes) - apparently Garbage Collector failing is almost the last line of defense:
scala> from(10000000).take(10).toList
error:
while compiling: <console>
during phase: global=terminal, atPhase=jvm
library version: version 2.10.2
compiler version: version 2.10.2
reconstructed args:
last tree to typer: Apply(method toList)
symbol: method toList in trait TraversableOnce (flags: <method> <triedcooking>)
symbol definition: def toList(): List
tpe: List
symbol owners: method toList -> trait TraversableOnce -> package collection
context owners: object iw -> package $line16
== Enclosing template or block ==
Template( // val <local $iw>: <notype>, tree.tpe=type
"java.lang.Object" // parents
ValDef(
private
"_"
<tpt>
<empty>
)
DefDef( // def <init>(): type
<method>
"<init>"
[]
List(Nil)
<tpt> // tree.tpe=type
Block( // tree.tpe=Unit
Apply( // def <init>(): Object in class Object, tree.tpe=Object
$read$super."<init>" // def <init>(): Object in class Object, tree.tpe=()Object
Nil
)
()
)
)
)
== Expanded type of tree ==
TypeRef(
TypeSymbol(
sealed abstract class List extends collection.AbstractSeq with collection.immutable.LinearSeq with Product with collection.generic.GenericTraversableTemplate with collection.LinearSeqOptimized
)
)
uncaught exception during compilation: java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: GC overhead limit exceeded
That entry seems to have slain the compiler. Shall I replay
your session? I can re-run each line except the last one.
[y/n]
even though normally, this solution would just work:
scala> def from(n:Int): Stream[Int] = n #:: from(n+1)
from: (n: Int)Stream[Int]
scala> from(10000000).take(10).toList
res0: List[Int] = List(10000000, 10000001, 10000002, 10000003, 10000004, 10000005, 10000006, 10000007, 10000008, 10000009)
And while we are at that... normal (non-stream) List seems to do a better job on the large sequences of numbers:
scala> (1 to 10000000)
res1: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170...
scala> res1.slice(9999990,10000000)
res2: scala.collection.immutable.IndexedSeq[Int] = Vector(9999991, 9999992, 9999993, 9999994, 9999995, 9999996, 9999997, 9999998, 9999999, 10000000)
And while we are at that... normal (non-stream) List seems to do a better job on the large sequences of numbers:
scala> (1 to 10000000)
res1: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170...
scala> res1.slice(9999990,10000000)
res2: scala.collection.immutable.IndexedSeq[Int] = Vector(9999991, 9999992, 9999993, 9999994, 9999995, 9999996, 9999997, 9999998, 9999999, 10000000)
We can go until we hit the "real" limit:
scala> (1 to 1000000000)
res6: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170...
scala> res6.slice(999999000,1000000000)
res7: scala.collection.immutable.IndexedSeq[Int] = Vector(999999001, 999999002, 999999003, 999999004, 999999005, 999999006, 999999007, 999999008, 999999009, 999999010, 999999011, 999999012, 999999013, 999999014, 999999015, 999999016, 999999017, 999999018, 999999019, 999999020, 999999021, 999999022, 999999023, 999999024, 999999025, 999999026, 999999027, 999999028, 999999029, 999999030, 999999031, 999999032, 999999033, 999999034, 999999035, 999999036, 999999037, 999999038, 999999039, 999999040, 999999041, 999999042, 999999043, 999999044, 999999045, 999999046, 999999047, 999999048, 999999049, 999999050, 999999051, 999999052, 999999053, 999999054, 999999055, 999999056, 999999057, 999999058, 999999059, 999999060, 999999061, 999999062, 999999063, 999999064, 999999065, 999999066, 999999067, 99...
scala> (1 to 10000000000)
<console>:1: error: integer number too large
(1 to 10000000000)
^
scala> (1L to 10000000000L)
java.lang.IllegalArgumentException: 1 to 10000000000 by 1: seqs cannot contain more than Int.MaxValue elements.
at scala.collection.immutable.NumericRange$.count(NumericRange.scala:249)
at scala.collection.immutable.NumericRange.numRangeElements$lzycompute(NumericRange.scala:53)
at scala.collection.immutable.NumericRange.numRangeElements(NumericRange.scala:52)
at scala.collection.immutable.NumericRange.length(NumericRange.scala:55)
at scala.collection.immutable.NumericRange.toString(NumericRange.scala:208)
at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:321)
at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:333)
at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:341)
...
Conclusion is as usual: magic should be applied where it's applicable :-}
Monday, August 26, 2013
Slick: 5 seconds of hate
Simple reminder why I hate ORM. Some daily experience from trying to make Slick doing something more useful than the examples they generously provide in the very incomplete and hazy documentation (I just wanted to count the rows in some query):
method count in class ColumnExtensionMethods is deprecated: Use Query.count instead
After some code changes and more esoteric responses from the compiler, I get this:
Warning #2:
method count in class Query is deprecated: Use .length instead of .count
LOL
++ to add more misery to insult, I can't print out the SQL statement that this thing generates to get the row count, because it's now a method of a query! (Of course, there are always the database logs...)
Argh.
Update: as a result of this rant, I had a quick twitter conversation with @StefanZeiger, who explained that myQuery.length.run will give the result I want, and Query(myQuery.length).selectStatement would produce the SQL, and that in Slick 2.0.0, that's coming out somewhere in September, it will be possible to write myQuery.length.selectStatement directly. That is nice, except that Slick 2.0.0 won't be backwards compatible, which probably means that the ongoing projects might not be able to benefit from the new library, unless people are already using alpha version of 2.0.0 (which we don't).
Nevertheless, good to know that someone's working on the problem :-}
++ to add more misery to insult, I can't print out the SQL statement that this thing generates to get the row count, because it's now a method of a query! (Of course, there are always the database logs...)
Argh.
Update: as a result of this rant, I had a quick twitter conversation with @StefanZeiger, who explained that myQuery.length.run will give the result I want, and Query(myQuery.length).selectStatement would produce the SQL, and that in Slick 2.0.0, that's coming out somewhere in September, it will be possible to write myQuery.length.selectStatement directly. That is nice, except that Slick 2.0.0 won't be backwards compatible, which probably means that the ongoing projects might not be able to benefit from the new library, unless people are already using alpha version of 2.0.0 (which we don't).
Nevertheless, good to know that someone's working on the problem :-}
Saturday, August 10, 2013
Scala, lists, vectors and mergesort
Today I noticed the following example of mergesort, quoted in the online book "Scala by example" (9.3 Example: Merge sort, page 69):
def msort[A](less: (A, A) => Boolean)(xs: List[A]): List[A] = {
Further the authors say that the complexity of this algorithm is O(N log (N)), which makes it an attractive option for sorting lists.
Being in an inquisitive mood, I couldn't help wondering how big can in practice be a list sorted with such algorithm. (My current machine is 64-bit Ubuntu with 6 GB RAM and 8 cores)
So here we go.
scala> import scala.util.Random
scala> val randomNums = Seq.fill(1000)(Random.nextInt)
scala> msort((x:Int,y:Int)=>x<y)(randomNums.toList)
res27: List[Int] = List(-2145589793, -2143814602, -2143330861, -2142630038, -2136184047, -2135476301, -2132371275, -2129131922, -2123018872, -2120069375, -2118819114, -2117572993, -2112618055, -2102489605, -2096417279, -2095978655, -2095806565, -2087860343, -2087105558, -2085005596, -2083360810, -2077234330, -2065393243, -2058765966, -2056823240, -2053145149, -2047696716, -2044737011, -1847777706, -18259...
xs2.head :: merge(xs1, xs2.tail)
means that all the time while you are busy merging the partial lists, there are orphan elements hanging around, waiting until they are allowed to be the head. And the stack isn't made of rubber!
Well... there is some sport, of course, could we make it at least work? Using the tail recursion, perhaps? (If you have forgotten, tail recursion happens when the recursive function ends with calling itself. Then the compiler can be smart enough to make a cycle instead of the sequence of the recursive calls, at least if you ask it politely :) )
However, this won't compile, and rightly so:
Ah right. Operator :: is right associative, that means it's called after merge(arg1, arg2) was called, therefore tail recursion isn't possible.
The reason why this code is slow is in the fact that now we have added the accumulator, and we want to add elements to the end of it. However, List (as also mentioned here) only guarantees the quick access to the head - the complexity of accessing the last element would be proportional to the size of the list!
So is there anything to do to make it faster, without trying hard to imagine some other algorithm? My solution was to use Vectors instead of Lists.
object Sorting {
def less[T:Ordering](x:T,y:T) = x<y
@tailrec
def merge[A:Ordering ](xs0: Vector[A], xs1: Vector[A], xs2: Vector[A]): Vector[A] =
if (xs1.isEmpty) if (xs0 == Nil) xs2 else xs0 ++ xs2
else if (xs2.isEmpty) if (xs0 == Nil) xs1 else xs0 ++ xs1
else if (less(xs1.head, xs2.head)) merge(xs0 :+ xs1.head, xs1.tail, xs2)
else merge(xs0 :+ xs2.head, xs1, xs2.tail)
def msort[A:Ordering](xs: Vector[A]): Vector[A] = {
val n = xs.length/2
if (n == 0) xs
else merge(Vector[A](), msort(xs take n), msort(xs drop n))
}
def checkSorted[A:Ordering](x:A, xs:List[A]):Boolean = xs match {
case Nil => true
case head::tail => if (head < x) false else checkSorted(head, tail)
}
def test(size:Int, printAll:Boolean = true) = {
val randomNums = Seq.fill(size)(Random.nextInt).toVector
val sortedVec = msort(randomNums)
if (printAll) println(sortedVec)
val sorted = sortedVec.toList
if (!checkSorted(sorted.head, sorted.tail)) println("Not sorted!!!")
}
def msort[A](less: (A, A) => Boolean)(xs: List[A]): List[A] = {
def merge(xs1: List[A], xs2: List[A]): List[A] =
if (xs1.isEmpty) xs2
else if (xs2.isEmpty) xs1
else if (less(xs1.head, xs2.head)) xs1.head :: merge(xs1.tail, xs2)
else xs2.head :: merge(xs1, xs2.tail)
val n = xs.length/2
if (n == 0) xs
else merge(msort(less)(xs take n), msort(less)(xs drop n))
}
Further the authors say that the complexity of this algorithm is O(N log (N)), which makes it an attractive option for sorting lists.
Being in an inquisitive mood, I couldn't help wondering how big can in practice be a list sorted with such algorithm. (My current machine is 64-bit Ubuntu with 6 GB RAM and 8 cores)
So here we go.
scala> import scala.util.Random
scala> val randomNums = Seq.fill(1000)(Random.nextInt)
scala> msort((x:Int,y:Int)=>x<y)(randomNums.toList)
res27: List[Int] = List(-2145589793, -2143814602, -2143330861, -2142630038, -2136184047, -2135476301, -2132371275, -2129131922, -2123018872, -2120069375, -2118819114, -2117572993, -2112618055, -2102489605, -2096417279, -2095978655, -2095806565, -2087860343, -2087105558, -2085005596, -2083360810, -2077234330, -2065393243, -2058765966, -2056823240, -2053145149, -2047696716, -2044737011, -1847777706, -18259...
OK, for 1000 elements it works, for 10000 it works too. But let's push one more order of magnitude...
scala> val randomNums = Seq.fill(100000)(Random.nextInt).toList
randomNums: List[Int] = List(186865346, 638415825, -637620864, 220723809, -1536234831, 1710286185, 126091472, -1621728642, -1819749330, -294195052, -613979926, 1278841478, -111715804, -1953497441, 1891679544, 582175290, 1555531003, -430520072, 652471392, 1211722008, -112446234, -1900260621, 2058382521, 564201400, -1225275015, 2069052362, 797097978, 1077363576, 1469066877, -303059738, -166855116, -1876385701, 285630983, -1956550564, -1991336959, -1713232594, -868759609, -723403847, -282664963, 1965397484, 1563483549, -618177790, -297223307, -197365661, -703715983, 28207094, -1793590690, 374050582, 992041027, -1931269739, -932512120, -1551657371, 1523463808, -742246427, -1665172973, 50892779, -286029416, -1654054925, -874783455, -1825744857, -571856180, 289326103, -215127347, -1488600483,...
scala> msort((x:Int,y:Int)=>x<y)(randomNums)
java.lang.StackOverflowError
at .merge$1(<console>:20)
at .merge$1(<console>:22)
at .merge$1(<console>:20)
at .merge$1(<console>:20)
at .merge$1(<console>:20)
at .merge$1(<console>:20)
at .merge$1(<console>:22)
...
Oi wei, we ran out of stack!
The reasons are obvious: this line of code
xs2.head :: merge(xs1, xs2.tail)
means that all the time while you are busy merging the partial lists, there are orphan elements hanging around, waiting until they are allowed to be the head. And the stack isn't made of rubber!
Well... there is some sport, of course, could we make it at least work? Using the tail recursion, perhaps? (If you have forgotten, tail recursion happens when the recursive function ends with calling itself. Then the compiler can be smart enough to make a cycle instead of the sequence of the recursive calls, at least if you ask it politely :) )
import scala.annotation.tailrec
...
@tailrec
def merge(xs1: List[A], xs2: List[A]): List[A] =
if (xs1.isEmpty) xs2
else if (xs2.isEmpty) xs1
else if (less(xs1.head, xs2.head)) xs1.head :: merge(xs1.tail, xs2)
else xs2.head :: merge(xs1, xs2.tail)
However, this won't compile, and rightly so:
[error]
<filename>: could not optimize @tailrec annotated method merge: it contains a recursive call not in tail position
[error] else xs2.head :: merge(xs1, xs2.tail)
Well, without further ado, that is the variant I have ended up with. It worked in the sense that it didn't break - there was now a cycle - but for the list with 100000 elements it took about 5 minutes to do the sorting!
import scala.math.Ordering.Implicits._
import scala.util.Random
import scala.annotation.tailrec
object Sorting {
import scala.util.Random
import scala.annotation.tailrec
object Sorting {
def less[T:Ordering](x:T,y:T) = x<y
@tailrec
def merge[A:Ordering ](xs0: List[A], xs1: List[A], xs2: List[A]): List[A] =
if (xs1.isEmpty) if (xs0 == Nil) xs2 else xs0 ::: xs2
else if (xs2.isEmpty) if (xs0 == Nil) xs1 else xs0 ::: xs1
else if (less(xs1.head, xs2.head)) merge(xs0 :+ xs1.head, xs1.tail, xs2)
else merge(xs0 :+ xs2.head, xs1, xs2.tail)
def msort[A:Ordering](xs: List[A]): List[A] = {
val n = xs.length/2
if (n == 0) xs
else merge(List[A](), msort(xs take n), msort(xs drop n))
}
def checkSorted[A:Ordering](x:A, xs:List[A]):Boolean = xs match {
case Nil => true
case head::tail => if (head < x) false else checkSorted(head, tail)
}
def test(size:Int, printAll:Boolean = true) = {
val randomNums = Seq.fill(size)(Random.nextInt).toList
val sorted = msort(randomNums)
if (printAll) println(sorted)
if (!checkSorted(sorted.head, sorted.tail)) println("Not sorted!!!")
}
}
The reason why this code is slow is in the fact that now we have added the accumulator, and we want to add elements to the end of it. However, List (as also mentioned here) only guarantees the quick access to the head - the complexity of accessing the last element would be proportional to the size of the list!
So is there anything to do to make it faster, without trying hard to imagine some other algorithm? My solution was to use Vectors instead of Lists.
object Sorting {
def less[T:Ordering](x:T,y:T) = x<y
@tailrec
def merge[A:Ordering ](xs0: Vector[A], xs1: Vector[A], xs2: Vector[A]): Vector[A] =
if (xs1.isEmpty) if (xs0 == Nil) xs2 else xs0 ++ xs2
else if (xs2.isEmpty) if (xs0 == Nil) xs1 else xs0 ++ xs1
else if (less(xs1.head, xs2.head)) merge(xs0 :+ xs1.head, xs1.tail, xs2)
else merge(xs0 :+ xs2.head, xs1, xs2.tail)
def msort[A:Ordering](xs: Vector[A]): Vector[A] = {
val n = xs.length/2
if (n == 0) xs
else merge(Vector[A](), msort(xs take n), msort(xs drop n))
}
def checkSorted[A:Ordering](x:A, xs:List[A]):Boolean = xs match {
case Nil => true
case head::tail => if (head < x) false else checkSorted(head, tail)
}
def test(size:Int, printAll:Boolean = true) = {
val randomNums = Seq.fill(size)(Random.nextInt).toVector
val sortedVec = msort(randomNums)
if (printAll) println(sortedVec)
val sorted = sortedVec.toList
if (!checkSorted(sorted.head, sorted.tail)) println("Not sorted!!!")
}
}
Vectors guarantee constant access to any element they contain. It is probably slightly slower than accessing the head of the list, but for the tasks like this one it seems to be a definitely better option. It took very little time to sort both 100000 and 1000000 vectors of random integers. With 10000000, however, it still takes quite some time and apparently more memory, but 10 millions is already more than the amount of rows in many decent tables :-)
I hope it was useful, or at least that you did rid to the end. I still want to write a C++ implementation and see which one finishes first. So may be there will be the part two...
Subscribe to:
Posts (Atom)