My main blog
Labels
Monday, March 30, 2015
Fwd: vague question { xr - production troubleshoot
Fwd: java threading questions {xr
Fwd: MS IV 20141211
Fwd: Bloomberg interview questions 2/13/2015 {XR
Base on input 2 strings, return a regex String that can cover 2 input strings.
e.g "Bloomberg", "BloomingDale" should output "Bloom.*g.*"
I proposed a solution:
step 1: find out common segments of the 2 strings
step 2: insert ".*" in between the common segments to form a string
He asked: how to find the common segments?
also he let me think of below examples, but I really couldn't find any
clue from his example.
"Bloomberg", "BloomingDale"
"StarBucksCoffee", "RedCoffee"
In the end, he said my proposed solution is not technically optimized.
Obviously, he has better solution, but he didn't give me the hint.
So I still have no idea at all.
Can you think of a solution?
Tuesday, March 17, 2015
home network crash
Friday, March 13, 2015
BM with drift is more visual than a GBM without drift
For a BM with drift, the position at a future time T has a familiar normal distribution. The next increment (+ve or –ve), too, has a normal distribution.
The GBM has a asymmetric bell shape. Harder to visualize.
matlab | index of first positive value
See the stoch HW8 on Poisson process, where our array has timestamps of each consecutive jump, and we need to find how many jumps by time=2 minutes.
Friday, February 27, 2015
probability measure is an underlying assumption
Any BM must be described under a specific probability measure, since the normal distribution assumes a probability measure...
Any expectation must be specified under a specific probability measure.
Countably Infinite collection
In probability, stoch etc we often talk about a distribution with a countable but infinite number of possible outcomes.
countable but finite – most intuitive and simple
uncountable – if the outcome is a real number..
countable but infinite – the set of all integers, or all even numbers...
Sunday, February 15, 2015
BM hitting 3 before hitting -5
This is actually identical to the BM with upper and lower boundaries. The BM walker stops when it hits either boundary. We know it eventually stops. At that stopping time, the walker is either at 3 or -5 but which is more likely?
Ultimately, we rely on the optional stopping theorem – At the stopping time, the martingale's value is a random variable and its expectation is equal to the initial value.
common martingales in beginner stoch calculus textbook
Simple Random Walk
sum of a fair number gen, generating 3 possible numbers each time with a predictable distro
BM
GBM – used to model many asset prices
Monday, February 2, 2015
Sunday, January 25, 2015
E[Z^n] where Z~N(0,1)
Thursday, January 22, 2015
Fwd: python ctor call syntax
return riskgenerator.BasicDealValuationGenerator(envDetails)
Luckily I know BasicDealValuationGenerator is a class in the riskgenerator module, so I know this is likely be calling the ctor.
Saturday, November 15, 2014
copula - 2 contexts
http://www.stat.ubc.ca/lib/FCKuserfiles/file/huacopula.pdf is the best so far. But I feel all the texts seem to skip some essential clarification. We often have some knowledge about the marginal distributions of 2 rvars. We often have calibrated models for each. But how do we model the dependency? If we have either a copula or a joint CDF, then we can derive the other. I there are 2 distinct contexts -- A) known CDF -> copula, or B) propose copula -> CDF
--Context A: known joint CDF
I feel this is not a practical context but an academic context, but students need to build this theoretical foundation.
Given 2 marginal distro F1 and F2 and the joint distro (let's call it F(u1,u2) ) between them, we can directly produce the true copula. Denoted CF(u1, u2) on P72, True copula := the copula to reproduce the joint CDF. This true copula C contains all information on the dependence structure between U1 and U2.
http://www.stat.ncsu.edu/people/bloomfield/courses/st810j/slides/copula.pdf P9 points that if the joint CDF is known (lucky!) then we can easily find the "true" copula that's specific to that input distro.
In contrast to Context B, the true copula for a given joint distro is constructed using the input distros.
-- Context A2:
Assume the joint distribution between 2 random variables X1 and X2 is, hmm ..... stable, then there exists a definite, concrete albeit formless CDF function H(x1, x2). If the marginal CDFs are continuous, then the true copula is unique by Sklar's theorem.
--Context B: unknown joint CDF -- "model the copula i.e. dependency, and thereby the CDF between 2 observable rvars"
This is the more common situation in practice. Given 2 marginal distro F1 and F2 without the joint distro and without the dependency structure, we can propose several candidate copula distributions. Each candidate copula would produce a joint CDF. I think often we have some calibrated parametric formula for the marginal distros, but we don't know the joint distro, so we "guess" the dependency using these candidate copulas.
* A Clayton copula (a type of Archimedean copula) is one of those proposed copulas. The generic Clayton copula can apply to a lot of "input distros"
* the independence copula
* the comonotonicity copula
* the countermonotonicity copula
* Gaussian copula
In contrast to Context A, these "generic" copulas are defined without reference to the input distros. All of these copulas are agnostic of the input random variables or input distributions. They apply to a lot of different input distros. I don't think they match the "true" copula though. Each proposed copula describes a unique dependency structure.
Perhaps this is similar -- we have calibrated models of the SPX smile curve at short tenor and long tenor. What's the term structure of vol? We propose various models of the term structure, and we examine their quality. We improve on the proposed models but we can never say "Look this is the true term structure". I would say there may not exist a stable term structure.
--
A copula is a joint distro, a CDF of 2 (or more) random variables. Not a density function. As such, C(u1, u2) := Pr(U1<u1, U2<u2). It looks (and is) a function, often parameterized.
CompletionService + others (Credit Suisse IV
safe publication
fail fast vs fail safe
timers (scheduled task queue, scheduled thread pool)
3 ways to create threads
how to shut down a multithreaded app
handle all exceptions in a thread pool?
Sunday, November 2, 2014
independent events A_and_B ^ A_or_B
But how about Pr(A or B)? I find it unintuitive. Let's use discrete to illustrate. A:= coin is head. B:=dice < 6.
Use the universal rule that Pr(A or B)=Pr(A)+Pr(B)-Pr(A and B) ... Venn diagram
Saturday, October 25, 2014
top 5 expertise by popularity - not "my expertise"
Tuesday, October 21, 2014
buy/sell USDJPY means borrow USD, near-date to far-date
label – intiuitive
Get an intuitive feel – a FX swap of "buy/sell CC1CC2" is basically borrowing CC1 pledging CC2 as collateral.
Very often, CC1=USD. A lot of market participants need to borrow USD, presumably to buy oil, gold, US stocks and US gov debt.
Monday, October 13, 2014
Fwd: Markit phone interview, full time position, 10/13/14
Date: Tue, Oct 14, 2014 at 4:02 AM
Subject: Markit phone interview, full time position, 10/13/14
Friday, October 10, 2014
Fwd: Barcap onsite interview, 10/09/14
Date: Fri, Oct 10, 2014 at 7:56 AM
Subject: Barcap onsite interview, 10/09/14
Wednesday, October 8, 2014
Fwd: 10/08/14 phone interview with BNP Paribas, GWT UI developer position, Jersey City
From: Hai Yi
Fwd: onsite BofA interview, continuance of the phone interview on 10/01/14
From: Hai Yi
Date: Wed, Oct 8, 2014 at 9:42 AM
Subject: onsite BofA interview, continuance of the phone interview on 10/01/14
Friday, October 3, 2014
equity swap, according to (my interpretation of) Pravin
Today, however, you pay me a fixed x% of the notional.
So the contract always references some index.
Thursday, October 2, 2014
Fwd: 10/01/14, 5:30PM, BofA phone interview with Wilson
From: Hai Yi
Date: Thu, Oct 2, 2014 at 11:47 PM
Subject: 10/01/14, 5:30PM, BofA phone interview with Wilson
Java
1. Explain how HashMap works
2. Compare HashTable and ConcurrentHashMap. Explain how CHM handle concurrency.
3. There is an arraylist of 1,2,2,3, how do you output unique values?
how do you only output dup values?
4. Can you remove an element during iteration? How do you do it properly?
5. Can you update elements in an arrayList if its reference is defined
as "final". If so, what do you do to prevent it?
6. Name JSP internal objects. If in URL there is "id=12345", how to
retrieve that value?
7. Difference b/w StringBuffer and StringBuilder; and their difference
with String?
8. What design patterns can you name? What's singleton?
9. How do you implement Producer-consumer pattern?
10. Explain Executor framework and Threadpool
11. Explain volatile and transient
12. Explain Breath-first traversal and depth-first traversal
13. What's the new feature in JDK 8? What's the "default" keyword used
for in JDK 8 context?
Database
1. What's Union and Union all?
2. explain left join and inner join
3. There is a table "Salary" with two columns "id" and "salary", this
is oracle, write sql to list first 4. 100 record with salary from
highest to lowest
4. How do you dump the data from table A to table B, if table B
doesn't exist. Do it in one SQL statement.
5. Difference b/w function and store procedure
6. What's view? What's trigger?
7. How to call a store procedure from Java? The JDBC API.
8. What are the transaction isolation levels?
9. How do you get the current date using Oracle?
10. In PL/SQL, how do you throw an exception and how to catch it?
Unix
1. what's the command to find the files that were updated in the past 10 days?
2. How to find a file with a particular name, like "hello.java"?
3. What's the command to kill a running Java process and generate threaddump?
4. There is a file of 10 million lines, whats' the command to list
first 10 line?
5. What's the command to count the line number in a file, what parameter to use?
6. In shell script, how to split a string with dilimiter?
7. In shell script, how to check if a file exists or not?
8. what's the command to sort a file?
9. what's the command to list all running processes?
10. In VI, how to replace a repeated word in the whole text?
Wednesday, August 20, 2014
## which skill/field CAN be self-taught
Q: which fields can be truly self-taught to in-depth? For example, mobile apps, desktop apps.
Goal #1: pass job interviews
Goal #2: reach the professional level of Theoretical expertise
Goal #3: reach the professional level of Practical expertise
* Quant? too dry. Too many doubts to be clarified only by professor. However the basic math part is standard and well-defined. Goal 3 is invalid.
* high-frequency, algo? Many books but none on real tricks. Compare swing books! Very few jobs.
* network/linux optimization? need machine to try. Extreme optimization techniques are top secrets.
* FIX? Many practical issues hard to imagine. No focus.
---
* swing, wpf? -- tricks can be self-taught through experiment. Practical books are available. But practical problems are unknown.
* c# and core libraries? Real obstacle is the IDE. Practical books abound.
risk-mgmt? impossible to experiment.
coherence? too many subtopics. no focus. Hard to experiment.
tibco? no book at all. Hard to experiment
threading? Can hit Goal 2, not 3
Friday, July 4, 2014
corr (X,Y) = 1 means X and Y ~ linear
Sunday, June 22, 2014
java performance - will these help@@
Q: have a small nursery generation, so as to increase the frequency of GC collections?
Q: have maximum heap size exceeding RAM capacity.
Saturday, June 21, 2014
mlock() syscall - a few notes
See more http://virtualizationandstorage.wordpress.com/2013/11/19/algo-high-frequency-trading-design-and-optmization/
Wednesday, June 18, 2014
Libor + 50 bps -- never in IRS
Whenever we see something like "3M Libor + 50 bps", I feel it shouldn't be part of an IRS contract. IRS should use Libor + 0 on the floating side.
Such an interest rate could be the floating loan interest rate offered to a corporation by a bank.
bid/off quote in terms of interest rates
FW: grab a critical component (and make it unnatural for other developers:)
I now see more wisdom in such a "job protector".
I feel a critical component could be
- the release process like the one your colleague controls
- build process
- some scheduling tool like autosys. You can make it very complicated.
- some home-made diagnostic tool to troubleshoot a critical component.
- some wrapper component that everyone must go through to access messaging, or some critical library...
- some very important SQL query? Well, colleagues can copy it and figure out how it works. It's "more effective" if there are many such queries and these queries need a lot of tweaking each time. Then no one can become familiar with these queries and replace me!
Tuesday, June 17, 2014
a few benchmarks in finance (vwap, sharpe...
Investment performance benchmark - various indices
Investment performance benchmark - risk free rate
Investment performance benchmark - value benchmark and size benchmark. See the construction http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library/f-f_bench_factor.html
Execution benchmark - vwap. I feel this is the natural, logical benchmark. "Did I sell my 5000 shares at yesterday morning's average price?"
Execution benchmark (2nd most common) -- implementation shortfall (very similar to arrival price)
FX Vol - Butterfly or Strangle@@
Butterfly is a combination of ATM straddle and an OTM strangle, and is a more exact way of trading the smile of volatility.
The OTM strangle relates net premium, in volatility terms, over the ATM ( volatility) rate. The purchase (or sale) of an OTM Strangle still leaves the trader open to a change in the ATM rates, so it's possible for a change in the smile shape to be compensated by a change in the ATM rates. To be more exact, trader can lock in the difference between the two (ATM vs OTM volatilities) by trading the butterfly spread.
IV - 2 processes sharing a socket
http://stackoverflow.com/questions/1997622/can-i-open-a-socket-and-pass-it-to-another-process-in-linux
[[linux programmer's toolbox]]
Saturday, June 14, 2014
finance model -- various meanings, very briefly
A "model" means something different in buy-side than in derivative pricing including complex structured products.
On the buy-side, I feel a model is like a regression formula that Predicts a (single?) dependent variable using several explanatory variables. In simple words, such a model is an alpha model, which is related to a trading strategy.
Friday, June 13, 2014
S&P 500 index - various symbols
The S&P 500 is often quoted using the symbol "SPX" or "INX", and may be prefixed with a caret (^) or with a dollar sign ($).
"GSPC" is another symbol
Thursday, June 12, 2014
some skills are key to IV, some key to GTD
GTD skill - tools
IV - performance profiling, tuning
IV – algo, data structure, threading
IV - prog. language details
IV - low-level or in-depth knowledge
IV - design skills including OO and agile best practices
matlab | string vs cell of 1 string
You can check the exact type of given stringy thingy with the function class().
To convert a cell into a string, use theVariable{:}
eg: regexp() --> cell
Saturday, June 7, 2014
low-hanging? perishable? niche? (after 2nd slow job search)
Evaluation in terms of long term job security, demand, job search, job interview, salary level
^^ core c#
^ python? growing demand; low-hanging fruit; might open many golden gates at baml and jpm
^ FIX? low churn. Hard to self-learn
- socket? frequently quizzed, low churn
- wcf
v wpf? not my chosen direction at the moment. Too big a domain.
- linux/c++ optimization? too niche
^ option math, stoch? often asked in-depth, but few roles in SG
- fixed income math? not used on the buy-side
- risk mgmt math? stable demand
v quant strategy? vague, dubious domain, apparently zero role in SG
things to be read right-to-left (c++/linq ...
what-if - (binary) option valuations - develop intuition
C_0 -> S_0
P_0 -> 0.0
binary Call price ->
binary Put price ->
as t -> T
C_0 ->
P_0 ->
binary Call price ->
binary Put price ->
as S -> K
C_0 ->
P_0 ->
binary Call price ->
binary Put price ->
as S -> 0
C_0 ->
P_0 ->
binary Call price ->
binary Put price ->
as S -> inf
C_0 ->
P_0 ->
binary Call price ->
binary Put price ->
as R_disc -> inf
C_0 ->
P_0 ->
binary Call price ->
binary Put price ->
as R_ disc -> 0
C_0 ->
P_0 ->
binary Call price ->
binary Put price ->
as R_grow -> inf
C_0 ->
P_0 ->
binary Call price ->
binary Put price ->
as R_grow -> 0
C_0 ->
P_0 ->
binary Call price ->
binary Put price ->
as K -> inf
C_0 ->
P_0 ->
binary Call price ->
binary Put price ->
as K -> 0
C_0 ->
P_0 ->
binary Call price ->
binary Put price ->
as sigma -> inf
C_0 ->
P_0 ->
binary Call price ->
binary Put price ->
as sigma -> 0
C_0 ->
P_0 ->
binary Call price ->
binary Put price ->
matlab | find()
Focus on a vector, not a matrix.
Focus on find(some logical expression) rather than find(someVector)
http://www.mathworks.com/company/newsletters/articles/matrix-indexing-in-matlab.html says
Logical indexing is closely related to the find function. The expression A(A > 5) is equivalent to A(find(A > 5)). Therefore, better learn logical indexing first.
matlab | logical subscripting - learning notes
Note if L has 5 non-zero elements, then length(X(L)) == 5.
I think L must be an array of booleans, not doubles.
For a matrix, see http://www.mathworks.com/help/matlab/math/matrix-indexing.html#bq7egb6-1
But here's a real illustration in my code:
step = 1/200;
steps = 2/step;
reruns=500;
% generate increments
%rng(0,'twister'); % if we want repeatable
incr = randn(steps,reruns)*sqrt(step);
std(incr) % should all be around 0.07
hist(incr(:,1))
% random walker positions
p = cumsum(incr);
% select a subset of Columns, using filter on
% "200th ROW and 400th ROW" so
% row expression = wildcard; column expression = filter on Row.
% If we carelessly swap the expressions, matlab won't warn us!
qualified = p(:, (p(200,:)>0 & p(400,:)>0));
Tuesday, June 3, 2014
matlab | foreach loop on matrix
If your original matrix is a column vector, then you better transpose it before using foreach. For a given matrix, foreach takes one column at a time.
Sunday, June 1, 2014
matlab | assign to cell array
outputCell(tmp_newRow, 3:4) = num2cell(betaTukeyN)
% assign to individual cell, using braces, not parentheses
outputCell{end+1, 3} = betaTukeyN(1)
matlab | a few useful indexing techniques
extract all the odd elements
extract every 3rd element
Reverse the order of elements
--logical subscript
To replace all NaN elements with zero
Saturday, May 31, 2014
matlab | sscanf faster than str2double
trFiles = dir(fullfile(trFolder, 'trade*2013013*.csv'));
tr1D =read1csv(fullfile(trFolder, trFiles(1).name));
tic
for i=1:length(tr1D.textdata(:,4))
tt=tr1D.textdata(i,4);
dummy = sscanf(tt{:}, '%f');
end
toc
%%%%%%%%%%%
tic
str2double(tr1D.textdata(:,4));
toc
Friday, May 30, 2014
Tuesday, May 27, 2014
python exception handling, basics
scripting languages.
A complete stack trace is available.
Error handlers can clean up or defuse a problem.
python performance, brief notes
In Quartz, python performance is considered a real issue.
Toa Payoh library has a book on Python performance...
Compiling to C ... cython.
100-leveraged long Treasury position
merger arbitrage, basics
Wednesday, May 21, 2014
SLOOB locations of xap files - win7winxp
C:\Documents and Settings\A5XXXXXX\Local Settings\Application Data\Microsoft\Silverlight\OutOfBrowser - winxp
Tuesday, May 20, 2014
Monday, May 19, 2014
Re: HW6 Q2.3b regress without intercept@@
Sunday, May 18, 2014
how many job interviews Focused on financial analytics quesitons
OC
Platimus
Baml STIRT
Barclays (Neresh) vol fitter
-- less heavy focus
pimco bond position overnight risk
Citi Changi equity derivative
MS IRD trading desk team
Saturday, May 17, 2014
4th data source to a yield curve - year-end "turn"
for more details.
The year-end turn of the yield curve is defined as the sudden jump in yields during the change of the year. This usually happens at the end of the calendar year, reflecting increased market activity related to
year-end portfolio adjustments and hedging activity....When there is a year turn(s), two discount curves are
constructed: one for turn discount factors and one for the discount factors calculated from the input instruments after adjustments and the discount factor at any time is the multiplication of two.
Thursday, May 15, 2014
some IKM java questions
Base aa = new Sub(); //both Base and Sub defines STATIC method m1()
aa.m1();
(Sub)aa.m1(); // which one?
Q: Deep copy a java array?
- clone()?
- serialization?
Q: a base class method without access modifier is callable by subclass?
I think java default method access level is "package, not subclass". In contrast, c# (and c++) default is private -- http://msdn.microsoft.com/en-us/library/ms173121.aspx.
Q: if interface1 declares method2 returning Set, can an implementing class's method return SortedSet?
Monday, May 12, 2014
extreme long-short allocations by MV optimizer
This is stressed over and again in my MV optimization course...
Suppose we have only 2 securities with high correlation.
Often one of them (AA) has a slightly higher Sharpe ratio than the other. The optimizer would go long a few hundred percent (say 300%) on it, and short 200% on the other (BB). These allocation weights add up to 100%.
If we tweak the historical mean return a bit so AA's Sharpe ratio becomes slightly below BB, then the optimizer would recommend go deep short AA and deep long BB.
This is a common illustration of the over-sensitivity and instability of MV allocation algorithm. In each case, the optimization goal is maximum Sharpe ratio of the portfolio. Why Sharpe? MV
Sunday, May 11, 2014
normal variable to lognormal variable
Q: given a LogNormal variable H, how do I generate a normal variable?
A: take the log of that variable, i.e. log(H) ~ Normal()
Q: Now given a Normal variable Z ~ Normal(), how do I generate a lognormal variable?
A: exp(Z) ~ LN
Wednesday, May 7, 2014
risk premium - dynamic factor models
Tuesday, May 6, 2014
risk premium - static factor models
Sunday, May 4, 2014
simple techniques to reduce CPU cache miss
strace, ltrace, truss, oprofile, gprof - random notes
Saturday, May 3, 2014
vbscript can ...
-- are based on [[automating windows administration]] --
access registry
connect to exchange/outlook to send/receive mails
regex
user account operations
**query group membership
**query Active Directory
**CRUD
file operations
** size
** delete folder
** read the version of a (DLL, EXE) file
** recursively find all files meeting a (size, mtime, atime..) criteria
** write into a text file
matlab | swap x and y axist
view(-90,90)
set(gca,'ydir','reverse')
Thursday, May 1, 2014
/proc/{pid}/ useful content
./cmdline is a text file ...
./cmd is a symlink to the current working dir of the process
./environ is a text file showing the process's env vars
./fd/ hold symlinks to the file descriptors, including sockets
./maps is a text file showing user space memory of the process
./smaps is a text file showing detailed info on shared lib used by the process
./status is a more human-readable text file with many process details
Saturday, April 26, 2014
list slicing operation - adoption in various languages
String slicing i.e. substring is popular though.
Matlab coders like slicing!
SGD exchange rate management by MAS
SG government chose a different strategy – "managed float". Whenever SGD FX rate exceeds the policy band, MAS would buy/sell SGD in the open market, which cost tax payer's money.
Due to the impossible trinity (http://en.wikipedia.org/wiki/Impossible_trinity),
- SG government loses the freedom to set independent interest rate. SGD interest rate is forced to follow USD interest rate.
- China has capital control, Fixed exchange rate and Independent interest rate policy.
python - write on windows; deploy to linux
I now feel the platform differences are relatively rare.
http://programmers.stackexchange.com/questions/199863/developing-python-on-windows-and-deploying-to-linux says Cygwin is a useful environment for this purpose
Tuesday, April 22, 2014
oprofile - phrasebook
dtrace - comparable. I think these two are the most powerful profilers on solaris/linux.
statistical - results can be partially wrong. Example - call graph.
Per-process - profiling is possible. I think default is system-wide.
CPU - counters (hardware counters). Therefore, low impact on running apps, lower than "attachment" profilers.
userland - or kernel : both can be profiled
recompile - not required. Other profilers require recompiling.
kernel support - must be compiled in.
oprifiled - the daemon. Note there's no executable named exactly "oprofile".
[[Optimizing Linux performance]] has detailed usage examples of oprofile. [[linux programmer's toolbox]] has decent coverage too.
Sunday, April 20, 2014
shared memory - in low latency systems
Is shared-memory a popular messaging solution in low-latency trading?
I know some high-volume data processing engines (like Ab Initio) favor
shared memory as the fastest IPC solution.
However, I feel in low latency trading, messaging (like tibrv, 29west,
Solace) is more popular. For a trading engine, shared memory IPC can
be the basis of messaging between processes on the same machine, but
not across different machines.
Do your system use shared memory?
If interested, you can check out
- http://www.cisco.com/web/strategy/docs/finance/29W-INFA-Cisco-IPC-Performance-new.pdf
- http://www.informatica.com/sg/company/news-and-events-calendar/press-releases/05152013-ultramessaging.aspx
- http://solacesystems.com/blog/high-frequency-trading-to-warp-speed
Saturday, April 19, 2014
variance or stdev is additive - an illustration
Since we hit the noisegen once a year, over 5 years we get 5 random "numbers", all with the same M and sigma. Each number is the realized annual (log) return. The cumulative end-to-end return is the sum of the 5 independent random variables. This sum is a random variable with a variance, which is additive. Assumption is iid i.e. repeated noisegen hits.
-----
In a different scenario, suppose we hit the noisegen once only and multiply the same output number by 5 in a "projected 5Y return". Now std is additive.
In both cases, the mean of the 5Y end-to-end return is 5*M
sharing a huge memory chunk between 2 processes
I am not quite sure about the use cases[1], but let's say this huge file needs to be loaded into memory, by 2 processes. If readonly, then savings is possible by sharing the memory pages between the 2 processes. Basically, the 2 virtual address spaces map to the same physical pages.
Now suppose one of them -- Process-A -- needs to write to that memory. Copy-on-write takes place so Process-B isn't affected. The write is "intercepted" by the kernel which transparently creates a "copy", before committing the write to the new page.
fork() system call is another user of Copy-on-write technique.
[1] Use cases?
* perhaps a large library, where the binary code must be loaded into memory
* memory-mapped file perhaps
Thursday, April 17, 2014
## c++ instrumentation tools
oprofile **
gprof *
papi
callgrind (part of valgrind)
sar *
strace *
(Compared to strace, I feel there are more occasions when ltrace is useful.)
systemtap
tcpdump
perf,
perf_events
--
libc
*Pin threads to CPUs. This prevents threads from moving between cores and invalidating caches etc. (sched_setaffinity)
See more http://virtualizationandstorage.wordpress.com/2013/11/19/algo-high-frequency-trading-design-and-optmization/
Wednesday, April 16, 2014
coding IV | java threading | Gelber Chicago
-------fwd--------
Hi Bin,
Here is the problem I was speaking about:
1. Develop an application in Java that will satisfy the following requirements:
- Read in any number of files text files.
- Calculate 5 most commonly used letters
- Calculate 3 most commonly used characters
- Calculate 3 most commonly used numbers
- Calculate 10 most commonly used words
- Concurrently parse all files, however, limit your application to 2 parser threads. If there are more files than available parser threads, you will need to queue files for parsing.
2. Write an application in Java that will pull down the HTML content of any number of specified websites - single file per URL, no depth. Strip out all metadata and generate statistics by showing 2 most commonly used letters, numbers, characters and words.
Regards,
Brendan
Fwd: IDC interview (12/09/13) Desktop
Subject: IDC interview (12/09/13) Desktop
I got the offer for this but turned it down today - HSBC seems a better one.
a lot of questions about Swing, since the project is a desktop data consolidation application using Swing and JIDE.
OO concepts:
Open-closed principle
Encapsulation and inheritance, your understanding?
aggregation or composition, comments?
Collections:
enumerate the Collection APIs
Which Set implementation or interface retains the natural order of the elements?
[TB] Natural order is like the order in an English dictionary, right? TreeSet/SkipListSet, SortedSet (interface)
Algo:
white board Pascal triangle
Quick sorting
binary search
How to refactor the following code snippet (they took out a code from Effective Java):
public class Person {
private final Date birthDate;
public boolean isBabyBoomer() {
Calendar gmtCal = Calendar.getInstance(TimeZone.getTimeZone("GMT"));
gmtCal.set(1946, Calendar.JANUARY, 1, 0, 0, 0);
Date boomStart = gmtCal.getTime();
gmtCal.set(1965, Calendar.JANUARY, 1, 0, 0, 0);
Date boomEnd = gmtCal.getTime();
return birthDate.compareTo(boomStart) >= 0 && birthDate.compareTo(boomEnd) < 0;
}
}
[TB] I feel isBabyBoomer should become a static utility method, like
isBabyBoomer(Person) or isBabyBoomer(Date)
The boomStart and boomEnd should be static constants. No need to create them as temp objects in each invocation.
Alternatively, we only need to look at a given year of birth "YOB". Is baby boomer IIF 1946 <= YOB <= 1964 i.e. YOB between 1946 and 1964.
Am I right?
Fwd: HSBC interview (12/11/13, Jersey City), a Dodd Frank Recon Report project
Looks like most of java questions are on threading and spring, and nothing else?
See my answers below.
JAVA
if wait/notify is called outside the sync block, what exception is thrown?
[TB] Doing this won't work. It's a programmer error, not system error like out of memory. I don't remember this exception because I have never made this mistake.
So this is an obscure question.
enumerate Executors's static methods to create a thread pool
[TB] I remember ExecutorS.java has a bunch of static factory methods. They let us specify the queue data structure used, the size of the queue, (min/max) size of the pool. There are also methods to create a single-thread pool. I think there are also methods to create timer-enabled thread pool.
how do you do to ensure a thread processed prior to other active threads. What do you in 1.4 and in 1.5 or later?
[TB] Thread priority is a platform-dependent feature. Usually java threads are kernel threads, so the kernel scheduler decides whether to honour the priority I set on my java threads. Sometimes the kernel scheduler ignores the priority I set. If that happens, I call yield() in many places in the methods executed by other threads. However, the "important" thread may still get pre-empted (context-switched out). In such a case, I would need to use lock or condition variables to block all other threads. Another techniques is a global Boolean flag, set by the important thread, and read-only by all other threads. If it's 0, then all other threads will go back to a sleep loop.
Volatile variable? What's your comment?
SPRING
How many ways of instantiating a Spring container?
How many ways of injecting a bean in Spring XML config
What's drawbacks of using constructor as injection mean? circular reference, what's exception will be thrown?
Spring annotation. If a bean must be provided with a constructor with another bean injected, what's the attribute of the annotation should be used to enforce it.
What're the scopes in a bean tag in Spring XML?
If a scope being prototype, what will return from using "getBean()"
UNIX
How to check the long running java process, is it done, deadlock, hanging there or still running.
[TB] Good question. If it is done or deadlocked then cpu usage would be 0 for a long time (like a few minutes). Hang is always for specific reasons like deadlock or blocked on IO.
If the java is a network server then it could be waiting for requests, so no-activity is OK. We had better know the actual type of program it is.
[TB] Ideally there should be log entry from the process. For a network server log could say "waiting for requests". In reality, the programmer may have forgotten to do the right things.
kill -3 to get the threasdump
Monday, April 14, 2014
execution risk in a VWAP execution algo
Suppose you are given a large (500,000 shares) Sell order. Suppose your goal is minimal market impact i.e. avoid pushing up the price. What execution strategy? I don't know about other strategies, but VWAP strategies generally participate according to market volume, so given a decent implementation the market impact is often ... reduced.
I think the idea of the Exec risk is the _uncertainty_ of the final block price. If an implementation offers a very tight control and results in well-controlled final block price, then exec risk is small. http://www.cis.upenn.edu/~mkearns/finread/impshort.pdf explains with an example --
suppose MSFT trades 30 million shares on an average day. If a trader has three million MSFT shares to trade, a VWAP algorithm may be appropriate. However, if the trader
gets 30,000 shares of MSFT to trade, then the savings of market impact (by spreading the trade over the whole day) is not significant compared against the opportunity cost the trader could save by trading the stock within the next few minutes. Quick execution means the uncertainty (or std) in "final block price" is much reduced. With a small order you would achieve something close to the arrival price.
a few ideas on how to manage the manager (OC
• Result or reason? Managers want results, not reasons. When she asks me for a reason, what she really wants is how she can help solve the problem and make progress towards the result.
• I don't disclose my actual implementation details. If managers ask, I try to avoid the details. I feel managers don't want to be bothered with implementation details, esp. when the codebase grows big and my module is outside the most critical 10% core modules.
• I seldom deviate from manager's technical direction. If I must deviate, I try to keep a low profile and get things to work quickly. If I miss a deadline and attracts his question, then I try to give a reason without disclosing my deviation.
• When things get unbearable, I tell myself I am not imprisoned for life here.
• When I receive a put-down remark, I remind myself I'm competent and I generally get things done, and I am respected by my colleagues.
• When asked about progress, I used to give a guarded response like "working but there are some questions about ….".
Sunday, April 13, 2014
DOS | Temporarily change working directory for a single command
pushd myjava & java -Djava.rmi.server.codebase=http://10.203.111.166:20380/murex.download.guiclient.download -classpath %CP% com.ocbc.quest.murexgateway.MurexServer 19673 & popd
This DOS command line does 3 things
1) temporarily chdir into “myjava” folder and
2) run a (long) java command line, and then
3) restore the previous directory.
Actually, the java process blocks the script forever. If you use ctrl-C to terminate the blocking java process, you still get back into the previous directory J
See
Wednesday, April 9, 2014
WCF interview topics
wcf custom binding
wcf streaming vs buffering -- http://stackoverflow.com/questions/4043683/wcf-httptransport-streamed-vs-buffered-transfermode
wcf message encoding -- http://msdn.microsoft.com/en-us/library/ms733742(v=vs.110).aspx
Monday, April 7, 2014
algorithm efficiency at HFT shops
I feel latency due to software algorithm might be a small component of overall latency. However, the bigger portions of that latency may be unavoidable – network latency, disk write(?), serialization for transmission(?), ... So the only part we could tune might be the software algorithm.
Further, it's also possible that all the competitors are already using the same tricks to minimize network latency. In that case the competitive advantage is in the software algorithm.
I feel algorithm efficiency could be more permanent and fundamental than threading. If I compare 2 algorithms A1 and A2 and find A2 being 2x A1's speed, then no matter what threading or hardware solutions I use, A2 still beats A1.
HFT + CitiMuni -- loosely coupled, MOM-based
Sunday, April 6, 2014
black littermam -- my brief notes
First, forget about optimizer, capm, MeanVariance, or views. Let's first get a grip on Bayesian inference. Given an unfair coin, we are trying to estimate the Pr(tail) by tossing it over and over. There's uncertainty (a pr distribution) about the mean.
Updating estimates -- Now we can formulate the problem as how to update our estimate of expected return when we get some special insight (like insider news). Such an insight is called a subjective "view", in contrast to the public information about the securities. The updated estimates must be numbers, not some vague preference.
Optimizer -- Once we get the updated estimates, they go into a regular portfolio allocation optimizer.
Problems of MV -- concentration. Small change in the input (returns, covariance etc.) leading to drastic re-alloations.
The investor is uncertain in their estimates (prior and views), and expresses them as distributions of the unknown mean about the estimated mean. As a result, the posterior estimate is also a distribution.
HFT iv (Chicago-based, Jump)
Thread pool -- Your resume mentioned your home-made thread pool? How?
What data type would you use for the tasks in a thread pool?
Boost::any, boost::bind, boost::function
CPU cache – how do you use it to improve performance? Any specific techniques?
Stack size – who controls it? at Compile time or run time?
Stack overflow – who can detect it and print an error msg? JVM can do it but what if there's no VM?
Shared ptr – how is it implemented?
Scoped lock - what is it, why use it?
Your bash shell customizations as a cpp developer?
$LD_LIBRARY_PATH -- what is it?
--
After malloc(), how do you cast the pointer to MyClass* ? Do you call the ctor? How?
(This is asked again by Alex of DRW)
Wednesday, April 2, 2014
Chicago/Sing (Jump) IV Aug 2012
Q1b: Given a 100-element collection, compare performance of ... (iteration? Lookup?)
Q: UDP vs TCP diff?
%%A: multicast needs UDP.
%%A: UDP is faster - no connection setup/teardown no error check no ACK, no sequence number
Q: How would you add reliability to multicast?
%%A: sequence number
Q: How would you use tibco for trade messages vs pricing messages?
%%A: the trade msg must be delivered reliably to back office?
%%A: one of them is real time?
Q5: In your systems, how serious was data loss in non-CM multicast?
%%A: Usually not a big problem. During peak volatile periods, messaging rates could surge 500%. Data loss would deteriorate.
Q5b: how would you address the high data loss?
%%A: test with a target message rate. Beyond the target rate, we don't feel confident.
A: tune the tibco reliability parameter -- http://javarevisited.blogspot.sg/2011/04/dataloss-advisory-in-tibco-rendezvous.html
Q7: how is order state managed in your OMS engine?
%%A: if an order is half-processed and pending the 3nd reply from ECN, the single thread would block.
Q7b: even if multiple orders (for the same security) are waiting in the queue?
%%A: yes. To allow multiple orders to enter the "stream" would be dangerous.
Now I think the single thread should pick up and process all new orders and keep all pending orders in cache. Any incoming exchange messages would join the same task queue (or a separate task queue) - the same single thread.
3 main infrastructure teams
* exchange connectivity - order submission
* exchange connectivity - price feed. I think this is incoming-only, probably higher volume. Probably similar to Zhen Hai's role.
* risk infrastructure - no VaR mathematics.
Monday, March 24, 2014
class template implementation (not just declaration) must reside in *.h not *.cpp
A lot of boost libraries have full source code in *.h files.
Saturday, March 22, 2014
Matlab | clear all except breakpoint
save('tmp.mat','tmp')
clear classes % clears even more than clear all
load('tmp.mat')
dbstop(tmp)
% clean up
clear tmp
delete('tmp.mat')
Wednesday, March 19, 2014
##math games @ipad
Friday, March 14, 2014
for YuJia - parameter optimization for the strategy
The matlab fminsearch and fmincon optimizers could find only local minimum around the initial input values. If our initial values are (j=-1, g=3) then the optimizers would try j values of -0.95, -1.05 and similar values. Even though the optimizer would try dozens of combinations it would not try far-away values like -5 or +5.
Therefore these standard optimizers are limited in their search capabilities. We took inspiration from a sample solution to Homework 3 and came up with our own optimizer -- i call it randshock. It's very fast and fairly stable.
unbeatenTimesMax=1000;
yellowJersey = 0;
unbeatenTimes = 0;
while (unbeatenTimes<unbeatenTimesMax)
Target = -getMetric(mu);
if Target > yellowJersey
unbeatenTimes=0;
unbeatenHistory=nan(unbeatenTimesMax,1);
yellowJersey = Target;
mu_yj = mu;
fprintf('! \n');
else
unbeatenTimes = unbeatenTimes + 1;
unbeatenHistory(unbeatenTimes) = Target;
fprintf('\n');
end
power = 2*log(5)*rand(1,1)-log(5);
multiplier = exp(power);
%hist(multiplier,100)
g = mu_yj(2) * multiplier;
gap = g * (0.5 + rand);
j = g - gap;
mu = [j g];
end
The entire optimizer is only 20 lines short. Can be easily tested by selecting ENV.MODE_OPTIMIZE. It gives a random "shock" to the input params, and checks out the result, then repeat. If any result is better than ALL previous results, then this result gets the yellow-jersey. We would declare it the final champion only after it survives 1000 random shocks.
The random shock is designed to be big enough to reach far-away values, and sufficiently "local" to cover many many values around the current param values i.e. the current "yellow-jersey".
Without loss of generality, we will focus on the most important parameter "g". The shock is implemented as a random multiplier between 0.2 and 5. This means the randshock optimizer has a good chance (1000 tries) to reach a value 80% below the yellow-jersey or 5 times above the yellow-jersey. If all values between these 2 extreme values don't outperform the yellow jersey, then there's no point exploring further afield.
Between these 2 extremes, we randomly try 1000 different multipliers. We apply the multipler to the current "g" value. If none of them beats the yellow jersey, then we conclude the optimization.
If any one of the 1000 tries is found to outperform the yellow jersey, then it gets the yellow jersey, and we reset the "unbeatenTimes" counter, and give 1000 random shocks to the new yellow jersey.
In theory this process could continue indefinitely but in practice we always found a champion within 2000 shocks, which takes a few minutes only.
Using the randshock optimizer we found optimal param values, which are the basis of all the analysis on the strategy. Next let us look at the optimal parameter values.
The optimal values are g=6.05 and j= -2.96. ROI = $194k from initial seed capital of $1k. In other words, $1k grew almost 200 times over the InSample period (about 10 years).
Using these two optimized parameter values, the OutSample period (from late 2009 till now) generated a profit of $7576, from initial seed capital of $1k. The return is not as "fast" as InSample, but still the optimized values are usable and profitable.
Tuesday, March 11, 2014
For YuJia - trade at market close
Although a stock market may close at 4:00 PM, many stocks continue to trade for at least 20 minutes after the close. We wait for the actual close price, and make our trading decision. There are risks to this practice.
* Trading volume typically plummets when the market closes. This causes market orders to become very risky. Therefore we use limit orders.
* Limit orders may not get filled, therefore we use total daily volume to throttle our trade size.