The project should be done with free statistical software known as "R" and the MySQL database. For this task, the statistician does not need to know PHP. The focus is on optimizing the variable settings using R and MySQL.
Project:
Integrating Open Source Optimization Software in Back Testing
Overview:
We need help implementing open source optimization software ([login to view URL]) to
bridge the gap between Stochastic Integer Programming and mixed integer programming problems.
R needs to be set up to 'guess' until it finds the optimal values for about 15 variables. In
the past, a similar problem was solved using a brute force methdology, however there are now
too many variable values for brute force to be feasible. Also, commercial software (SAS) was
used to do the statistical analysis, however, it's difficult to find freelance statisticians
that have access to SAS (or other commercial software systems) and that forces us to rely on
open source software. Fortunately, there seem to be enough modules in R to handle the tasks.
Environment:
The operating system will be Debian 6.0 with php 5.3.3-7, and MySQL 5.1.49-3. If it's needed,
the system can be run on a cloud server (for processing power). The algorythm is written in
php. The data is stored in MySQL, and R seems have modules to handle test logistics and the
statistical analysis.
Here are the R software modules we'll probably use:
R + MySQL: [login to view URL]
R Quantitative Financial Modelling & Trading Framework: quantmod [login to view URL]
R + PHP: r-php [login to view URL] (Call php from R to run the algorythm)
R MIP Solvers: GLPK, CLP, SYMPHONY, LP_SOLVE
R Testing Pairs of Symbols for Cointegration: [login to view URL]
R + SOMA / STOPROG Stochastic optimization algorithm is similar to that of genetic algorithms
Background:
In the past, we used a brute force methodology to run over 1.2 million tests. A statistician
used SAS to analyze the test results and find the optimal values for about a dozen variables.
The analysis can be found here: [login to view URL] Please read through it to
get a description of the methodology used. Also note the methodologies that didn't work. The
optimal values provided were set and new (randomized) tests were run to verify that the values
chosen were able to perform as expected. The test results showed that the optimal values work.
Since that time, we have updated the software by adding features and variable value settings.
Now, we want to use R to find the values while running the tests and have a reusable analysis
infrastruture.
R Code the Problem:
After designing the system, you will need to encode the problem in R. The R code will set the
variable names and the ranges and/or bounds of each variable. The code will also specify what
which variables should be maximumize or minimumized. The goal of the optimization is to find
a set of optimal values to get a predictably high "rank". Rank is a "goodness value" that is
used to compare one test result to another. Since the underlying input data (stock prices) is
random, rank will not be deterministic. Sometimes the optimal values will return a low value
(which may even be a negative number) and sometimes they will return a high value.
Here are the various variables and our goal for each:
Rank:
Rank is the "goodness value". It is calculated by subtracting the setprofit from the godprofit
results. R will need to determine what variable settings result in the best average rank (as well
as the probability).
setprofit:
setprofit is calculated by buying all of the symbols and selling them after the test duration.
If possible, R should try to maximize the setprofit (by selecting the best symbols to work with).
godprofit:
godprofit is calculated by running an algorithm over the test duration and calculating the profit.
If possible, select symbols to maxmimize godprofit so that in turn, the average rank will be high.
Parent Test Loop:
The software algorithm is set up to process tests in batches. This way there can be an "apples to
apples" comparison on the test results. Therefore, when R selects a test to run to get a specific
test result, it will actually get at least a dozen test results. The batch processing is also much
more efficient as it requires far less writing and transferring data.
startdate:
A date chosen at random from the range of dates in the database.
startquote:
A integer with a decimal (ex: 1.01) It is the lowest price a symbol can have on the random date.
set:
R will select a set of symbols to be tested using based on their negative correlation strength.
Child Test Loop:
Once the above values are selected, it's we should set up quantmod (or another script) to loop
through and process the following variables as a related batch:
frequency:
An integer that governs how frequently the algorithm processes the dates in the test duration.
density:
An integer ( 1 - 4 ) chosen to omit specific symbols before running the tests.
level:
An integer ( 1 - 10 )
positions:
An integer ( 1 - 10 ) The algorithm can process all 10 positions at one time, however, this
variable is dependent on the quantity and level values being equal or higher.
quantity:
An integer ( 2 - 50 ) which is the number of symbols selected for the tests. Minimize this
value so that there are less symbols that need to be logistically handled on a daily basis.
duration:
An integer ( 1 - 730 ) which is the number of days (date range) of the tests. Minimize this
value so that the godprofit and rank are high in the shortest timeframe possible.
hold:
A value ( yes / no ) or ( 1 / 0 )
diversify:
A value ( yes / no ) or ( 1 / 0 ) Setting diversify to "yes" should be less risky than "no".
startcash:
An integer ( 100 - 100,000 ) which is the amount of cash used for all of the positions in the test.
Minimize this value (since it should be less risky to divide the capital over multiple positions.
cost:
An integer ( 0 - 25 ) which is the amount of money paid to the broker for each transaction. The
cost is actually set by the broker. Maximize this value to find out how much cost reduces rank.
sharebuffer:
An integer ( 1 - 5000 ) Maximize this value to reduce slippage and the logistics of trading.
profitlimitpercent:
An integer ( 0 - 10000 ) Minimize this value to tell the system to "quit while it's rank is high".
stoplosspercentage:
An integer ( 1 - 100 ) The value represents the amount of cash left over after a loss in a position.
Maximize this value to reduce the 'drawdown' for each individual position.
buystop_percentage:
An integer ( 1 - 10000 ) Minimize this value to tell the algo to "quit if there's an unusually
high performance for a symbol price while the algo is holding a position in it.
sellstoppercentage:
An integer ( 0 - 100 ) Maximize this value to reduce the drawdown while the aglo holds a position.
trailingpercentage:
An integer ( 0 - 100 ) Maximize this value to reduce the drawdown after a position price increase.
pricebufferpercent:
An integer ( 0 - 100 ) Maximize this value to discover how high the pricebufferpercent is before
it reduces rank. That setting will be used to control the amount of slippage the algo will allow.
// Extra values:
drift:
An integer ( -100 - 10000 ) with a decimal. This value can be positive or negative. It is
the change in price after the algo's last transaction. Reducing the duration should reduce drift
and return less random results.
number of transactions:
An integer ( 0 - 730 ) Reduce this value so that there are less logistic problems.
correlationdays:
An integer ( 0 - 90 ). This is the number of days R + testForCoint will test for a correlation
between the symbols in the set. Minimize the number of correlationdays to increase the number of
symbols that can be selected. Correlations between symbols probably breaks down over time.
correlationpercent:
An integer ( -100 - 100 ). Minimize this variable to find the strongest negative correlation for
the symbols in the set. This percentage value should be the 'minimum strength' of the correlation.
Testing Process:
Decide what test to run:
There will be a small R script that defines the scope of variable values. R will look at the scope
of the test and decide what set of variable values to test. R will need to analyze the database to
figure out what settings to test next. R will call an MIP solver to optimize a ranking value.
Record the test variable values:
R will record the variable values in a MySQL database. The values will be stored in a settings table
called 'dayoptions' and in a test table (so that each rank result can be associated with the values).
The settings in the dayoptions table will be used by the algo during the test.
Select symbols for the Set:
R will select the symbols to be traded. Have R take the random startdate and select symbols. Next,
it will back 'correlationdays' in time & select the symbols that have a strong negative correlation.
The symbols do not need to be grouped in pairs, and ideally, each symbol will be negatively related
to several other symbols in the set. This symbol selection infrastructure will be routinely used to
find negatively correlated symbols (so it should be set up to be somewhat easy to use). The symbols
and prices will be in MySQL tables. R must select a random date that is at least 'duration' days
from the maximum date in the historical quotes database. The random startdate must also be at least
'correlationdays' from the minimum date in the database. If preferred, some of the functionality of
selecting the symbols can be coded in a php script (rather than R). After selecting the symbols the
set of symbols should be recorded in the 'symbols' and 'test' tables.
Run the tests:
R can use quantmod (or we can code a php script) to loop time and run the batch of tests. The code
will need to call the algo for each time incriment. If there is some sort of malfunction the algo
will record a notation in the database. The statistician/R programmer can decide which script will
handle the testing loop. The testing framework will record the setprofit, godprofit & rank in the
test table. Once the algo is done running, it can send a response to R.
The test process will be looped until the average 'rank' value has been optimized.
Check the optimization:
The historical data will be split into parts. One part will be used to optimize the values. Next,
the tests will be run again on another part of the data to make sure the optimal values are valid.